Introduction to LLMs in Python - Interview Questions and Answers

In this post, I explain LLMs in Python, Python Setup & Installation, Inference with Transformers, Calling ChatGPT API in Python, Python Local Deployment with Hugging Face Models, Prompt Engineering in Python and FineTuning & Custom Training (including LoRA). You can test your knowledge of LLMs in Python by attempting the Quiz after every set of Questions and Answers.
If you want my complete Introduction to LLMs in Python document that additionally includes the following important topics, you can message me on LinkedIn:
Python Advanced Techniques (Streaming, Batching & Callbacks), Python Efficiency & #Optimization (quantization, distillation, and parameter‑efficient tuning), Integration & Deployment Workflows, LLMs in Python Best Practices & Troubleshooting, and consolidated Introduction to LLMs in Python Quiz (with answer explanations to reinforce learning).


Question: What do I mean by "Introduction to LLMs in Python"?
Answer: Introduction to LLMs in Python means the foundational knowledge and practical steps necessary to utilize Large Language Models (LLMs) within the Python ecosystem. It covers concepts such as loading pre-trained models, tokenization, inference, and integration with APIs or libraries. This course equips developers and technical users with the skills to harness LLM capabilities (like text generation, summarization, and translation) directly through Python code. Note: If you are new to Python, you can view an introduction to Python in the tutorial here. The full set of Python for beginners tutorials are available here.

Question: Why is Python the lingua franca (common language) for LLM development?
Answer: Python has extensive machine-learning libraries (such as Transformers, PyTorch, and TensorFlow), a vibrant ecosystem of wrapper utilities (like Hugging Face pipelines), and API clients (for OpenAI’s GPT services). Its readable syntax and broad community support enable fast experimentation and production deployment, making it the popular choice for AI practitioners.

Question: What are typical use cases for LLMs in Python?
Answer: Developers and technical users can use LLMs in Python for tasks including automated documentation generation, code completion, chatbots, data extraction from unstructured text, and sentiment analysis. Python’s data-processing capabilities allow these models to integrate with web frameworks, data pipelines, and DevOps tools.
Example: A Python script using an LLM can parse customer reviews, summarize sentiment trends, and output a JSON report for BI dashboards. If you need video tutorials, please check out the Software and Testing Training playlists here.

Quiz
1. Which language is most used for LLM integration due to its rich AI libraries?
A. Java
B. Python (Correct)
C. C++
D. Ruby

2. Loading a pre-trained model and performing tokenization in Python typically involves which library?
A. NumPy
B. Transformers (Correct)
C. Requests
D. Matplotlib

3. An LLM use case that transforms raw text into structured key-value pairs exemplifies:
A. Code completion
B. Data extraction (Correct)
C. Model quantization
D. Image segmentation

4. The phrase "lingua franca" in this context refers to Python’s role as:
A. A spoken language for data scientists
B. A common programming language for LLM tasks (Correct)
C. A legacy scripting language
D. A proprietary AI framework

5. In a chatbot application, Python’s role is to:
A. Train the LLM from scratch
B. Serve as the interface for prompt sending and response handling (Correct)

Question: How can you confirm that you have the correct Python 3.8+ environment for LLM development?
Answer: You can verify your interpreter with the command:python --version and, if necessary, install a compatible distribution (such as Anaconda or the official Python installer). To isolate dependencies, create a virtual environment using python -m venv venv, then activate it by source venv/bin/activate on Linux/macOS or venv\Scripts\activate on Windows. This makes sure that the package versions for LLM libraries remain consistent across projects.

Question: How is the Transformers library installed and initialized in Python?
Answer: Within the activated environment, run pip install transformers. After installation, you can load a model and tokenizer with code such as:
from transformers import AutoModelForCausalLM, AutoTokenizer  
tokenizer = AutoTokenizer.from_pretrained("gpt2")  
model = AutoModelForCausalLM.from_pretrained("gpt2")
This setup provides the core classes needed for tokenization and inference.

Question: What are the steps to install and configure PyTorch for GPU acceleration?
Answer: First, determine your CUDA version (e.g., nvidia-smi). Then install PyTorch with the matching CUDA toolkit using the command from the official site, for example
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
Verify GPU availability in Python with:
import torch  
torch.cuda.is_available()  # Returns True if GPU is ready
Question: How can you add the OpenAI client library and authenticate API access?
Answer: Install via
pip install openai
Set your API key as an environment variable—
export OPENAI_API_KEY="sk-..."
on Linux/macOS or
set OPENAI_API_KEY="sk-..."
on Windows—so your scripts can call:
import openai  
openai.api_key = os.getenv("OPENAI_API_KEY")
This avoids hard-coding secrets, for security.

Question: What techniques help manage dependencies and their versions across different projects?
Answer: Use a requirements.txt file generated with
pip freeze > requirements.txt
to lock exact versions. This prevents version conflicts when multiple LLM projects coexist.

Connect with Inder P Singh (6 years' experience in AI and ML) on LinkedIn. You can message Inder if you need personalized training or want to collaborate on projects.

Quiz
1. Which command creates a virtual environment in Python?
A. python -m venv venv (Correct)
B. pip install venv
C. conda activate venv
D. python setup.py venv

2. To install the Transformers library, you use:
A. pip install torch
B. pip install transformers (Correct)
C. pip install openai
D. pip install numpy

3. How do you verify GPU availability for PyTorch?
A. torch.has_gpu()
B. torch.cuda.is_available() (Correct)
C. nvidia.check_gpu()
D. torch.device("cpu")

4. Securely setting your OpenAI API key involves:
A. Hard-coding it in your script
B. Storing it in config.json
C. Exporting it as an environment variable (Correct)
D. Passing it as a URL parameter

5. A requirements.txt file is generated with:
A. pip list > requirements.txt
B. pip freeze > requirements.txt (Correct)
C. pip install requirements.txt
D. pip dependency-list > requirements.txt

Question: How can you initialize a pipeline for text generation using Hugging Face?
Answer: You import and call the pipeline constructor from the Transformers library, specifying the task and model name. For example:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
This generator object handles both tokenization and decoding internally, allowing you to pass prompts and parameters.

Question: How does tokenization work before generating text?
Answer: The pipeline's tokenizer splits input strings into discrete tokens (meaning subword units) from the model's vocabulary. It then converts those tokens into numeric IDs. For instance:
tokens = generator.tokenizer("Hello, Inder!", return_tensors="pt")
Here, tokens.input_ids contains a tensor of IDs that the model consumes.

Question: How do you generate a simple completion once the pipeline is set up?
Answer: Call the generator with your prompt and configuration options like max_length and num_return_sequences:
result = generator("The future of AI is", max_length=30, num_return_sequences=1)
print(result[0]["generated_text"])
This returns the prompt plus generated continuation up to 30 tokens.

Question: What parameters help control generation behavior in the pipeline?
Answer: Parameters such as temperature (for randomness), top_k or top_p (for sampling diversity), and do_sample (to enable sampling) adjust creativity and variation:
generator("Explain quantum computing:", max_length=50, temperature=0.7, top_p=0.9, do_sample=True)
# © Inder P Singh https://www.linkedin.com/in/inderpsingh

Quiz
1. Which function creates a text-generation pipeline?
A. AutoModel.from_pretrained
B. pipeline("text-generation", ...) (Correct)
C. generate_text(...)
D. TextGenerator()

2. In tokenization, what does the tokenizer output?
A. Raw strings
B. Numeric token IDs (Correct)
C. Model weights
D. Attention scores

3. The max_length parameter controls:
A. The number of pipelines created
B. The maximum tokens in the generated sequence (Correct)
C. The tokenizer vocabulary size
D. The number of threads used

4. To enable probabilistic sampling diversity, which parameter is used?
A. force_cpu
B. do_sample (Correct)
C. use_cache
D. return_tensors

Question: How can you authenticate when using OpenAI's ChatGPT API in Python?
Answer: You can install the openai package and set your API key as an environment variable—export OPENAI\_API\_KEY="sk-..." on Linux/macOS or set OPENAI\_API\_KEY="sk-..." on Windows. In your script, you then invoke:
import os, openai
openai.api_key = os.getenv("OPENAI_API_KEY")
This keeps your secret key out of source code and loaded at runtime.

Question: What structure does the messages parameter use in ChatGPT calls?
Answer: The messages argument is a list of role-tagged dictionaries defining the conversation. Each entry has "role" set to "system", "user", or "assistant", and a "content" string. For example:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the benefits of LLMs in Python."}
]
Question: How can you make a synchronous ChatGPT call using the SDK?
Answer: Use openai.ChatCompletion.create, passing the model name and messages. For example:
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.3,
    max_tokens=150
)
print(response.choices[0].message.content)
This blocks execution until the full response is received and available in response.choices.

Question: How can you handle streaming responses to display partial results in real time?
Answer: Enable the stream=True flag and iterate over the response generator. Each chunk contains delta segments that you can print as they arrive. This approach provides a more responsive experience, rendering output token by token:
stream = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.get("content", ""), end="", flush=True)
Quiz
1. Where should you store your OpenAI API key for secure access?
A. In your script as a constant
B. As an environment variable (Correct)
C. In a GitHub gist
D. In plain text logs

2. The messages list item with "role": "system" is used to:
A. Define user queries
B. Set high-level instructions for the assistant (Correct)
C. Stream responses
D. Format JSON output

3. In a synchronous call, the generated text is retrieved from:
A. response.data
B. response.choices[0].message.content (Correct)
C. response.text
D. response.streaming

4. To receive partial content as it’s generated, you must set:
A. stream=True (Correct)
B. do_sample=True
C. echo=True
D. max_tokens=1

Question: How can you download model weights for local inference using Hugging Face?
Answer: You can call the from_pretrained method on both the model and tokenizer classes, specifying the model identifier. For example:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
This fetches the weights and vocabulary files into your local cache, making them available for offline use.

Question: How do you prepare the AutoTokenizer and AutoModelForCausalLM for inference?
Answer: After loading, you set the model to evaluation mode and move it to the appropriate device. For example:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
Question: What steps enable running inference on CPU versus GPU?
Answer: The device assignment determines where tensors reside. On GPU:
input_ids = tokenizer("Hello, world!",
               return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=50)
If device is CPU, the same code runs on the processor, though with higher latency. Always move both model and tensors to device for consistent execution.

Quiz
1. Which method fetches model weights and tokenizer files?
A. load_pretrained
B. from_pretrained (Correct)
C. download_model
D. init_pretrained

2. To switch the model to evaluation mode before inference, you call:
A. model.train()
B. model.eval() (Correct)
C. model.generate()
D. model.infer()

3. How do you determine whether to use GPU or CPU for inference?
A. By checking torch.device availability (Correct)
B. By reading a config file
C. By calling model.device()
D. By inspecting tokenizer attributes

4. To run inference on GPU, you must:
A. Only move the model to CUDA
B. Move both model and input tensors to CUDA (Correct)
C. Increase max_length
D. Enable do_train mode

Question: How can you build a prompt template programmatically in Python?
Answer: You can define a Python format string with placeholders for dynamic values. This approach centralizes prompt structure and allows easy substitution for different inputs. For example:
template = "Translate the following text to French:\n\n\"{text}\""  
def render_prompt(text):  
    return template.format(text=text)  
prompt = render_prompt("Good morning!")
Question: What is prompt chaining and how is it implemented in Python?
Answer: Prompt chaining sequences multiple calls to the API, using each response to form the next prompt. In Python:
first = client.chat(["Summarize this article: ..."])  
second = client.chat([f"Based on that summary, list three key takeaways:\n\n{first}"])
By passing first into the second call, you create a pipeline where each step builds on prior output.

Question: How do you compare zero-shot and few-shot strategies in code?
Answer: For zero-shot, send only the instruction:
resp0 = client.chat([{"role":"user","content":"Explain quantum computing in simple terms."}])
For few-shot, include inline examples:
examples = [  
    {"role":"user","content":"Q: What is 2+2?\nA: 4"},  
    {"role":"user","content":"Q: What is 10-3?\nA: 7"}  
]  
prompt = examples + [{"role":"user","content":"Q: What is 5×3?\nA:"}]  
resp1 = client.chat(prompt)
You can compare resp0 with resp1 to reveal improved accuracy and formatting consistency with few-shot.

Quiz
1. Which Python construct allows dynamic insertion of variables into a prompt template?
A. Lambda functions
B. f-strings or str.format (Correct)
C. List comprehensions
D. Decorators

2. Prompt chaining in Python involves:
A. Encrypting prompts before sending
B. Passing the previous API response as part of the next prompt (Correct)
C. Using only system messages
D. Parallelizing API calls

3. A zero-shot prompt differs from a few-shot prompt by:
A. Including multiple examples
B. Relying solely on the instruction without examples (Correct)
C. Using a higher temperature
D. Always returning JSON

4. In a few-shot strategy, embedding two Q&A pairs before a new question helps to:
A. Decrease model temperature
B. Guide the model’s response format and improve accuracy (Correct)
C. Increase token usage efficiency
D. Force the model into eval mode

Question: How do you prepare a dataset for fine-tuning an LLM in Python?
Answer: You collect and clean target-domain text, then format it into examples. This is typically done with keys like {"prompt": "...", "completion": "..."}. You convert this list into a datasets.Dataset object:
from datasets import Dataset  
data = [{"prompt": "Hello, how are you?", "completion": "I am fine, thank you."}]  
ds = Dataset.from_list(data)  
ds = ds.train_test_split(test_size=0.1)
Question: How is the Hugging Face Trainer API used to fine-tune a model?
Answer: You instantiate TrainingArguments to set parameters (such as per_device_train_batch_size, num_train_epochs, and output directory) and then create a Trainer with your model, tokenizer, dataset, and a data collator:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling  
args = TrainingArguments(output_dir="out", per_device_train_batch_size=4, num_train_epochs=3)  
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)  
trainer = Trainer(model=model, args=args, train_dataset=ds["train"], data_collator=collator)  
trainer.train()
Question: What is the advantage of LoRA for custom training?
Answer: LoRA adds low-rank adapter matrices to each transformer layer, training only these small modules. This reduces GPU memory usage and training time, enabling fine-tuning with limited resources. Integration uses the peft library:
from peft import get_peft_model, LoraConfig  
config = LoraConfig(r=8, lora_alpha=16)  
peft_model = get_peft_model(model, config)  
trainer.model = peft_model  
trainer.train()
Question: How do you execute a Python script to run fine-tuning from the command line?
Answer: You wrap your code in a script fine_tune.py and use standard Python invocation, passing hyperparameters via flags or environment variables:
python fine_tune.py --model_name gpt2 --train_file data.json --epochs 3
Quiz
1. Which format is commonly used for fine-tuning examples?
A. Plain text files
B. Prompt-completion JSON objects (Correct)
C. XML documents
D. CSV without headers

2. The DataCollatorForLanguageModeling in a Trainer is responsible for:
A. Logging metrics
B. Preparing masked or causal batches (Correct)
C. Saving model checkpoints
D. Scheduling learning rate

3. The benefit of using LoRA over full fine-tuning is:
A. Increased model size
B. Lower resource usage and faster training (Correct)
C. Eliminating need for tokenization
D. Automatic hyperparameter tuning

4. To pass hyperparameters to a Python fine-tuning script, you typically use:
A. Hard-coded constants
B. argparse flags or environment variables (Correct)
C. Direct modifications in library source
D. Comments in the script

Comments