13 July, 2025

Large Language Models (LLM) Concepts - Interview Questions and Answers

In this post, I explain What is LLM?, Language Modeling Basics, Tokenization & Words in LLMs, Neural Network Foundations, Transformer Architecture, Scaling: Parameters, FLOPs, Emergent Abilities, Architectural Variants, Training Paradigms, Sampling & Decoding Techniques, In-context Learning & Prompting, Hallucinations, Bias & Reliability, Explainability & Interpretability, Retrieval-Augmented Generation RAG, Multimodality & Multimodal LLMs MLLMs, and Domain-Specialization & Fine-tuning. If you want my complete Large Language Models (LLM) Concepts document that additionally explains the following topics, please message me on LinkedIn:
Top Models Overview (GPT-series, BERT family, PaLM, LLaMA, Claude, Gemini), Prompt Engineering Strategies, LLM Usage Patterns, Best Practices for Reliability, Efficiency Optimization and Integration & Tooling

Question: What is a Large Language Model (LLM)?
Answer: A Large Language Model (LLM) is a type of neural network–based language model trained on massive corpora of text to learn statistical patterns of human language. The full form "Large Language Model" emphasizes both the scale (often billions or trillions of parameters) and its focus on language understanding and generation. The term originated in the evolution from traditional n-gram models through recurrent neural networks to the breakthrough transformer architecture, which enabled effective scaling of both model size and training data. In practice, LLMs refers to systems like GPT, BERT, and their derivatives, which can perform diverse tasks (from text completion to translation) by predicting the next token in a sequence based on its extensive internalized knowledge of syntax, semantics, and real-world context.

Question: What are large language models used for?
Answer: Large language models can be used for a wide array of applications: automated drafting of documents, conversational agents, code synthesis, content summarization, and more. By leveraging deep attention mechanisms, an LLM can generate fluent, contextually appropriate prose, answer complex queries, and adapt to specialized domains through fine-tuning. Professionals harness these capabilities to accelerate workflows, enhance decision support, and build intelligent systems that interact with humans in natural language. Example: A developer might prompt an LLM to draft a client-facing report outline; the model draws on its learned patterns to produce a coherent structure and suggested language, which the developer then refines and customizes for accuracy and tone.

Question: What is a language model and what are models of language production?
Answer: A language model is a statistical or neural construct that assigns probabilities to sequences of words, capturing the likelihood of a particular word following a given context. It embodies the principles of models of language production, which seek to replicate how humans generate coherent spoken or written text by learning patterns of syntax, semantics, and discourse. In essence, a language model learns to estimate P(wordₙ | word₁…wordₙ₋₁), enabling it to predict or generate the next token in a sequence by internalizing the distributional properties of language from vast text corpora.

Question: What is a language model example?
Answer:
Example: In an n-gram model, the probability of the next word depends only on the preceding (n–1) words, such that P(wₙ|wₙ₋₂,wₙ₋₁) for a trigram. This simple approach captures local context but struggles with long-range dependencies.
Example: A recurrent neural network (RNN) processes one token at a time, maintaining a hidden state that carries information from all previous tokens, which allows it to model longer contexts but may suffer from vanishing gradients.
Example: The transformer architecture revolutionizes language modeling by using self-attention mechanisms to evaluate relationships among all tokens in a sequence in parallel, achieving superior performance on tasks requiring both local and global context understanding.

Question: What is the difference between a token and a word in LLMs, and what is a vocabulary?
Answer: A word is the traditional linguistic unit—a sequence of characters separated by whitespace or punctuation in human language. A token, by contrast, is the elemental input unit that an LLM actually processes. Tokens can be entire words, punctuation marks, or fragments of words depending on the chosen tokenization scheme. The vocabulary of an LLM is the fixed set of all tokens it recognizes, typically ranging from tens of thousands to hundreds of thousands of entries. By mapping every possible input to one of these tokens, the model converts raw text into numerical IDs, enabling consistent downstream computation.

Question: What are subword units (or model words) and why are they used?
Answer: Subword units—often called model words—are fragments of words derived by algorithms like Byte Pair Encoding or WordPiece. They bridge the gap between full-word tokenization (which struggles with rare or novel words) and character-level tokenization (which can produce excessively long sequences). By breaking unfamiliar or compound words into known sub-components, the model limits its vocabulary size while retaining the ability to represent new terms.
Example: The word “unbelievable” might be split into “un”, “##believ”, “##able”. Each piece exists in the vocabulary, so the model can handle “unbelievable” even if it has never seen that exact word during training. This strategy reduces out-of-vocabulary failures and keeps sequence lengths manageable, improving both efficiency and generalization.

Question: What are embeddings and why are they fundamental in LLMs?
Answer: Embeddings are learned dense-vector representations that map discrete tokens into a continuous numerical space. By assigning each token in the vocabulary to a point in a high-dimensional vector space, the model captures semantic and syntactic relationships—tokens with similar meanings lie close together. During training, these vectors adjust so that related words (e.g., “king” and “queen”) become geometrically aligned according to linguistic patterns. This continuous representation allows downstream neural layers to perform algebraic operations on language concepts rather than manipulating raw, sparse one-hot encodings.
Example: The token “computer” might map to a vector like [0.12, 0.45, 0.78,…], while “laptop” maps to [0.10, 0.40, 0.80,…], placing them near each other in embedding space because of their related meanings.

Question: What is a feed-forward network in LLMs and how does it operate?
Answer: A feed-forward network (often called a point-wise MLP in transformer layers) applies two sequential linear transformations with a non-linear activation in between to each token’s representation independently. After the self-attention mechanism contextualizes each token, the feed-forward network projects that context vector into a higher dimensional space, applies a non-linearity (e.g., GeLU), and then projects it back to the model’s original dimension. This process injects complexity and non-linearity, enabling the model to learn sophisticated feature interactions and hierarchical abstractions beyond what attention alone can provide.
Example: Given a contextualized vector x, the network computes y = W₂·(GeLU(W₁·x + b₁)) + b₂, where W₁ and W₂ are learned weight matrices and b₁, b₂ are biases. This transforms x into richer representations before passing them to the next layer.

Question: What is the Transformer architecture?
Answer: The Transformer architecture is a neural network design that dispenses with recurrence and convolutions, relying instead on attention mechanisms to process entire sequences in parallel. It consists of stacked layers that alternate between self-attention modules and position-wise feed-forward networks, each wrapped in residual connections and layer normalization. By treating every token’s representation as a query, key, and value vector, the Transformer can dynamically weight the influence of all other tokens when encoding contextual information, enabling efficient modeling of both short- and long-range dependencies across very long text.

Question: What is self-attention and how does it work?
Answer: Self-attention computes pairwise interactions among all tokens in a sequence by projecting each token embedding into three distinct spaces—queries (Q), keys (K), and values (V). The attention score between token i and j is obtained by the scaled dot-product of Qi and Kj, which is normalized via softmax to create weights that modulate Vj when aggregating information for token i. This yields a context-aware representation for every position, allowing the model to focus selectively on relevant words regardless of their distance.
Example: In the sentence “The cat sat on the mat,” when encoding “mat,” the model can assign high attention weight to “sat” and “cat,” ensuring the generated representation captures the grammatical subject and action.

Question: What roles do the encoder and decoder play in a Transformer?
Answer: The encoder stack ingests an input sequence and transforms it into a rich sequence of continuous representations through repeated self-attention and feed-forward layers. In sequence-to-sequence settings, the decoder stack generates output tokens autoregressively: each decoder layer applies self-attention over previously generated tokens, then cross-attention over encoder outputs, followed by its own feed-forward network. This dual-attention scheme enables the decoder to ground its predictions in both its own context and the encoded source, making it ideal for tasks such as translation, summarization, and conditional text generation.

Question: What are model parameters and how does scale impact LLM performance?
Answer: Model parameters are the individual weights and biases in a neural network that are adjusted during training to capture language patterns. Scale refers to the total count of these parameters, ranging from millions in early models to hundreds of billions or even trillions in cutting-edge LLMs, and the associated compute measured in FLOPs (floating point operations). As parameter count grows, the model’s capacity to memorize and generalize from vast text corpora increases, enabling finer-grained representations of syntax, semantics, and world knowledge. However, larger scale also demands exponentially more compute for training and inference, drives up latency and cost, and can give diminishing returns if not paired with architectural optimizations or efficient parallelism.
Example: A 175 billion-parameter model like GPT-3 requires on the order of 3×10^23 FLOPs during pre-training, delivering dramatic gains in text coherence over its 1.5 billion parameter predecessor, yet its resource demands necessitate specialized clusters and optimized libraries.

Question: What are emergent abilities in LLMs and why do they matter?
Answer: Emergent abilities are capabilities that materialize only once an LLM crosses a critical scale threshold, appearing unpredictably rather than increasing smoothly with size. These include sophisticated reasoning, arithmetic, code synthesis, or translation prowess that smaller models lack despite similar training protocols. Such abilities suggest that large scale models internalize latent structures of language and logic in ways that aren’t linearly extrapolated from smaller siblings. Recognizing and harnessing emergent behaviors empowers practitioners to unlock novel applications, but also raises challenges in predictability, safety, and alignment, as these latent capabilities can appear in unexpected contexts. If you like this blog post, I’m happy to explain further to you and answer your questions. You can message me on LinkedIn at https://www.linkedin.com/in/inderpsingh/
Example: Only above roughly 10 billion parameters do some LLMs begin solving multi-step arithmetic or follow chain-of-thought (COT) prompts reliably, exhibiting reasoning skills that simply did not exist in models scaled at 1 billion parameters.

Question: What is an encoder-only architecture?
Answer: An encoder-only architecture processes the entire input sequence bidirectionally to build deep contextualized representations, optimizing for understanding tasks rather than generation. By attending to both left and right contexts simultaneously, it excels at comprehension-oriented objectives like masked language modeling and sentence classification.
Example: BERT masks tokens during pre-training and uses its encoder stack to predict them, making it highly effective for tasks such as named entity recognition and sentiment analysis.

Question: What is a decoder-only architecture?
Answer: A decoder-only architecture generates text autoregressively by predicting each next token based solely on previously generated tokens and the original prompt. This unidirectional flow supports fluent, coherent generation across diverse contexts, from dialogue to document completion.
Example: GPT models use stacked decoder layers with self-attention to produce human-like prose, code, or answers to open-ended queries without needing a separate encoder.

Question: What is an encoder-decoder architecture?
Answer: An encoder-decoder architecture combines an encoder to ingest and contextualize an input sequence with a decoder that attends to those contextual embeddings while generating an output sequence. This sequence-to-sequence design is ideal for conditional generation tasks where mapping between input and output domains is required.
Example: T5 uses its encoder to understand a text-to-text prompt (“translate English to German: …”) and its decoder to synthesize the translated output, supporting tasks like translation, summarization, and question answering.

Question: What are pre-training and self-supervised learning in LLMs?
Answer: Pre-training is the initial phase where an LLM ingests massive unlabeled text corpora to learn general language patterns by predicting masked or next tokens. This process uses self-supervised learning, meaning the model generates its own training signals—such as masking 15% of tokens and asking the model to recover them—without human annotations. Through repeated exposure, the LLM internalizes syntax, semantics, and world facts in its billions of model parameters, establishing a broad foundation for downstream tasks.
Example: During pre-training, the model sees "The cat ___ on the mat" with "sat" masked; it learns to predict "sat" by leveraging context understanding after vast reading across the internet.

Question: What is fine-tuning and how does it specialize an LLM?
Answer: Fine-tuning takes a pre-trained LLM and continues training on a smaller, task-specific labeled dataset. This refines the model’s weights to excel at defined objectives, such as sentiment analysis or question answering, by adjusting parameters toward the nuances of the target domain. Fine-tuning bridges the gap between generic language understanding and precise task performance, boosting accuracy and reducing hallucinations for specialized workflows.
Example: A pre-trained LLM fine-tuned on legal contracts learns legal terminology and clause structures, enabling it to classify contract types or flag unusual clauses with high precision.

Question: What is instruction tuning and why is it important?
Answer: Instruction tuning further refines an LLM by training on pairs of natural-language instructions and desired outputs. Rather than simply learning from input-output examples, the model learns to follow human-readable directions, improving its ability to generalize to new tasks specified at inference time. This paradigm elevates the model from pattern completer to an interactive assistant capable of interpreting diverse prompts with minimal examples.
Example: Given the instruction "Summarize this article in three bullet points," an instruction-tuned LLM structures its response as requested, even for articles it has never seen, because it has learned the mapping from instruction style to output format.

Question: What is distillation and how does it optimize LLM deployment?
Answer: Distillation compresses a large “teacher” LLM into a smaller “student” model by having the student mimic the teacher’s output distributions. The student learns to reproduce soft logits or probability distributions over tokens, capturing the teacher’s knowledge in a more compact architecture. This reduces inference latency, memory footprint, and cost while retaining a high fraction of the teacher’s performance—enabling practical deployment in real-time or resource-constrained environments (such as smartphones).
Example: A 175 billion-parameter teacher model can distill its behavior into a 10 billion-parameter student that runs on a single GPU, delivering near-teacher-level fluency for conversational tasks with significantly lower compute requirements.

Question: What is Top-k sampling and how does it influence generation?
Answer: Top-k sampling restricts the model’s next-token selection to the k highest-probability tokens, then samples from that truncated distribution. By limiting choices to the most likely candidates, it avoids low-probability outliers while retaining randomness. This balances coherence and creativity: small k yields conservative, predictable text; larger k allows more diversity but risks incoherence.
Example: If the model’s next-token probabilities rank "the" (0.30), "a" (0.20), "this" (0.10), and dozens more below, setting k=3 means sampling only among "the," "a," and "this," preventing obscure tokens from appearing.

Question: What is nucleus sampling (top-p) and why use it?
Answer: Nucleus sampling—also called top-p—selects the smallest set of tokens whose cumulative probability meets or exceeds a threshold p (e.g., 0.9), then samples from that dynamic pool. Unlike fixed k, p adapts to the model’s confidence: in high-certainty contexts, the pool is small; in uncertain contexts, it expands. This yields more reliable diversity control and better fluency across varied prompts.
Example: If the top probabilities are "the" (0.4), "and" (0.3), "to" (0.2), "in" (0.05), … setting p=0.85 includes "the," "and," and "to," since their sum (0.9) surpasses 0.85, while excluding lower tokens.

Question: How does beam search work and when is it preferred?
Answer: Beam search is a deterministic decoding strategy that maintains b parallel hypotheses (beams) at each step, expanding each by all possible next tokens and retaining the top b sequences by cumulative log-probability. It prioritizes globally coherent outputs by exploring multiple paths simultaneously, reducing the risk of locally optimal but globally suboptimal choices. Beam width b governs exploration depth versus computational cost.
Example: With b=3, the model keeps its three best partial sentences—e.g., "The cat sat," "The cat is," "A cat sat"—then extends and re-scores them at each time step, ultimately selecting the highest-scoring complete sentence.

Question: What is in-context learning and prompting?
Answer: In-context learning refers to an LLM’s ability to adapt its output based solely on examples or instructions provided in the prompt, without updating its internal weights. The model treats the prompt as a temporary context window, extracting patterns from input-output pairs or directives and applying them to generate appropriate continuations. Prompting is the craft of designing that context—structuring instructions, examples, or questions—to steer the model toward desired behaviors, effectively “programming” it at inference time.

Question: What are zero-shot, few-shot, and chain-of-thought prompting?
Answer: Zero-shot prompting supplies only a task description or instruction, relying on the LLM’s pre-trained knowledge to perform without exemplars.
Example: Asking “Translate ‘Good morning’ to French.” yields “Bonjour” with no further context.
Example: Few-shot prompting embeds a handful of input–output pairs in the prompt, demonstrating the task format so the model infers the mapping for new instances.
Example:
Q: Capital of Italy?
A: Rome
Q: Capital of Japan?
A: Tokyo
Q: Capital of Canada?
A:
The model completes “Ottawa” by analogy.
Example: Chain-of-thought prompting encourages the model to articulate intermediate reasoning steps before the final answer, improving performance on complex tasks by making its internal deliberations explicit.

Question: What techniques enable explainability and interpretability in LLMs for transparency and debugging?
Answer: Techniques for model transparency include attention visualization, where the weights from self-attention layers are projected as heatmaps to reveal which tokens influence each prediction. By tracing high-attention scores, practitioners can detect spurious correlations—such as a model over-relying on punctuation to answer questions—and adjust prompts or fine-tuning data to correct behavior.
Example: Visualizing the attention pattern for the prompt "Paris is the capital of ___" shows strong links between "Paris" and the mask token, confirming that the model grounds its prediction in the correct context.

Question: What are feature-importance methods like LIME and SHAP?
Answer: Feature-importance methods like LIME and SHAP approximate the LLM’s local decision boundary by perturbing input tokens and measuring output changes. These approaches assign an importance score to each token, highlighting which words drive the model’s response.
Example: Applying SHAP to a sentiment-analysis prompt can uncover that the word "unfortunately" disproportionately flips the sentiment from positive to negative, guiding data augmentation to balance emotional cues.

Question: What are probing classifiers and diagnostic heads?
Answer: Probing classifiers and diagnostic heads involve training lightweight classifiers on internal hidden states to test for encoded linguistic properties—such as part-of-speech tags or syntactic dependencies—revealing what knowledge the LLM has internalized.
Example: A probe trained on layer 5 embeddings might achieve high accuracy on subject–verb agreement tasks, indicating that early layers capture grammatical structure.

Question: What is counterfactual analysis and how do influence functions help in debugging LLMs?
Answer: Counterfactual analysis and influence functions trace the impact of specific training examples on a given prediction, pinpointing data points that cause unwanted behaviors. By identifying and editing or removing problematic examples from the training set, teams can reduce biases or hallucinations at their root.
Example: Influence functions might reveal that a rare, misannotated news article disproportionately drives a false historical claim, prompting its correction in the training corpus.

Question: What is Retrieval-Augmented Generation (RAG)?
Answer: Retrieval-Augmented Generation (RAG) is a hybrid framework that combines an LLM’s generative capabilities with an external retrieval system. When a prompt is input, the retrieval component searches a knowledge base, such as document embeddings in a vector database for relevant context passages. These retrieved snippets are then concatenated with the original prompt and fed into the LLM, which generates responses grounded in up-to-date, authoritative sources rather than solely relying on its pre-trained weights. This architecture ensures the model can access fresh or domain-specific information on the fly while preserving the LLM’s fluent text generation.
Example: In a customer-support scenario, a RAG system retrieves the latest product manual section on "warranty policy" and includes verbatim policy language in its answer, ensuring the response reflects current terms and eliminates guesswork.

Question: How does external context retrieval in RAG reduce misinformation?
Answer: By fetching and incorporating precise, source-verified content at inference time, external context retrieval anchors the LLM’s outputs to actual documents, dramatically lowering the risk of fabrications or outdated knowledge. Instead of hallucinating facts, the model quotes or paraphrases retrieved passages, and can even cite source identifiers back to users. This tight coupling of retrieval and generation creates a feedback loop: if the retrieved context lacks supporting evidence, the model signals uncertainty rather than confidently inventing details.
Example: In a customer-support scenario, a RAG system retrieves the latest product manual section on "warranty policy" and includes verbatim policy language in its answer, ensuring the response reflects current terms and eliminating guesswork.

Question: What is multimodality in the context of LLMs and what are Multimodal LLMs (MLLMs)?
Answer: Multimodality refers to an AI model’s ability to process and integrate information from different data types—text, images, audio, or code—within a single architecture. Multimodal LLMs (MLLMs) extend pure-text LLMs by adding specialized encoders or tokenizers for non-text inputs, then unifying their representations through attention mechanisms. This fusion enables the model to reason across modalities, grounding language understanding in visual, auditory, or structural cues.
Example: Given an image of a bar chart and the prompt “Describe the trend,” an MLLM attends to visual tokens representing bars and textual tokens of the question to generate: “The chart shows sales rising steadily from Q1 to Q4.”

Question: How do MLLMs accept and utilize images, audio, and code?
Answer: MLLMs transform each modality into a common embedding space before feeding them into shared transformer layers. For images, a vision encoder (e.g., a convolutional or patch-based transformer) converts pixel arrays into token embeddings that align with word embeddings. For audio, a spectrogram or waveform encoder tokenizes sound patterns into sequences analogous to text tokens. For code, specialized tokenizers split syntax into logical units—identifiers, operators, literals—mirroring text tokenization. The joint attention layers then attend across all embeddings, enabling cross-modal reasoning.
Example: When provided with a short Python snippet and asked, “What does this function return?”, the model processes code tokens and delivers the expected return value explanation.

Question: What is domain-specialization in the context of LLMs and how does fine-tuning enable it?
Answer: Domain-specialization tailors a pre-trained LLM to excel within a narrow field—such as healthcare, finance, or legal—by exposing it to sector-specific terminology, style, and knowledge. Fine-tuning achieves this by continuing training on a curated corpus of labeled or unlabeled texts from that domain, adjusting the model’s parameters so it prioritizes relevant concepts and patterns. This focused adaptation sharpens accuracy, reduces hallucinations on niche queries, and embeds domain conventions into the model’s latent space.
Example: In the healthcare domain, fine-tuning a general LLM on 50,000 annotated clinical discharge summaries yields a medical assistant that accurately extracts diagnoses, recommends follow-up tests, and drafts patient summaries in physician-approved language.

Question: How is fine-tuning for domain-specialization performed in practice?
Answer: The process begins by gathering a high-quality dataset: regulatory filings for finance, clinical notes for medicine, or case law for legal. Next, text is preprocessed and formatted—often with task-specific prompts (e.g., “Classify this medical note as diagnosis or treatment plan”). The LLM undergoes additional training epochs on this data at a lower learning rate to avoid catastrophic forgetting of general language skills. Validation monitors domain-relevant metrics (e.g., F1 score on medical entity recognition). Finally, the specialized model is deployed via APIs or integrated into pipelines, where it demonstrates heightened fluency and factual precision within its target sector.
Example: A legal LLM fine-tuned on thousands of judicial opinions and statutes can reliably extract legal citations, summarize rulings, and flag precedent-relevant clauses, helping lawyers draft memos more efficiently.

Note: To get my latest publications, you can follow me on Kaggle here.

Large Language Models (LLM) Concepts - Interview Questions and Answers

In this post, I explain What is LLM?, Language Modeling Basics, Tokenization & Words in LLMs, Neural Network Foundations, Transformer Ar...