Skip to content

ML Q&A


Fundamental Concepts

  • What is bias vs variance?
  • What is overfitting vs underfitting?
  • How does regularization help prevent overfitting?
  • What is the difference between L1 and L2 regularization?
  • What is the difference between generative and discriminative models?
  • What is the difference between supervised and unsupervised learning?
  • What is the difference between LDA and PCA?
  • What is the curse of dimensionality?

Evaluation Metrics

  • What are precision, recall, and F1 score?
  • What is an ROC curve, and how is it used?
  • What is AUC and how is it interpreted?
  • What is log loss / cross-entropy loss?
  • What are confusion matrices and how do you interpret them?
  • What is the difference between micro, macro, and weighted averaging in classification metrics?
  • How do you evaluate a model for multi-class vs multi-label classification?
  • What metrics are suitable for ranking models?

Optimization

  • How does gradient descent work?
  • What is stochastic gradient descent, and how is it different from full-batch gradient descent?
  • What is backward propagation, and how does it relate to neural networks?
  • What are common variants of gradient descent (SGD, Momentum, Adam, RMSProp)?
  • What are vanishing and exploding gradients?
  • How does learning rate affect training?
  • What is gradient clipping and why is it used?
  • How do optimizers differ in convergence speed and stability?

ML Algorithms

  • What is logistic regression? (forward calculation, loss function)
  • What is K-Nearest Neighbors (KNN)?
  • What is decision tree learning?
  • What is random forest, and how does it differ from gradient boosting?
  • What is the difference between bagging and boosting?
  • What is K-Means clustering, and how does it work?
  • What is Support Vector Machine (SVM)?
  • What is Bayesian learning?
  • What is the difference between MAP and MLE?
  • What is Naive Bayes and when is it effective?
  • What are the assumptions behind linear regression?
  • How does ridge regression differ from lasso regression?
  • What are the strengths and weaknesses of tree-based methods?
  • What are ensemble methods, and why do they work?

Data Issues

  • How does imbalanced data affect model performance?
  • What are common strategies to handle imbalanced data?
  • What is data drift and how do you handle it?
  • How to handle missing data in ML pipelines?
  • How do you split your data into train/validation/test?
  • What are the benefits and risks of oversampling and undersampling?

Deep Learning

  • What is the ReLU activation function?
  • How to deal with gradient vanishing?
  • What are common activation functions and when to use them (ReLU, sigmoid, tanh, GELU, etc.)?
  • What are fully connected layers?
  • What is the role of initialization in deep learning?
  • What is dropout and how does it prevent overfitting?
  • What is an epoch, batch, and iteration?
  • What is early stopping and how does it help generalization?
  • What is the difference between feedforward networks and recurrent networks?

Transformers

  • Why do we divide the attention score by √dₖ in the Transformer?
  • Why are different weight matrices used to compute Q, K, and V?
  • Why do we use Multi-Head Attention?
  • What is the time complexity of attention?
  • What is KV cache and why is it used?
  • What is Multi-Query Attention (MQA)?
  • What is Grouped Query Attention (GQA)?
  • What are the shapes of Q, K, V in MHA, MQA, and GQA?
  • What is Flash Attention and how does it work?
  • How to optimize memory usage in attention mechanisms?

LLM Architecture & Inference

  • What are the key factors that affect LLM inference latency?
  • What are common LLM inference optimization techniques?
  • What is KV cache and how does it improve inference?
  • What is smart batching and how does it affect performance?
  • What is quantization and how does it help inference?
  • What is the tradeoff in using MQA or GQA?

LLM Fine-Tuning Techniques

  • What are typical fine-tuning methods for LLMs?
  • What is LoRA and how does it work?
  • How is the loss calculated in supervised fine-tuning (SFT)?
  • What is the difference between prefix tuning and prompt tuning?
  • What is RLHF and how does it work?

LLM Training Optimization

  • What is mixed-precision training and how does it work?
  • What are the benefits and tradeoffs of FP16 vs BF16?
  • What is gradient checkpointing?
  • What is Distributed Data Parallel (DDP)?
  • What is Fully Sharded Data Parallel (FSDP)?
  • What is ZeRO (Zero Redundancy Optimizer) and its stages?
  • How does ZeRO enable large-scale model training?

Retrieval-Augmented Generation (RAG)

  • Why use RAG for LLMs?
  • How does RAG architecture work?
  • What are the steps to build a RAG-based chatbot?


Learning Resources

GitHub Repositories


Blogs & Articles

General ML

Clustering

Transformers & DL


Research Papers

  • Attention Is All You Need – Vaswani et al.
    The original paper that introduced the Transformer architecture.
  • CLIP: Learning Transferable Visual Models From Natural Language
    CLIP Paper (OpenAI) – Vision-language model combining image and text representations.

Courses