All AI / ML Engineer Interview Flashcards
All 150 AI / ML Engineer interview flashcards. Tap any question to practice it.
Easy (50)
- What is supervised learning?
- What is unsupervised learning?
- What is overfitting?
- What is underfitting?
- Why do you split data into training and test sets?
- What is a validation set used for?
- What is k-fold cross-validation?
- What is the bias-variance tradeoff?
- What is a feature in machine learning?
- What is a label in supervised learning?
- What is the difference between classification and regression?
- What does accuracy measure?
- What is precision?
- What is recall?
- What is the F1 score?
- What is a confusion matrix?
- What is gradient descent?
- What is the learning rate?
- What is a loss function?
- What is an epoch?
- What is batch size?
- What is an activation function?
- What is the ReLU activation?
- What does the sigmoid function do?
- What is the softmax function?
- What is a neural network?
- What is a hidden layer?
- What are weights and biases in a neural network?
- What is backpropagation?
- What is regularization?
- What is the difference between L1 and L2 regularization?
- What is dropout?
- Why normalize or standardize features?
- What is one-hot encoding?
- What is feature scaling?
- What is linear regression?
- What does logistic regression predict?
- What is a decision tree?
- What is a random forest?
- How does k-nearest neighbors classify a point?
- What does the k-means algorithm do?
- What is a hyperparameter?
- What is the difference between a parameter and a hyperparameter?
- What is the difference between training and inference?
- What is label encoding?
- What is an outlier?
- What is data leakage in machine learning?
- What problem does an imbalanced dataset cause?
- What is ground truth?
- What does generalization mean in machine learning?
Medium (50)
- What is ensemble learning?
- How does gradient boosting work?
- Why is XGBoost popular for tabular data?
- What does ROC AUC measure?
- When is a precision-recall curve preferred over ROC?
- What does principal component analysis do?
- What is the curse of dimensionality?
- What is the vanishing gradient problem?
- What does batch normalization do?
- What is a CNN good at and why?
- What is the purpose of pooling layers?
- What is an RNN designed for?
- Why were LSTMs introduced?
- What is an embedding?
- What is transfer learning?
- What does fine-tuning a pretrained model involve?
- What is data augmentation?
- How do grid
- What is early stopping?
- How does class weighting help with imbalance?
- What does SMOTE do?
- Why use stratified sampling when splitting data?
- How can you handle missing feature values?
- What problem does multicollinearity cause?
- What is algorithmic bias and how does it arise?
- Why is cross-entropy used for classification?
- How does Adam differ from plain SGD?
- What does momentum add to gradient descent?
- Why does weight initialization matter?
- Why schedule the learning rate during training?
- How do learning curves diagnose a model?
- What is stacking in ensembles?
- What does it mean for a classifier to be calibrated?
- Why tune the classification threshold?
- Why is feature engineering important?
- What is target encoding and its main risk?
- Why can't you randomly shuffle a time series for train/test?
- What is data drift?
- How does concept drift differ from data drift?
- Why evaluate models both offline and online?
- How does bagging reduce error?
- How does boosting reduce error?
- What does precision at k measure?
- What does temperature do in a softmax?
- What is the tradeoff captured by sensitivity and specificity?
- What happens as you increase regularization strength?
- Why use mini-batch rather than full-batch or single-sample updates?
- How does label noise affect training?
- Why perform feature selection?
- How do preprocessing pipelines prevent leakage?
Hard (50)
- What makes the transformer architecture powerful?
- How is self-attention computed?
- Why use multiple attention heads?
- Why do transformers need positional encoding?
- What is the computational bottleneck of self-attention?
- How do BERT-style and GPT-style models differ?
- What is masked language modeling?
- How do autoregressive models generate text?
- How do greedy
- What problem does the KV cache solve?
- What does LoRA do?
- What is reinforcement learning from human feedback?
- How does a mixture-of-experts model save compute?
- What is knowledge distillation?
- How does quantization shrink and speed up models?
- What is model pruning?
- Why and how is gradient clipping used?
- Why do transformers use layer norm instead of batch norm?
- What problem do residual connections solve?
- What is the double descent phenomenon?
- What does the lottery ticket hypothesis claim?
- What is catastrophic forgetting?
- What is in-context learning in large language models?
- How does contrastive learning create useful representations?
- How does a generative adversarial network train?
- What is mode collapse in GANs?
- How do diffusion models generate data?
- What does a variational autoencoder learn?
- Why is the reparameterization trick needed in VAEs?
- What is the difference between epistemic and aleatoric uncertainty?
- What does label smoothing do?
- What problem does focal loss address?
- Why use approximate nearest neighbor search for embeddings?
- What is retrieval-augmented generation?
- Why is cosine similarity common for embeddings?
- Why do LLMs use subword tokenization like BPE?
- What does perplexity measure for a language model?
- What do neural scaling laws describe?
- How does mixed precision training help?
- How do data and model parallelism differ?
- What is gradient accumulation used for?
- What should you monitor for a deployed model?
- What is a shadow deployment?
- What problem does a feature store solve?
- What is training-serving skew?
- Why do large language models hallucinate?
- How can fine-tuning a large model on small data go wrong?
- What is active learning?
- Why monitor embedding drift in a deployed NLP system?
- How does speculative decoding speed up LLM inference?
Ready to practice the full interview?
Try a 10-minute interview for free!
No credit card needed.
