AI / ML Engineer Interview Flashcards

Question 1

What is supervised learning?

Accepted Answer

Supervised learning trains a model on input-output pairs where each example has a known label, so the model learns a mapping from features to targets that generalizes to new data. It covers classification (discrete labels) and regression (continuous values), and requires labeled data, which is often the costliest part.

Question 2

What is unsupervised learning?

Accepted Answer

Unsupervised learning discovers structure in data without labels, for example clustering similar points (k-means), reducing dimensionality (PCA), or density estimation. It is used for exploration, segmentation, and pretraining when labels are scarce or expensive, and is evaluated indirectly since there is no ground-truth target.

Question 3

What is overfitting?

Accepted Answer

Overfitting is when a model captures noise and idiosyncrasies of the training set rather than the underlying signal, so training error is low but test error is high. It is caused by excessive capacity or too little data, and is mitigated with regularization, more data, early stopping, and cross-validation.

Question 4

What is underfitting?

Accepted Answer

Underfitting is when a model lacks the capacity or training to capture the underlying relationship, giving high error on both training and test sets (high bias). Fixes include a more expressive model, better features, reduced regularization, or longer training.

Question 5

Why do you split data into training and test sets?

Accepted Answer

Splitting holds out a test set the model never trains on, giving an unbiased estimate of generalization. The model learns on the training set and is evaluated on the test set, preventing the optimistic bias of scoring on data it memorized. A typical split is 70-80 percent train and the rest test.

Question 6

What is a validation set used for?

Accepted Answer

A validation set is held out from training to tune hyperparameters, select architectures, and decide on early stopping without touching the test set. Keeping the test set untouched until the end prevents leaking tuning decisions into the final, reported generalization estimate.

Question 7

What is k-fold cross-validation?

Accepted Answer

K-fold cross-validation partitions the data into k folds, trains on k-1 folds and validates on the remaining one, rotating k times and averaging the scores. It gives a more stable, lower-variance estimate of performance than a single split and uses the data efficiently, at k times the compute cost.

Question 8

What is the bias-variance tradeoff?

Accepted Answer

The bias-variance tradeoff describes how expected error decomposes into bias (error from oversimplified assumptions), variance (sensitivity to the training sample), and irreducible noise. Increasing model complexity lowers bias but raises variance; the goal is the sweet spot that minimizes total generalization error.

Question 9

What is a feature in machine learning?

Accepted Answer

A feature is a measurable input attribute the model uses to predict the target, for example age, pixel intensity, or a word count. Feature quality and engineering often matter more than the algorithm, since informative, well-scaled features make patterns learnable.

Question 10

What is a label in supervised learning?

Accepted Answer

A label is the ground-truth target paired with each training example, such as the class of an image or the price of a house. Supervised models learn to predict labels from features, and label quality directly bounds achievable accuracy.

Question 11

What is the difference between classification and regression?

Accepted Answer

Classification outputs a discrete class label (spam or not, digit 0-9), while regression outputs a continuous quantity (price, temperature). They differ in loss functions (cross-entropy vs squared error) and metrics (accuracy/F1 vs RMSE/MAE), though some tasks like ordinal prediction blur the line.

Question 12

What does accuracy measure?

Accepted Answer

Accuracy = (correct predictions) / (total predictions), the share of examples classified correctly. It is intuitive but misleading on imbalanced data, where predicting the majority class can score high while missing the minority class entirely, so precision, recall, or AUC are often better.

Question 13

What is precision?

Accepted Answer

Precision = TP / (TP + FP): among all positive predictions, how many were correct. High precision means few false positives, which matters when acting on a positive is costly, such as flagging fraud or recommending a treatment.

Question 14

What is recall?

Accepted Answer

Recall = TP / (TP + FN): among all true positives, how many were found. High recall means few false negatives, which matters when missing a positive is costly, such as detecting disease or security threats. Recall usually trades off against precision.

Question 15

What is the F1 score?

Accepted Answer

F1 = 2 * (precision * recall) / (precision + recall), the harmonic mean that rewards a balance of both. It is useful on imbalanced data where a single metric is needed, and the harmonic mean punishes extreme imbalance between precision and recall more than the arithmetic mean would.

Question 16

What is a confusion matrix?

Accepted Answer

A confusion matrix tabulates predictions against actual labels, with cells for true positives, true negatives, false positives, and false negatives. It exposes which classes are confused and is the basis for precision, recall, accuracy, and specificity rather than a single summary number.

Question 17

What is gradient descent?

Accepted Answer

Gradient descent iteratively updates parameters by stepping opposite the gradient of the loss: theta = theta - learning_rate * gradient. It minimizes the loss surface; variants like stochastic and mini-batch gradient descent use sampled data for faster, scalable updates.

Question 18

What is the learning rate?

Accepted Answer

The learning rate scales the size of parameter updates in gradient descent. Too high causes divergence or oscillation; too low makes training slow and prone to getting stuck. It is a key hyperparameter, often scheduled or adapted (warmup, decay, Adam) during training.

Question 19

What is a loss function?

Accepted Answer

A loss function quantifies the error between predictions and targets, providing the signal that optimization minimizes. Common choices are mean squared error for regression and cross-entropy for classification; the loss must be differentiable for gradient-based training.

Question 20

What is an epoch?

Accepted Answer

An epoch is one complete pass over all training examples. Models typically train for many epochs so parameters converge, while monitoring validation loss to stop before overfitting. Within an epoch, data is processed in mini-batches.

Question 21

What is batch size?

Accepted Answer

Batch size is how many examples are used to compute one gradient update. Larger batches give smoother, more stable gradients and better hardware utilization but need more memory and can generalize worse; smaller batches add useful noise but are slower per epoch.

Question 22

What is an activation function?

Accepted Answer

An activation function introduces nonlinearity after a neuron's weighted sum, enabling networks to approximate complex, non-linear relationships. Without it, stacked layers collapse into a single linear map. Common choices include ReLU, sigmoid, and tanh.

Question 23

What is the ReLU activation?

Accepted Answer

ReLU(x) = max(0, x) outputs the input when positive and zero otherwise. It is cheap, mitigates vanishing gradients, and is the default hidden-layer activation, though it can suffer dying neurons (always-zero outputs), addressed by variants like leaky ReLU.

Question 24

What does the sigmoid function do?

Accepted Answer

Sigmoid(x) = 1 / (1 + e^-x) maps any real number to (0, 1), interpretable as a probability for binary classification output. In deep hidden layers it can cause vanishing gradients because its derivative is small for large magnitude inputs.

Question 25

What is the softmax function?

Accepted Answer

Softmax exponentiates each score and normalizes: softmax(z_i) = e^{z_i} / sum_j e^{z_j}, producing a probability distribution over classes that sums to 1. It is the standard output for multi-class classification, paired with cross-entropy loss.

Question 26

What is a neural network?

Accepted Answer

A neural network is composed of layers of interconnected nodes (neurons), each applying a weighted sum and nonlinear activation. By stacking layers and training weights with backpropagation, it learns hierarchical representations capable of approximating complex functions.

Question 27

What is a hidden layer?

Accepted Answer

A hidden layer sits between input and output and transforms the data into intermediate features the next layer can use. Depth (more hidden layers) lets the network learn increasingly abstract, hierarchical representations, which is the essence of deep learning.

Question 28

What are weights and biases in a neural network?

Accepted Answer

Weights scale the contribution of each input to a neuron, and the bias shifts the weighted sum before activation: output = activation(w·x + b). These parameters are learned during training to minimize loss, and together define what the network computes.

Question 29

What is backpropagation?

Accepted Answer

Backpropagation applies the chain rule to compute the gradient of the loss with respect to every weight, propagating error signals from output back to input layer by layer. These gradients then drive gradient descent updates; it makes training deep networks computationally feasible.

Question 30

What is regularization?

Accepted Answer

Regularization adds constraints or penalties that reduce model complexity and overfitting, trading a little training fit for better generalization. Examples include L1/L2 weight penalties, dropout, early stopping, and data augmentation.

Question 31

What is the difference between L1 and L2 regularization?

Accepted Answer

L1 (Lasso) penalizes the sum of absolute weights, driving some to exactly zero for feature selection and sparsity. L2 (Ridge) penalizes the sum of squared weights, shrinking them smoothly toward but not to zero. Elastic Net combines both.

Question 32

What is dropout?

Accepted Answer

Dropout randomly zeroes a fraction of neuron activations each training step, forcing the network to learn redundant, robust features rather than co-adapting. It acts like training an ensemble of subnetworks; at inference all neurons are used with scaled outputs.

Question 33

Why normalize or standardize features?

Accepted Answer

Normalization (scaling to [0,1]) or standardization (zero mean, unit variance via (x - mean)/std) puts features on comparable scales, which speeds and stabilizes gradient descent and prevents large-scale features from dominating. It is essential for distance-based and gradient-based models.

Question 34

What is one-hot encoding?

Accepted Answer

One-hot encoding converts a categorical variable into a binary vector where exactly one element is 1 for the present category and the rest are 0. It avoids implying false ordinal relationships between categories, at the cost of high dimensionality for many categories.

Question 35

What is feature scaling?

Accepted Answer

Feature scaling transforms features to a common range or distribution (min-max or standardization) so models that rely on distances or gradients treat features fairly. Tree-based models are scale-invariant, but linear models, SVMs, KNN, and neural nets benefit greatly.

Question 36

What is linear regression?

Accepted Answer

Linear regression models the target as a weighted linear combination of features plus a bias: y = w·x + b, fit by minimizing squared error. It is interpretable and fast, assuming a roughly linear relationship and being sensitive to outliers and multicollinearity.

Question 37

What does logistic regression predict?

Accepted Answer

Logistic regression models the probability of a class by passing a linear combination of features through a sigmoid: p = 1/(1 + e^-(w·x+b)), trained with cross-entropy loss. Despite its name it is a classifier, and it is a strong, interpretable baseline.

Question 38

What is a decision tree?

Accepted Answer

A decision tree recursively splits the data on the feature and threshold that best separates the target (by Gini impurity or entropy), forming a tree whose leaves give predictions. It is interpretable and handles nonlinearity, but a single deep tree overfits easily.

Question 39

What is a random forest?

Accepted Answer

A random forest trains many decision trees on bootstrap samples with random feature subsets, then aggregates them by voting or averaging. This bagging reduces variance and overfitting versus a single tree, giving robust, strong performance with little tuning.

Question 40

How does k-nearest neighbors classify a point?

Accepted Answer

KNN classifies a new point by finding its k nearest training examples (by a distance metric like Euclidean) and taking the majority label, or averaging for regression. It is simple and non-parametric but slow at inference and sensitive to feature scaling and the curse of dimensionality.

Question 41

What does the k-means algorithm do?

Accepted Answer

K-means partitions data into k clusters by alternately assigning each point to its nearest centroid and recomputing centroids as cluster means, minimizing within-cluster variance. You must choose k, it assumes roughly spherical clusters, and results depend on initialization (k-means++).

Question 42

What is a hyperparameter?

Accepted Answer

A hyperparameter is configured before training and not learned from data, for example learning rate, number of layers, tree depth, or k. They are tuned via validation using grid search, random search, or Bayesian optimization to optimize generalization.

Question 43

What is the difference between a parameter and a hyperparameter?

Accepted Answer

Model parameters (weights, biases) are learned by optimization from the training data, while hyperparameters (learning rate, depth, regularization strength) are set beforehand and govern how learning proceeds. Parameters are outputs of training; hyperparameters are inputs to it.

Question 44

What is the difference between training and inference?

Accepted Answer

Training is the compute-heavy phase that fits parameters by minimizing loss over data. Inference (or serving) applies the fixed trained model to new inputs to produce predictions, optimized for latency and throughput rather than learning.

Question 45

What is label encoding?

Accepted Answer

Label encoding assigns each category a unique integer. It is compact and fine for tree-based models or truly ordinal variables, but for nominal categories in linear or distance-based models it falsely implies order and magnitude, where one-hot encoding is preferred.

Question 46

What is an outlier?

Accepted Answer

An outlier is an observation far from the bulk of the data, arising from errors, rare events, or heavy-tailed distributions. Outliers can distort means, variances, and models like linear regression, so they are detected (z-score, IQR) and handled by removal, capping, or robust methods.

Question 47

What is data leakage in machine learning?

Accepted Answer

Data leakage occurs when the model is exposed to information it would not have at prediction time, such as target-derived features, future data, or fitting preprocessing on the full dataset before splitting. It produces over-optimistic validation scores that collapse in production.

Question 48

What problem does an imbalanced dataset cause?

Accepted Answer

In an imbalanced dataset one class vastly outnumbers others, so a naive model maximizes accuracy by predicting the majority while failing on the important minority. Remedies include resampling (oversampling/SMOTE, undersampling), class weights, and using precision/recall, F1, or AUC instead of accuracy.

Question 49

What is ground truth?

Accepted Answer

Ground truth is the verified, correct label or value against which predictions are compared during training and evaluation. Its accuracy bounds model quality; noisy or biased ground truth (from labeling errors) caps how well any model can truly perform.

Question 50

What does generalization mean in machine learning?

Accepted Answer

Generalization is a model's ability to perform well on data drawn from the same distribution but not seen in training, which is the real goal of learning. It is estimated with held-out test data and improved by regularization, more and cleaner data, and appropriate model capacity.

Question 51

What is ensemble learning?

Accepted Answer

Ensemble learning combines several models so errors partially cancel, lowering variance or bias. Bagging (random forests) trains models in parallel on bootstrap samples to cut variance; boosting trains models sequentially to fix prior errors and cut bias; stacking learns a meta-model over base predictions.

Question 52

How does gradient boosting work?

Accepted Answer

Gradient boosting fits an additive ensemble where each new weak learner is trained on the negative gradient (residuals) of the loss from the current ensemble, then added with a shrinkage learning rate. It is powerful on tabular data but can overfit without regularization and early stopping.

Question 53

Why is XGBoost popular for tabular data?

Accepted Answer

XGBoost is an optimized gradient boosting library adding L1/L2 regularization, second-order gradients, sparsity-aware splitting, parallelized tree building, and built-in handling of missing values. Its speed, accuracy, and tunability make it a frequent winner on structured/tabular problems.

Question 54

What does ROC AUC measure?

Accepted Answer

The ROC curve plots true positive rate against false positive rate across thresholds; AUC is the area under it, equal to the probability a random positive is ranked above a random negative. AUC of 0.5 is random and 1.0 is perfect; it is threshold-independent but can be optimistic on heavy imbalance.

Question 55

When is a precision-recall curve preferred over ROC?

Accepted Answer

A precision-recall curve plots precision against recall across thresholds and is more informative than ROC when positives are rare, because ROC's false positive rate is dominated by abundant negatives and looks deceptively good. PR-AUC focuses on performance for the minority class that matters.

Question 56

What does principal component analysis do?

Accepted Answer

PCA finds orthogonal directions (principal components, the eigenvectors of the covariance matrix) capturing maximal variance, then projects data onto the top components to reduce dimensionality while retaining information. It needs standardized features and yields uncorrelated but less interpretable components.

Question 57

What is the curse of dimensionality?

Accepted Answer

The curse of dimensionality is that volume grows exponentially with dimensions, so data becomes sparse and nearly equidistant, weakening distance-based methods and requiring exponentially more data to cover the space. Mitigations include dimensionality reduction, feature selection, and regularization.

Question 58

What is the vanishing gradient problem?

Accepted Answer

Vanishing gradients occur when repeated multiplication of small derivatives (e.g. sigmoid/tanh) through many layers drives gradients toward zero, stalling learning in early layers. Remedies include ReLU activations, residual/skip connections, batch normalization, and careful weight initialization.

Question 59

What does batch normalization do?

Accepted Answer

Batch normalization standardizes each layer's pre-activations over the mini-batch (zero mean, unit variance) then rescales with learned parameters. It reduces internal covariate shift, allows higher learning rates, adds mild regularization, and speeds convergence, with care needed for small batches and inference statistics.

Question 60

What is a CNN good at and why?

Accepted Answer

A CNN uses convolutional filters that slide over the input to detect local patterns (edges, textures) with weight sharing and translation invariance, plus pooling for downsampling. This drastically reduces parameters versus dense layers and exploits spatial structure, making it dominant for images and grid data.

Question 61

What is the purpose of pooling layers?

Accepted Answer

Pooling (max or average) aggregates over local regions to shrink feature maps, cutting computation and parameters while providing small translation invariance and a larger receptive field. Max pooling keeps the strongest activation; strided convolutions are a learnable alternative.

Question 62

What is an RNN designed for?

Accepted Answer

An RNN processes sequences one step at a time, carrying a hidden state that summarizes prior context, sharing weights across steps. It suits text, time series, and audio, but vanilla RNNs struggle with long-range dependencies due to vanishing gradients, motivating LSTMs and attention.

Question 63

Why were LSTMs introduced?

Accepted Answer

LSTMs add a gated cell state (input, forget, output gates) that controls what information to keep, discard, or expose, letting gradients flow over long sequences without vanishing. This captures long-range dependencies far better than vanilla RNNs, though transformers now often outperform them.

Question 64

What is an embedding?

Accepted Answer

An embedding maps discrete tokens (words, users, items) to dense, lower-dimensional vectors learned so that similar items are nearby in vector space. Embeddings capture semantic relationships, generalize better than one-hot encoding, and can be pretrained and transferred.

All AI / ML Engineer Interview Flashcards

Easy (50)

Medium (50)

Hard (50)

Ready to practice the full interview?