Data Scientist Interview Flashcards

Question 1

What is the difference between supervised and unsupervised learning?

Accepted Answer

Supervised learning trains on labeled examples (input X with known output y) to predict the target on new data, covering classification and regression. Unsupervised learning works on unlabeled data to discover structure, such as clustering (k-means) or dimensionality reduction (PCA). Semi-supervised and reinforcement learning sit between these.

Question 2

What is the difference between classification and regression?

Accepted Answer

Both are supervised, but classification predicts a discrete class label and is scored with accuracy, precision/recall, or AUC, while regression predicts a continuous quantity like price and is scored with MAE, MSE/RMSE, or R^2. The output type and loss function differ accordingly.

Question 3

When is the median preferred over the mean as a measure of central tendency?

Accepted Answer

Prefer the median when the distribution is skewed or has outliers, because the mean is pulled toward extreme values while the median, the 50th percentile, is robust to them. Incomes, for example, are right-skewed, so median income better represents a typical person than the mean does.

Question 4

What is overfitting?

Accepted Answer

Overfitting is when a model fits the training data's noise and quirks, so training error is low but test or generalization error is high, a sign of high variance. Causes include too much model capacity or too little data. Remedies: more data, regularization, simpler models, cross-validation, and early stopping.

Question 5

What is underfitting?

Accepted Answer

Underfitting is when a model is too simple or under-trained to capture the underlying signal, so it has high bias and poor performance on both training and test sets. Fixes include adding features or capacity, reducing regularization, or training longer. It is the opposite end of the bias-variance trade-off from overfitting.

Question 6

Why do you split data into training and test sets?

Accepted Answer

Splitting holds out a test set the model never trains on, so its performance there estimates generalization to new data and exposes overfitting. A common split is 70-80% train and the rest test, often with a separate validation set or cross-validation for tuning so the test set stays untouched until the end.

Question 7

What is a feature in machine learning?

Accepted Answer

A feature is a measurable input variable, a column, the model uses to predict the target, such as square footage when predicting price. Feature engineering, creating, transforming, and selecting good features, often matters more to accuracy than the choice of algorithm.

Question 8

What is a label (target) in supervised learning?

Accepted Answer

The label or target is the known correct output for each training example in supervised learning, such as the true class or value. The model learns a mapping from features to labels by minimizing the error between its predictions and these labels, so labeling quality directly bounds model quality.

Question 9

Why does correlation not imply causation?

Accepted Answer

Correlation measures that two variables move together, but that can arise from coincidence, reverse causation, or a confounder influencing both, like ice-cream sales and drownings both rising with temperature. Establishing causation requires controlled experiments such as A/B tests, or causal-inference methods that account for confounders.

Question 10

What is an outlier?

Accepted Answer

An outlier is an observation far from the rest of the data, possibly from measurement error, data-entry mistakes, or a genuine rare event. Common rules flag points beyond 1.5 x IQR from the quartiles or more than about 3 standard deviations from the mean. Handle by investigating, capping, or using robust methods rather than blindly deleting.

Question 11

What does standard deviation measure?

Accepted Answer

Standard deviation measures dispersion around the mean: sigma = sqrt( (1/N) * sum( (x_i - mean)^2 ) ), the square root of the variance. A larger sigma means more spread. For a normal distribution, about 68% of values lie within 1 sigma and 95% within 2 sigma of the mean.

Question 12

What shape does a normal distribution have?

Accepted Answer

The normal (Gaussian) distribution is a symmetric bell curve fully described by its mean (center) and standard deviation (spread), with mean = median = mode. Its 68-95-99.7 rule says roughly 68%, 95%, and 99.7% of data fall within 1, 2, and 3 standard deviations. Many statistical methods assume approximate normality.

Question 13

What does a p-value represent?

Accepted Answer

A p-value is P(data at least as extreme as observed, given the null hypothesis is true). A small p-value, commonly below 0.05, means the data would be unlikely under the null, so we reject it. It is not the probability the null is true, nor the effect size, and it shrinks with larger samples.

Question 14

What is a null hypothesis?

Accepted Answer

The null hypothesis (H0) is the default claim of no effect, no difference, or no relationship, while the alternative (H1) is what you suspect. A test gathers evidence to reject H0 in favor of H1 if the p-value falls below the significance level alpha. Failing to reject H0 is not proof that it is true.

Question 15

What is mean imputation?

Accepted Answer

Mean imputation replaces missing numeric values with that column's mean. It is simple but shrinks variance and can distort relationships and correlations, and it is sensitive to outliers, where median imputation is more robust. Better options include model-based or multiple imputation plus a missingness-indicator feature.

Question 16

What is one-hot encoding?

Accepted Answer

One-hot encoding turns a categorical column into several binary 0/1 columns, one per category, so models needing numeric input can use it without implying a false order. The cost is dimensionality for high-cardinality features; alternatives include target or hashing encoding, and dropping one column avoids the dummy-variable trap.

Question 17

What does accuracy measure in classification?

Accepted Answer

Accuracy = (TP + TN) / (TP + TN + FP + FN), the share of all predictions that are correct. It is misleading on imbalanced classes, where predicting the majority class can score high while being useless, so pair it with precision, recall, F1, or AUC when classes are skewed.

Question 18

What does SQL GROUP BY do?

Accepted Answer

GROUP BY collapses rows that share the grouping column(s) into one row per group, over which aggregates like COUNT, SUM, and AVG are computed. Non-aggregated selected columns must appear in GROUP BY. Use WHERE to filter rows before grouping and HAVING to filter groups after aggregation.

Question 19

Name three common SQL aggregate functions.

Accepted Answer

Common aggregates are COUNT (number of rows), SUM (total), AVG (mean), MIN, and MAX. They collapse many rows into one value, often per group via GROUP BY. Note COUNT(*) counts all rows while COUNT(col) ignores NULLs, and AVG also skips NULLs, which can surprise you.

Question 20

What does a histogram show?

Accepted Answer

A histogram shows a numeric variable's distribution by dividing its range into bins and plotting the count or density in each, revealing shape, center, spread, skew, and modality. Bin width matters: too few bins hide structure, too many add noise. It differs from a bar chart, which compares categories.

Question 21

What is a scatter plot used to visualize?

Accepted Answer

A scatter plot maps two numeric variables to x and y, with each point an observation, revealing correlation, direction, strength, nonlinearity, clusters, and outliers. Color or size can encode a third or fourth variable. It is the go-to chart for exploring pairwise relationships before modeling.

Question 22

What is the difference between categorical and numeric variables?

Accepted Answer

Categorical variables take discrete labels, either nominal with no order like color or ordinal with order like small/medium/large. Numeric variables are measurable quantities, discrete counts or continuous measurements. The type dictates valid operations, suitable charts, encodings, and which summary statistics and models apply.

Question 23

What is the difference between a sample and a population?

Accepted Answer

The population is the entire set you want to draw conclusions about, while a sample is the subset you actually measure because studying the whole population is usually infeasible. A representative, ideally random, sample lets you infer population parameters with quantified uncertainty; a biased sample yields misleading conclusions.

Question 24

What is bias in a model in simple terms?

Accepted Answer

In the bias-variance sense, bias is error from overly simplistic assumptions that make a model systematically miss the true relationship, that is, underfitting. High bias shows as poor training and test performance. It trades off against variance, since a more complex model lowers bias but tends to raise variance, so the goal is to balance both.

Question 25

What does high variance mean for a model?

Accepted Answer

Variance is how much a model's predictions would change if trained on a different sample; high variance means it is fitting noise and overfitting, doing well on training but poorly on test data. It trades off against bias, since total expected error = bias^2 + variance + irreducible noise. More data and regularization reduce variance.

Question 26

What is training data?

Accepted Answer

Training data is the set of examples, features plus labels in supervised learning, the model learns from by adjusting its parameters to minimize a loss. Its size, quality, and representativeness bound performance, and it must be kept separate from test data to get an honest generalization estimate.

Question 27

What is a Pandas DataFrame?

Accepted Answer

A DataFrame is pandas' 2D labeled table: columns can hold different dtypes, and both rows (the index) and columns are labeled, supporting vectorized operations, grouping, joining, and reshaping. Each column is a Series. It is the workhorse structure for data wrangling in Python.

Question 28

What does mean absolute error (MAE) measure?

Accepted Answer

Mean Absolute Error = (1/n) * sum( |y_i - yhat_i| ), the average absolute prediction error in the target's own units. Unlike RMSE it weights all errors linearly, so it is more robust to outliers but less sensitive to large misses. Lower is better.

Question 29

Why scale features before some algorithms?

Accepted Answer

Scaling, standardization to mean 0 and std 1 or min-max to 0-1, puts features on comparable ranges so distance-based methods like kNN, k-means, and SVM and gradient-descent models converge well and are not dominated by large-magnitude features. Tree-based models do not need it. Fit the scaler on training data only to avoid leakage.

Question 30

Name two ways to handle missing values.

Accepted Answer

Options include dropping rows or columns, only safe when missingness is small and random, or imputing, using mean/median for numeric, mode for categorical, or model-based/multiple imputation for accuracy. Consider why data is missing (MCAR, MAR, MNAR) and add a missingness indicator, since the fact of being missing can itself be informative.

Question 31

What is label encoding?

Accepted Answer

Label encoding assigns each category an integer, such as red=0, green=1, blue=2. It is compact but implies an ordinal relationship that may be false, which can mislead linear or distance-based models, so it suits ordinal features or tree-based models. For nominal features with non-tree models, prefer one-hot encoding.

Question 32

How do you find the median of a sorted list?

Accepted Answer

To find the median, sort the values; with an odd count it is the middle element, and with an even count it is the average of the two middle elements. As the 50th percentile it splits the data in half and is robust to outliers, unlike the mean.

Question 33

What is the mode of a dataset?

Accepted Answer

The mode is the most frequently occurring value; a dataset can be unimodal, bimodal, or multimodal, or have no mode if all values are unique. It is the only central-tendency measure that works for categorical data, and for skewed numeric data it differs from the mean and median.

Question 34

How is the range of a dataset calculated?

Accepted Answer

Range = max - min, the simplest measure of spread. It is easy to compute but very sensitive to outliers because it depends only on the two extremes and ignores the distribution in between, so the IQR or standard deviation is usually a more robust spread measure.

Question 35

What is the difference between a bar chart and a histogram?

Accepted Answer

A bar chart plots a value per discrete category, with gaps between bars and reorderable categories. A histogram bins a continuous variable into adjacent intervals and shows frequency, so the bars touch and the order is fixed. Using one for the other's purpose misrepresents the data.

Question 36

What is the difference between a dependent and independent variable?

Accepted Answer

The independent variable, the predictor or feature x, is what you set or vary, while the dependent variable, the response or target y, is the outcome you measure and hypothesize depends on x. In y = f(x), x is independent and y dependent. Controlling other factors is what lets you attribute changes in y to x.

Question 37

What is a cross-tabulation (contingency table)?

Accepted Answer

A cross-tabulation, or contingency table, counts how often each combination of two categorical variables occurs, with rows for one variable and columns for the other. It reveals associations between them and is the basis for a chi-square test of independence; the margins give the per-category totals.

Question 38

What does the 90th percentile mean?

Accepted Answer

The 90th percentile is the value below which 90% of the data lie, with 10% above. Percentiles describe position in a distribution: the 50th is the median, and the 25th and 75th are the quartiles whose difference is the IQR. They are useful for SLAs (p95/p99 latency) and robust spread summaries.

Question 39

What is data cleaning?

Accepted Answer

Data cleaning detects and fixes errors, duplicates, inconsistent formats or units, and missing or invalid values so downstream analysis is trustworthy; it typically consumes most of a project's time. Steps include validating types and ranges, standardizing categories, deduping, and documenting decisions for reproducibility.

Question 40

What is binary classification?

Accepted Answer

Binary classification predicts one of two classes, positive or negative. A model outputs a probability that is thresholded, commonly at 0.5, into a class. Evaluate with a confusion matrix and precision, recall, F1, and ROC-AUC, tuning the threshold to balance false positives and false negatives for the use case.

Question 41

What is time series data?

Accepted Answer

Time series data is ordered by time, usually at regular intervals, so observations are not independent: trend, seasonality, and autocorrelation matter. This requires time-aware splits with no shuffling, specialized models like ARIMA or exponential smoothing or ML with lag features, and care to avoid leaking future information.

Question 42

What is random sampling?

Accepted Answer

Simple random sampling gives every member an equal, independent chance of selection, reducing selection bias so the sample represents the population. Variants include stratified sampling, which samples within subgroups to preserve proportions, and cluster sampling. Poor sampling like convenience samples introduces bias that no model can fix.

Question 43

What does boolean indexing do in pandas?

Accepted Answer

Boolean indexing filters a DataFrame or Series with a boolean mask, e.g. df[df['age'] > 30], keeping rows where the condition is True. You can combine conditions with & and | using parentheses, and use it to read or assign subsets. It is vectorized and fast compared with looping.

Question 44

What does min-max normalization do?

Accepted Answer

Min-max normalization rescales a feature to a fixed range with x' = (x - min) / (max - min), giving 0 to 1 or any chosen interval. It preserves the distribution's shape but is sensitive to outliers, which compress the rest of the values. Fit min and max on training data only to avoid leakage.

Question 45

What does it mean for a distribution to be right-skewed?

Accepted Answer

A right-skewed (positively skewed) distribution has a long right tail that pulls the mean above the median above the mode; incomes are a classic example. Left-skew is the mirror image. Skew signals non-normality, so you might log-transform the variable or prefer the median over the mean.

Question 46

What is data aggregation?

Accepted Answer

Aggregation reduces many rows to summary values, totals, counts, averages, min/max, usually per group via GROUP BY in SQL or groupby in pandas. It is central to reporting and feature engineering, such as average spend per customer. Watch how NULLs or missing values are treated in each aggregate.

Question 47

What does a confusion matrix show?

Accepted Answer

A confusion matrix tabulates predictions versus actual classes into TP, FP, TN, and FN. From it you derive precision = TP/(TP+FP), recall = TP/(TP+FN), and accuracy = (TP+TN)/total. It shows what kinds of errors a classifier makes, which a single accuracy number hides, especially on imbalanced data.

Question 48

In a dataset predicting house price, which is the target?

Accepted Answer

When predicting house price, the price is the target, the y you want to predict, and the remaining columns like square footage, bedrooms, and location are the features, the X. The model learns a mapping from features to target, and choosing informative features is often the biggest driver of accuracy.

Question 49

What are descriptive statistics?

Accepted Answer

Descriptive statistics summarize a dataset's main features: central tendency (mean, median, mode), spread (range, IQR, standard deviation), and shape (skew, kurtosis), plus counts. They describe the data at hand without inferring beyond it, unlike inferential statistics, which generalize from a sample to a population.

Question 50

Why does the data type of a column matter in analysis?

Accepted Answer

A column's data type dictates valid operations (you can average numbers but not categories), how it sorts and compares, its memory use, and which charts and models apply. Wrong inference, such as numbers read as strings or dates as text, breaks aggregations and sorts, so verifying and casting dtypes early is essential.

Question 51

What is the bias-variance tradeoff?

Accepted Answer

Expected test error decomposes as bias^2 + variance + irreducible noise. A too-simple model has high bias and underfits; a too-complex one has high variance and overfits. Lowering one typically raises the other, so the goal is the sweet spot that minimizes total error, found via validation, regularization, and right-sizing model capacity.

Question 52

What is the difference between precision and recall?

Accepted Answer

Precision = TP/(TP+FP) answers, of those I flagged positive, how many really are? Recall = TP/(TP+FN) answers, of all true positives, how many did I catch? They trade off: a stricter threshold raises precision but lowers recall. Choose based on the relative cost of false positives versus false negatives.

Question 53

What does the F1 score balance?

Accepted Answer

F1 = 2 * (precision * recall) / (precision + recall), the harmonic mean, which is high only when both precision and recall are high and punishes imbalance between them, making it better than accuracy on skewed classes. The F-beta generalization weights recall (beta > 1) or precision (beta < 1) more heavily.

Question 54

What does ROC AUC measure?

Accepted Answer

ROC AUC is the area under the curve of true-positive rate versus false-positive rate across all thresholds, and it equals the probability the model ranks a random positive above a random negative. 0.5 is chance and 1.0 perfect. It is threshold-independent but can look optimistic under heavy imbalance, where PR-AUC is preferred.

Question 55

What is k-fold cross-validation?

Accepted Answer

k-fold CV partitions the data into k folds, trains on k-1 and validates on the held-out fold, rotating so each fold is validated once, then averages the scores. This uses data efficiently and gives a more stable estimate than a single split. Use stratified folds for classification and time-based splits for time series.

Question 56

What do L1 and L2 regularization do?

Accepted Answer

Regularization adds a penalty on parameter magnitude to the loss to curb overfitting. L2 (ridge) adds lambda * sum(w^2), shrinking weights smoothly toward zero; L1 (lasso) adds lambda * sum(|w|), which can drive some weights exactly to zero and so performs feature selection. lambda controls strength, and elastic net combines both.

Question 57

What does gradient descent do?

Accepted Answer

Gradient descent updates parameters opposite the gradient of the loss: theta := theta - eta * grad(L), stepping downhill toward a minimum. Variants include batch, stochastic (one example at a time), and mini-batch, while momentum and Adam adapt the step. The learning rate eta governs the step size.

Question 58

What happens if the learning rate is too high?

Accepted Answer

If the learning rate is too high, updates overshoot the minimum and the loss oscillates or diverges; too low and training is slow and may stall in poor minima. Good practice is to tune it, use schedules with decay or warmup, and adaptive optimizers like Adam to balance speed and stability.

Question 59

Name two ways to handle class imbalance.

Accepted Answer

Tactics include resampling (oversample the minority, e.g. SMOTE, or undersample the majority), applying class weights so the loss penalizes minority errors more, adjusting the decision threshold, and evaluating with precision, recall, F1, or PR-AUC instead of accuracy, which is misleading when one class dominates.

Question 60

What is feature engineering?

Accepted Answer

Feature engineering creates, transforms, and selects input variables, such as interactions, ratios, aggregations, date parts, and text or embedding features, to expose signal the model can use. Domain knowledge guides it, and it often improves results more than swapping algorithms. Do transformations inside CV folds to avoid leakage.

Question 61

What problem does multicollinearity cause in linear regression?

Accepted Answer

Multicollinearity is when predictors are highly correlated, so the model cannot separate their individual effects: coefficient estimates become unstable, have inflated standard errors, and flip signs with small data changes, hurting interpretation though not necessarily prediction. Detect it with VIF and address it by dropping or combining features or using regularization or PCA.

Question 62

How does a decision tree make predictions?

Accepted Answer

A decision tree recursively splits the data on the feature and threshold that best reduce impurity, Gini or entropy for classification and variance for regression, and a prediction follows the splits to a leaf and returns its majority class or mean. Trees are interpretable but overfit without depth limits or pruning.

Question 63

Why does a random forest usually beat a single decision tree?

Accepted Answer

A random forest trains many trees on bootstrap samples (bagging) and, at each split, considers a random subset of features, which de-correlates the trees. Averaging their predictions sharply reduces variance versus a single tree, improving generalization at the cost of interpretability. Out-of-bag samples give a built-in validation estimate.

Question 64

How does gradient boosting build a model?

Accepted Answer

Gradient boosting builds an additive ensemble stage by stage: each new weak learner, usually a shallow tree, fits the negative gradient (residuals) of the loss from the current ensemble and is added with a learning-rate shrinkage. This strongly reduces bias but can overfit, so it needs tuning of depth, learning rate, and rounds with early stopping. XGBoost and LightGBM are popular implementations.

All Data Scientist Interview Flashcards

Easy (50)

Medium (50)

Hard (50)

Ready to practice the full interview?