Home Block 1 Block 2 Block 3 Block 4 Block 5

Data Cleaning and Standardization

Block 1: Data Acquisition and Pre-Processing

Topic 1.2 · 4 Objectives

1.2.1 Structured and Unstructured Data

Understand the difference between structured and unstructured data and how data structure impacts storage, retrieval, and analysis.

What Is Structured Data?

Structured data is data that conforms to a predefined schema with rows and columns. Every record follows the same format, making it easy to store in relational databases and spreadsheets and straightforward to query with SQL or pandas.

Exam Tip: Structured data is the easiest to analyze programmatically because its schema enforces consistency. Most exam questions focus on structured tabular data handled via pandas DataFrames.

Characteristics of Structured Data

CharacteristicDescription
Fixed schemaColumn names, data types, and constraints are defined in advance.
Easy queryingSQL or pandas can filter, aggregate, and join data efficiently.
Low storage overheadColumnar formats (e.g., Parquet) compress well because data is homogeneous.
High searchabilityIndexing on columns enables fast lookups.

What Is Unstructured Data?

Unstructured data has no predefined format or model. It includes free-form text, images, audio, and video. Because there is no schema, traditional relational queries do not work directly on it.

Processing Needs for Unstructured Data

Before analysis, unstructured data typically requires feature extraction or transformation to convert it into a structured format:

How Data Structure Impacts Storage, Retrieval, and Analysis

AspectStructuredUnstructured
StorageRDBMS, columnar stores (Parquet, Feather)Object storage (S3), NoSQL (MongoDB), file systems
RetrievalSQL queries, pandas indexing — fast and preciseFull-text search engines (Elasticsearch), specialized APIs
AnalysisDirect aggregation, statistical analysis, ML on tabular dataRequires pre-processing pipeline to extract features first
ScalabilityScales well with indexing and partitioningRequires distributed storage (Hadoop HDFS, cloud lakes)
Key Takeaway: In practice, most data analysis projects begin by converting or extracting structured features from unstructured sources so that standard tools (pandas, SQL) can be used downstream.

1.2.2 Erroneous Data

Identify data errors and inconsistencies, understand types of missingness, and learn imputation and deduplication strategies.

Identifying Data Errors and Inconsistencies

Erroneous data is any value that does not accurately represent the real-world entity it is supposed to describe. Common categories include:

Detecting Errors with pandas

import pandas as pd import numpy as np # Sample dataset df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eve'], 'age': [28, np.nan, -5, 28, 300], 'salary': [50000, 60000, 55000, 50000, np.nan] }) # 1. Check for missing values print(df.isnull().sum()) # name 0 # age 1 # salary 1 # 2. Detailed info (non-null counts and dtypes) df.info() # 3. Detect duplicates print(df.duplicated().sum()) # 1 duplicate row (index 3) # 4. Quick statistical sanity check print(df.describe()) # min age = -5, max age = 300 — clearly invalid

Types of Missingness

Understanding why data is missing is critical because the mechanism determines which imputation strategy is appropriate.

TypeDefinitionExampleImplication
MCAR
(Missing Completely At Random)
The probability of being missing is the same for all observations — missingness is unrelated to any variable. A lab sample is accidentally dropped. Safe to drop or impute without introducing bias.
MAR
(Missing At Random)
Missingness depends on observed variables but not on the missing value itself. Males are less likely to report their weight, but the actual weight does not determine missingness. Can be addressed with model-based imputation that conditions on the observed variable.
MNAR
(Missing Not At Random)
Missingness depends on the unobserved (missing) value itself. People with very high income are less likely to report it. Most problematic. Imputation may be biased; requires domain knowledge or sensitivity analysis.
Exam Warning: Know the three types of missingness and be able to match each to an example. This course frequently tests whether a described scenario is MCAR, MAR, or MNAR.

Data Imputation Methods

Imputation replaces missing values with estimated substitutes so that records are not lost. The correct method depends on the data distribution, the type of variable, and the missingness mechanism.

MethodWhen to Usepandas Syntax
MeanNumerical columns with roughly symmetric distribution (sensitive to outliers).df['col'].fillna(df['col'].mean())
MedianNumerical columns with skewed distribution or outliers.df['col'].fillna(df['col'].median())
ModeCategorical columns or any column where the most frequent value is a reasonable substitute.df['col'].fillna(df['col'].mode()[0])
Forward Fill (ffill)Time-series data where the last observed value is a good proxy.df['col'].ffill()
Backward Fill (bfill)Time-series data where the next observed value is a better proxy.df['col'].bfill()

Imputation Examples

import pandas as pd import numpy as np df = pd.DataFrame({ 'temperature': [22.1, np.nan, 23.4, np.nan, 21.8], 'city': ['NYC', 'LA', np.nan, 'NYC', 'LA'] }) # Mean imputation (numerical) df['temperature'] = df['temperature'].fillna(df['temperature'].mean()) # Mode imputation (categorical) df['city'] = df['city'].fillna(df['city'].mode()[0]) # Forward fill (time-series friendly) ts = pd.Series([10, np.nan, np.nan, 20, np.nan, 30]) print(ts.ffill()) # 0 10.0 # 1 10.0 # 2 10.0 # 3 20.0 # 4 20.0 # 5 30.0 # Backward fill print(ts.bfill()) # 0 10.0 # 1 20.0 # 2 20.0 # 3 20.0 # 4 30.0 # 5 30.0

Dropping Duplicates

# Remove exact duplicate rows df_clean = df.drop_duplicates() # Remove duplicates based on specific columns df_clean = df.drop_duplicates(subset=['name', 'age'], keep='first') # keep='first' — keep the first occurrence (default) # keep='last' — keep the last occurrence # keep=False — drop ALL duplicates (including the first)

Implications of Data Correction and Removal

Every decision to correct or remove data affects data integrity:

Best Practice: Always document every cleaning step you perform. Keep a copy of the raw data so that any transformation can be audited or reversed.

Data Collection Importance for Outlier Detection

Outlier detection is only as good as the data that feeds it. Understanding how the data was collected helps determine whether an extreme value is a genuine rare event or a data-entry error.

High-Quality Data for Accurate Outlier Detection

Dirty data can mask real outliers or create false ones. Before running outlier detection:

How Data Types Influence Outlier Detection Strategies

Data TypeOutlier Detection Approach
Continuous numericalIQR method, Z-score, modified Z-score, box plots
Categorical / ordinalFrequency analysis — values with extremely low frequency may be errors
Date/timeRange checks (future dates, dates before system launch) and gap analysis
TextLength distribution, unexpected characters, regex validation

1.2.3 Data Normalization and Scaling

Learn Min-Max scaling, Z-score normalization, encoding methods, data reduction, and outlier handling techniques.

Why Normalization Is Needed

Features measured on different scales (e.g., income in thousands vs. age in decades) can distort distance-based algorithms (k-NN, k-Means, SVM) and slow gradient-descent convergence. Normalization puts all features on a comparable scale.

Key Insight: Normalization does not change the underlying relationships in the data — it only rescales values so that no single feature dominates due to its magnitude.

Min-Max Scaling

Rescales values to the range [0, 1] (or any [a, b]).

Formula:

X_scaled = (X - X_min) / (X_max - X_min)

Min-Max Scaling Example

import pandas as pd df = pd.DataFrame({ 'income': [30000, 50000, 80000, 120000], 'age': [22, 35, 45, 60] }) # Manual Min-Max scaling for col in ['income', 'age']: df[col + '_scaled'] = (df[col] - df[col].min()) / (df[col].max() - df[col].min()) print(df) # income age income_scaled age_scaled # 0 30000 22 0.000000 0.000000 # 1 50000 35 0.222222 0.342105 # 2 80000 45 0.555556 0.605263 # 3 120000 60 1.000000 1.000000

Z-Score Normalization (Standardization)

Transforms values so they have a mean of 0 and a standard deviation of 1.

Formula:

Z = (X - mean) / std

Z-Score Normalization Example

import pandas as pd df = pd.DataFrame({'score': [70, 80, 90, 100, 60]}) # Z-score normalization df['score_z'] = (df['score'] - df['score'].mean()) / df['score'].std() print(df) # score score_z # 0 70 -0.632456 # 1 80 0.000000 # 2 90 0.632456 # 3 100 1.264911 # 4 60 -1.264911

Comparison: Min-Max vs Z-Score

FeatureMin-Max ScalingZ-Score (Standardization)
Output range[0, 1]Unbounded (centered at 0)
Outlier sensitivityHighModerate
Preserves distributionYes (same shape)Yes (same shape)
Best forNeural networks, image pixelsLinear regression, PCA, SVMs

Encoding Categorical Variables

One-Hot Encoding

Creates a new binary (0/1) column for each unique category value. Use when there is no ordinal relationship between categories.

import pandas as pd df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']}) # One-hot encoding with pd.get_dummies() df_encoded = pd.get_dummies(df, columns=['color'], dtype=int) print(df_encoded) # color_blue color_green color_red # 0 0 0 1 # 1 1 0 0 # 2 0 1 0 # 3 1 0 0 # 4 0 0 1
Watch Out: The Dummy Variable Trap — When using one-hot encoding in regression models, drop one column (e.g., drop_first=True) to avoid perfect multicollinearity. In tree-based models this is generally not necessary.

Label Encoding

Maps each category to a unique integer. Use for ordinal variables where the order matters (e.g., low < medium < high).

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small']}) df['size_encoded'] = le.fit_transform(df['size']) print(df) # size size_encoded # 0 small 2 # 1 medium 1 # 2 large 0 # 3 medium 1 # 4 small 2 # To see the mapping: print(dict(zip(le.classes_, le.transform(le.classes_)))) # {'large': 0, 'medium': 1, 'small': 2}
Note: LabelEncoder assigns integers alphabetically, not by semantic order. For truly ordinal encoding where you control the order, use a manual mapping: df['size'].map({'small': 0, 'medium': 1, 'large': 2}).

Data Reduction: Pros and Cons

Data reduction decreases the volume of data while retaining meaningful information. Common techniques include dimensionality reduction (PCA), feature selection, and aggregation.

ProsCons
Faster computation and model trainingRisk of losing important information
Reduced storage and memory requirementsMay oversimplify complex relationships
Mitigates the curse of dimensionalityReduced components can be hard to interpret
Can reduce noise and improve model performanceRequires careful tuning (e.g., choosing the number of components)

Outlier Handling

Detection with the IQR Method

The Interquartile Range (IQR) method defines outliers as values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.

import pandas as pd import numpy as np data = pd.Series([10, 12, 14, 15, 15, 16, 18, 19, 20, 100]) Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = data[(data < lower_bound) | (data > upper_bound)] print(f"IQR: {IQR}") print(f"Bounds: [{lower_bound}, {upper_bound}]") print(f"Outliers: {outliers.tolist()}") # IQR: 5.25 # Bounds: [5.875, 27.125] # Outliers: [100]

Detection with the Z-Score Method

import numpy as np data = pd.Series([10, 12, 14, 15, 15, 16, 18, 19, 20, 100]) z_scores = (data - data.mean()) / data.std() outliers = data[z_scores.abs() > 3] print(f"Outliers (|Z| > 3): {outliers.tolist()}") # Outliers (|Z| > 3): [100]

Outlier Treatment Techniques

TechniqueDescriptionWhen to Use
RemoveDelete the outlier rows.When you are confident the values are errors.
Cap / WinsorizeReplace outliers with the nearest non-outlier value (e.g., the upper/lower bound).When you want to keep the row but limit extreme influence.
TransformApply log, sqrt, or Box-Cox to compress the range.When the data is right-skewed and outliers are legitimate.
ImputeReplace with mean/median.When the value is clearly an error but removal would lose other column data.
KeepLeave as-is and use robust methods (e.g., median instead of mean).When outliers are genuine and informative.

Data Format Standardization

Ensuring consistent formats is essential for joining datasets and running comparisons.

import pandas as pd # Standardize mixed date formats dates = pd.Series(['2024-01-15', '01/20/2024', 'March 5, 2024']) dates_clean = pd.to_datetime(dates) print(dates_clean) # 0 2024-01-15 # 1 2024-01-20 # 2 2024-03-05 # dtype: datetime64[ns] # Standardize numerical strings with currency symbols prices = pd.Series(['$1,200.50', '$850.00', '$3,100.75']) prices_clean = prices.str.replace('[$,]', '', regex=True).astype(float) print(prices_clean) # 0 1200.50 # 1 850.00 # 2 3100.75 # dtype: float64

1.2.4 Applying Cleaning Techniques

Apply string manipulation, boolean normalization, type conversions, encoding, and binning in end-to-end cleaning pipelines.

String Manipulation and Cleaning

Text data is often messy. Common string cleaning operations include trimming whitespace, normalizing case, removing special characters, and extracting patterns with regex.

import pandas as pd df = pd.DataFrame({ 'name': [' Alice ', 'BOB', 'charlie', ' Dave '], 'email': ['ALICE@MAIL.COM', 'bob@mail.com', 'Charlie@Mail.Com', 'dave@MAIL.com'], 'phone': ['(555) 123-4567', '555.234.5678', '555-345-6789', '5554567890'] }) # Strip leading/trailing whitespace df['name'] = df['name'].str.strip() # Normalize case — title case for names, lower for emails df['name'] = df['name'].str.title() df['email'] = df['email'].str.lower() # Remove non-digit characters from phone numbers df['phone'] = df['phone'].str.replace(r'\D', '', regex=True) print(df) # name email phone # 0 Alice alice@mail.com 5551234567 # 1 Bob bob@mail.com 5552345678 # 2 Charlie charlie@mail.com 5553456789 # 3 Dave dave@mail.com 5554567890

Boolean Normalization

Datasets may encode boolean values inconsistently: "Yes", "yes", "Y", 1, "true", etc. Normalize them to a single Python bool type.

import pandas as pd df = pd.DataFrame({ 'subscribed': ['Yes', 'no', 'Y', 'TRUE', 'false', '1', '0'] }) # Map various representations to True/False true_values = {'yes', 'y', 'true', '1'} df['subscribed_clean'] = df['subscribed'].str.lower().isin(true_values) print(df) # subscribed subscribed_clean # 0 Yes True # 1 no False # 2 Y True # 3 TRUE True # 4 false False # 5 1 True # 6 0 False

String Case Normalization

pandas provides several case-conversion methods via the .str accessor:

MethodResultUse Case
.str.lower()'hello world'Email addresses, case-insensitive matching
.str.upper()'HELLO WORLD'Country codes, acronyms
.str.title()'Hello World'People's names, place names
.str.capitalize()'Hello world'Sentence starts

String-to-Number Conversions

import pandas as pd df = pd.DataFrame({ 'value': ['42', '3.14', 'N/A', '100', 'abc'] }) # pd.to_numeric with errors='coerce' turns non-parseable strings to NaN df['value_num'] = pd.to_numeric(df['value'], errors='coerce') print(df) # value value_num # 0 42 42.00 # 1 3.14 3.14 # 2 N/A NaN # 3 100 100.00 # 4 abc NaN
Exam Tip: pd.to_numeric(series, errors='coerce') is the safest way to convert a column that might contain non-numeric strings. The errors='coerce' parameter turns unparseable values into NaN instead of raising an error.

Imputation vs. Exclusion: Pros and Cons

StrategyProsCons
Imputation
  • Preserves sample size
  • Utilizes available partial data
  • Maintains statistical power
  • Introduces artificial values
  • Can reduce variance
  • May mask underlying patterns
Exclusion (Dropping)
  • Simple to implement
  • No artificial data introduced
  • Maintains data authenticity
  • Reduces sample size
  • Can introduce selection bias (if not MCAR)
  • Loss of information
Rule of Thumb: If a column has more than 40–50% missing values, consider dropping the entire column rather than imputing. If a row has most of its values missing, consider dropping the row. For smaller amounts of missingness, imputation is usually preferred.

One-Hot Encoding for Categorical Variables

One-hot encoding converts each category level into a separate binary column. This is the standard approach for nominal (unordered) categories in most machine learning pipelines.

import pandas as pd df = pd.DataFrame({ 'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'], 'sales': [200, 150, 300, 250, 180] }) # One-hot encode 'city' df_encoded = pd.get_dummies(df, columns=['city'], dtype=int) print(df_encoded) # sales city_Chicago city_LA city_NYC # 0 200 0 0 1 # 1 150 0 1 0 # 2 300 1 0 0 # 3 250 0 0 1 # 4 180 0 1 0 # With drop_first=True to avoid multicollinearity df_encoded2 = pd.get_dummies(df, columns=['city'], drop_first=True, dtype=int) print(df_encoded2) # sales city_LA city_NYC # 0 200 0 1 # 1 150 1 0 # 2 300 0 0 # 3 250 0 1 # 4 180 1 0

Bucketization (Binning) of Continuous Variables

Bucketization (also called binning or discretization) converts continuous numerical values into categorical groups (bins). This is useful when:

Using pd.cut() for Equal-Width Bins

import pandas as pd df = pd.DataFrame({ 'age': [5, 17, 22, 35, 45, 55, 67, 78, 85] }) # Define custom bin edges and labels bins = [0, 18, 35, 50, 65, 100] labels = ['Child', 'Young Adult', 'Middle Age', 'Senior', 'Elderly'] df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels) print(df) # age age_group # 0 5 Child # 1 17 Child # 2 22 Young Adult # 3 35 Young Adult # 4 45 Middle Age # 5 55 Senior # 6 67 Elderly # 7 78 Elderly # 8 85 Elderly

Using pd.qcut() for Equal-Frequency Bins

# pd.qcut() creates bins with approximately equal numbers of observations df['age_quartile'] = pd.qcut(df['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) print(df[['age', 'age_quartile']]) # age age_quartile # 0 5 Q1 # 1 17 Q1 # 2 22 Q2 # 3 35 Q2 # 4 45 Q3 # 5 55 Q3 # 6 67 Q4 # 7 78 Q4 # 8 85 Q4
pd.cut() vs pd.qcut(): pd.cut() divides the value range into equal-width intervals. pd.qcut() divides the data into intervals with roughly equal numbers of observations. Use qcut when you want balanced groups regardless of value distribution.

Comprehensive Cleaning Pipeline Example

Below is a realistic end-to-end cleaning pipeline that combines multiple techniques covered in this topic.

import pandas as pd import numpy as np # ---- 1. Load raw data ---- df = pd.DataFrame({ 'name': [' alice ', 'BOB', 'Charlie', ' alice ', 'eve', 'Frank'], 'age': [28, np.nan, -5, 28, 300, 42], 'salary': ['$50,000', '$60,000', '$55,000', '$50,000', np.nan, '$70,000'], 'dept': ['Sales', 'IT', 'Sales', 'Sales', 'HR', 'IT'], 'active': ['Yes', 'true', '1', 'Yes', 'no', 'FALSE'] }) # ---- 2. String cleaning ---- df['name'] = df['name'].str.strip().str.title() # ---- 3. Boolean normalization ---- true_vals = {'yes', 'true', '1'} df['active'] = df['active'].str.lower().isin(true_vals) # ---- 4. Numeric standardization ---- df['salary'] = df['salary'].str.replace(r'[$,]', '', regex=True).astype(float) # ---- 5. Remove duplicates ---- df = df.drop_duplicates(subset=['name', 'age'], keep='first') # ---- 6. Handle invalid ages ---- df.loc[~df['age'].between(0, 120), 'age'] = np.nan # ---- 7. Impute missing values ---- df['age'] = df['age'].fillna(df['age'].median()) df['salary'] = df['salary'].fillna(df['salary'].median()) # ---- 8. One-hot encode department ---- df = pd.get_dummies(df, columns=['dept'], dtype=int) # ---- 9. Min-Max scale salary ---- df['salary_scaled'] = ( (df['salary'] - df['salary'].min()) / (df['salary'].max() - df['salary'].min()) ) print(df) # Clean, encoded, and scaled DataFrame ready for analysis

Practice Quiz — Topic 1.2

Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.

Q1. Which type of data has a predefined schema with rows and columns?
A) Unstructured data
B) Structured data
C) Semi-structured data
D) Raw binary data
Correct: B) Structured data is organized in a tabular format with a fixed schema (predefined columns with data types). Relational databases and CSV files are common examples. Unstructured data (A) lacks a schema, semi-structured data (C) has tags/keys but no rigid tabular layout, and raw binary (D) is not a standard data classification category.
Q2. A survey records weight for most participants, but males are less likely to fill in the weight field. The actual weight does not influence whether the field is left blank. What type of missingness is this?
A) MCAR (Missing Completely At Random)
B) MAR (Missing At Random)
C) MNAR (Missing Not At Random)
D) Systematic missingness
Correct: B) The missingness depends on an observed variable (gender) but not on the missing value itself (weight). This matches the definition of MAR. If the actual weight determined whether it was reported (e.g., heavier people avoid reporting), it would be MNAR.
Q3. Which imputation method is MOST appropriate for a numerical column with a highly skewed distribution?
A) Mean imputation
B) Median imputation
C) Mode imputation
D) Forward fill
Correct: B) The median is robust to outliers and skewness, making it the best central-tendency imputer for skewed numerical data. Mean imputation (A) is sensitive to outliers and would be pulled toward the tail. Mode (C) is typically used for categorical data. Forward fill (D) is for time-series data.
Q4. What is the result of applying Min-Max scaling to a value that equals the column minimum?
A) 0
B) 1
C) -1
D) 0.5
Correct: A) The Min-Max formula is (X - X_min) / (X_max - X_min). When X equals X_min, the numerator is 0, so the result is 0. The maximum value maps to 1.
Q5. After Z-score normalization, what are the mean and standard deviation of the transformed feature?
A) Mean = 0, Std = 0
B) Mean = 0, Std = 1
C) Mean = 1, Std = 0
D) Mean = 0.5, Std = 0.5
Correct: B) Z-score normalization (standardization) transforms data so that the mean is 0 and the standard deviation is 1. The formula Z = (X - mean) / std achieves this by centering on the mean and scaling by the standard deviation.
Q6. Which encoding method should you use for a nominal categorical variable with no inherent order (e.g., color: red, blue, green)?
A) One-hot encoding
B) Label encoding
C) Ordinal encoding
D) Binary encoding
Correct: A) One-hot encoding creates a separate binary column for each category and does not imply any ordinal relationship. Label encoding (B) assigns integers that a model might misinterpret as having an order (e.g., 0 < 1 < 2), which is incorrect for nominal data.
Q7. Using the IQR method, a data point is considered an outlier if it falls:
A) Below Q1 or above Q3
B) Below Q1 - 1.0*IQR or above Q3 + 1.0*IQR
C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
D) Below Q1 - 2.0*IQR or above Q3 + 2.0*IQR
Correct: C) The standard IQR rule defines outliers as observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. Values beyond Q1 - 3 * IQR or Q3 + 3 * IQR are sometimes called "extreme outliers."
Q8. What does pd.to_numeric(series, errors='coerce') do when it encounters a non-numeric string?
A) Raises a ValueError
B) Converts it to 0
C) Converts it to NaN
D) Skips the value and leaves the original string
Correct: C) The errors='coerce' parameter tells pandas to convert any value that cannot be parsed as a number into NaN rather than raising an error (errors='raise', the default) or leaving it as-is (errors='ignore').
Q9. What is the main difference between pd.cut() and pd.qcut()?
A) pd.cut() is for categorical data; pd.qcut() is for numerical data
B) pd.cut() creates equal-width bins; pd.qcut() creates equal-frequency bins
C) pd.cut() is deprecated in favor of pd.qcut()
D) pd.qcut() creates equal-width bins; pd.cut() creates equal-frequency bins
Correct: B) pd.cut() divides the value range into intervals of equal width (e.g., 0-10, 10-20, 20-30). pd.qcut() divides the data into intervals that each contain approximately the same number of observations (quantile-based). Both are used for numerical data.
Q10. A dataset has 10% missing values in a column and the data is MCAR. Which approach is MOST appropriate?
A) Drop the entire column
B) Impute with mean or median
C) Drop all rows with any missing values in the dataset
D) Replace with a constant value of 0
Correct: B) With only 10% missing and MCAR mechanism, imputation (mean for symmetric, median for skewed) is the most appropriate strategy. It preserves sample size without introducing bias (since MCAR means missingness is random). Dropping the column (A) is wasteful for 10% missingness. Dropping all rows (C) reduces data unnecessarily. Replacing with 0 (D) would introduce bias unless 0 is actually a meaningful value.

Navigation

1.2.1 Structured & Unstructured Data 1.2.2 Erroneous Data 1.2.3 Normalization & Scaling 1.2.4 Applying Cleaning Techniques Practice Quiz