Topic 1.2: Data Cleaning and Standardization

1.2.1 Structured and Unstructured Data

Understand the difference between structured and unstructured data and how data structure impacts storage, retrieval, and analysis.

What Is Structured Data?

Structured data is data that conforms to a predefined schema with rows and columns. Every record follows the same format, making it easy to store in relational databases and spreadsheets and straightforward to query with SQL or pandas.

Relational databases — MySQL, PostgreSQL, SQLite tables with fixed columns and data types.
Spreadsheets — Excel / Google Sheets files with labelled columns.
CSV / TSV files — delimited text files where each row maps to a record and each column to a field.

Exam Tip: Structured data is the easiest to analyze programmatically because its schema enforces consistency. Most exam questions focus on structured tabular data handled via pandas DataFrames.

Characteristics of Structured Data

Characteristic	Description
Fixed schema	Column names, data types, and constraints are defined in advance.
Easy querying	SQL or pandas can filter, aggregate, and join data efficiently.
Low storage overhead	Columnar formats (e.g., Parquet) compress well because data is homogeneous.
High searchability	Indexing on columns enables fast lookups.

What Is Unstructured Data?

Unstructured data has no predefined format or model. It includes free-form text, images, audio, and video. Because there is no schema, traditional relational queries do not work directly on it.

Text — emails, social-media posts, PDF documents, log files.
Images — photographs, medical scans, satellite imagery.
Audio / Video — call-center recordings, surveillance footage.
Semi-structured — JSON, XML, and HTML sit between structured and unstructured; they have tags or keys but no rigid tabular schema.

Processing Needs for Unstructured Data

Before analysis, unstructured data typically requires feature extraction or transformation to convert it into a structured format:

Natural Language Processing (NLP) — tokenization, TF-IDF, sentiment scores for text.
Computer Vision — pixel arrays, edge detection, CNN-based feature extraction for images.
Signal Processing — spectrograms, Mel-frequency cepstral coefficients (MFCCs) for audio.

How Data Structure Impacts Storage, Retrieval, and Analysis

Aspect	Structured	Unstructured
Storage	RDBMS, columnar stores (Parquet, Feather)	Object storage (S3), NoSQL (MongoDB), file systems
Retrieval	SQL queries, pandas indexing — fast and precise	Full-text search engines (Elasticsearch), specialized APIs
Analysis	Direct aggregation, statistical analysis, ML on tabular data	Requires pre-processing pipeline to extract features first
Scalability	Scales well with indexing and partitioning	Requires distributed storage (Hadoop HDFS, cloud lakes)

Key Takeaway: In practice, most data analysis projects begin by converting or extracting structured features from unstructured sources so that standard tools (pandas, SQL) can be used downstream.

1.2.2 Erroneous Data

Identify data errors and inconsistencies, understand types of missingness, and learn imputation and deduplication strategies.

Identifying Data Errors and Inconsistencies

Erroneous data is any value that does not accurately represent the real-world entity it is supposed to describe. Common categories include:

Missing values — cells that contain NaN, None, or empty strings.
Inaccurate values — a person's age recorded as 250, or a negative price.
Misleading information — data that is technically valid but contextually wrong (e.g., recording height in inches when the column expects centimeters).
Duplicate records — the same observation entered more than once.
Invalid entries — text in a numeric column, impossible dates (Feb 30), or codes outside the defined set.
Numerical data problems — overflow, rounding errors, or mixed units within the same column.

Detecting Errors with pandas

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Charlie', 'Alice', 'Eve'],
    'age':    [28, np.nan, -5, 28, 300],
    'salary': [50000, 60000, 55000, 50000, np.nan]
})

# 1. Check for missing values
print(df.isnull().sum())
# name      0
# age       1
# salary    1

# 2. Detailed info (non-null counts and dtypes)
df.info()

# 3. Detect duplicates
print(df.duplicated().sum())   # 1 duplicate row (index 3)

# 4. Quick statistical sanity check
print(df.describe())
# min age = -5, max age = 300 — clearly invalid

Types of Missingness

Understanding why data is missing is critical because the mechanism determines which imputation strategy is appropriate.

Type	Definition	Example	Implication
MCAR (Missing Completely At Random)	The probability of being missing is the same for all observations — missingness is unrelated to any variable.	A lab sample is accidentally dropped.	Safe to drop or impute without introducing bias.
MAR (Missing At Random)	Missingness depends on observed variables but not on the missing value itself.	Males are less likely to report their weight, but the actual weight does not determine missingness.	Can be addressed with model-based imputation that conditions on the observed variable.
MNAR (Missing Not At Random)	Missingness depends on the unobserved (missing) value itself.	People with very high income are less likely to report it.	Most problematic. Imputation may be biased; requires domain knowledge or sensitivity analysis.

Exam Warning: Know the three types of missingness and be able to match each to an example. This course frequently tests whether a described scenario is MCAR, MAR, or MNAR.

Data Imputation Methods

Imputation replaces missing values with estimated substitutes so that records are not lost. The correct method depends on the data distribution, the type of variable, and the missingness mechanism.

Method	When to Use	pandas Syntax
Mean	Numerical columns with roughly symmetric distribution (sensitive to outliers).	`df['col'].fillna(df['col'].mean())`
Median	Numerical columns with skewed distribution or outliers.	`df['col'].fillna(df['col'].median())`
Mode	Categorical columns or any column where the most frequent value is a reasonable substitute.	`df['col'].fillna(df['col'].mode()[0])`
Forward Fill (ffill)	Time-series data where the last observed value is a good proxy.	`df['col'].ffill()`
Backward Fill (bfill)	Time-series data where the next observed value is a better proxy.	`df['col'].bfill()`

Imputation Examples

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'temperature': [22.1, np.nan, 23.4, np.nan, 21.8],
    'city':        ['NYC', 'LA', np.nan, 'NYC', 'LA']
})

# Mean imputation (numerical)
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())

# Mode imputation (categorical)
df['city'] = df['city'].fillna(df['city'].mode()[0])

# Forward fill (time-series friendly)
ts = pd.Series([10, np.nan, np.nan, 20, np.nan, 30])
print(ts.ffill())
# 0    10.0
# 1    10.0
# 2    10.0
# 3    20.0
# 4    20.0
# 5    30.0

# Backward fill
print(ts.bfill())
# 0    10.0
# 1    20.0
# 2    20.0
# 3    20.0
# 4    30.0
# 5    30.0

Dropping Duplicates

# Remove exact duplicate rows
df_clean = df.drop_duplicates()

# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['name', 'age'], keep='first')

# keep='first'  — keep the first occurrence (default)
# keep='last'   — keep the last occurrence
# keep=False    — drop ALL duplicates (including the first)

Implications of Data Correction and Removal

Every decision to correct or remove data affects data integrity:

Removing rows reduces sample size, potentially introducing selection bias if missingness is not MCAR.
Imputing values preserves sample size but may introduce artificial patterns or reduce variance.
Capping outliers prevents extreme values from distorting statistics but alters the true distribution.
Correcting errors (e.g., fixing typos) improves accuracy but must be documented for reproducibility.

Best Practice: Always document every cleaning step you perform. Keep a copy of the raw data so that any transformation can be audited or reversed.

Data Collection Importance for Outlier Detection

Outlier detection is only as good as the data that feeds it. Understanding how the data was collected helps determine whether an extreme value is a genuine rare event or a data-entry error.

Sensor data may contain spikes from calibration errors — domain knowledge tells you to discard them.
Survey data may have outliers due to misunderstanding the scale (e.g., answering "100" on a 1–10 scale).
Financial data often has legitimate outliers (market crashes) that should be kept.

High-Quality Data for Accurate Outlier Detection

Dirty data can mask real outliers or create false ones. Before running outlier detection:

Fix data-type issues (strings in numeric columns).
Handle missing values (NaN can skew mean/std calculations).
Standardize units (mixing kg and lbs will create phantom outliers).

How Data Types Influence Outlier Detection Strategies

Data Type	Outlier Detection Approach
Continuous numerical	IQR method, Z-score, modified Z-score, box plots
Categorical / ordinal	Frequency analysis — values with extremely low frequency may be errors
Date/time	Range checks (future dates, dates before system launch) and gap analysis
Text	Length distribution, unexpected characters, regex validation

1.2.3 Data Normalization and Scaling

Learn Min-Max scaling, Z-score normalization, encoding methods, data reduction, and outlier handling techniques.

Why Normalization Is Needed

Features measured on different scales (e.g., income in thousands vs. age in decades) can distort distance-based algorithms (k-NN, k-Means, SVM) and slow gradient-descent convergence. Normalization puts all features on a comparable scale.

Key Insight: Normalization does not change the underlying relationships in the data — it only rescales values so that no single feature dominates due to its magnitude.

Min-Max Scaling

Rescales values to the range [0, 1] (or any [a, b]).

Formula:

X_scaled = (X - X_min) / (X_max - X_min)

Preserves the original distribution shape.
Sensitive to outliers (a single extreme value stretches the range).

Min-Max Scaling Example

import pandas as pd

df = pd.DataFrame({
    'income': [30000, 50000, 80000, 120000],
    'age':    [22, 35, 45, 60]
})

# Manual Min-Max scaling
for col in ['income', 'age']:
    df[col + '_scaled'] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

print(df)
#    income  age  income_scaled  age_scaled
# 0   30000   22       0.000000    0.000000
# 1   50000   35       0.222222    0.342105
# 2   80000   45       0.555556    0.605263
# 3  120000   60       1.000000    1.000000

Z-Score Normalization (Standardization)

Transforms values so they have a mean of 0 and a standard deviation of 1.

Formula:

Z = (X - mean) / std

Not bounded to a specific range.
Less sensitive to outliers than Min-Max.
Ideal when the feature approximately follows a normal distribution.

Z-Score Normalization Example

import pandas as pd

df = pd.DataFrame({'score': [70, 80, 90, 100, 60]})

# Z-score normalization
df['score_z'] = (df['score'] - df['score'].mean()) / df['score'].std()

print(df)
#    score   score_z
# 0     70 -0.632456
# 1     80  0.000000
# 2     90  0.632456
# 3    100  1.264911
# 4     60 -1.264911

Comparison: Min-Max vs Z-Score

Feature	Min-Max Scaling	Z-Score (Standardization)
Output range	[0, 1]	Unbounded (centered at 0)
Outlier sensitivity	High	Moderate
Preserves distribution	Yes (same shape)	Yes (same shape)
Best for	Neural networks, image pixels	Linear regression, PCA, SVMs

Encoding Categorical Variables

One-Hot Encoding

Creates a new binary (0/1) column for each unique category value. Use when there is no ordinal relationship between categories.

import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})

# One-hot encoding with pd.get_dummies()
df_encoded = pd.get_dummies(df, columns=['color'], dtype=int)
print(df_encoded)
#    color_blue  color_green  color_red
# 0           0            0          1
# 1           1            0          0
# 2           0            1          0
# 3           1            0          0
# 4           0            0          1

Watch Out: The Dummy Variable Trap — When using one-hot encoding in regression models, drop one column (e.g., drop_first=True) to avoid perfect multicollinearity. In tree-based models this is generally not necessary.

Label Encoding

Maps each category to a unique integer. Use for ordinal variables where the order matters (e.g., low < medium < high).

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small']})

df['size_encoded'] = le.fit_transform(df['size'])
print(df)
#      size  size_encoded
# 0   small             2
# 1  medium             1
# 2   large             0
# 3  medium             1
# 4   small             2

# To see the mapping:
print(dict(zip(le.classes_, le.transform(le.classes_))))
# {'large': 0, 'medium': 1, 'small': 2}

Note: LabelEncoder assigns integers alphabetically, not by semantic order. For truly ordinal encoding where you control the order, use a manual mapping: df['size'].map({'small': 0, 'medium': 1, 'large': 2}).

Data Reduction: Pros and Cons

Data reduction decreases the volume of data while retaining meaningful information. Common techniques include dimensionality reduction (PCA), feature selection, and aggregation.

Pros	Cons
Faster computation and model training	Risk of losing important information
Reduced storage and memory requirements	May oversimplify complex relationships
Mitigates the curse of dimensionality	Reduced components can be hard to interpret
Can reduce noise and improve model performance	Requires careful tuning (e.g., choosing the number of components)

Outlier Handling

Detection with the IQR Method

The Interquartile Range (IQR) method defines outliers as values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.

import pandas as pd
import numpy as np

data = pd.Series([10, 12, 14, 15, 15, 16, 18, 19, 20, 100])

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data < lower_bound) | (data > upper_bound)]
print(f"IQR: {IQR}")
print(f"Bounds: [{lower_bound}, {upper_bound}]")
print(f"Outliers: {outliers.tolist()}")
# IQR: 5.25
# Bounds: [5.875, 27.125]
# Outliers: [100]

Detection with the Z-Score Method

import numpy as np

data = pd.Series([10, 12, 14, 15, 15, 16, 18, 19, 20, 100])

z_scores = (data - data.mean()) / data.std()
outliers = data[z_scores.abs() > 3]
print(f"Outliers (|Z| > 3): {outliers.tolist()}")
# Outliers (|Z| > 3): [100]

Outlier Treatment Techniques

Technique	Description	When to Use
Remove	Delete the outlier rows.	When you are confident the values are errors.
Cap / Winsorize	Replace outliers with the nearest non-outlier value (e.g., the upper/lower bound).	When you want to keep the row but limit extreme influence.
Transform	Apply log, sqrt, or Box-Cox to compress the range.	When the data is right-skewed and outliers are legitimate.
Impute	Replace with mean/median.	When the value is clearly an error but removal would lose other column data.
Keep	Leave as-is and use robust methods (e.g., median instead of mean).	When outliers are genuine and informative.

Data Format Standardization

Ensuring consistent formats is essential for joining datasets and running comparisons.

Date-time — parse to a consistent datetime type and use a single format (e.g., ISO 8601: YYYY-MM-DD).
Numerical values — ensure consistent decimal separators (comma vs. period), remove currency symbols, and standardize units.

import pandas as pd

# Standardize mixed date formats
dates = pd.Series(['2024-01-15', '01/20/2024', 'March 5, 2024'])
dates_clean = pd.to_datetime(dates)
print(dates_clean)
# 0   2024-01-15
# 1   2024-01-20
# 2   2024-03-05
# dtype: datetime64[ns]

# Standardize numerical strings with currency symbols
prices = pd.Series(['$1,200.50', '$850.00', '$3,100.75'])
prices_clean = prices.str.replace('[$,]', '', regex=True).astype(float)
print(prices_clean)
# 0    1200.50
# 1     850.00
# 2    3100.75
# dtype: float64

1.2.4 Applying Cleaning Techniques

Apply string manipulation, boolean normalization, type conversions, encoding, and binning in end-to-end cleaning pipelines.

String Manipulation and Cleaning

Text data is often messy. Common string cleaning operations include trimming whitespace, normalizing case, removing special characters, and extracting patterns with regex.

import pandas as pd

df = pd.DataFrame({
    'name':  ['  Alice ', 'BOB', 'charlie', '  Dave  '],
    'email': ['ALICE@MAIL.COM', 'bob@mail.com', 'Charlie@Mail.Com', 'dave@MAIL.com'],
    'phone': ['(555) 123-4567', '555.234.5678', '555-345-6789', '5554567890']
})

# Strip leading/trailing whitespace
df['name'] = df['name'].str.strip()

# Normalize case — title case for names, lower for emails
df['name']  = df['name'].str.title()
df['email'] = df['email'].str.lower()

# Remove non-digit characters from phone numbers
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)

print(df)
#       name             email       phone
# 0    Alice    alice@mail.com  5551234567
# 1      Bob      bob@mail.com  5552345678
# 2  Charlie  charlie@mail.com  5553456789
# 3     Dave     dave@mail.com  5554567890

Boolean Normalization

Datasets may encode boolean values inconsistently: "Yes", "yes", "Y", 1, "true", etc. Normalize them to a single Python bool type.

import pandas as pd

df = pd.DataFrame({
    'subscribed': ['Yes', 'no', 'Y', 'TRUE', 'false', '1', '0']
})

# Map various representations to True/False
true_values = {'yes', 'y', 'true', '1'}
df['subscribed_clean'] = df['subscribed'].str.lower().isin(true_values)

print(df)
#   subscribed  subscribed_clean
# 0        Yes              True
# 1         no             False
# 2          Y              True
# 3       TRUE              True
# 4      false             False
# 5          1              True
# 6          0             False

String Case Normalization

pandas provides several case-conversion methods via the .str accessor:

Method	Result	Use Case
`.str.lower()`	`'hello world'`	Email addresses, case-insensitive matching
`.str.upper()`	`'HELLO WORLD'`	Country codes, acronyms
`.str.title()`	`'Hello World'`	People's names, place names
`.str.capitalize()`	`'Hello world'`	Sentence starts

String-to-Number Conversions

import pandas as pd

df = pd.DataFrame({
    'value': ['42', '3.14', 'N/A', '100', 'abc']
})

# pd.to_numeric with errors='coerce' turns non-parseable strings to NaN
df['value_num'] = pd.to_numeric(df['value'], errors='coerce')

print(df)
#   value  value_num
# 0    42      42.00
# 1  3.14       3.14
# 2   N/A        NaN
# 3   100     100.00
# 4   abc        NaN

Exam Tip: pd.to_numeric(series, errors='coerce') is the safest way to convert a column that might contain non-numeric strings. The errors='coerce' parameter turns unparseable values into NaN instead of raising an error.

Imputation vs. Exclusion: Pros and Cons

Strategy	Pros	Cons
Imputation	Preserves sample size Utilizes available partial data Maintains statistical power	Introduces artificial values Can reduce variance May mask underlying patterns
Exclusion (Dropping)	Simple to implement No artificial data introduced Maintains data authenticity	Reduces sample size Can introduce selection bias (if not MCAR) Loss of information

Rule of Thumb: If a column has more than 40–50% missing values, consider dropping the entire column rather than imputing. If a row has most of its values missing, consider dropping the row. For smaller amounts of missingness, imputation is usually preferred.

One-Hot Encoding for Categorical Variables

One-hot encoding converts each category level into a separate binary column. This is the standard approach for nominal (unordered) categories in most machine learning pipelines.

import pandas as pd

df = pd.DataFrame({
    'city':  ['NYC', 'LA', 'Chicago', 'NYC', 'LA'],
    'sales': [200, 150, 300, 250, 180]
})

# One-hot encode 'city'
df_encoded = pd.get_dummies(df, columns=['city'], dtype=int)
print(df_encoded)
#    sales  city_Chicago  city_LA  city_NYC
# 0    200             0        0         1
# 1    150             0        1         0
# 2    300             1        0         0
# 3    250             0        0         1
# 4    180             0        1         0

# With drop_first=True to avoid multicollinearity
df_encoded2 = pd.get_dummies(df, columns=['city'], drop_first=True, dtype=int)
print(df_encoded2)
#    sales  city_LA  city_NYC
# 0    200        0         1
# 1    150        1         0
# 2    300        0         0
# 3    250        0         1
# 4    180        1         0

Bucketization (Binning) of Continuous Variables

Bucketization (also called binning or discretization) converts continuous numerical values into categorical groups (bins). This is useful when:

You want to reduce the effect of minor observation errors.
The relationship between the variable and the target is non-linear and step-like.
You need categorical features for certain algorithms (e.g., decision trees may benefit from pre-binned features).

Using pd.cut() for Equal-Width Bins

import pandas as pd

df = pd.DataFrame({
    'age': [5, 17, 22, 35, 45, 55, 67, 78, 85]
})

# Define custom bin edges and labels
bins   = [0, 18, 35, 50, 65, 100]
labels = ['Child', 'Young Adult', 'Middle Age', 'Senior', 'Elderly']

df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

print(df)
#    age    age_group
# 0    5        Child
# 1   17        Child
# 2   22  Young Adult
# 3   35  Young Adult
# 4   45   Middle Age
# 5   55       Senior
# 6   67      Elderly
# 7   78      Elderly
# 8   85      Elderly

Using pd.qcut() for Equal-Frequency Bins

# pd.qcut() creates bins with approximately equal numbers of observations
df['age_quartile'] = pd.qcut(df['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

print(df[['age', 'age_quartile']])
#    age age_quartile
# 0    5           Q1
# 1   17           Q1
# 2   22           Q2
# 3   35           Q2
# 4   45           Q3
# 5   55           Q3
# 6   67           Q4
# 7   78           Q4
# 8   85           Q4

pd.cut() vs pd.qcut(): pd.cut() divides the value range into equal-width intervals. pd.qcut() divides the data into intervals with roughly equal numbers of observations. Use qcut when you want balanced groups regardless of value distribution.

Comprehensive Cleaning Pipeline Example

Below is a realistic end-to-end cleaning pipeline that combines multiple techniques covered in this topic.

import pandas as pd
import numpy as np

# ---- 1. Load raw data ----
df = pd.DataFrame({
    'name':     ['  alice ', 'BOB', 'Charlie', '  alice ', 'eve', 'Frank'],
    'age':      [28, np.nan, -5, 28, 300, 42],
    'salary':   ['$50,000', '$60,000', '$55,000', '$50,000', np.nan, '$70,000'],
    'dept':     ['Sales', 'IT', 'Sales', 'Sales', 'HR', 'IT'],
    'active':   ['Yes', 'true', '1', 'Yes', 'no', 'FALSE']
})

# ---- 2. String cleaning ----
df['name'] = df['name'].str.strip().str.title()

# ---- 3. Boolean normalization ----
true_vals = {'yes', 'true', '1'}
df['active'] = df['active'].str.lower().isin(true_vals)

# ---- 4. Numeric standardization ----
df['salary'] = df['salary'].str.replace(r'[$,]', '', regex=True).astype(float)

# ---- 5. Remove duplicates ----
df = df.drop_duplicates(subset=['name', 'age'], keep='first')

# ---- 6. Handle invalid ages ----
df.loc[~df['age'].between(0, 120), 'age'] = np.nan

# ---- 7. Impute missing values ----
df['age']    = df['age'].fillna(df['age'].median())
df['salary'] = df['salary'].fillna(df['salary'].median())

# ---- 8. One-hot encode department ----
df = pd.get_dummies(df, columns=['dept'], dtype=int)

# ---- 9. Min-Max scale salary ----
df['salary_scaled'] = (
    (df['salary'] - df['salary'].min())
    / (df['salary'].max() - df['salary'].min())
)

print(df)
# Clean, encoded, and scaled DataFrame ready for analysis

Practice Quiz — Topic 1.2

Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.

Q1. Which type of data has a predefined schema with rows and columns?

A) Unstructured data

B) Structured data

C) Semi-structured data

D) Raw binary data

Correct: B) Structured data is organized in a tabular format with a fixed schema (predefined columns with data types). Relational databases and CSV files are common examples. Unstructured data (A) lacks a schema, semi-structured data (C) has tags/keys but no rigid tabular layout, and raw binary (D) is not a standard data classification category.

Q2. A survey records weight for most participants, but males are less likely to fill in the weight field. The actual weight does not influence whether the field is left blank. What type of missingness is this?

A) MCAR (Missing Completely At Random)

B) MAR (Missing At Random)

C) MNAR (Missing Not At Random)

D) Systematic missingness

Correct: B) The missingness depends on an observed variable (gender) but not on the missing value itself (weight). This matches the definition of MAR. If the actual weight determined whether it was reported (e.g., heavier people avoid reporting), it would be MNAR.

Q3. Which imputation method is MOST appropriate for a numerical column with a highly skewed distribution?

A) Mean imputation

B) Median imputation

C) Mode imputation

D) Forward fill

Correct: B) The median is robust to outliers and skewness, making it the best central-tendency imputer for skewed numerical data. Mean imputation (A) is sensitive to outliers and would be pulled toward the tail. Mode (C) is typically used for categorical data. Forward fill (D) is for time-series data.

Q4. What is the result of applying Min-Max scaling to a value that equals the column minimum?

A) 0

B) 1

C) -1

D) 0.5

Correct: A) The Min-Max formula is (X - X_min) / (X_max - X_min). When X equals X_min, the numerator is 0, so the result is 0. The maximum value maps to 1.

Q5. After Z-score normalization, what are the mean and standard deviation of the transformed feature?

A) Mean = 0, Std = 0

B) Mean = 0, Std = 1

C) Mean = 1, Std = 0

D) Mean = 0.5, Std = 0.5

Correct: B) Z-score normalization (standardization) transforms data so that the mean is 0 and the standard deviation is 1. The formula Z = (X - mean) / std achieves this by centering on the mean and scaling by the standard deviation.

Q6. Which encoding method should you use for a nominal categorical variable with no inherent order (e.g., color: red, blue, green)?

A) One-hot encoding

B) Label encoding

C) Ordinal encoding

D) Binary encoding

Correct: A) One-hot encoding creates a separate binary column for each category and does not imply any ordinal relationship. Label encoding (B) assigns integers that a model might misinterpret as having an order (e.g., 0 < 1 < 2), which is incorrect for nominal data.

Q7. Using the IQR method, a data point is considered an outlier if it falls:

A) Below Q1 or above Q3

B) Below Q1 - 1.0*IQR or above Q3 + 1.0*IQR

C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

D) Below Q1 - 2.0*IQR or above Q3 + 2.0*IQR

Correct: C) The standard IQR rule defines outliers as observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. Values beyond Q1 - 3 * IQR or Q3 + 3 * IQR are sometimes called "extreme outliers."

Q8. What does pd.to_numeric(series, errors='coerce') do when it encounters a non-numeric string?

A) Raises a ValueError

B) Converts it to 0

C) Converts it to NaN

D) Skips the value and leaves the original string

Correct: C) The errors='coerce' parameter tells pandas to convert any value that cannot be parsed as a number into NaN rather than raising an error (errors='raise', the default) or leaving it as-is (errors='ignore').

Q9. What is the main difference between pd.cut() and pd.qcut()?

A) pd.cut() is for categorical data; pd.qcut() is for numerical data

B) pd.cut() creates equal-width bins; pd.qcut() creates equal-frequency bins

C) pd.cut() is deprecated in favor of pd.qcut()

D) pd.qcut() creates equal-width bins; pd.cut() creates equal-frequency bins

Correct: B) pd.cut() divides the value range into intervals of equal width (e.g., 0-10, 10-20, 20-30). pd.qcut() divides the data into intervals that each contain approximately the same number of observations (quantile-based). Both are used for numerical data.

Q10. A dataset has 10% missing values in a column and the data is MCAR. Which approach is MOST appropriate?

A) Drop the entire column

B) Impute with mean or median

C) Drop all rows with any missing values in the dataset

D) Replace with a constant value of 0

Correct: B) With only 10% missing and MCAR mechanism, imputation (mean for symmetric, median for skewed) is the most appropriate strategy. It preserves sample size without introducing bias (since MCAR means missingness is random). Dropping the column (A) is wasteful for 10% missingness. Dropping all rows (C) reduces data unnecessarily. Replacing with 0 (D) would introduce bias unless 0 is actually a meaningful value.

Previous Topic Next Topic

Data Cleaning and Standardization