Understand the difference between structured and unstructured data and how data structure impacts storage, retrieval, and analysis.
What Is Structured Data?
Structured data is data that conforms to a predefined schema with rows and columns. Every record follows the same format, making it easy to store in relational databases and spreadsheets and straightforward to query with SQL or pandas.
Relational databases — MySQL, PostgreSQL, SQLite tables with fixed columns and data types.
Spreadsheets — Excel / Google Sheets files with labelled columns.
CSV / TSV files — delimited text files where each row maps to a record and each column to a field.
Exam Tip: Structured data is the easiest to analyze programmatically because its schema enforces consistency. Most exam questions focus on structured tabular data handled via pandas DataFrames.
Characteristics of Structured Data
Characteristic
Description
Fixed schema
Column names, data types, and constraints are defined in advance.
Easy querying
SQL or pandas can filter, aggregate, and join data efficiently.
Low storage overhead
Columnar formats (e.g., Parquet) compress well because data is homogeneous.
High searchability
Indexing on columns enables fast lookups.
What Is Unstructured Data?
Unstructured data has no predefined format or model. It includes free-form text, images, audio, and video. Because there is no schema, traditional relational queries do not work directly on it.
Text — emails, social-media posts, PDF documents, log files.
Images — photographs, medical scans, satellite imagery.
Audio / Video — call-center recordings, surveillance footage.
Semi-structured — JSON, XML, and HTML sit between structured and unstructured; they have tags or keys but no rigid tabular schema.
Processing Needs for Unstructured Data
Before analysis, unstructured data typically requires feature extraction or transformation to convert it into a structured format:
Natural Language Processing (NLP) — tokenization, TF-IDF, sentiment scores for text.
Key Takeaway: In practice, most data analysis projects begin by converting or extracting structured features from unstructured sources so that standard tools (pandas, SQL) can be used downstream.
1.2.2 Erroneous Data
Identify data errors and inconsistencies, understand types of missingness, and learn imputation and deduplication strategies.
Identifying Data Errors and Inconsistencies
Erroneous data is any value that does not accurately represent the real-world entity it is supposed to describe. Common categories include:
Missing values — cells that contain NaN, None, or empty strings.
Inaccurate values — a person's age recorded as 250, or a negative price.
Misleading information — data that is technically valid but contextually wrong (e.g., recording height in inches when the column expects centimeters).
Duplicate records — the same observation entered more than once.
Invalid entries — text in a numeric column, impossible dates (Feb 30), or codes outside the defined set.
Numerical data problems — overflow, rounding errors, or mixed units within the same column.
Detecting Errors with pandas
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eve'],
'age': [28, np.nan, -5, 28, 300],
'salary': [50000, 60000, 55000, 50000, np.nan]
})
# 1. Check for missing valuesprint(df.isnull().sum())
# name 0
# age 1
# salary 1# 2. Detailed info (non-null counts and dtypes)
df.info()
# 3. Detect duplicatesprint(df.duplicated().sum()) # 1 duplicate row (index 3)# 4. Quick statistical sanity checkprint(df.describe())
# min age = -5, max age = 300 — clearly invalid
Types of Missingness
Understanding why data is missing is critical because the mechanism determines which imputation strategy is appropriate.
Type
Definition
Example
Implication
MCAR (Missing Completely At Random)
The probability of being missing is the same for all observations — missingness is unrelated to any variable.
A lab sample is accidentally dropped.
Safe to drop or impute without introducing bias.
MAR (Missing At Random)
Missingness depends on observed variables but not on the missing value itself.
Males are less likely to report their weight, but the actual weight does not determine missingness.
Can be addressed with model-based imputation that conditions on the observed variable.
MNAR (Missing Not At Random)
Missingness depends on the unobserved (missing) value itself.
People with very high income are less likely to report it.
Most problematic. Imputation may be biased; requires domain knowledge or sensitivity analysis.
Exam Warning: Know the three types of missingness and be able to match each to an example. This course frequently tests whether a described scenario is MCAR, MAR, or MNAR.
Data Imputation Methods
Imputation replaces missing values with estimated substitutes so that records are not lost. The correct method depends on the data distribution, the type of variable, and the missingness mechanism.
Method
When to Use
pandas Syntax
Mean
Numerical columns with roughly symmetric distribution (sensitive to outliers).
df['col'].fillna(df['col'].mean())
Median
Numerical columns with skewed distribution or outliers.
df['col'].fillna(df['col'].median())
Mode
Categorical columns or any column where the most frequent value is a reasonable substitute.
df['col'].fillna(df['col'].mode()[0])
Forward Fill (ffill)
Time-series data where the last observed value is a good proxy.
df['col'].ffill()
Backward Fill (bfill)
Time-series data where the next observed value is a better proxy.
# Remove exact duplicate rows
df_clean = df.drop_duplicates()
# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['name', 'age'], keep='first')
# keep='first' — keep the first occurrence (default)# keep='last' — keep the last occurrence# keep=False — drop ALL duplicates (including the first)
Implications of Data Correction and Removal
Every decision to correct or remove data affects data integrity:
Removing rows reduces sample size, potentially introducing selection bias if missingness is not MCAR.
Imputing values preserves sample size but may introduce artificial patterns or reduce variance.
Capping outliers prevents extreme values from distorting statistics but alters the true distribution.
Correcting errors (e.g., fixing typos) improves accuracy but must be documented for reproducibility.
Best Practice: Always document every cleaning step you perform. Keep a copy of the raw data so that any transformation can be audited or reversed.
Data Collection Importance for Outlier Detection
Outlier detection is only as good as the data that feeds it. Understanding how the data was collected helps determine whether an extreme value is a genuine rare event or a data-entry error.
Sensor data may contain spikes from calibration errors — domain knowledge tells you to discard them.
Survey data may have outliers due to misunderstanding the scale (e.g., answering "100" on a 1–10 scale).
Financial data often has legitimate outliers (market crashes) that should be kept.
High-Quality Data for Accurate Outlier Detection
Dirty data can mask real outliers or create false ones. Before running outlier detection:
Fix data-type issues (strings in numeric columns).
Handle missing values (NaN can skew mean/std calculations).
Standardize units (mixing kg and lbs will create phantom outliers).
How Data Types Influence Outlier Detection Strategies
Data Type
Outlier Detection Approach
Continuous numerical
IQR method, Z-score, modified Z-score, box plots
Categorical / ordinal
Frequency analysis — values with extremely low frequency may be errors
Date/time
Range checks (future dates, dates before system launch) and gap analysis
Learn Min-Max scaling, Z-score normalization, encoding methods, data reduction, and outlier handling techniques.
Why Normalization Is Needed
Features measured on different scales (e.g., income in thousands vs. age in decades) can distort distance-based algorithms (k-NN, k-Means, SVM) and slow gradient-descent convergence. Normalization puts all features on a comparable scale.
Key Insight: Normalization does not change the underlying relationships in the data — it only rescales values so that no single feature dominates due to its magnitude.
Min-Max Scaling
Rescales values to the range [0, 1] (or any [a, b]).
Formula:
X_scaled = (X - X_min) / (X_max - X_min)
Preserves the original distribution shape.
Sensitive to outliers (a single extreme value stretches the range).
Watch Out: The Dummy Variable Trap — When using one-hot encoding in regression models, drop one column (e.g., drop_first=True) to avoid perfect multicollinearity. In tree-based models this is generally not necessary.
Label Encoding
Maps each category to a unique integer. Use for ordinal variables where the order matters (e.g., low < medium < high).
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small']})
df['size_encoded'] = le.fit_transform(df['size'])
print(df)
# size size_encoded
# 0 small 2
# 1 medium 1
# 2 large 0
# 3 medium 1
# 4 small 2# To see the mapping:print(dict(zip(le.classes_, le.transform(le.classes_))))
# {'large': 0, 'medium': 1, 'small': 2}
Note:LabelEncoder assigns integers alphabetically, not by semantic order. For truly ordinal encoding where you control the order, use a manual mapping: df['size'].map({'small': 0, 'medium': 1, 'large': 2}).
Data Reduction: Pros and Cons
Data reduction decreases the volume of data while retaining meaningful information. Common techniques include dimensionality reduction (PCA), feature selection, and aggregation.
Pros
Cons
Faster computation and model training
Risk of losing important information
Reduced storage and memory requirements
May oversimplify complex relationships
Mitigates the curse of dimensionality
Reduced components can be hard to interpret
Can reduce noise and improve model performance
Requires careful tuning (e.g., choosing the number of components)
Outlier Handling
Detection with the IQR Method
The Interquartile Range (IQR) method defines outliers as values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.
Apply string manipulation, boolean normalization, type conversions, encoding, and binning in end-to-end cleaning pipelines.
String Manipulation and Cleaning
Text data is often messy. Common string cleaning operations include trimming whitespace, normalizing case, removing special characters, and extracting patterns with regex.
import pandas as pd
df = pd.DataFrame({
'name': [' Alice ', 'BOB', 'charlie', ' Dave '],
'email': ['ALICE@MAIL.COM', 'bob@mail.com', 'Charlie@Mail.Com', 'dave@MAIL.com'],
'phone': ['(555) 123-4567', '555.234.5678', '555-345-6789', '5554567890']
})
# Strip leading/trailing whitespace
df['name'] = df['name'].str.strip()
# Normalize case — title case for names, lower for emails
df['name'] = df['name'].str.title()
df['email'] = df['email'].str.lower()
# Remove non-digit characters from phone numbers
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
print(df)
# name email phone
# 0 Alice alice@mail.com 5551234567
# 1 Bob bob@mail.com 5552345678
# 2 Charlie charlie@mail.com 5553456789
# 3 Dave dave@mail.com 5554567890
Boolean Normalization
Datasets may encode boolean values inconsistently: "Yes", "yes", "Y", 1, "true", etc. Normalize them to a single Python bool type.
pandas provides several case-conversion methods via the .str accessor:
Method
Result
Use Case
.str.lower()
'hello world'
Email addresses, case-insensitive matching
.str.upper()
'HELLO WORLD'
Country codes, acronyms
.str.title()
'Hello World'
People's names, place names
.str.capitalize()
'Hello world'
Sentence starts
String-to-Number Conversions
import pandas as pd
df = pd.DataFrame({
'value': ['42', '3.14', 'N/A', '100', 'abc']
})
# pd.to_numeric with errors='coerce' turns non-parseable strings to NaN
df['value_num'] = pd.to_numeric(df['value'], errors='coerce')
print(df)
# value value_num
# 0 42 42.00
# 1 3.14 3.14
# 2 N/A NaN
# 3 100 100.00
# 4 abc NaN
Exam Tip:pd.to_numeric(series, errors='coerce') is the safest way to convert a column that might contain non-numeric strings. The errors='coerce' parameter turns unparseable values into NaN instead of raising an error.
Imputation vs. Exclusion: Pros and Cons
Strategy
Pros
Cons
Imputation
Preserves sample size
Utilizes available partial data
Maintains statistical power
Introduces artificial values
Can reduce variance
May mask underlying patterns
Exclusion (Dropping)
Simple to implement
No artificial data introduced
Maintains data authenticity
Reduces sample size
Can introduce selection bias (if not MCAR)
Loss of information
Rule of Thumb: If a column has more than 40–50% missing values, consider dropping the entire column rather than imputing. If a row has most of its values missing, consider dropping the row. For smaller amounts of missingness, imputation is usually preferred.
One-Hot Encoding for Categorical Variables
One-hot encoding converts each category level into a separate binary column. This is the standard approach for nominal (unordered) categories in most machine learning pipelines.
pd.cut() vs pd.qcut():pd.cut() divides the value range into equal-width intervals. pd.qcut() divides the data into intervals with roughly equal numbers of observations. Use qcut when you want balanced groups regardless of value distribution.
Comprehensive Cleaning Pipeline Example
Below is a realistic end-to-end cleaning pipeline that combines multiple techniques covered in this topic.
Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.
Q1. Which type of data has a predefined schema with rows and columns?
A) Unstructured data
B) Structured data
C) Semi-structured data
D) Raw binary data
Correct: B) Structured data is organized in a tabular format with a fixed schema (predefined columns with data types). Relational databases and CSV files are common examples. Unstructured data (A) lacks a schema, semi-structured data (C) has tags/keys but no rigid tabular layout, and raw binary (D) is not a standard data classification category.
Q2. A survey records weight for most participants, but males are less likely to fill in the weight field. The actual weight does not influence whether the field is left blank. What type of missingness is this?
A) MCAR (Missing Completely At Random)
B) MAR (Missing At Random)
C) MNAR (Missing Not At Random)
D) Systematic missingness
Correct: B) The missingness depends on an observed variable (gender) but not on the missing value itself (weight). This matches the definition of MAR. If the actual weight determined whether it was reported (e.g., heavier people avoid reporting), it would be MNAR.
Q3. Which imputation method is MOST appropriate for a numerical column with a highly skewed distribution?
A) Mean imputation
B) Median imputation
C) Mode imputation
D) Forward fill
Correct: B) The median is robust to outliers and skewness, making it the best central-tendency imputer for skewed numerical data. Mean imputation (A) is sensitive to outliers and would be pulled toward the tail. Mode (C) is typically used for categorical data. Forward fill (D) is for time-series data.
Q4. What is the result of applying Min-Max scaling to a value that equals the column minimum?
A) 0
B) 1
C) -1
D) 0.5
Correct: A) The Min-Max formula is (X - X_min) / (X_max - X_min). When X equals X_min, the numerator is 0, so the result is 0. The maximum value maps to 1.
Q5. After Z-score normalization, what are the mean and standard deviation of the transformed feature?
A) Mean = 0, Std = 0
B) Mean = 0, Std = 1
C) Mean = 1, Std = 0
D) Mean = 0.5, Std = 0.5
Correct: B) Z-score normalization (standardization) transforms data so that the mean is 0 and the standard deviation is 1. The formula Z = (X - mean) / std achieves this by centering on the mean and scaling by the standard deviation.
Q6. Which encoding method should you use for a nominal categorical variable with no inherent order (e.g., color: red, blue, green)?
A) One-hot encoding
B) Label encoding
C) Ordinal encoding
D) Binary encoding
Correct: A) One-hot encoding creates a separate binary column for each category and does not imply any ordinal relationship. Label encoding (B) assigns integers that a model might misinterpret as having an order (e.g., 0 < 1 < 2), which is incorrect for nominal data.
Q7. Using the IQR method, a data point is considered an outlier if it falls:
A) Below Q1 or above Q3
B) Below Q1 - 1.0*IQR or above Q3 + 1.0*IQR
C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
D) Below Q1 - 2.0*IQR or above Q3 + 2.0*IQR
Correct: C) The standard IQR rule defines outliers as observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. Values beyond Q1 - 3 * IQR or Q3 + 3 * IQR are sometimes called "extreme outliers."
Q8. What does pd.to_numeric(series, errors='coerce') do when it encounters a non-numeric string?
A) Raises a ValueError
B) Converts it to 0
C) Converts it to NaN
D) Skips the value and leaves the original string
Correct: C) The errors='coerce' parameter tells pandas to convert any value that cannot be parsed as a number into NaN rather than raising an error (errors='raise', the default) or leaving it as-is (errors='ignore').
Q9. What is the main difference between pd.cut() and pd.qcut()?
A) pd.cut() is for categorical data; pd.qcut() is for numerical data
B) pd.cut() creates equal-width bins; pd.qcut() creates equal-frequency bins
Correct: B)pd.cut() divides the value range into intervals of equal width (e.g., 0-10, 10-20, 20-30). pd.qcut() divides the data into intervals that each contain approximately the same number of observations (quantile-based). Both are used for numerical data.
Q10. A dataset has 10% missing values in a column and the data is MCAR. Which approach is MOST appropriate?
A) Drop the entire column
B) Impute with mean or median
C) Drop all rows with any missing values in the dataset
D) Replace with a constant value of 0
Correct: B) With only 10% missing and MCAR mechanism, imputation (mean for symmetric, median for skewed) is the most appropriate strategy. It preserves sample size without introducing bias (since MCAR means missingness is random). Dropping the column (A) is wasteful for 10% missingness. Dropping all rows (C) reduces data unnecessarily. Replacing with 0 (D) would introduce bias unless 0 is actually a meaningful value.