Topic 1.4: Data Preparation Techniques

1.4.1 File Formats in Data Acquisition

Data comes in many file formats, and as a data analyst you must know when to use each format, how to read them into Python, and how to export your results. The four most common formats you should know are CSV, JSON, XML, and TXT.

Common File Formats

Format	Best For	Structure	Python Library
CSV	Tabular data (rows and columns)	Comma-separated values, one record per line	`pandas`, `csv`
JSON	Structured/nested data, APIs	Key-value pairs, supports nesting	`pandas`, `json`
XML	Hierarchical data, configuration files	Tags with attributes and nested elements	`pandas`, `xml.etree.ElementTree`
TXT	Unstructured text, logs	Plain text, custom delimiters possible	`pandas`, built-in `open()`

CSV (Comma-Separated Values)

CSV is the most common format for tabular data. Each line represents a row, and values are separated by commas (or other delimiters like tabs or semicolons). CSV files do not store data types, so everything is read as strings unless explicitly converted.

# Reading a CSV file with pandas
import pandas as pd

# Basic read
df = pd.read_csv('data/sales.csv')

# With options: custom delimiter, specific columns, date parsing
df = pd.read_csv(
    'data/sales.csv',
    sep=';',                  # semicolon delimiter
    usecols=['date', 'amount'], # only load specific columns
    parse_dates=['date'],       # parse date column
    encoding='utf-8',           # specify encoding
    na_values=['N/A', '']       # treat these as NaN
)

# Writing a DataFrame to CSV
df.to_csv('output/cleaned_sales.csv', index=False)

JSON (JavaScript Object Notation)

JSON stores data as key-value pairs and supports nested structures. It is the standard format for API responses and web data exchange. JSON preserves data types better than CSV (numbers, booleans, nulls).

# Reading JSON data with pandas
import pandas as pd

# Read a JSON file (records orientation)
df = pd.read_json('data/employees.json')

# Read nested JSON using json_normalize
import json

with open('data/nested.json') as f:
    data = json.load(f)

df = pd.json_normalize(data, record_path='orders',
                        meta=['customer_id', 'name'])

# Writing to JSON
df.to_json('output/employees.json', orient='records', indent=2)

XML (Extensible Markup Language)

XML uses a tree structure with tags, attributes, and nested elements. It is common in legacy systems, configuration files, and enterprise data feeds.

# Reading XML with pandas (pandas 1.3+)
import pandas as pd

df = pd.read_xml('data/catalog.xml')

# Parsing XML manually with ElementTree
import xml.etree.ElementTree as ET

tree = ET.parse('data/catalog.xml')
root = tree.getroot()

rows = []
for item in root.findall('product'):
    rows.append({
        'name': item.find('name').text,
        'price': float(item.find('price').text),
        'category': item.get('category')  # attribute
    })

df = pd.DataFrame(rows)

TXT (Plain Text Files)

TXT files are used for unstructured text, log files, or data with custom delimiters. They require careful handling since there is no inherent structure.

# Reading a tab-delimited TXT file
df = pd.read_csv('data/log.txt', sep='\t')

# Reading unstructured text
with open('data/notes.txt', 'r') as f:
    content = f.read()
    lines = content.splitlines()

# Reading fixed-width format text
df = pd.read_fwf('data/fixed_width.txt')

Reading Excel Files

While not a plain text format, Excel files (.xlsx, .xls) are ubiquitous in business settings. Pandas provides excellent Excel support.

# Reading an Excel file
df = pd.read_excel('data/report.xlsx')

# Read a specific sheet
df = pd.read_excel('data/report.xlsx', sheet_name='Q4 Sales')

# Read multiple sheets into a dictionary
sheets = pd.read_excel('data/report.xlsx', sheet_name=None)
# sheets['Q4 Sales'], sheets['Summary'], etc.

# Write to Excel
df.to_excel('output/report.xlsx', index=False, sheet_name='Results')

Study Tip

pd.read_csv() is the most versatile reader — it can handle CSV, TSV, and any custom-delimited TXT file by setting the sep parameter. Remember that pd.read_excel() requires the openpyxl engine for .xlsx files.

1.4.2 Accessing and Managing Datasets

Data analysts must know how to access data from various storage locations and then organize it effectively for analysis. This includes local files, databases, and online repositories.

Accessing Data from Local Files

Local files are the simplest data source. Use absolute or relative paths, and handle missing files gracefully.

import pandas as pd
import os

# Check if file exists before reading
file_path = 'data/customers.csv'
if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print(f"Loaded {len(df)} rows")
else:
    print(f"File not found: {file_path}")

# List all CSV files in a directory
csv_files = [f for f in os.listdir('data/') if f.endswith('.csv')]

Accessing Data from Databases

Databases are the backbone of enterprise data storage. Python connects to databases using libraries like sqlite3, sqlalchemy, or database-specific drivers.

import sqlite3
import pandas as pd

# Connect to a SQLite database
conn = sqlite3.connect('company.db')

# Read SQL query results directly into a DataFrame
df = pd.read_sql('SELECT * FROM employees WHERE department = "Sales"', conn)

# Always close the connection
conn.close()

# Better: use a context manager
with sqlite3.connect('company.db') as conn:
    df = pd.read_sql_query('SELECT name, salary FROM employees', conn)

Accessing Data from Online Repositories

Many datasets are hosted online. Pandas can read directly from URLs, and libraries like sklearn.datasets provide built-in sample data.

# Read CSV directly from a URL
url = 'https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv'
df = pd.read_csv(url)

# scikit-learn built-in datasets
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.frame

Organizing, Sorting, and Filtering Data

Once data is loaded, organizing it is essential. Sorting and filtering are the most fundamental data management operations.

import pandas as pd

df = pd.read_csv('data/employees.csv')

# Sorting by a single column (ascending)
df_sorted = df.sort_values('salary')

# Sorting by multiple columns
df_sorted = df.sort_values(['department', 'salary'],
                            ascending=[True, False])

# Filtering rows with boolean conditions
high_earners = df[df['salary'] > 80000]

# Multiple filter conditions
senior_sales = df[(df['department'] == 'Sales') &
                  (df['years_exp'] >= 5)]

# Using .query() for readable filtering
result = df.query('salary > 80000 and department == "Sales"')

# Filtering with .isin() for multiple values
selected_depts = df[df['department'].isin(['Sales', 'Marketing', 'IT'])]

Key Point

When combining multiple boolean conditions in pandas, you must use & (and), | (or), and ~ (not) instead of Python's and, or, not. Each condition must be wrapped in parentheses.

1.4.3 Extracting Data from Various Sources

Modern data analysis often requires extracting data from databases, APIs, and web pages. Understanding how to do this programmatically is a core skill for the exam.

Extracting Data from APIs

APIs (Application Programming Interfaces) provide structured access to web data. The requests library is the standard tool for making HTTP requests in Python.

import requests
import pandas as pd

# GET request to a REST API
response = requests.get('https://api.example.com/data/products')

# Check the response status
if response.status_code == 200:
    data = response.json()  # Parse JSON response
    df = pd.DataFrame(data)
    print(df.head())
else:
    print(f"Error: {response.status_code}")

# API with query parameters
params = {
    'country': 'US',
    'year': 2024,
    'format': 'json'
}
response = requests.get('https://api.example.com/statistics', params=params)

# API with authentication header
headers = {'Authorization': 'Bearer YOUR_API_KEY'}
response = requests.get('https://api.example.com/secure/data', headers=headers)

Web Scraping with BeautifulSoup

When data is not available via an API, web scraping allows you to extract information directly from HTML pages. The BeautifulSoup library (from bs4) parses HTML and provides methods to navigate and search the document tree.

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the web page
url = 'https://example.com/data-table'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Step 2: Find elements by tag, class, or id
title = soup.find('h1').text
paragraphs = soup.find_all('p')
specific_div = soup.find('div', class_='content')

# Step 3: Extract data from an HTML table
table = soup.find('table')
rows = table.find_all('tr')

data = []
for row in rows[1:]:  # skip header row
    cols = row.find_all('td')
    data.append([col.text.strip() for col in cols])

# Get headers
headers = [th.text.strip() for th in rows[0].find_all('th')]
df = pd.DataFrame(data, columns=headers)

Quick Alternative: pd.read_html()

Pandas provides a shortcut for extracting tables from HTML pages:

# read_html returns a list of DataFrames (one per table found)
tables = pd.read_html('https://example.com/data-table')
df = tables[0]  # first table on the page

Data Compatibility and Integrity Challenges

When extracting data from multiple sources, you often face compatibility and integrity issues:

Encoding mismatches: Different sources may use UTF-8, Latin-1, or other encodings. Mismatched encodings cause garbled characters.
Schema differences: Column names, data types, and formats may differ between sources (e.g., "Date" vs "date" vs "DATE").
Missing or inconsistent values: One source uses N/A, another uses null, another uses empty strings.
Timezone and locale differences: Dates may be in MM/DD/YYYY or DD/MM/YYYY format depending on the source.
Duplicate records: Merging data from multiple sources can introduce duplicates that need to be identified and resolved.

Ethical Web Scraping

Responsible web scraping follows established guidelines to respect website owners and legal boundaries.

Practice	Description
Respect robots.txt	Check `/robots.txt` before scraping. It specifies which pages crawlers are allowed or disallowed to access.
Rate-limiting	Add delays between requests (e.g., `time.sleep(1)`) to avoid overwhelming the server.
Identify yourself	Set a descriptive `User-Agent` header so the server knows who is accessing the data.
Check Terms of Service	Some websites explicitly prohibit scraping in their ToS. Always review legal terms.
Prefer APIs	If a site offers an API, use it instead of scraping. APIs are designed for programmatic access.

import requests
import time
from urllib.robotparser import RobotFileParser

# Check robots.txt before scraping
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

url = 'https://example.com/data-page'
if rp.can_fetch('*', url):
    response = requests.get(url, headers={'User-Agent': 'MyDataBot/1.0'})
    # Process the response...
    time.sleep(1)  # Rate limit: wait 1 second before next request
else:
    print("Scraping this page is disallowed by robots.txt")

Exam Warning

This course expects you to understand ethical scraping practices. Always check robots.txt, implement rate-limiting, and prefer APIs over scraping when available.

1.4.4 Spreadsheet Best Practices

Spreadsheets (Excel, Google Sheets) remain a primary data tool in business. Knowing how to manipulate spreadsheets programmatically and follow best practices is an essential skill.

Layout Adjustments

Good spreadsheet layout makes data easier to analyze and reduces errors:

One header row: Column headers should be in the first row, with no merged cells or multi-line headers.
One record per row: Each row should represent a single observation or record.
No empty rows/columns: Blank rows or columns between data cause issues with sorting, filtering, and importing.
Consistent data per column: Each column should contain one data type (all numbers, all dates, etc.).
Avoid color-coding as data: Do not use cell color to represent categories. Use a dedicated column instead.

Formatting Best Practices

Use consistent date formats within each column (e.g., YYYY-MM-DD throughout).
Avoid leading/trailing spaces in text values.
Use number formatting (not text) for numeric columns to enable calculations.
Name your sheets descriptively (e.g., "Sales_Q4_2024" instead of "Sheet1").

Basic Formula Applications

While Python is the primary analysis tool on the exam, understanding common spreadsheet formulas demonstrates data literacy:

Formula	Purpose	Pandas Equivalent
`=SUM(A1:A100)`	Sum values in a range	`df['A'].sum()`
`=AVERAGE(A1:A100)`	Calculate mean	`df['A'].mean()`
`=COUNTIF(A:A, "Yes")`	Count matching values	`(df['A'] == 'Yes').sum()`
`=VLOOKUP()`	Look up matching data	`pd.merge()`
`=IF(A1>100, "High", "Low")`	Conditional logic	`np.where(df['A']>100, 'High', 'Low')`

Programmatic Excel Formatting with openpyxl

The openpyxl library allows Python to read, write, and format Excel files with full control over styles, formulas, and layouts.

from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill, Border, Side

# Create a new workbook
wb = Workbook()
ws = wb.active
ws.title = "Sales Report"

# Add headers with formatting
headers = ['Product', 'Q1', 'Q2', 'Q3', 'Q4', 'Total']
for col, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col, value=header)
    cell.font = Font(bold=True, color="FFFFFF")
    cell.fill = PatternFill(start_color="2563EB", fill_type="solid")
    cell.alignment = Alignment(horizontal="center")

# Add data rows
data = [
    ['Widget A', 1200, 1500, 1800, 2100],
    ['Widget B', 800, 950, 1100, 1300],
]

for row_idx, row_data in enumerate(data, 2):
    for col_idx, value in enumerate(row_data, 1):
        ws.cell(row=row_idx, column=col_idx, value=value)
    # Add a SUM formula for the Total column
    ws.cell(row=row_idx, column=6,
           value=f"=SUM(B{row_idx}:E{row_idx})")

# Auto-adjust column widths
for col in ws.columns:
    max_length = max(len(str(cell.value or "")) for cell in col)
    ws.column_dimensions[col[0].column_letter].width = max_length + 2

wb.save('output/formatted_report.xlsx')

Study Tip

Know the difference between openpyxl (for .xlsx files with full formatting control) and pandas.to_excel() (for quick data export). The exam may test your understanding of when to use each approach.

1.4.5 Preparing and Pre-processing Data

Data preparation is the most time-consuming phase of the data analysis pipeline — often accounting for 60-80% of a project. Proper preprocessing ensures accurate, reliable results downstream.

Understanding Context and Objectives

Before writing any code, effective data preparation begins with understanding:

Business context: What problem is the data meant to solve? What decisions will it support?
Analysis objectives: Are you building a predictive model, generating a report, or exploring for insights?
Stakeholder expectations: Who will use the results? Technical analysts need different outputs than executive decision-makers.
Data lineage: Where did the data come from? How was it collected? What biases might it contain?

Key Point

The preprocessing steps you choose depend entirely on the analysis objective. Data prepared for a machine learning model (numerical, scaled, split) looks very different from data prepared for a business dashboard (aggregated, well-labeled, human-readable).

Sorting, Filtering, and Preparing Data

import pandas as pd
import numpy as np

df = pd.read_csv('data/transactions.csv')

# Sort by date (most recent first)
df = df.sort_values('transaction_date', ascending=False)

# Reset index after sorting
df = df.reset_index(drop=True)

# Filter out incomplete records
df = df.dropna(subset=['amount', 'customer_id'])

# Remove duplicates
df = df.drop_duplicates(subset=['transaction_id'], keep='first')

# Create new computed columns
df['year'] = pd.to_datetime(df['transaction_date']).dt.year
df['amount_category'] = pd.cut(df['amount'],
                                  bins=[0, 50, 200, 1000, np.inf],
                                  labels=['small', 'medium', 'large', 'enterprise'])

Proper Data Formatting

Consistent formatting is essential for accurate analysis, merging, and visualization.

Date-Time Consistency

# Convert string dates to datetime objects
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

# Handle mixed date formats
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

# Standardize to ISO format (YYYY-MM-DD)
df['date_str'] = df['date'].dt.strftime('%Y-%m-%d')

# Extract components
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter

Aligning Data Structures

# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Standardize string values
df['city'] = df['city'].str.strip().str.title()

# Ensure correct data types
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df['is_active'] = df['is_active'].astype(bool)
df['category'] = df['category'].astype('category')

Dataset Structuring: Wide vs Long Formats

Understanding wide and long data formats is critical for the exam. Many analysis and visualization tools require one specific format.

Aspect	Wide Format	Long Format
Structure	Each variable has its own column	Variables are stacked into key-value pairs
Rows	Fewer rows, more columns	More rows, fewer columns
Best for	Human reading, spreadsheets, pivot tables	Plotting, statistical modeling, groupby
Example use	Sales by quarter as separate columns	One row per (product, quarter) combination

Wide to Long with pd.melt()

import pandas as pd

# Wide format DataFrame
df_wide = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C'],
    'Q1': [100, 200, 150],
    'Q2': [120, 180, 170],
    'Q3': [140, 220, 160],
    'Q4': [160, 250, 190]
})
print(df_wide)
#     product   Q1   Q2   Q3   Q4
# 0  Widget A  100  120  140  160
# 1  Widget B  200  180  220  250
# 2  Widget C  150  170  160  190

# Convert wide to long format using melt
df_long = pd.melt(
    df_wide,
    id_vars=['product'],         # columns to keep as identifiers
    value_vars=['Q1', 'Q2', 'Q3', 'Q4'],  # columns to unpivot
    var_name='quarter',           # name for the variable column
    value_name='sales'            # name for the value column
)
print(df_long)
#      product quarter  sales
# 0   Widget A      Q1    100
# 1   Widget B      Q1    200
# 2   Widget C      Q1    150
# 3   Widget A      Q2    120
# ...and so on for all 12 rows

Long to Wide with pd.pivot()

# Convert long back to wide format using pivot
df_back_to_wide = df_long.pivot(
    index='product',    # row identifiers
    columns='quarter',  # values become column names
    values='sales'      # values to fill cells
).reset_index()

# Flatten MultiIndex column names if needed
df_back_to_wide.columns.name = None
print(df_back_to_wide)
#     product   Q1   Q2   Q3   Q4
# 0  Widget A  100  120  140  160
# 1  Widget B  200  180  220  250
# 2  Widget C  150  170  160  190

Study Tip

Remember the key difference: pd.melt() converts wide to long (unpivoting), while pd.pivot() converts long to wide (pivoting). If there are duplicate index-column combinations, use pd.pivot_table() with an aggregation function instead.

Train/Test Splitting for Machine Learning

Before training a machine learning model, you must split your data into training and testing sets. The training set is used to fit the model, while the test set evaluates its performance on unseen data.

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv('data/housing.csv')

# Separate features (X) and target (y)
X = df.drop('price', axis=1)
y = df['price']

# 80/20 train/test split with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% for testing
    random_state=42       # ensures reproducible results
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")

# For classification: use stratify to maintain class proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y           # preserves target class ratios in both sets
)

Common Pitfall

Never perform feature engineering (scaling, encoding, imputation) before splitting! This causes data leakage — information from the test set influences the training process, producing overly optimistic performance estimates. Always split first, then preprocess each set independently (fitting transformers only on training data).

Outlier Management Impact on Preprocessing

Outliers can significantly affect statistical calculations and model performance. Understanding how to detect and handle them is a core preprocessing skill.

Detecting Outliers

import pandas as pd
import numpy as np

df = pd.read_csv('data/salaries.csv')

# Method 1: IQR (Interquartile Range) method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"Found {len(outliers)} outliers")

# Method 2: Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers_z = df[z_scores > 3]  # values more than 3 std devs from mean

Handling Outliers

# Option 1: Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Option 2: Cap/clip outliers (winsorization)
df['salary_capped'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

# Option 3: Transform to reduce impact (log transformation)
df['salary_log'] = np.log1p(df['salary'])  # log(1 + x) handles zeros

# Option 4: Replace with median (for moderate outliers)
median_salary = df['salary'].median()
df.loc[df['salary'] > upper_bound, 'salary'] = median_salary

Impact of outlier decisions on preprocessing:

Mean and standard deviation are highly sensitive to outliers. A few extreme values can skew the mean significantly, while the median is robust.
Normalization (min-max scaling) compresses all other values when outliers stretch the range. Consider clipping before scaling.
Model training: Linear regression and k-means clustering are particularly affected by outliers. Tree-based models (Random Forest, Gradient Boosting) are more robust.
Domain knowledge matters: Not all outliers are errors. A $10 million salary for a CEO is a valid data point, while a negative salary is an error. Always consider the business context before removing data.

Key Takeaway

The choice of outlier handling strategy should be driven by your analysis objective and domain knowledge. Document every decision — removing or transforming data without justification undermines the credibility of your analysis.

Practice Quiz: Data Preparation Techniques

Test your understanding of the concepts covered in Topic 1.4. Click on the option you believe is correct.

Q1: Which pandas function is used to read a comma-separated values file into a DataFrame?

A) pd.load_csv()

B) pd.read_csv()

C) pd.import_csv()

D) pd.open_csv()

Correct: B) pd.read_csv() is the standard pandas function for reading CSV files. It also supports custom delimiters via the sep parameter, making it versatile for TSV and other delimited text files as well.

Q2: What does the pd.melt() function do?

A) Converts a long-format DataFrame to wide format

B) Converts a wide-format DataFrame to long format

C) Removes duplicate rows from a DataFrame

D) Merges two DataFrames on a common key

Correct: B) pd.melt() unpivots a DataFrame from wide to long format. It takes columns and stacks them into rows as key-value pairs using id_vars, value_vars, var_name, and value_name parameters.

Q3: Which library is commonly used for parsing HTML content during web scraping in Python?

A) html5lib

B) lxml

C) BeautifulSoup (bs4)

D) Scrapy

Correct: C) BeautifulSoup from the bs4 package is the most commonly used library for parsing HTML content in Python. It provides intuitive methods like find() and find_all() to navigate and search the HTML document tree. While html5lib, lxml, and Scrapy are also web scraping tools, BeautifulSoup is the one emphasized on the exam.

Q4: Before scraping a website, what file should you check to determine which pages you are allowed to crawl?

A) sitemap.xml

B) robots.txt

C) index.html

D) .htaccess

Correct: B) The robots.txt file (located at the root of a website, e.g., https://example.com/robots.txt) specifies which pages or sections of the site web crawlers are allowed or disallowed to access. Checking this file is a fundamental ethical web scraping practice.

Q5: What is the purpose of setting random_state=42 in train_test_split()?

A) It sets the test set size to 42%

B) It limits the training data to 42 samples

C) It ensures the split is reproducible across runs

D) It sets the number of cross-validation folds to 42

Correct: C) The random_state parameter seeds the random number generator used for shuffling the data before splitting. Using the same seed (e.g., 42) ensures that you get the exact same train/test split every time you run the code, which is essential for reproducibility.

Q6: Which of the following is the correct way to combine multiple boolean filter conditions in pandas?

A) df[df['age'] > 30 and df['salary'] > 50000]

B) df[(df['age'] > 30) & (df['salary'] > 50000)]

C) df[df['age'] > 30 && df['salary'] > 50000]

D) df.filter(age > 30, salary > 50000)

Correct: B) In pandas, boolean conditions must use bitwise operators: & (and), | (or), ~ (not). Each condition must be wrapped in parentheses due to Python's operator precedence rules. Using Python's and/or keywords raises an ambiguity error.

Q7: What is "data leakage" in the context of train/test splitting?

A) When training data is accidentally deleted

B) When the test set is too small to be meaningful

C) When information from the test set influences the training process

D) When the model overfits on the training data

Correct: C) Data leakage occurs when information from the test set "leaks" into the training process, producing overly optimistic performance estimates. A common cause is performing feature engineering (such as scaling or imputation) on the entire dataset before splitting. Always split first, then preprocess each set independently.

Q8: Which file format is best suited for storing hierarchical data with tags and attributes?

A) CSV

B) JSON

C) XML

D) TXT

Correct: C) XML (Extensible Markup Language) uses a tree structure with tags and attributes that is purpose-built for hierarchical data. While JSON also supports nesting, XML is specifically designed with a tag-based hierarchy and supports attributes on elements, making it the best choice for complex hierarchical structures.

Q9: In the IQR method for outlier detection, a value is considered an outlier if it falls below Q1 - 1.5 * IQR or above which upper boundary?

A) Q3 + 1.0 * IQR

B) Q3 + 1.5 * IQR

C) Q3 + 2.0 * IQR

D) Q3 + 3.0 * IQR

Correct: B) The standard IQR method defines outliers as values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where IQR = Q3 - Q1. The 1.5 multiplier is the conventional threshold. Values beyond 3.0 * IQR are sometimes called "extreme outliers."

Q10: What does the stratify parameter in train_test_split() accomplish?

A) It shuffles the data before splitting

B) It ensures equal numbers of samples in train and test sets

C) It preserves the proportion of target classes in both train and test sets

D) It removes outliers before splitting the data

Correct: C) The stratify parameter ensures that the class distribution of the target variable is preserved in both the training and test sets. This is particularly important for imbalanced datasets — for example, if 10% of your data is class 1, stratification ensures that both train and test sets contain approximately 10% class 1 samples.

Previous: 1.3 Data Validation and Integrity Next: 2.1 Core Python Proficiency

Data Preparation Techniques

1.4.1 File Formats in Data Acquisition

Common File Formats

CSV (Comma-Separated Values)

JSON (JavaScript Object Notation)

XML (Extensible Markup Language)

TXT (Plain Text Files)

Reading Excel Files

Study Tip

1.4.2 Accessing and Managing Datasets

Accessing Data from Local Files

Accessing Data from Databases

Accessing Data from Online Repositories

Organizing, Sorting, and Filtering Data

Key Point

1.4.3 Extracting Data from Various Sources

Extracting Data from APIs

Web Scraping with BeautifulSoup

Quick Alternative: pd.read_html()

Data Compatibility and Integrity Challenges

Ethical Web Scraping

Exam Warning

1.4.4 Spreadsheet Best Practices

Layout Adjustments

Formatting Best Practices

Basic Formula Applications

Programmatic Excel Formatting with openpyxl

Study Tip

1.4.5 Preparing and Pre-processing Data

Understanding Context and Objectives

Key Point

Sorting, Filtering, and Preparing Data

Proper Data Formatting

Date-Time Consistency

Aligning Data Structures

Dataset Structuring: Wide vs Long Formats

Wide to Long with pd.melt()

Long to Wide with pd.pivot()

Study Tip

Train/Test Splitting for Machine Learning

Common Pitfall

Outlier Management Impact on Preprocessing

Detecting Outliers

Handling Outliers

Key Takeaway

Practice Quiz: Data Preparation Techniques

Navigation