Home File Formats Datasets Extraction Spreadsheets Pre-processing Quiz

Data Preparation Techniques

Block 1: Data Acquisition and Pre-Processing

Topic 1.4 · 5 Objectives

1.4.1 File Formats in Data Acquisition

Data comes in many file formats, and as a data analyst you must know when to use each format, how to read them into Python, and how to export your results. The four most common formats you should know are CSV, JSON, XML, and TXT.

Common File Formats

Format Best For Structure Python Library
CSV Tabular data (rows and columns) Comma-separated values, one record per line pandas, csv
JSON Structured/nested data, APIs Key-value pairs, supports nesting pandas, json
XML Hierarchical data, configuration files Tags with attributes and nested elements pandas, xml.etree.ElementTree
TXT Unstructured text, logs Plain text, custom delimiters possible pandas, built-in open()

CSV (Comma-Separated Values)

CSV is the most common format for tabular data. Each line represents a row, and values are separated by commas (or other delimiters like tabs or semicolons). CSV files do not store data types, so everything is read as strings unless explicitly converted.

# Reading a CSV file with pandas import pandas as pd # Basic read df = pd.read_csv('data/sales.csv') # With options: custom delimiter, specific columns, date parsing df = pd.read_csv( 'data/sales.csv', sep=';', # semicolon delimiter usecols=['date', 'amount'], # only load specific columns parse_dates=['date'], # parse date column encoding='utf-8', # specify encoding na_values=['N/A', ''] # treat these as NaN ) # Writing a DataFrame to CSV df.to_csv('output/cleaned_sales.csv', index=False)

JSON (JavaScript Object Notation)

JSON stores data as key-value pairs and supports nested structures. It is the standard format for API responses and web data exchange. JSON preserves data types better than CSV (numbers, booleans, nulls).

# Reading JSON data with pandas import pandas as pd # Read a JSON file (records orientation) df = pd.read_json('data/employees.json') # Read nested JSON using json_normalize import json with open('data/nested.json') as f: data = json.load(f) df = pd.json_normalize(data, record_path='orders', meta=['customer_id', 'name']) # Writing to JSON df.to_json('output/employees.json', orient='records', indent=2)

XML (Extensible Markup Language)

XML uses a tree structure with tags, attributes, and nested elements. It is common in legacy systems, configuration files, and enterprise data feeds.

# Reading XML with pandas (pandas 1.3+) import pandas as pd df = pd.read_xml('data/catalog.xml') # Parsing XML manually with ElementTree import xml.etree.ElementTree as ET tree = ET.parse('data/catalog.xml') root = tree.getroot() rows = [] for item in root.findall('product'): rows.append({ 'name': item.find('name').text, 'price': float(item.find('price').text), 'category': item.get('category') # attribute }) df = pd.DataFrame(rows)

TXT (Plain Text Files)

TXT files are used for unstructured text, log files, or data with custom delimiters. They require careful handling since there is no inherent structure.

# Reading a tab-delimited TXT file df = pd.read_csv('data/log.txt', sep='\t') # Reading unstructured text with open('data/notes.txt', 'r') as f: content = f.read() lines = content.splitlines() # Reading fixed-width format text df = pd.read_fwf('data/fixed_width.txt')

Reading Excel Files

While not a plain text format, Excel files (.xlsx, .xls) are ubiquitous in business settings. Pandas provides excellent Excel support.

# Reading an Excel file df = pd.read_excel('data/report.xlsx') # Read a specific sheet df = pd.read_excel('data/report.xlsx', sheet_name='Q4 Sales') # Read multiple sheets into a dictionary sheets = pd.read_excel('data/report.xlsx', sheet_name=None) # sheets['Q4 Sales'], sheets['Summary'], etc. # Write to Excel df.to_excel('output/report.xlsx', index=False, sheet_name='Results')

Study Tip

pd.read_csv() is the most versatile reader — it can handle CSV, TSV, and any custom-delimited TXT file by setting the sep parameter. Remember that pd.read_excel() requires the openpyxl engine for .xlsx files.

1.4.2 Accessing and Managing Datasets

Data analysts must know how to access data from various storage locations and then organize it effectively for analysis. This includes local files, databases, and online repositories.

Accessing Data from Local Files

Local files are the simplest data source. Use absolute or relative paths, and handle missing files gracefully.

import pandas as pd import os # Check if file exists before reading file_path = 'data/customers.csv' if os.path.exists(file_path): df = pd.read_csv(file_path) print(f"Loaded {len(df)} rows") else: print(f"File not found: {file_path}") # List all CSV files in a directory csv_files = [f for f in os.listdir('data/') if f.endswith('.csv')]

Accessing Data from Databases

Databases are the backbone of enterprise data storage. Python connects to databases using libraries like sqlite3, sqlalchemy, or database-specific drivers.

import sqlite3 import pandas as pd # Connect to a SQLite database conn = sqlite3.connect('company.db') # Read SQL query results directly into a DataFrame df = pd.read_sql('SELECT * FROM employees WHERE department = "Sales"', conn) # Always close the connection conn.close() # Better: use a context manager with sqlite3.connect('company.db') as conn: df = pd.read_sql_query('SELECT name, salary FROM employees', conn)

Accessing Data from Online Repositories

Many datasets are hosted online. Pandas can read directly from URLs, and libraries like sklearn.datasets provide built-in sample data.

# Read CSV directly from a URL url = 'https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv' df = pd.read_csv(url) # scikit-learn built-in datasets from sklearn.datasets import load_iris iris = load_iris(as_frame=True) df = iris.frame

Organizing, Sorting, and Filtering Data

Once data is loaded, organizing it is essential. Sorting and filtering are the most fundamental data management operations.

import pandas as pd df = pd.read_csv('data/employees.csv') # Sorting by a single column (ascending) df_sorted = df.sort_values('salary') # Sorting by multiple columns df_sorted = df.sort_values(['department', 'salary'], ascending=[True, False]) # Filtering rows with boolean conditions high_earners = df[df['salary'] > 80000] # Multiple filter conditions senior_sales = df[(df['department'] == 'Sales') & (df['years_exp'] >= 5)] # Using .query() for readable filtering result = df.query('salary > 80000 and department == "Sales"') # Filtering with .isin() for multiple values selected_depts = df[df['department'].isin(['Sales', 'Marketing', 'IT'])]

Key Point

When combining multiple boolean conditions in pandas, you must use & (and), | (or), and ~ (not) instead of Python's and, or, not. Each condition must be wrapped in parentheses.

1.4.3 Extracting Data from Various Sources

Modern data analysis often requires extracting data from databases, APIs, and web pages. Understanding how to do this programmatically is a core skill for the exam.

Extracting Data from APIs

APIs (Application Programming Interfaces) provide structured access to web data. The requests library is the standard tool for making HTTP requests in Python.

import requests import pandas as pd # GET request to a REST API response = requests.get('https://api.example.com/data/products') # Check the response status if response.status_code == 200: data = response.json() # Parse JSON response df = pd.DataFrame(data) print(df.head()) else: print(f"Error: {response.status_code}") # API with query parameters params = { 'country': 'US', 'year': 2024, 'format': 'json' } response = requests.get('https://api.example.com/statistics', params=params) # API with authentication header headers = {'Authorization': 'Bearer YOUR_API_KEY'} response = requests.get('https://api.example.com/secure/data', headers=headers)

Web Scraping with BeautifulSoup

When data is not available via an API, web scraping allows you to extract information directly from HTML pages. The BeautifulSoup library (from bs4) parses HTML and provides methods to navigate and search the document tree.

import requests from bs4 import BeautifulSoup import pandas as pd # Step 1: Fetch the web page url = 'https://example.com/data-table' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Step 2: Find elements by tag, class, or id title = soup.find('h1').text paragraphs = soup.find_all('p') specific_div = soup.find('div', class_='content') # Step 3: Extract data from an HTML table table = soup.find('table') rows = table.find_all('tr') data = [] for row in rows[1:]: # skip header row cols = row.find_all('td') data.append([col.text.strip() for col in cols]) # Get headers headers = [th.text.strip() for th in rows[0].find_all('th')] df = pd.DataFrame(data, columns=headers)

Quick Alternative: pd.read_html()

Pandas provides a shortcut for extracting tables from HTML pages:

# read_html returns a list of DataFrames (one per table found) tables = pd.read_html('https://example.com/data-table') df = tables[0] # first table on the page

Data Compatibility and Integrity Challenges

When extracting data from multiple sources, you often face compatibility and integrity issues:

  • Encoding mismatches: Different sources may use UTF-8, Latin-1, or other encodings. Mismatched encodings cause garbled characters.
  • Schema differences: Column names, data types, and formats may differ between sources (e.g., "Date" vs "date" vs "DATE").
  • Missing or inconsistent values: One source uses N/A, another uses null, another uses empty strings.
  • Timezone and locale differences: Dates may be in MM/DD/YYYY or DD/MM/YYYY format depending on the source.
  • Duplicate records: Merging data from multiple sources can introduce duplicates that need to be identified and resolved.

Ethical Web Scraping

Responsible web scraping follows established guidelines to respect website owners and legal boundaries.

Practice Description
Respect robots.txt Check /robots.txt before scraping. It specifies which pages crawlers are allowed or disallowed to access.
Rate-limiting Add delays between requests (e.g., time.sleep(1)) to avoid overwhelming the server.
Identify yourself Set a descriptive User-Agent header so the server knows who is accessing the data.
Check Terms of Service Some websites explicitly prohibit scraping in their ToS. Always review legal terms.
Prefer APIs If a site offers an API, use it instead of scraping. APIs are designed for programmatic access.
import requests import time from urllib.robotparser import RobotFileParser # Check robots.txt before scraping rp = RobotFileParser() rp.set_url('https://example.com/robots.txt') rp.read() url = 'https://example.com/data-page' if rp.can_fetch('*', url): response = requests.get(url, headers={'User-Agent': 'MyDataBot/1.0'}) # Process the response... time.sleep(1) # Rate limit: wait 1 second before next request else: print("Scraping this page is disallowed by robots.txt")

Exam Warning

This course expects you to understand ethical scraping practices. Always check robots.txt, implement rate-limiting, and prefer APIs over scraping when available.

1.4.4 Spreadsheet Best Practices

Spreadsheets (Excel, Google Sheets) remain a primary data tool in business. Knowing how to manipulate spreadsheets programmatically and follow best practices is an essential skill.

Layout Adjustments

Good spreadsheet layout makes data easier to analyze and reduces errors:

  • One header row: Column headers should be in the first row, with no merged cells or multi-line headers.
  • One record per row: Each row should represent a single observation or record.
  • No empty rows/columns: Blank rows or columns between data cause issues with sorting, filtering, and importing.
  • Consistent data per column: Each column should contain one data type (all numbers, all dates, etc.).
  • Avoid color-coding as data: Do not use cell color to represent categories. Use a dedicated column instead.

Formatting Best Practices

  • Use consistent date formats within each column (e.g., YYYY-MM-DD throughout).
  • Avoid leading/trailing spaces in text values.
  • Use number formatting (not text) for numeric columns to enable calculations.
  • Name your sheets descriptively (e.g., "Sales_Q4_2024" instead of "Sheet1").

Basic Formula Applications

While Python is the primary analysis tool on the exam, understanding common spreadsheet formulas demonstrates data literacy:

Formula Purpose Pandas Equivalent
=SUM(A1:A100) Sum values in a range df['A'].sum()
=AVERAGE(A1:A100) Calculate mean df['A'].mean()
=COUNTIF(A:A, "Yes") Count matching values (df['A'] == 'Yes').sum()
=VLOOKUP() Look up matching data pd.merge()
=IF(A1>100, "High", "Low") Conditional logic np.where(df['A']>100, 'High', 'Low')

Programmatic Excel Formatting with openpyxl

The openpyxl library allows Python to read, write, and format Excel files with full control over styles, formulas, and layouts.

from openpyxl import Workbook from openpyxl.styles import Font, Alignment, PatternFill, Border, Side # Create a new workbook wb = Workbook() ws = wb.active ws.title = "Sales Report" # Add headers with formatting headers = ['Product', 'Q1', 'Q2', 'Q3', 'Q4', 'Total'] for col, header in enumerate(headers, 1): cell = ws.cell(row=1, column=col, value=header) cell.font = Font(bold=True, color="FFFFFF") cell.fill = PatternFill(start_color="2563EB", fill_type="solid") cell.alignment = Alignment(horizontal="center") # Add data rows data = [ ['Widget A', 1200, 1500, 1800, 2100], ['Widget B', 800, 950, 1100, 1300], ] for row_idx, row_data in enumerate(data, 2): for col_idx, value in enumerate(row_data, 1): ws.cell(row=row_idx, column=col_idx, value=value) # Add a SUM formula for the Total column ws.cell(row=row_idx, column=6, value=f"=SUM(B{row_idx}:E{row_idx})") # Auto-adjust column widths for col in ws.columns: max_length = max(len(str(cell.value or "")) for cell in col) ws.column_dimensions[col[0].column_letter].width = max_length + 2 wb.save('output/formatted_report.xlsx')

Study Tip

Know the difference between openpyxl (for .xlsx files with full formatting control) and pandas.to_excel() (for quick data export). The exam may test your understanding of when to use each approach.

1.4.5 Preparing and Pre-processing Data

Data preparation is the most time-consuming phase of the data analysis pipeline — often accounting for 60-80% of a project. Proper preprocessing ensures accurate, reliable results downstream.

Understanding Context and Objectives

Before writing any code, effective data preparation begins with understanding:

  • Business context: What problem is the data meant to solve? What decisions will it support?
  • Analysis objectives: Are you building a predictive model, generating a report, or exploring for insights?
  • Stakeholder expectations: Who will use the results? Technical analysts need different outputs than executive decision-makers.
  • Data lineage: Where did the data come from? How was it collected? What biases might it contain?

Key Point

The preprocessing steps you choose depend entirely on the analysis objective. Data prepared for a machine learning model (numerical, scaled, split) looks very different from data prepared for a business dashboard (aggregated, well-labeled, human-readable).

Sorting, Filtering, and Preparing Data

import pandas as pd import numpy as np df = pd.read_csv('data/transactions.csv') # Sort by date (most recent first) df = df.sort_values('transaction_date', ascending=False) # Reset index after sorting df = df.reset_index(drop=True) # Filter out incomplete records df = df.dropna(subset=['amount', 'customer_id']) # Remove duplicates df = df.drop_duplicates(subset=['transaction_id'], keep='first') # Create new computed columns df['year'] = pd.to_datetime(df['transaction_date']).dt.year df['amount_category'] = pd.cut(df['amount'], bins=[0, 50, 200, 1000, np.inf], labels=['small', 'medium', 'large', 'enterprise'])

Proper Data Formatting

Consistent formatting is essential for accurate analysis, merging, and visualization.

Date-Time Consistency

# Convert string dates to datetime objects df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y') # Handle mixed date formats df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True) # Standardize to ISO format (YYYY-MM-DD) df['date_str'] = df['date'].dt.strftime('%Y-%m-%d') # Extract components df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.day_name() df['quarter'] = df['date'].dt.quarter

Aligning Data Structures

# Standardize column names df.columns = df.columns.str.lower().str.replace(' ', '_') # Standardize string values df['city'] = df['city'].str.strip().str.title() # Ensure correct data types df['amount'] = pd.to_numeric(df['amount'], errors='coerce') df['is_active'] = df['is_active'].astype(bool) df['category'] = df['category'].astype('category')

Dataset Structuring: Wide vs Long Formats

Understanding wide and long data formats is critical for the exam. Many analysis and visualization tools require one specific format.

Aspect Wide Format Long Format
Structure Each variable has its own column Variables are stacked into key-value pairs
Rows Fewer rows, more columns More rows, fewer columns
Best for Human reading, spreadsheets, pivot tables Plotting, statistical modeling, groupby
Example use Sales by quarter as separate columns One row per (product, quarter) combination

Wide to Long with pd.melt()

import pandas as pd # Wide format DataFrame df_wide = pd.DataFrame({ 'product': ['Widget A', 'Widget B', 'Widget C'], 'Q1': [100, 200, 150], 'Q2': [120, 180, 170], 'Q3': [140, 220, 160], 'Q4': [160, 250, 190] }) print(df_wide) # product Q1 Q2 Q3 Q4 # 0 Widget A 100 120 140 160 # 1 Widget B 200 180 220 250 # 2 Widget C 150 170 160 190 # Convert wide to long format using melt df_long = pd.melt( df_wide, id_vars=['product'], # columns to keep as identifiers value_vars=['Q1', 'Q2', 'Q3', 'Q4'], # columns to unpivot var_name='quarter', # name for the variable column value_name='sales' # name for the value column ) print(df_long) # product quarter sales # 0 Widget A Q1 100 # 1 Widget B Q1 200 # 2 Widget C Q1 150 # 3 Widget A Q2 120 # ...and so on for all 12 rows

Long to Wide with pd.pivot()

# Convert long back to wide format using pivot df_back_to_wide = df_long.pivot( index='product', # row identifiers columns='quarter', # values become column names values='sales' # values to fill cells ).reset_index() # Flatten MultiIndex column names if needed df_back_to_wide.columns.name = None print(df_back_to_wide) # product Q1 Q2 Q3 Q4 # 0 Widget A 100 120 140 160 # 1 Widget B 200 180 220 250 # 2 Widget C 150 170 160 190

Study Tip

Remember the key difference: pd.melt() converts wide to long (unpivoting), while pd.pivot() converts long to wide (pivoting). If there are duplicate index-column combinations, use pd.pivot_table() with an aggregation function instead.

Train/Test Splitting for Machine Learning

Before training a machine learning model, you must split your data into training and testing sets. The training set is used to fit the model, while the test set evaluates its performance on unseen data.

from sklearn.model_selection import train_test_split import pandas as pd df = pd.read_csv('data/housing.csv') # Separate features (X) and target (y) X = df.drop('price', axis=1) y = df['price'] # 80/20 train/test split with random state for reproducibility X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42 # ensures reproducible results ) print(f"Training set: {X_train.shape[0]} samples") print(f"Test set: {X_test.shape[0]} samples") # For classification: use stratify to maintain class proportions X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # preserves target class ratios in both sets )

Common Pitfall

Never perform feature engineering (scaling, encoding, imputation) before splitting! This causes data leakage — information from the test set influences the training process, producing overly optimistic performance estimates. Always split first, then preprocess each set independently (fitting transformers only on training data).

Outlier Management Impact on Preprocessing

Outliers can significantly affect statistical calculations and model performance. Understanding how to detect and handle them is a core preprocessing skill.

Detecting Outliers

import pandas as pd import numpy as np df = pd.read_csv('data/salaries.csv') # Method 1: IQR (Interquartile Range) method Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)] print(f"Found {len(outliers)} outliers") # Method 2: Z-score method from scipy import stats z_scores = np.abs(stats.zscore(df['salary'])) outliers_z = df[z_scores > 3] # values more than 3 std devs from mean

Handling Outliers

# Option 1: Remove outliers df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)] # Option 2: Cap/clip outliers (winsorization) df['salary_capped'] = df['salary'].clip(lower=lower_bound, upper=upper_bound) # Option 3: Transform to reduce impact (log transformation) df['salary_log'] = np.log1p(df['salary']) # log(1 + x) handles zeros # Option 4: Replace with median (for moderate outliers) median_salary = df['salary'].median() df.loc[df['salary'] > upper_bound, 'salary'] = median_salary

Impact of outlier decisions on preprocessing:

  • Mean and standard deviation are highly sensitive to outliers. A few extreme values can skew the mean significantly, while the median is robust.
  • Normalization (min-max scaling) compresses all other values when outliers stretch the range. Consider clipping before scaling.
  • Model training: Linear regression and k-means clustering are particularly affected by outliers. Tree-based models (Random Forest, Gradient Boosting) are more robust.
  • Domain knowledge matters: Not all outliers are errors. A $10 million salary for a CEO is a valid data point, while a negative salary is an error. Always consider the business context before removing data.

Key Takeaway

The choice of outlier handling strategy should be driven by your analysis objective and domain knowledge. Document every decision — removing or transforming data without justification undermines the credibility of your analysis.

Practice Quiz: Data Preparation Techniques

Test your understanding of the concepts covered in Topic 1.4. Click on the option you believe is correct.

Q1: Which pandas function is used to read a comma-separated values file into a DataFrame?
A) pd.load_csv()
B) pd.read_csv()
C) pd.import_csv()
D) pd.open_csv()
Correct: B) pd.read_csv() is the standard pandas function for reading CSV files. It also supports custom delimiters via the sep parameter, making it versatile for TSV and other delimited text files as well.
Q2: What does the pd.melt() function do?
A) Converts a long-format DataFrame to wide format
B) Converts a wide-format DataFrame to long format
C) Removes duplicate rows from a DataFrame
D) Merges two DataFrames on a common key
Correct: B) pd.melt() unpivots a DataFrame from wide to long format. It takes columns and stacks them into rows as key-value pairs using id_vars, value_vars, var_name, and value_name parameters.
Q3: Which library is commonly used for parsing HTML content during web scraping in Python?
A) html5lib
B) lxml
C) BeautifulSoup (bs4)
D) Scrapy
Correct: C) BeautifulSoup from the bs4 package is the most commonly used library for parsing HTML content in Python. It provides intuitive methods like find() and find_all() to navigate and search the HTML document tree. While html5lib, lxml, and Scrapy are also web scraping tools, BeautifulSoup is the one emphasized on the exam.
Q4: Before scraping a website, what file should you check to determine which pages you are allowed to crawl?
A) sitemap.xml
B) robots.txt
C) index.html
D) .htaccess
Correct: B) The robots.txt file (located at the root of a website, e.g., https://example.com/robots.txt) specifies which pages or sections of the site web crawlers are allowed or disallowed to access. Checking this file is a fundamental ethical web scraping practice.
Q5: What is the purpose of setting random_state=42 in train_test_split()?
A) It sets the test set size to 42%
B) It limits the training data to 42 samples
C) It ensures the split is reproducible across runs
D) It sets the number of cross-validation folds to 42
Correct: C) The random_state parameter seeds the random number generator used for shuffling the data before splitting. Using the same seed (e.g., 42) ensures that you get the exact same train/test split every time you run the code, which is essential for reproducibility.
Q6: Which of the following is the correct way to combine multiple boolean filter conditions in pandas?
A) df[df['age'] > 30 and df['salary'] > 50000]
B) df[(df['age'] > 30) & (df['salary'] > 50000)]
C) df[df['age'] > 30 && df['salary'] > 50000]
D) df.filter(age > 30, salary > 50000)
Correct: B) In pandas, boolean conditions must use bitwise operators: & (and), | (or), ~ (not). Each condition must be wrapped in parentheses due to Python's operator precedence rules. Using Python's and/or keywords raises an ambiguity error.
Q7: What is "data leakage" in the context of train/test splitting?
A) When training data is accidentally deleted
B) When the test set is too small to be meaningful
C) When information from the test set influences the training process
D) When the model overfits on the training data
Correct: C) Data leakage occurs when information from the test set "leaks" into the training process, producing overly optimistic performance estimates. A common cause is performing feature engineering (such as scaling or imputation) on the entire dataset before splitting. Always split first, then preprocess each set independently.
Q8: Which file format is best suited for storing hierarchical data with tags and attributes?
A) CSV
B) JSON
C) XML
D) TXT
Correct: C) XML (Extensible Markup Language) uses a tree structure with tags and attributes that is purpose-built for hierarchical data. While JSON also supports nesting, XML is specifically designed with a tag-based hierarchy and supports attributes on elements, making it the best choice for complex hierarchical structures.
Q9: In the IQR method for outlier detection, a value is considered an outlier if it falls below Q1 - 1.5 * IQR or above which upper boundary?
A) Q3 + 1.0 * IQR
B) Q3 + 1.5 * IQR
C) Q3 + 2.0 * IQR
D) Q3 + 3.0 * IQR
Correct: B) The standard IQR method defines outliers as values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where IQR = Q3 - Q1. The 1.5 multiplier is the conventional threshold. Values beyond 3.0 * IQR are sometimes called "extreme outliers."
Q10: What does the stratify parameter in train_test_split() accomplish?
A) It shuffles the data before splitting
B) It ensures equal numbers of samples in train and test sets
C) It preserves the proportion of target classes in both train and test sets
D) It removes outliers before splitting the data
Correct: C) The stratify parameter ensures that the class distribution of the target variable is preserved in both the training and test sets. This is particularly important for imbalanced datasets — for example, if 10% of your data is class 1, stratification ensures that both train and test sets contain approximately 10% class 1 samples.

Navigation

1.4.1 File Formats 1.4.2 Accessing Datasets 1.4.3 Extracting Data 1.4.4 Spreadsheet Practices 1.4.5 Pre-processing Practice Quiz