Block 1: Data Acquisition and Pre-Processing
Topic 1.4 · 5 Objectives
Data comes in many file formats, and as a data analyst you must know when to use each format, how to read them into Python, and how to export your results. The four most common formats you should know are CSV, JSON, XML, and TXT.
| Format | Best For | Structure | Python Library |
|---|---|---|---|
| CSV | Tabular data (rows and columns) | Comma-separated values, one record per line | pandas, csv |
| JSON | Structured/nested data, APIs | Key-value pairs, supports nesting | pandas, json |
| XML | Hierarchical data, configuration files | Tags with attributes and nested elements | pandas, xml.etree.ElementTree |
| TXT | Unstructured text, logs | Plain text, custom delimiters possible | pandas, built-in open() |
CSV is the most common format for tabular data. Each line represents a row, and values are separated by commas (or other delimiters like tabs or semicolons). CSV files do not store data types, so everything is read as strings unless explicitly converted.
JSON stores data as key-value pairs and supports nested structures. It is the standard format for API responses and web data exchange. JSON preserves data types better than CSV (numbers, booleans, nulls).
XML uses a tree structure with tags, attributes, and nested elements. It is common in legacy systems, configuration files, and enterprise data feeds.
TXT files are used for unstructured text, log files, or data with custom delimiters. They require careful handling since there is no inherent structure.
While not a plain text format, Excel files (.xlsx, .xls) are ubiquitous in business settings. Pandas provides excellent Excel support.
pd.read_csv() is the most versatile reader — it can handle CSV, TSV, and any custom-delimited TXT file by setting the sep parameter. Remember that pd.read_excel() requires the openpyxl engine for .xlsx files.
Data analysts must know how to access data from various storage locations and then organize it effectively for analysis. This includes local files, databases, and online repositories.
Local files are the simplest data source. Use absolute or relative paths, and handle missing files gracefully.
Databases are the backbone of enterprise data storage. Python connects to databases using libraries like sqlite3, sqlalchemy, or database-specific drivers.
Many datasets are hosted online. Pandas can read directly from URLs, and libraries like sklearn.datasets provide built-in sample data.
Once data is loaded, organizing it is essential. Sorting and filtering are the most fundamental data management operations.
When combining multiple boolean conditions in pandas, you must use & (and), | (or), and ~ (not) instead of Python's and, or, not. Each condition must be wrapped in parentheses.
Modern data analysis often requires extracting data from databases, APIs, and web pages. Understanding how to do this programmatically is a core skill for the exam.
APIs (Application Programming Interfaces) provide structured access to web data. The requests library is the standard tool for making HTTP requests in Python.
When data is not available via an API, web scraping allows you to extract information directly from HTML pages. The BeautifulSoup library (from bs4) parses HTML and provides methods to navigate and search the document tree.
Pandas provides a shortcut for extracting tables from HTML pages:
When extracting data from multiple sources, you often face compatibility and integrity issues:
N/A, another uses null, another uses empty strings.Responsible web scraping follows established guidelines to respect website owners and legal boundaries.
| Practice | Description |
|---|---|
| Respect robots.txt | Check /robots.txt before scraping. It specifies which pages crawlers are allowed or disallowed to access. |
| Rate-limiting | Add delays between requests (e.g., time.sleep(1)) to avoid overwhelming the server. |
| Identify yourself | Set a descriptive User-Agent header so the server knows who is accessing the data. |
| Check Terms of Service | Some websites explicitly prohibit scraping in their ToS. Always review legal terms. |
| Prefer APIs | If a site offers an API, use it instead of scraping. APIs are designed for programmatic access. |
This course expects you to understand ethical scraping practices. Always check robots.txt, implement rate-limiting, and prefer APIs over scraping when available.
Spreadsheets (Excel, Google Sheets) remain a primary data tool in business. Knowing how to manipulate spreadsheets programmatically and follow best practices is an essential skill.
Good spreadsheet layout makes data easier to analyze and reduces errors:
While Python is the primary analysis tool on the exam, understanding common spreadsheet formulas demonstrates data literacy:
| Formula | Purpose | Pandas Equivalent |
|---|---|---|
=SUM(A1:A100) |
Sum values in a range | df['A'].sum() |
=AVERAGE(A1:A100) |
Calculate mean | df['A'].mean() |
=COUNTIF(A:A, "Yes") |
Count matching values | (df['A'] == 'Yes').sum() |
=VLOOKUP() |
Look up matching data | pd.merge() |
=IF(A1>100, "High", "Low") |
Conditional logic | np.where(df['A']>100, 'High', 'Low') |
The openpyxl library allows Python to read, write, and format Excel files with full control over styles, formulas, and layouts.
Know the difference between openpyxl (for .xlsx files with full formatting control) and pandas.to_excel() (for quick data export). The exam may test your understanding of when to use each approach.
Data preparation is the most time-consuming phase of the data analysis pipeline — often accounting for 60-80% of a project. Proper preprocessing ensures accurate, reliable results downstream.
Before writing any code, effective data preparation begins with understanding:
The preprocessing steps you choose depend entirely on the analysis objective. Data prepared for a machine learning model (numerical, scaled, split) looks very different from data prepared for a business dashboard (aggregated, well-labeled, human-readable).
Consistent formatting is essential for accurate analysis, merging, and visualization.
Understanding wide and long data formats is critical for the exam. Many analysis and visualization tools require one specific format.
| Aspect | Wide Format | Long Format |
|---|---|---|
| Structure | Each variable has its own column | Variables are stacked into key-value pairs |
| Rows | Fewer rows, more columns | More rows, fewer columns |
| Best for | Human reading, spreadsheets, pivot tables | Plotting, statistical modeling, groupby |
| Example use | Sales by quarter as separate columns | One row per (product, quarter) combination |
Remember the key difference: pd.melt() converts wide to long (unpivoting), while pd.pivot() converts long to wide (pivoting). If there are duplicate index-column combinations, use pd.pivot_table() with an aggregation function instead.
Before training a machine learning model, you must split your data into training and testing sets. The training set is used to fit the model, while the test set evaluates its performance on unseen data.
Never perform feature engineering (scaling, encoding, imputation) before splitting! This causes data leakage — information from the test set influences the training process, producing overly optimistic performance estimates. Always split first, then preprocess each set independently (fitting transformers only on training data).
Outliers can significantly affect statistical calculations and model performance. Understanding how to detect and handle them is a core preprocessing skill.
Impact of outlier decisions on preprocessing:
The choice of outlier handling strategy should be driven by your analysis objective and domain knowledge. Document every decision — removing or transforming data without justification undermines the credibility of your analysis.
Test your understanding of the concepts covered in Topic 1.4. Click on the option you believe is correct.
pd.read_csv() is the standard pandas function for reading CSV files. It also supports custom delimiters via the sep parameter, making it versatile for TSV and other delimited text files as well.pd.melt() function do?pd.melt() unpivots a DataFrame from wide to long format. It takes columns and stacks them into rows as key-value pairs using id_vars, value_vars, var_name, and value_name parameters.bs4 package is the most commonly used library for parsing HTML content in Python. It provides intuitive methods like find() and find_all() to navigate and search the HTML document tree. While html5lib, lxml, and Scrapy are also web scraping tools, BeautifulSoup is the one emphasized on the exam.robots.txt file (located at the root of a website, e.g., https://example.com/robots.txt) specifies which pages or sections of the site web crawlers are allowed or disallowed to access. Checking this file is a fundamental ethical web scraping practice.random_state=42 in train_test_split()?random_state parameter seeds the random number generator used for shuffling the data before splitting. Using the same seed (e.g., 42) ensures that you get the exact same train/test split every time you run the code, which is essential for reproducibility.df[df['age'] > 30 and df['salary'] > 50000]df[(df['age'] > 30) & (df['salary'] > 50000)]df[df['age'] > 30 && df['salary'] > 50000]df.filter(age > 30, salary > 50000)& (and), | (or), ~ (not). Each condition must be wrapped in parentheses due to Python's operator precedence rules. Using Python's and/or keywords raises an ambiguity error.Q1 - 1.5 * IQR or above which upper boundary?Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where IQR = Q3 - Q1. The 1.5 multiplier is the conventional threshold. Values beyond 3.0 * IQR are sometimes called "extreme outliers."stratify parameter in train_test_split() accomplish?stratify parameter ensures that the class distribution of the target variable is preserved in both the training and test sets. This is particularly important for imbalanced datasets — for example, if 10% of your data is class 1, stratification ensures that both train and test sets contain approximately 10% class 1 samples.