Topic 2.1: Core Python Proficiency

2.1.1 Python Syntax and Control Structures

Variables and Data Types

Python is dynamically typed, meaning you do not declare a variable's type explicitly. The interpreter infers the type at runtime based on the assigned value. The core built-in data types you must know for the exam are:

Type	Example	Description
`int`	`42`	Whole numbers, arbitrary precision
`float`	`3.14`	64-bit floating point (IEEE 754)
`str`	`"hello"`	Immutable sequence of Unicode characters
`bool`	`True` / `False`	Logical values; subclass of `int`

# Variables and type checking in a data-analysis context
import pandas as pd

sample_size = 500                  # int
avg_revenue = 12345.67              # float
dataset_name = "Q1_sales_2024"     # str
is_cleaned = False                  # bool

print(type(sample_size))   # <class 'int'>
print(type(avg_revenue))   # <class 'float'>
print(type(dataset_name))  # <class 'str'>
print(type(is_cleaned))    # <class 'bool'>

Variable Scopes and the LEGB Rule

Python resolves variable names using the LEGB rule, searching in this order:

Local — Variables defined inside the current function.
Enclosing — Variables in the enclosing (outer) function's scope (relevant for nested functions).
Global — Variables defined at the module level.
Built-in — Names pre-defined by Python (e.g., print, len, range).

threshold = 0.05  # Global scope

def analyze_results(p_values):
    significance_count = 0  # Local scope

    def is_significant(p):
        # 'threshold' resolved via Enclosing -> Global (LEGB)
        return p < threshold

    for p in p_values:
        if is_significant(p):
            significance_count += 1

    return significance_count

results = [0.03, 0.12, 0.001, 0.07, 0.04]
print(analyze_results(results))  # 3

Exam Tip: global and nonlocal keywords

Use global to modify a global variable inside a function, and nonlocal to modify a variable in an enclosing function's scope. Avoid both when possible — prefer passing arguments and returning values.

Control Structures

if / elif / else

def classify_correlation(r):
    """Classify a Pearson correlation coefficient."""
    if abs(r) >= 0.7:
        return "strong"
    elif abs(r) >= 0.4:
        return "moderate"
    elif abs(r) >= 0.2:
        return "weak"
    else:
        return "negligible"

print(classify_correlation(0.85))   # "strong"
print(classify_correlation(-0.35))  # "weak"

for Loops

# Iterate over rows of data to compute a running total
sales = [120, 340, 210, 450, 180]
cumulative = []
running_total = 0

for sale in sales:
    running_total += sale
    cumulative.append(running_total)

print(cumulative)  # [120, 460, 670, 1120, 1300]

while Loops

# Simulate convergence of a simple iterative algorithm
estimate = 10.0
tolerance = 0.001
iterations = 0

while abs(estimate - 5.0) > tolerance:
    estimate = (estimate + 5.0 / estimate) / 2  # Newton's method for sqrt(5)
    iterations += 1

print(f"Converged to {estimate:.6f} in {iterations} iterations")

break and continue

data_points = [23, 45, -1, 67, 89, 12]

# Skip negative values, stop at values above 80
clean_data = []
for val in data_points:
    if val < 0:
        continue   # skip invalid entries
    if val > 80:
        break      # stop processing (outlier threshold)
    clean_data.append(val)

print(clean_data)  # [23, 45, 67]

List Comprehensions

List comprehensions provide a concise, readable way to create lists by applying an expression to each item in an iterable, optionally filtering with a condition.

# Basic comprehension: convert temperatures from Celsius to Fahrenheit
celsius = [0, 20, 37, 100]
fahrenheit = [c * 9/5 + 32 for c in celsius]
print(fahrenheit)  # [32.0, 68.0, 98.6, 212.0]

# Comprehension with condition: filter significant p-values
p_values = [0.03, 0.12, 0.001, 0.07, 0.04, 0.50]
significant = [p for p in p_values if p < 0.05]
print(significant)  # [0.03, 0.001, 0.04]

# Dictionary comprehension: column name mapping
raw_columns = ["First Name", "Last Name", "Annual Income"]
clean_map = {col: col.lower().replace(" ", "_") for col in raw_columns}
print(clean_map)
# {'First Name': 'first_name', 'Last Name': 'last_name', 'Annual Income': 'annual_income'}

Nested Loops

# Generate all combinations of parameters for a grid search
learning_rates = [0.01, 0.1]
max_depths = [3, 5, 10]

param_grid = []
for lr in learning_rates:
    for depth in max_depths:
        param_grid.append({"lr": lr, "max_depth": depth})

print(param_grid)
# [{'lr': 0.01, 'max_depth': 3}, {'lr': 0.01, 'max_depth': 5}, ...]

# Equivalent one-liner using list comprehension
param_grid = [{"lr": lr, "max_depth": d}
              for lr in learning_rates
              for d in max_depths]

2.1.2 Python Functions

Defining Functions with def

Functions encapsulate reusable logic. In data analysis, they help create modular, testable data-processing pipelines.

def calculate_mean(values):
    """Return the arithmetic mean of a list of numbers."""
    return sum(values) / len(values)

scores = [85, 92, 78, 90, 88]
print(calculate_mean(scores))  # 86.6

Positional and Keyword Arguments

Positional arguments are matched by their position in the function call. Keyword arguments are matched by name, which makes calls more readable and order-independent.

def describe_column(data, column, decimal_places):
    """Print summary statistics for a column."""
    mean_val = round(sum(data) / len(data), decimal_places)
    min_val = min(data)
    max_val = max(data)
    print(f"{column}: mean={mean_val}, min={min_val}, max={max_val}")

revenue = [1200, 3400, 2100, 4500]

# Positional call
describe_column(revenue, "revenue", 2)

# Keyword call (order doesn't matter)
describe_column(decimal_places=2, column="revenue", data=revenue)

Optional vs Required Arguments and Default Values

def clean_text(text, lowercase=True, strip_whitespace=True):
    """Clean a text value with configurable options.

    Args:
        text: The input string (required).
        lowercase: Convert to lowercase (optional, default True).
        strip_whitespace: Remove leading/trailing spaces (optional, default True).
    """
    if strip_whitespace:
        text = text.strip()
    if lowercase:
        text = text.lower()
    return text

print(clean_text("  Hello World  "))                  # "hello world"
print(clean_text("  Hello World  ", lowercase=False))  # "Hello World"

Exam Tip: Mutable Default Arguments

Never use a mutable object (like a list or dict) as a default argument value. Use None instead and create the mutable inside the function body. Otherwise, the object is shared across all calls.

*args and **kwargs

*args collects extra positional arguments into a tuple. **kwargs collects extra keyword arguments into a dictionary. They enable flexible function signatures.

def log_metrics(experiment_name, *metrics, **metadata):
    """Log experiment metrics with optional metadata."""
    print(f"Experiment: {experiment_name}")
    print(f"  Metrics: {metrics}")
    for key, value in metadata.items():
        print(f"  {key}: {value}")

log_metrics(
    "model_v2",
    0.95, 0.87, 0.91,           # captured by *metrics
    dataset="training",          # captured by **metadata
    algorithm="random_forest"
)
# Experiment: model_v2
#   Metrics: (0.95, 0.87, 0.91)
#   dataset: training
#   algorithm: random_forest

Return Values and Multiple Returns

Functions can return multiple values as a tuple. This is commonly used to return both a result and a status, or multiple computed statistics at once.

def compute_stats(data):
    """Return mean, median, and standard deviation."""
    n = len(data)
    mean = sum(data) / n
    sorted_data = sorted(data)
    mid = n // 2
    median = (sorted_data[mid] + sorted_data[-mid - 1]) / 2
    variance = sum((x - mean) ** 2 for x in data) / n
    std_dev = variance ** 0.5
    return mean, median, std_dev  # returns a tuple

ages = [25, 30, 35, 40, 28, 33]

# Unpack the returned tuple
avg, med, sd = compute_stats(ages)
print(f"Mean: {avg:.2f}, Median: {med:.2f}, Std Dev: {sd:.2f}")
# Mean: 31.83, Median: 31.50, Std Dev: 4.79

Key Concept

When a function uses return a, b, c, Python actually returns a single tuple (a, b, c). You can unpack it into separate variables or keep it as a tuple.

2.1.3 Python Data Science Ecosystem

Python's strength in data analysis comes from its rich ecosystem of specialized libraries. This course expects you to know which library to choose for a given task.

Library	Purpose	Key Use Cases
`pandas`	Data manipulation and analysis	DataFrames, reading CSV/Excel, groupby, merging, reshaping, time series
`numpy`	Numerical computing	N-dimensional arrays, vectorized math, linear algebra, random sampling
`matplotlib`	Core plotting library	Line plots, bar charts, histograms, scatter plots, full customization
`seaborn`	Statistical visualization	Heatmaps, pair plots, box plots, violin plots, distribution plots
`scikit-learn`	Machine learning	Classification, regression, clustering, model evaluation, preprocessing
`scipy`	Scientific computing	Statistical tests, optimization, interpolation, signal processing
`statistics`	Basic statistics (stdlib)	Mean, median, mode, stdev, variance — no install required

When to Use Which Library

# Task: Read CSV and compute summary stats per group
import pandas as pd
df = pd.read_csv("sales.csv")
summary = df.groupby("region")["revenue"].describe()

# Task: Fast element-wise math on large arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
normalized = (arr - arr.mean()) / arr.std()

# Task: Create a quick histogram
import matplotlib.pyplot as plt
plt.hist(df["revenue"], bins=20, edgecolor="black")
plt.xlabel("Revenue")
plt.title("Revenue Distribution")
plt.show()

# Task: Correlation heatmap with minimal code
import seaborn as sns
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

# Task: Train a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Task: Perform a t-test
from scipy import stats
t_stat, p_value = stats.ttest_ind(group_a, group_b)

# Task: Quick mean/median without external deps
import statistics
print(statistics.mean([10, 20, 30]))       # 20
print(statistics.median([10, 20, 30]))     # 20

Quick Decision Guide

Tabular data? Use pandas. Numerical arrays? Use numpy. Statistical tests? Use scipy.stats. Machine learning? Use scikit-learn. Plotting? Start with matplotlib; use seaborn for statistical plots. Simple stats without imports? Use statistics (standard library).

2.1.4 Core Data Structures

Lists

Lists are ordered, mutable sequences. They are the most versatile data structure in Python and are used extensively for storing collections of data points.

# Creating lists
temperatures = [22.1, 23.4, 19.8, 25.6, 21.3]
empty_list = []
from_range = list(range(1, 6))  # [1, 2, 3, 4, 5]

# Indexing (0-based) and negative indexing
print(temperatures[0])    # 22.1 (first element)
print(temperatures[-1])   # 21.3 (last element)

# Slicing: list[start:stop:step]
print(temperatures[1:4])   # [23.4, 19.8, 25.6]
print(temperatures[::2])   # [22.1, 19.8, 21.3] (every 2nd element)
print(temperatures[::-1])  # [21.3, 25.6, 19.8, 23.4, 22.1] (reversed)

# Common methods
temperatures.append(24.0)         # add one element to end
temperatures.extend([20.5, 22.8]) # add multiple elements
temperatures.pop()                  # remove and return last element
temperatures.pop(0)                # remove and return first element
temperatures.sort()                 # sort in-place (ascending)
temperatures.sort(reverse=True)    # sort descending
temperatures.insert(0, 18.0)      # insert at index 0
print(len(temperatures))            # number of elements

Tuples

Tuples are ordered, immutable sequences. Once created, their elements cannot be changed. Use tuples for data that should not be modified, like database records or coordinates.

# Creating tuples
coordinates = (40.7128, -74.0060)  # NYC latitude, longitude
single = (42,)                      # single-element tuple needs trailing comma
record = ("Alice", 28, "analyst")

# Packing and unpacking
name, age, role = record  # unpacking
print(f"{name} is a {age}-year-old {role}")

# Unpacking with * (extended unpacking)
first, *rest = [10, 20, 30, 40, 50]
print(first)  # 10
print(rest)   # [20, 30, 40, 50]

# Tuples as dictionary keys (because they are hashable)
sales_by_location = {
    ("US", "East"): 15000,
    ("US", "West"): 18000,
    ("EU", "North"): 12000,
}
print(sales_by_location[("US", "West")])  # 18000

Key Concept: Immutability

You cannot add, remove, or change elements of a tuple after creation. Attempting coordinates[0] = 41.0 raises a TypeError. This makes tuples safer for data that should not change and slightly faster than lists.

Dictionaries

Dictionaries store key-value pairs with O(1) average lookup time. They are ideal for mapping identifiers to records, configuration settings, and JSON-like data.

# Creating dictionaries
employee = {
    "name": "Alice",
    "department": "Data Science",
    "salary": 95000,
    "skills": ["Python", "SQL", "Tableau"]
}

# Accessing values
print(employee["name"])           # "Alice"
print(employee.get("title", "N/A"))  # "N/A" (safe access with default)

# Common methods
print(list(employee.keys()))    # ['name', 'department', 'salary', 'skills']
print(list(employee.values()))  # ['Alice', 'Data Science', 95000, [...]]
print(list(employee.items()))   # list of (key, value) tuples

# Updating and adding
employee["salary"] = 100000    # update existing
employee["title"] = "Senior"   # add new key
employee.update({"location": "NYC", "remote": True})

# Nested dictionaries (common for JSON data)
dataset_info = {
    "train": {"rows": 8000, "features": 15},
    "test": {"rows": 2000, "features": 15},
}
print(dataset_info["train"]["rows"])  # 8000

# Iterating over a dictionary
for key, value in employee.items():
    print(f"{key}: {value}")

Sets

Sets are unordered collections of unique elements. They support fast membership testing and mathematical set operations.

# Creating sets
skills_alice = {"Python", "SQL", "Pandas", "Tableau"}
skills_bob = {"Python", "R", "SQL", "Excel"}

# Set operations
common = skills_alice & skills_bob               # intersection
print(common)  # {'Python', 'SQL'}

all_skills = skills_alice | skills_bob            # union
print(all_skills)  # {'Python', 'SQL', 'Pandas', 'Tableau', 'R', 'Excel'}

only_alice = skills_alice - skills_bob            # difference
print(only_alice)  # {'Pandas', 'Tableau'}

exclusive = skills_alice ^ skills_bob             # symmetric difference
print(exclusive)  # {'Pandas', 'Tableau', 'R', 'Excel'}

# Practical use: remove duplicates from data
raw_ids = [101, 102, 101, 103, 102, 104]
unique_ids = list(set(raw_ids))
print(unique_ids)  # [101, 102, 103, 104] (order may vary)

# Fast membership testing
valid_columns = {"name", "age", "salary", "department"}
print("age" in valid_columns)  # True  (O(1) lookup)

Strings

Strings are immutable sequences of characters. Data analysts frequently use string methods for cleaning and formatting text data.

# Common string methods for data cleaning
raw = "  John Doe  "
print(raw.strip())           # "John Doe"
print(raw.lower())           # "  john doe  "
print(raw.upper())           # "  JOHN DOE  "
print(raw.strip().split())   # ['John', 'Doe']
print("-".join(["2024", "01", "15"]))  # "2024-01-15"

# Checking content
filename = "report_2024.csv"
print(filename.endswith(".csv"))      # True
print(filename.startswith("report"))  # True
print("12345".isdigit())              # True

# Replacing substrings
col_name = "Annual Revenue ($)"
clean_name = col_name.replace(" ", "_").replace("(", "").replace(")", "").replace("$", "usd").lower()
print(clean_name)  # "annual_revenue_usd"

# f-strings (formatted string literals) - Python 3.6+
metric = "accuracy"
value = 0.9534
print(f"Model {metric}: {value:.2%}")  # "Model accuracy: 95.34%"
print(f"Value: {value:.4f}")           # "Value: 0.9534"
print(f"Count: {1500000:,}")           # "Count: 1,500,000"

Choosing the Right Data Structure

Scenario	Best Choice	Reason
Ordered collection that changes over time	`list`	Mutable, maintains insertion order
Fixed record (e.g., database row)	`tuple`	Immutable, hashable, slightly faster
Lookup by key / label mapping	`dict`	O(1) key lookup, key-value semantics
Remove duplicates / membership tests	`set`	Unique elements, O(1) membership test
Text data / labels	`str`	Rich methods for manipulation and formatting

Exam Tip

If a question asks about storing unique items or fast lookups, think set or dict. If order matters and elements may repeat, think list. If data should not change, think tuple.

2.1.5 Python Scripting Best Practices

PEP 8: Style Guide for Python Code

PEP 8 is the official style guide for Python code. Following PEP 8 ensures consistency and readability across projects and teams.

Naming Conventions

Entity	Convention	Example
Variables and functions	`snake_case`	`total_revenue`, `calculate_mean()`
Constants	`UPPER_SNAKE_CASE`	`MAX_RETRIES`, `DEFAULT_TIMEOUT`
Classes	`PascalCase`	`DataProcessor`, `SalesReport`
Modules / packages	`lowercase` (short)	`utils.py`, `analysis.py`
Private attributes	Leading underscore	`_internal_cache`

Indentation and Line Length

# Use 4 spaces per indentation level (NEVER tabs)
def process_data(df):
    for col in df.columns:
        if df[col].dtype == "object":
            df[col] = df[col].str.strip().str.lower()
    return df

# Maximum line length: 79 characters (code), 72 (docstrings/comments)
# Break long lines with parentheses (implicit continuation)
filtered_df = df[
    (df["revenue"] > 1000)
    & (df["region"] == "North")
    & (df["year"] >= 2023)
]

Whitespace Conventions

# YES: spaces around operators and after commas
x = 1
y = x + 2
data = [1, 2, 3]
result = calculate(a, b, c=10)

# NO: extra spaces
# x=1
# data = [ 1 , 2 , 3 ]
# result = calculate( a , b , c = 10 )

# Blank lines: 2 blank lines before top-level functions/classes
# 1 blank line between methods inside a class

Import Organization

# Imports should be at the top of the file, grouped in this order:
# 1. Standard library imports
import os
import sys
from datetime import datetime

# 2. Third-party library imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# 3. Local/project-specific imports
from utils import clean_data
from config import DATABASE_URL

PEP 257: Docstring Conventions

PEP 257 defines conventions for writing docstrings — the string literals that appear as the first statement in a module, function, class, or method.

One-line Docstrings

def square(x):
    """Return the square of a number."""
    return x ** 2

Multi-line Docstrings

def load_and_clean(filepath, drop_na=True, encoding="utf-8"):
    """Load a CSV file and apply standard cleaning steps.

    Read the file at the given path into a DataFrame, strip whitespace
    from string columns, and optionally drop rows with missing values.

    Args:
        filepath: Path to the CSV file.
        drop_na: If True, drop rows containing NaN values.
            Defaults to True.
        encoding: File encoding. Defaults to 'utf-8'.

    Returns:
        A cleaned pandas DataFrame ready for analysis.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file is empty.
    """
    import pandas as pd
    df = pd.read_csv(filepath, encoding=encoding)
    if df.empty:
        raise ValueError(f"File is empty: {filepath}")
    # Strip whitespace from string columns
    str_cols = df.select_dtypes(include="object").columns
    df[str_cols] = df[str_cols].apply(lambda c: c.str.strip())
    if drop_na:
        df = df.dropna()
    return df

Module and Class Docstrings

"""Data processing utilities for the sales analysis project.

This module provides helper functions for loading, cleaning,
and transforming sales data from CSV files and SQL databases.
"""

class SalesAnalyzer:
    """Analyze sales data and generate summary reports.

    Attributes:
        data: A pandas DataFrame containing sales records.
        period: The time period for the analysis (e.g., 'Q1_2024').
    """

    def __init__(self, data, period):
        """Initialize SalesAnalyzer with data and period."""
        self.data = data
        self.period = period

PEP 257 Rules to Remember

Use triple double quotes ("""...""") for all docstrings.
One-line docstrings: opening and closing quotes on the same line, ending with a period.
Multi-line docstrings: summary line, blank line, then elaboration.
The closing """ goes on its own line for multi-line docstrings.

Common PEP 8 Violations on the Exam

Watch for questions that test whether you can spot style violations: using camelCase for variables, mixing tabs and spaces, putting imports at the bottom, missing blank lines between functions, or lines exceeding 79 characters.

Practice Quiz: Core Python Proficiency

Q1. In Python's LEGB rule, what does the "E" stand for?

A) External — variables imported from other modules

B) Enclosing — the scope of an enclosing (outer) function

C) Environment — OS environment variables

D) Evaluated — variables evaluated at compile time

Correct: B) LEGB stands for Local, Enclosing, Global, Built-in. The Enclosing scope refers to the local scope of any enclosing function, which is relevant when using nested (inner) functions.

Q2. What is the output of the following code?
data = [10, 20, 30, 40, 50]
result = data[1:4]
print(result)

A) [10, 20, 30, 40]

B) [20, 30, 40]

C) [20, 30, 40, 50]

D) [10, 20, 30]

Correct: B) Slicing with data[1:4] returns elements at indices 1, 2, and 3. The start index is inclusive and the stop index is exclusive, so we get [20, 30, 40].

Q3. Which of the following correctly defines a function with a default argument?

A) def process(data, verbose) = True:

B) def process(data, verbose=True):

C) def process(verbose=True, data):

D) def process(data; verbose=True):

Correct: B) Default values are specified with = inside the parentheses. Parameters with defaults must come after parameters without defaults, which makes C) a SyntaxError.

Q4. What does **kwargs collect in a function definition?

A) All positional arguments as a list

B) All positional arguments as a tuple

C) All extra keyword arguments as a dictionary

D) All arguments (both positional and keyword) as a dictionary

Correct: C) **kwargs collects any keyword arguments that are not explicitly defined in the function signature into a dictionary. *args collects extra positional arguments into a tuple.

Q5. Which library would you use to perform a t-test comparing two groups in Python?

A) pandas

B) numpy

C) scipy.stats

D) statistics

Correct: C) scipy.stats provides ttest_ind() for independent two-sample t-tests and ttest_rel() for paired t-tests. The statistics module only covers basic descriptive stats. numpy and pandas do not include hypothesis testing functions.

Q6. What is the key difference between a list and a tuple in Python?

A) Lists can contain mixed types; tuples cannot

B) Tuples are faster for element access; lists are faster for iteration

C) Lists are mutable; tuples are immutable

D) Lists are ordered; tuples are unordered

Correct: C) Both lists and tuples are ordered sequences that can contain mixed types. The fundamental difference is that lists are mutable (can be modified after creation) while tuples are immutable (cannot be changed after creation).

Q7. What does the following list comprehension produce?
result = [x**2 for x in range(6) if x % 2 == 0]
print(result)

A) [0, 1, 4, 9, 16, 25]

B) [0, 4, 16]

C) [4, 16, 36]

D) [1, 9, 25]

Correct: B) range(6) produces 0 through 5. The condition x % 2 == 0 filters to even numbers: 0, 2, 4. Squaring these gives [0, 4, 16].

Q8. According to PEP 8, which naming convention should be used for a Python function?

A) CalculateMean (PascalCase)

B) calculateMean (camelCase)

C) calculate_mean (snake_case)

D) CALCULATE_MEAN (UPPER_SNAKE_CASE)

Correct: C) PEP 8 specifies snake_case for functions and variables, PascalCase for classes, and UPPER_SNAKE_CASE for constants.

Q9. What is the result of the following set operation?
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a - b)

A) {1, 2}

B) {5, 6}

C) {3, 4}

D) {1, 2, 5, 6}

Correct: A) The - operator computes the set difference: elements in a that are not in b. Since 3 and 4 are in both sets, only {1, 2} remain. Note: a ^ b (symmetric difference) would give {1, 2, 5, 6}.

Q10. According to PEP 257, which of the following is a correctly formatted one-line docstring?

A) # Return the square of x.

B) 'Return the square of x.'

C) """Return the square of x."""

D) """return the square of x"""

Correct: C) PEP 257 requires docstrings to use triple double quotes ("""). A one-line docstring should be a complete sentence starting with a capital letter and ending with a period, all on a single line. Option A is a comment (not a docstring), B uses single quotes, and D lacks capitalization and a period.

Previous: 1.4 Data Preparation Techniques Next: 2.2 Module Management & Exception Handling

Core Python Proficiency