Home

Core Python Proficiency

Block 2: Programming and Database Skills

Topic 2.1 · 5 Objectives

2.1.1 Python Syntax and Control Structures

Variables and Data Types

Python is dynamically typed, meaning you do not declare a variable's type explicitly. The interpreter infers the type at runtime based on the assigned value. The core built-in data types you must know for the exam are:

TypeExampleDescription
int42Whole numbers, arbitrary precision
float3.1464-bit floating point (IEEE 754)
str"hello"Immutable sequence of Unicode characters
boolTrue / FalseLogical values; subclass of int
# Variables and type checking in a data-analysis context import pandas as pd sample_size = 500 # int avg_revenue = 12345.67 # float dataset_name = "Q1_sales_2024" # str is_cleaned = False # bool print(type(sample_size)) # <class 'int'> print(type(avg_revenue)) # <class 'float'> print(type(dataset_name)) # <class 'str'> print(type(is_cleaned)) # <class 'bool'>

Variable Scopes and the LEGB Rule

Python resolves variable names using the LEGB rule, searching in this order:

  1. Local — Variables defined inside the current function.
  2. Enclosing — Variables in the enclosing (outer) function's scope (relevant for nested functions).
  3. Global — Variables defined at the module level.
  4. Built-in — Names pre-defined by Python (e.g., print, len, range).
threshold = 0.05 # Global scope def analyze_results(p_values): significance_count = 0 # Local scope def is_significant(p): # 'threshold' resolved via Enclosing -> Global (LEGB) return p < threshold for p in p_values: if is_significant(p): significance_count += 1 return significance_count results = [0.03, 0.12, 0.001, 0.07, 0.04] print(analyze_results(results)) # 3
Exam Tip: global and nonlocal keywords

Use global to modify a global variable inside a function, and nonlocal to modify a variable in an enclosing function's scope. Avoid both when possible — prefer passing arguments and returning values.

Control Structures

if / elif / else

def classify_correlation(r): """Classify a Pearson correlation coefficient.""" if abs(r) >= 0.7: return "strong" elif abs(r) >= 0.4: return "moderate" elif abs(r) >= 0.2: return "weak" else: return "negligible" print(classify_correlation(0.85)) # "strong" print(classify_correlation(-0.35)) # "weak"

for Loops

# Iterate over rows of data to compute a running total sales = [120, 340, 210, 450, 180] cumulative = [] running_total = 0 for sale in sales: running_total += sale cumulative.append(running_total) print(cumulative) # [120, 460, 670, 1120, 1300]

while Loops

# Simulate convergence of a simple iterative algorithm estimate = 10.0 tolerance = 0.001 iterations = 0 while abs(estimate - 5.0) > tolerance: estimate = (estimate + 5.0 / estimate) / 2 # Newton's method for sqrt(5) iterations += 1 print(f"Converged to {estimate:.6f} in {iterations} iterations")

break and continue

data_points = [23, 45, -1, 67, 89, 12] # Skip negative values, stop at values above 80 clean_data = [] for val in data_points: if val < 0: continue # skip invalid entries if val > 80: break # stop processing (outlier threshold) clean_data.append(val) print(clean_data) # [23, 45, 67]

List Comprehensions

List comprehensions provide a concise, readable way to create lists by applying an expression to each item in an iterable, optionally filtering with a condition.

# Basic comprehension: convert temperatures from Celsius to Fahrenheit celsius = [0, 20, 37, 100] fahrenheit = [c * 9/5 + 32 for c in celsius] print(fahrenheit) # [32.0, 68.0, 98.6, 212.0] # Comprehension with condition: filter significant p-values p_values = [0.03, 0.12, 0.001, 0.07, 0.04, 0.50] significant = [p for p in p_values if p < 0.05] print(significant) # [0.03, 0.001, 0.04] # Dictionary comprehension: column name mapping raw_columns = ["First Name", "Last Name", "Annual Income"] clean_map = {col: col.lower().replace(" ", "_") for col in raw_columns} print(clean_map) # {'First Name': 'first_name', 'Last Name': 'last_name', 'Annual Income': 'annual_income'}

Nested Loops

# Generate all combinations of parameters for a grid search learning_rates = [0.01, 0.1] max_depths = [3, 5, 10] param_grid = [] for lr in learning_rates: for depth in max_depths: param_grid.append({"lr": lr, "max_depth": depth}) print(param_grid) # [{'lr': 0.01, 'max_depth': 3}, {'lr': 0.01, 'max_depth': 5}, ...] # Equivalent one-liner using list comprehension param_grid = [{"lr": lr, "max_depth": d} for lr in learning_rates for d in max_depths]

2.1.2 Python Functions

Defining Functions with def

Functions encapsulate reusable logic. In data analysis, they help create modular, testable data-processing pipelines.

def calculate_mean(values): """Return the arithmetic mean of a list of numbers.""" return sum(values) / len(values) scores = [85, 92, 78, 90, 88] print(calculate_mean(scores)) # 86.6

Positional and Keyword Arguments

Positional arguments are matched by their position in the function call. Keyword arguments are matched by name, which makes calls more readable and order-independent.

def describe_column(data, column, decimal_places): """Print summary statistics for a column.""" mean_val = round(sum(data) / len(data), decimal_places) min_val = min(data) max_val = max(data) print(f"{column}: mean={mean_val}, min={min_val}, max={max_val}") revenue = [1200, 3400, 2100, 4500] # Positional call describe_column(revenue, "revenue", 2) # Keyword call (order doesn't matter) describe_column(decimal_places=2, column="revenue", data=revenue)

Optional vs Required Arguments and Default Values

def clean_text(text, lowercase=True, strip_whitespace=True): """Clean a text value with configurable options. Args: text: The input string (required). lowercase: Convert to lowercase (optional, default True). strip_whitespace: Remove leading/trailing spaces (optional, default True). """ if strip_whitespace: text = text.strip() if lowercase: text = text.lower() return text print(clean_text(" Hello World ")) # "hello world" print(clean_text(" Hello World ", lowercase=False)) # "Hello World"
Exam Tip: Mutable Default Arguments

Never use a mutable object (like a list or dict) as a default argument value. Use None instead and create the mutable inside the function body. Otherwise, the object is shared across all calls.

*args and **kwargs

*args collects extra positional arguments into a tuple. **kwargs collects extra keyword arguments into a dictionary. They enable flexible function signatures.

def log_metrics(experiment_name, *metrics, **metadata): """Log experiment metrics with optional metadata.""" print(f"Experiment: {experiment_name}") print(f" Metrics: {metrics}") for key, value in metadata.items(): print(f" {key}: {value}") log_metrics( "model_v2", 0.95, 0.87, 0.91, # captured by *metrics dataset="training", # captured by **metadata algorithm="random_forest" ) # Experiment: model_v2 # Metrics: (0.95, 0.87, 0.91) # dataset: training # algorithm: random_forest

Return Values and Multiple Returns

Functions can return multiple values as a tuple. This is commonly used to return both a result and a status, or multiple computed statistics at once.

def compute_stats(data): """Return mean, median, and standard deviation.""" n = len(data) mean = sum(data) / n sorted_data = sorted(data) mid = n // 2 median = (sorted_data[mid] + sorted_data[-mid - 1]) / 2 variance = sum((x - mean) ** 2 for x in data) / n std_dev = variance ** 0.5 return mean, median, std_dev # returns a tuple ages = [25, 30, 35, 40, 28, 33] # Unpack the returned tuple avg, med, sd = compute_stats(ages) print(f"Mean: {avg:.2f}, Median: {med:.2f}, Std Dev: {sd:.2f}") # Mean: 31.83, Median: 31.50, Std Dev: 4.79
Key Concept

When a function uses return a, b, c, Python actually returns a single tuple (a, b, c). You can unpack it into separate variables or keep it as a tuple.

2.1.3 Python Data Science Ecosystem

Python's strength in data analysis comes from its rich ecosystem of specialized libraries. This course expects you to know which library to choose for a given task.

Library Purpose Key Use Cases
pandas Data manipulation and analysis DataFrames, reading CSV/Excel, groupby, merging, reshaping, time series
numpy Numerical computing N-dimensional arrays, vectorized math, linear algebra, random sampling
matplotlib Core plotting library Line plots, bar charts, histograms, scatter plots, full customization
seaborn Statistical visualization Heatmaps, pair plots, box plots, violin plots, distribution plots
scikit-learn Machine learning Classification, regression, clustering, model evaluation, preprocessing
scipy Scientific computing Statistical tests, optimization, interpolation, signal processing
statistics Basic statistics (stdlib) Mean, median, mode, stdev, variance — no install required

When to Use Which Library

# Task: Read CSV and compute summary stats per group import pandas as pd df = pd.read_csv("sales.csv") summary = df.groupby("region")["revenue"].describe() # Task: Fast element-wise math on large arrays import numpy as np arr = np.array([1, 2, 3, 4, 5]) normalized = (arr - arr.mean()) / arr.std() # Task: Create a quick histogram import matplotlib.pyplot as plt plt.hist(df["revenue"], bins=20, edgecolor="black") plt.xlabel("Revenue") plt.title("Revenue Distribution") plt.show() # Task: Correlation heatmap with minimal code import seaborn as sns sns.heatmap(df.corr(), annot=True, cmap="coolwarm") # Task: Train a linear regression model from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) # Task: Perform a t-test from scipy import stats t_stat, p_value = stats.ttest_ind(group_a, group_b) # Task: Quick mean/median without external deps import statistics print(statistics.mean([10, 20, 30])) # 20 print(statistics.median([10, 20, 30])) # 20
Quick Decision Guide

Tabular data? Use pandas. Numerical arrays? Use numpy. Statistical tests? Use scipy.stats. Machine learning? Use scikit-learn. Plotting? Start with matplotlib; use seaborn for statistical plots. Simple stats without imports? Use statistics (standard library).

2.1.4 Core Data Structures

Lists

Lists are ordered, mutable sequences. They are the most versatile data structure in Python and are used extensively for storing collections of data points.

# Creating lists temperatures = [22.1, 23.4, 19.8, 25.6, 21.3] empty_list = [] from_range = list(range(1, 6)) # [1, 2, 3, 4, 5] # Indexing (0-based) and negative indexing print(temperatures[0]) # 22.1 (first element) print(temperatures[-1]) # 21.3 (last element) # Slicing: list[start:stop:step] print(temperatures[1:4]) # [23.4, 19.8, 25.6] print(temperatures[::2]) # [22.1, 19.8, 21.3] (every 2nd element) print(temperatures[::-1]) # [21.3, 25.6, 19.8, 23.4, 22.1] (reversed) # Common methods temperatures.append(24.0) # add one element to end temperatures.extend([20.5, 22.8]) # add multiple elements temperatures.pop() # remove and return last element temperatures.pop(0) # remove and return first element temperatures.sort() # sort in-place (ascending) temperatures.sort(reverse=True) # sort descending temperatures.insert(0, 18.0) # insert at index 0 print(len(temperatures)) # number of elements

Tuples

Tuples are ordered, immutable sequences. Once created, their elements cannot be changed. Use tuples for data that should not be modified, like database records or coordinates.

# Creating tuples coordinates = (40.7128, -74.0060) # NYC latitude, longitude single = (42,) # single-element tuple needs trailing comma record = ("Alice", 28, "analyst") # Packing and unpacking name, age, role = record # unpacking print(f"{name} is a {age}-year-old {role}") # Unpacking with * (extended unpacking) first, *rest = [10, 20, 30, 40, 50] print(first) # 10 print(rest) # [20, 30, 40, 50] # Tuples as dictionary keys (because they are hashable) sales_by_location = { ("US", "East"): 15000, ("US", "West"): 18000, ("EU", "North"): 12000, } print(sales_by_location[("US", "West")]) # 18000
Key Concept: Immutability

You cannot add, remove, or change elements of a tuple after creation. Attempting coordinates[0] = 41.0 raises a TypeError. This makes tuples safer for data that should not change and slightly faster than lists.

Dictionaries

Dictionaries store key-value pairs with O(1) average lookup time. They are ideal for mapping identifiers to records, configuration settings, and JSON-like data.

# Creating dictionaries employee = { "name": "Alice", "department": "Data Science", "salary": 95000, "skills": ["Python", "SQL", "Tableau"] } # Accessing values print(employee["name"]) # "Alice" print(employee.get("title", "N/A")) # "N/A" (safe access with default) # Common methods print(list(employee.keys())) # ['name', 'department', 'salary', 'skills'] print(list(employee.values())) # ['Alice', 'Data Science', 95000, [...]] print(list(employee.items())) # list of (key, value) tuples # Updating and adding employee["salary"] = 100000 # update existing employee["title"] = "Senior" # add new key employee.update({"location": "NYC", "remote": True}) # Nested dictionaries (common for JSON data) dataset_info = { "train": {"rows": 8000, "features": 15}, "test": {"rows": 2000, "features": 15}, } print(dataset_info["train"]["rows"]) # 8000 # Iterating over a dictionary for key, value in employee.items(): print(f"{key}: {value}")

Sets

Sets are unordered collections of unique elements. They support fast membership testing and mathematical set operations.

# Creating sets skills_alice = {"Python", "SQL", "Pandas", "Tableau"} skills_bob = {"Python", "R", "SQL", "Excel"} # Set operations common = skills_alice & skills_bob # intersection print(common) # {'Python', 'SQL'} all_skills = skills_alice | skills_bob # union print(all_skills) # {'Python', 'SQL', 'Pandas', 'Tableau', 'R', 'Excel'} only_alice = skills_alice - skills_bob # difference print(only_alice) # {'Pandas', 'Tableau'} exclusive = skills_alice ^ skills_bob # symmetric difference print(exclusive) # {'Pandas', 'Tableau', 'R', 'Excel'} # Practical use: remove duplicates from data raw_ids = [101, 102, 101, 103, 102, 104] unique_ids = list(set(raw_ids)) print(unique_ids) # [101, 102, 103, 104] (order may vary) # Fast membership testing valid_columns = {"name", "age", "salary", "department"} print("age" in valid_columns) # True (O(1) lookup)

Strings

Strings are immutable sequences of characters. Data analysts frequently use string methods for cleaning and formatting text data.

# Common string methods for data cleaning raw = " John Doe " print(raw.strip()) # "John Doe" print(raw.lower()) # " john doe " print(raw.upper()) # " JOHN DOE " print(raw.strip().split()) # ['John', 'Doe'] print("-".join(["2024", "01", "15"])) # "2024-01-15" # Checking content filename = "report_2024.csv" print(filename.endswith(".csv")) # True print(filename.startswith("report")) # True print("12345".isdigit()) # True # Replacing substrings col_name = "Annual Revenue ($)" clean_name = col_name.replace(" ", "_").replace("(", "").replace(")", "").replace("$", "usd").lower() print(clean_name) # "annual_revenue_usd" # f-strings (formatted string literals) - Python 3.6+ metric = "accuracy" value = 0.9534 print(f"Model {metric}: {value:.2%}") # "Model accuracy: 95.34%" print(f"Value: {value:.4f}") # "Value: 0.9534" print(f"Count: {1500000:,}") # "Count: 1,500,000"

Choosing the Right Data Structure

Scenario Best Choice Reason
Ordered collection that changes over time list Mutable, maintains insertion order
Fixed record (e.g., database row) tuple Immutable, hashable, slightly faster
Lookup by key / label mapping dict O(1) key lookup, key-value semantics
Remove duplicates / membership tests set Unique elements, O(1) membership test
Text data / labels str Rich methods for manipulation and formatting
Exam Tip

If a question asks about storing unique items or fast lookups, think set or dict. If order matters and elements may repeat, think list. If data should not change, think tuple.

2.1.5 Python Scripting Best Practices

PEP 8: Style Guide for Python Code

PEP 8 is the official style guide for Python code. Following PEP 8 ensures consistency and readability across projects and teams.

Naming Conventions

EntityConventionExample
Variables and functionssnake_casetotal_revenue, calculate_mean()
ConstantsUPPER_SNAKE_CASEMAX_RETRIES, DEFAULT_TIMEOUT
ClassesPascalCaseDataProcessor, SalesReport
Modules / packageslowercase (short)utils.py, analysis.py
Private attributesLeading underscore_internal_cache

Indentation and Line Length

# Use 4 spaces per indentation level (NEVER tabs) def process_data(df): for col in df.columns: if df[col].dtype == "object": df[col] = df[col].str.strip().str.lower() return df # Maximum line length: 79 characters (code), 72 (docstrings/comments) # Break long lines with parentheses (implicit continuation) filtered_df = df[ (df["revenue"] > 1000) & (df["region"] == "North") & (df["year"] >= 2023) ]

Whitespace Conventions

# YES: spaces around operators and after commas x = 1 y = x + 2 data = [1, 2, 3] result = calculate(a, b, c=10) # NO: extra spaces # x=1 # data = [ 1 , 2 , 3 ] # result = calculate( a , b , c = 10 ) # Blank lines: 2 blank lines before top-level functions/classes # 1 blank line between methods inside a class

Import Organization

# Imports should be at the top of the file, grouped in this order: # 1. Standard library imports import os import sys from datetime import datetime # 2. Third-party library imports import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # 3. Local/project-specific imports from utils import clean_data from config import DATABASE_URL

PEP 257: Docstring Conventions

PEP 257 defines conventions for writing docstrings — the string literals that appear as the first statement in a module, function, class, or method.

One-line Docstrings

def square(x): """Return the square of a number.""" return x ** 2

Multi-line Docstrings

def load_and_clean(filepath, drop_na=True, encoding="utf-8"): """Load a CSV file and apply standard cleaning steps. Read the file at the given path into a DataFrame, strip whitespace from string columns, and optionally drop rows with missing values. Args: filepath: Path to the CSV file. drop_na: If True, drop rows containing NaN values. Defaults to True. encoding: File encoding. Defaults to 'utf-8'. Returns: A cleaned pandas DataFrame ready for analysis. Raises: FileNotFoundError: If the file does not exist. ValueError: If the file is empty. """ import pandas as pd df = pd.read_csv(filepath, encoding=encoding) if df.empty: raise ValueError(f"File is empty: {filepath}") # Strip whitespace from string columns str_cols = df.select_dtypes(include="object").columns df[str_cols] = df[str_cols].apply(lambda c: c.str.strip()) if drop_na: df = df.dropna() return df

Module and Class Docstrings

"""Data processing utilities for the sales analysis project. This module provides helper functions for loading, cleaning, and transforming sales data from CSV files and SQL databases. """ class SalesAnalyzer: """Analyze sales data and generate summary reports. Attributes: data: A pandas DataFrame containing sales records. period: The time period for the analysis (e.g., 'Q1_2024'). """ def __init__(self, data, period): """Initialize SalesAnalyzer with data and period.""" self.data = data self.period = period
PEP 257 Rules to Remember
  • Use triple double quotes ("""...""") for all docstrings.
  • One-line docstrings: opening and closing quotes on the same line, ending with a period.
  • Multi-line docstrings: summary line, blank line, then elaboration.
  • The closing """ goes on its own line for multi-line docstrings.
Common PEP 8 Violations on the Exam

Watch for questions that test whether you can spot style violations: using camelCase for variables, mixing tabs and spaces, putting imports at the bottom, missing blank lines between functions, or lines exceeding 79 characters.

Practice Quiz: Core Python Proficiency

Q1. In Python's LEGB rule, what does the "E" stand for?
A) External — variables imported from other modules
B) Enclosing — the scope of an enclosing (outer) function
C) Environment — OS environment variables
D) Evaluated — variables evaluated at compile time
Correct: B) LEGB stands for Local, Enclosing, Global, Built-in. The Enclosing scope refers to the local scope of any enclosing function, which is relevant when using nested (inner) functions.
Q2. What is the output of the following code?
data = [10, 20, 30, 40, 50]
result = data[1:4]
print(result)
A) [10, 20, 30, 40]
B) [20, 30, 40]
C) [20, 30, 40, 50]
D) [10, 20, 30]
Correct: B) Slicing with data[1:4] returns elements at indices 1, 2, and 3. The start index is inclusive and the stop index is exclusive, so we get [20, 30, 40].
Q3. Which of the following correctly defines a function with a default argument?
A) def process(data, verbose) = True:
B) def process(data, verbose=True):
C) def process(verbose=True, data):
D) def process(data; verbose=True):
Correct: B) Default values are specified with = inside the parentheses. Parameters with defaults must come after parameters without defaults, which makes C) a SyntaxError.
Q4. What does **kwargs collect in a function definition?
A) All positional arguments as a list
B) All positional arguments as a tuple
C) All extra keyword arguments as a dictionary
D) All arguments (both positional and keyword) as a dictionary
Correct: C) **kwargs collects any keyword arguments that are not explicitly defined in the function signature into a dictionary. *args collects extra positional arguments into a tuple.
Q5. Which library would you use to perform a t-test comparing two groups in Python?
A) pandas
B) numpy
C) scipy.stats
D) statistics
Correct: C) scipy.stats provides ttest_ind() for independent two-sample t-tests and ttest_rel() for paired t-tests. The statistics module only covers basic descriptive stats. numpy and pandas do not include hypothesis testing functions.
Q6. What is the key difference between a list and a tuple in Python?
A) Lists can contain mixed types; tuples cannot
B) Tuples are faster for element access; lists are faster for iteration
C) Lists are mutable; tuples are immutable
D) Lists are ordered; tuples are unordered
Correct: C) Both lists and tuples are ordered sequences that can contain mixed types. The fundamental difference is that lists are mutable (can be modified after creation) while tuples are immutable (cannot be changed after creation).
Q7. What does the following list comprehension produce?
result = [x**2 for x in range(6) if x % 2 == 0]
print(result)
A) [0, 1, 4, 9, 16, 25]
B) [0, 4, 16]
C) [4, 16, 36]
D) [1, 9, 25]
Correct: B) range(6) produces 0 through 5. The condition x % 2 == 0 filters to even numbers: 0, 2, 4. Squaring these gives [0, 4, 16].
Q8. According to PEP 8, which naming convention should be used for a Python function?
A) CalculateMean (PascalCase)
B) calculateMean (camelCase)
C) calculate_mean (snake_case)
D) CALCULATE_MEAN (UPPER_SNAKE_CASE)
Correct: C) PEP 8 specifies snake_case for functions and variables, PascalCase for classes, and UPPER_SNAKE_CASE for constants.
Q9. What is the result of the following set operation?
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a - b)
A) {1, 2}
B) {5, 6}
C) {3, 4}
D) {1, 2, 5, 6}
Correct: A) The - operator computes the set difference: elements in a that are not in b. Since 3 and 4 are in both sets, only {1, 2} remain. Note: a ^ b (symmetric difference) would give {1, 2, 5, 6}.
Q10. According to PEP 257, which of the following is a correctly formatted one-line docstring?
A) # Return the square of x.
B) 'Return the square of x.'
C) """Return the square of x."""
D) """return the square of x"""
Correct: C) PEP 257 requires docstrings to use triple double quotes ("""). A one-line docstring should be a complete sentence starting with a capital letter and ending with a period, all on a single line. Option A is a comment (not a docstring), B uses single quotes, and D lacks capitalization and a period.

Navigation

2.1.1 Syntax & Control 2.1.2 Functions 2.1.3 Data Science Ecosystem 2.1.4 Data Structures 2.1.5 Best Practices Practice Quiz