Home

SQL for Data Analysts

Block 2: Programming and Database Skills

Topic 2.4 · 6 Objectives

2.4.1 SQL Queries for Data Retrieval

SELECT, FROM, WHERE Clauses

The SELECT statement is the foundation of SQL data retrieval. It specifies which columns to return. FROM identifies the table, and WHERE filters rows based on conditions.

-- Select all columns from the employees table SELECT * FROM employees; -- Select specific columns with a condition SELECT name, department, salary FROM employees WHERE salary > 50000; -- Multiple conditions using AND / OR SELECT name, department FROM employees WHERE department = 'Engineering' AND salary >= 60000; -- Pattern matching with LIKE SELECT name FROM employees WHERE name LIKE 'A%'; -- names starting with 'A' -- IN operator for matching multiple values SELECT * FROM employees WHERE department IN ('Sales', 'Marketing'); -- BETWEEN for range filtering SELECT * FROM employees WHERE salary BETWEEN 40000 AND 80000;

JOINs: INNER, LEFT, RIGHT, FULL

JOINs combine rows from two or more tables based on a related column. Understanding the different types is critical for data analysis.

JOIN TypeReturns
INNER JOINOnly rows with matching values in both tables
LEFT JOINAll rows from the left table, plus matches from the right (NULL if no match)
RIGHT JOINAll rows from the right table, plus matches from the left (NULL if no match)
FULL OUTER JOINAll rows from both tables (NULL where there is no match on either side)
-- Sample tables for JOIN examples: -- employees(id, name, dept_id) departments(id, dept_name) -- INNER JOIN: only employees who belong to a department SELECT e.name, d.dept_name FROM employees e INNER JOIN departments d ON e.dept_id = d.id; -- LEFT JOIN: all employees, even those without a department SELECT e.name, d.dept_name FROM employees e LEFT JOIN departments d ON e.dept_id = d.id; -- RIGHT JOIN: all departments, even those with no employees SELECT e.name, d.dept_name FROM employees e RIGHT JOIN departments d ON e.dept_id = d.id; -- FULL OUTER JOIN: all rows from both tables SELECT e.name, d.dept_name FROM employees e FULL OUTER JOIN departments d ON e.dept_id = d.id;
Exam Note: SQLite does not support RIGHT JOIN or FULL OUTER JOIN natively. In SQLite, you can simulate a FULL OUTER JOIN using a UNION of a LEFT JOIN and a subquery. The course exam may test your conceptual understanding of all four types.

GROUP BY, HAVING, ORDER BY, LIMIT

GROUP BY aggregates rows that share a value. HAVING filters groups (like WHERE but for aggregated data). ORDER BY sorts results, and LIMIT restricts the number of rows returned.

-- Count employees per department SELECT department, COUNT(*) AS emp_count FROM employees GROUP BY department; -- Average salary per department, only those with avg > 50000 SELECT department, AVG(salary) AS avg_salary, COUNT(*) AS emp_count FROM employees GROUP BY department HAVING AVG(salary) > 50000 ORDER BY avg_salary DESC; -- Get the top 5 highest-paid employees SELECT name, salary FROM employees ORDER BY salary DESC LIMIT 5;
Key Difference: WHERE filters individual rows before grouping. HAVING filters groups after aggregation. You cannot use aggregate functions in a WHERE clause.

SQL Aggregate Functions

Aggregate functions perform calculations across a set of rows and return a single value.

FunctionDescriptionExample
COUNT()Number of rowsCOUNT(*) or COUNT(column)
SUM()Total of numeric valuesSUM(salary)
AVG()Mean of numeric valuesAVG(salary)
MIN()Smallest valueMIN(salary)
MAX()Largest valueMAX(salary)
-- Using multiple aggregate functions SELECT COUNT(*) AS total_employees, SUM(salary) AS total_payroll, AVG(salary) AS avg_salary, MIN(salary) AS lowest_salary, MAX(salary) AS highest_salary FROM employees;

Filtering and Manipulating Data

SQL provides operators and functions for advanced data filtering and manipulation:

  • DISTINCT — removes duplicate rows from results
  • IS NULL / IS NOT NULL — checks for missing values
  • UPPER(), LOWER(), LENGTH() — string functions
  • ROUND(), ABS() — numeric functions
  • CASE WHEN — conditional logic within queries
  • Subqueries — nested queries for complex filtering
-- DISTINCT to remove duplicates SELECT DISTINCT department FROM employees; -- Filter for NULL values SELECT name FROM employees WHERE department IS NULL; -- CASE WHEN for conditional columns SELECT name, salary, CASE WHEN salary >= 80000 THEN 'Senior' WHEN salary >= 50000 THEN 'Mid' ELSE 'Junior' END AS level FROM employees; -- Subquery: employees earning above average SELECT name, salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

2.4.2 CRUD Operations

CRUD stands for Create, Read, Update, Delete — the four fundamental operations for persistent data storage.

CREATE TABLE

Defines a new table structure with columns and their data types.

CREATE TABLE employees ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, department TEXT, salary REAL, hire_date TEXT );

INSERT INTO (Create)

Adds new rows to an existing table.

-- Insert a single row INSERT INTO employees (name, department, salary, hire_date) VALUES ('Alice Smith', 'Engineering', 75000, '2023-01-15'); -- Insert multiple rows INSERT INTO employees (name, department, salary, hire_date) VALUES ('Bob Jones', 'Marketing', 55000, '2023-03-20'), ('Carol Lee', 'Engineering', 82000, '2022-06-10'), ('David Kim', 'Sales', 60000, '2023-09-01');

SELECT (Read)

Retrieves data from one or more tables. This is the most commonly used SQL statement.

-- Read all data SELECT * FROM employees; -- Read with conditions and sorting SELECT name, salary FROM employees WHERE department = 'Engineering' ORDER BY salary DESC;

UPDATE

Modifies existing rows in a table. Always use a WHERE clause to target specific rows — omitting it updates every row.

-- Update salary for a specific employee UPDATE employees SET salary = 80000 WHERE name = 'Alice Smith'; -- Update multiple columns UPDATE employees SET department = 'Data Science', salary = 90000 WHERE id = 3;
Warning: Running UPDATE or DELETE without a WHERE clause will affect every row in the table. This is one of the most common and dangerous SQL mistakes.

DELETE

Removes rows from a table.

-- Delete a specific row DELETE FROM employees WHERE id = 4; -- Delete all rows matching a condition DELETE FROM employees WHERE department = 'Sales'; -- Delete ALL rows (use with extreme caution!) DELETE FROM employees;

2.4.3 Python-Database Connections

sqlite3 Module Usage

Python's built-in sqlite3 module provides a lightweight way to interact with SQLite databases without any additional installation. This is the most commonly tested database module on the exam.

import sqlite3 # Full sqlite3 workflow: create DB, table, insert, select, update, delete # 1. Connect to database (creates file if it doesn't exist) conn = sqlite3.connect('company.db') # 2. Create a cursor object cursor = conn.cursor() # 3. Create a table cursor.execute(''' CREATE TABLE IF NOT EXISTS employees ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, department TEXT, salary REAL ) ''') # 4. Insert data cursor.execute(""" INSERT INTO employees (name, department, salary) VALUES ('Alice', 'Engineering', 75000) """) # Insert multiple rows at once employees_data = [ ('Bob', 'Marketing', 55000), ('Carol', 'Engineering', 82000), ('David', 'Sales', 60000), ] cursor.executemany( "INSERT INTO employees (name, department, salary) VALUES (?, ?, ?)", employees_data ) # 5. Commit changes (IMPORTANT: changes aren't saved until commit) conn.commit() # 6. Query (SELECT) data cursor.execute("SELECT * FROM employees") rows = cursor.fetchall() for row in rows: print(row) # (1, 'Alice', 'Engineering', 75000.0) # (2, 'Bob', 'Marketing', 55000.0) # ... # 7. Update data cursor.execute("UPDATE employees SET salary = ? WHERE name = ?", (80000, 'Alice')) conn.commit() # 8. Delete data cursor.execute("DELETE FROM employees WHERE name = ?", ('David',)) conn.commit() # 9. Close the connection conn.close()
Best Practice: Use a with statement (context manager) to ensure the connection is properly closed, even if an error occurs:
import sqlite3 with sqlite3.connect('company.db') as conn: cursor = conn.cursor() cursor.execute("SELECT * FROM employees") rows = cursor.fetchall() for row in rows: print(row) # Connection is automatically closed when the block exits

Fetch Methods

MethodReturnsUse Case
fetchone()A single row as a tuple, or NoneWhen you expect one result
fetchall()A list of all remaining rows as tuplesWhen the result set is small
fetchmany(n)A list of the next n rowsWhen processing in batches

pymysql Basics

pymysql is a third-party library for connecting to MySQL databases. It follows a similar pattern to sqlite3 but requires connection parameters.

import pymysql # Connect to a MySQL database conn = pymysql.connect( host='localhost', user='root', password='mypassword', database='company_db' ) cursor = conn.cursor() cursor.execute("SELECT * FROM employees WHERE salary > %s", (50000,)) rows = cursor.fetchall() conn.close()

Creating Connections, Cursors

The general workflow for any Python database interaction follows the DB-API 2.0 pattern:

  1. Connect — Establish a connection to the database using connect()
  2. Cursor — Create a cursor object with conn.cursor()
  3. Execute — Run SQL with cursor.execute() or cursor.executemany()
  4. Fetch — Retrieve results with fetchone(), fetchall(), or fetchmany()
  5. Commit — Save changes with conn.commit() (for INSERT/UPDATE/DELETE)
  6. Close — Close the connection with conn.close()

Common Connection Issues

IssueCauseSolution
OperationalErrorDatabase file not found or permissions errorCheck file path and permissions
Data not persistedForgot to call conn.commit()Always commit after INSERT/UPDATE/DELETE
ProgrammingErrorSQL syntax error or table doesn't existVerify SQL syntax and table names
Resource leakConnection not closedUse with statement or finally block
InterfaceErrorOperating on a closed cursor/connectionEnsure connection is still open before querying

2.4.4 Parameterized Queries

Why Parameterized Queries Matter

Parameterized queries (also called prepared statements) separate SQL logic from user-supplied data. Instead of inserting values directly into a query string, you use placeholders and pass values separately. This is the single most important defense against SQL injection attacks.

Never do this: Building SQL queries by concatenating user input with string formatting is dangerous and should always be avoided.
# DANGEROUS - Never do this! user_input = "Alice" query = f"SELECT * FROM employees WHERE name = '{user_input}'" cursor.execute(query) # Vulnerable to SQL injection! # SAFE - Always use parameterized queries cursor.execute("SELECT * FROM employees WHERE name = ?", (user_input,))

Using ? Placeholders with sqlite3

The sqlite3 module uses the ? character as its placeholder. Parameters are passed as a tuple.

import sqlite3 conn = sqlite3.connect('company.db') cursor = conn.cursor() # Single parameter (note the trailing comma to make it a tuple) name = "Alice" cursor.execute("SELECT * FROM employees WHERE name = ?", (name,)) # Multiple parameters dept = "Engineering" min_salary = 60000 cursor.execute( "SELECT * FROM employees WHERE department = ? AND salary > ?", (dept, min_salary) ) # Using executemany with parameterized queries new_employees = [ ('Eve', 'HR', 52000), ('Frank', 'Engineering', 78000), ] cursor.executemany( "INSERT INTO employees (name, department, salary) VALUES (?, ?, ?)", new_employees ) conn.commit() # Named placeholders (sqlite3 also supports dict-based params) cursor.execute( "SELECT * FROM employees WHERE name = :name AND department = :dept", {"name": "Alice", "dept": "Engineering"} )

Using %s with pymysql

The pymysql module uses %s as its placeholder (regardless of data type). Do not confuse this with Python's % string formatting operator — they are handled differently by the database driver.

import pymysql conn = pymysql.connect(host='localhost', user='root', password='pass', database='company_db') cursor = conn.cursor() # %s placeholder for ALL types (string, int, float, etc.) cursor.execute( "SELECT * FROM employees WHERE department = %s AND salary > %s", ('Engineering', 60000) ) results = cursor.fetchall() conn.close()
Placeholder Summary: sqlite3 uses ? (or :name for named params). pymysql uses %s. Both prevent SQL injection when used correctly.

Preventing SQL Injection

SQL injection occurs when an attacker manipulates a query by inserting malicious SQL through user input. Parameterized queries prevent this because the database engine treats parameter values as data, never as executable SQL.

# SQL INJECTION ATTACK EXAMPLE # Imagine a login form where the user enters their username: # Attacker enters: ' OR '1'='1' -- user_input = "' OR '1'='1' --" # VULNERABLE CODE (string concatenation) query = f"SELECT * FROM users WHERE username = '{user_input}'" # This becomes: # SELECT * FROM users WHERE username = '' OR '1'='1' --' # The condition '1'='1' is always true, so ALL rows are returned! # The -- comments out the rest of the query # SAFE CODE (parameterized query) cursor.execute("SELECT * FROM users WHERE username = ?", (user_input,)) # The database treats the entire input as a literal string value # It searches for a username that literally equals: ' OR '1'='1' -- # No rows match, and the attack fails

2.4.5 SQL Data Types in Python

SQL Types and Python Equivalents

When working with SQLite through Python, values are automatically converted between SQL types and Python types. Understanding this mapping is essential for the exam.

SQLite TypePython TypeDescriptionExample
INTEGERintWhole numbers42
REALfloatFloating-point numbers3.14
TEXTstrText strings"hello"
BLOBbytesBinary datab"\x00\x01"
NULLNoneMissing/absent valueNone

Type Conversion Examples

import sqlite3 conn = sqlite3.connect(':memory:') # In-memory database cursor = conn.cursor() # Create a table with various types cursor.execute(''' CREATE TABLE type_demo ( int_col INTEGER, real_col REAL, text_col TEXT, blob_col BLOB, null_col TEXT ) ''') # Insert Python values - automatic type conversion cursor.execute( "INSERT INTO type_demo VALUES (?, ?, ?, ?, ?)", (42, 3.14, "hello", b"\x00\x01\x02", None) ) conn.commit() # Fetch and check Python types cursor.execute("SELECT * FROM type_demo") row = cursor.fetchone() print(type(row[0])) # <class 'int'> print(type(row[1])) # <class 'float'> print(type(row[2])) # <class 'str'> print(type(row[3])) # <class 'bytes'> print(row[4]) # None print(row[4] is None) # True conn.close()
Key Insight: SQLite uses dynamic typing (type affinity), meaning it is flexible about what you store. A column declared as INTEGER can technically store a string. However, Python's sqlite3 module returns values in their natural Python types based on what SQLite actually stored.
# Handling date/time (stored as TEXT in SQLite) import sqlite3 from datetime import datetime conn = sqlite3.connect(':memory:') cursor = conn.cursor() cursor.execute("CREATE TABLE events (name TEXT, event_date TEXT)") # Store datetime as ISO format string now = datetime.now().isoformat() cursor.execute("INSERT INTO events VALUES (?, ?)", ("meeting", now)) conn.commit() # Retrieve and convert back to datetime cursor.execute("SELECT event_date FROM events") date_str = cursor.fetchone()[0] event_dt = datetime.fromisoformat(date_str) print(type(event_dt)) # <class 'datetime.datetime'> conn.close()

2.4.6 Database Security

SQL Injection Attacks Explained

SQL injection is a code injection technique that exploits security vulnerabilities in an application's database layer. It occurs when user input is incorporated into SQL queries without proper sanitization or parameterization.

How SQL Injection Works

An attacker crafts input that changes the intended SQL query's logic:

# Vulnerable application code def login(username, password): query = f"SELECT * FROM users WHERE username = '{username}' AND password = '{password}'" cursor.execute(query) return cursor.fetchone() # Normal usage: login("alice", "secret123") # Query: SELECT * FROM users WHERE username = 'alice' AND password = 'secret123' # ATTACK 1: Bypass authentication login("admin' --", "anything") # Query: SELECT * FROM users WHERE username = 'admin' --' AND password = 'anything' # The -- comments out the password check! # ATTACK 2: Return all users login("' OR 1=1 --", "anything") # Query: SELECT * FROM users WHERE username = '' OR 1=1 --' AND password = 'anything' # 1=1 is always true, so ALL users are returned # ATTACK 3: Drop a table login("'; DROP TABLE users; --", "anything") # Query: SELECT * FROM users WHERE username = ''; DROP TABLE users; --' # This could destroy the entire users table!
Real-World Impact: SQL injection has been responsible for major data breaches, including leaked passwords, stolen credit card numbers, and complete database destruction. It remains one of the OWASP Top 10 security risks.

Prevention Strategies

  1. Always use parameterized queries — This is the primary defense. Never build SQL strings with user input.
  2. Input validation — Validate and sanitize all user inputs (check length, type, allowed characters).
  3. Principle of least privilege — Database accounts used by applications should have minimal permissions (e.g., no DROP TABLE access).
  4. Use ORM libraries — Object-Relational Mappers like SQLAlchemy add an additional layer of protection by abstracting raw SQL.
  5. Error handling — Never expose raw database error messages to users, as they can reveal table structures.

Secure Query Writing in Python

import sqlite3 # SECURE: Complete example of safe database interaction def get_employee_by_name(name): """Safely query an employee by name using parameterized query.""" try: with sqlite3.connect('company.db') as conn: cursor = conn.cursor() cursor.execute( "SELECT * FROM employees WHERE name = ?", (name,) # Always a tuple! ) return cursor.fetchone() except sqlite3.Error as e: print(f"Database error: {e}") # Log internally return None # Don't expose error details to the user def add_employee(name, department, salary): """Safely insert a new employee.""" try: with sqlite3.connect('company.db') as conn: cursor = conn.cursor() cursor.execute( "INSERT INTO employees (name, department, salary) VALUES (?, ?, ?)", (name, department, salary) ) conn.commit() return cursor.lastrowid # Return the ID of the new row except sqlite3.IntegrityError: print("Constraint violation.") return None except sqlite3.Error as e: print(f"Database error: {e}") return None def search_employees(department=None, min_salary=None): """Safely build a dynamic query with optional filters.""" query = "SELECT * FROM employees WHERE 1=1" params = [] if department: query += " AND department = ?" params.append(department) if min_salary is not None: query += " AND salary >= ?" params.append(min_salary) with sqlite3.connect('company.db') as conn: cursor = conn.cursor() cursor.execute(query, params) return cursor.fetchall()
Exam Tip: This course will test your ability to identify vulnerable code and select the secure alternative. Remember: parameterized queries are always the correct answer for preventing SQL injection.

Practice Quiz: SQL for Data Analysts

Q1. Which SQL clause is used to filter groups created by GROUP BY?

A) WHERE
B) HAVING
C) FILTER
D) GROUP FILTER
Correct: B) HAVING — The HAVING clause filters groups after aggregation. WHERE filters individual rows before grouping. There is no FILTER or GROUP FILTER clause in standard SQL.

Q2. What does a LEFT JOIN return?

A) Only rows with matching values in both tables
B) All rows from the left table, with matching rows from the right (NULL if no match)
C) All rows from the right table, with matching rows from the left
D) All rows from both tables, regardless of matches
Correct: B) — A LEFT JOIN returns all rows from the left table. For each left row, if a matching row exists in the right table, it is included; otherwise, NULL values fill the right table's columns. Option A describes INNER JOIN, C describes RIGHT JOIN, and D describes FULL OUTER JOIN.

Q3. Which placeholder does the sqlite3 module use for parameterized queries?

A) %s
B) ?
C) ${}
D) @param
Correct: B) ? — The sqlite3 module uses ? as its placeholder (also supporting :name for named parameters). %s is used by pymysql. The other options are not used by standard Python database modules.

Q4. What is the correct order of operations in a Python database workflow?

A) Cursor → Connect → Execute → Commit → Close
B) Connect → Cursor → Execute → Commit → Close
C) Connect → Execute → Cursor → Commit → Close
D) Cursor → Execute → Connect → Close → Commit
Correct: B) — You must first establish a connection, then create a cursor from that connection, execute queries through the cursor, commit changes via the connection, and finally close the connection.

Q5. What Python type does a SQLite REAL column map to?

A) int
B) float
C) str
D) Decimal
Correct: B) float — SQLite REAL maps to Python's float. INTEGER maps to int, TEXT maps to str, BLOB maps to bytes, and NULL maps to None. Python's Decimal type is not natively used by sqlite3.

Q6. Which of the following is vulnerable to SQL injection?

A) cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
B) cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,))
C) cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
D) cursor.execute("SELECT * FROM users WHERE id = :id", {"id": user_id})
Correct: A) — Using an f-string to embed user input directly into SQL is vulnerable to injection because the input becomes part of the SQL command itself. Options B, C, and D all use parameterized queries (with different placeholder styles), which safely separate data from SQL logic.

Q7. What does cursor.fetchone() return when no rows match the query?

A) An empty list []
B) An empty tuple ()
C) None
D) It raises a StopIteration exception
Correct: C) None — When there are no more rows to return, fetchone() returns None. fetchall() returns an empty list [] in the same scenario. No exception is raised.

Q8. What happens if you run an UPDATE statement without a WHERE clause?

A) A syntax error is raised
B) Only the first row is updated
C) Every row in the table is updated
D) No rows are updated
Correct: C) Every row in the table is updated — Without a WHERE clause, the UPDATE applies to all rows. This is syntactically valid SQL, so no error is raised. It is one of the most common and dangerous mistakes in SQL.

Q9. Which SQL function returns the number of rows in a result set?

A) SUM()
B) COUNT()
C) LEN()
D) TOTAL()
Correct: B) COUNT()COUNT(*) counts all rows; COUNT(column) counts non-NULL values. SUM() adds numeric values. LEN() is not a standard SQL function (LENGTH() is). TOTAL() exists in SQLite but returns the sum as a float, not a row count.

Q10. Why is conn.commit() necessary after INSERT, UPDATE, or DELETE?

A) It closes the database connection
B) It saves the changes to the database permanently
C) It creates a backup of the database
D) It validates the SQL syntax
Correct: B) — Database changes happen in a transaction. Until conn.commit() is called, changes exist only in memory and will be lost if the connection is closed or the program crashes. Committing writes the changes permanently to disk.

Navigation

2.4.1 SQL Queries for Data Retrieval 2.4.2 CRUD Operations 2.4.3 Python-Database Connections 2.4.4 Parameterized Queries 2.4.5 SQL Data Types in Python 2.4.6 Database Security Practice Quiz (10 Questions)