1. Extracting Data from Files

From a Single File

  • Use spark.read with format (e.g., CSV, JSON, Parquet).
  • Example:
    df = spark.read.format("csv").option("header", "true").load("/path/to/file.csv")
    

From a Directory of Files

  • Spark automatically reads all files in a directory if path is given.
  • Example:
    df = spark.read.parquet("/path/to/parquet_files/")
    
  • Glob Patterns: Filter files (e.g., *.json).

2. Prefix in FROM Keyword (Data Type Identification)

  • In SQL, the prefix after FROM indicates the file type:
    SELECT * FROM csv.`/path/to/file.csv` -- CSV file
    SELECT * FROM json.`/path/to/file.json` -- JSON file
    SELECT * FROM delta.`/path/to/delta_table` -- Delta Lake table
    

3. Creating Views, Temporary Views, and CTEs

Views

  • Persisted in metastore (visible across sessions).
    CREATE VIEW sales_view AS SELECT * FROM sales_raw;
    

Temporary Views

  • Session-scoped (disappears after session ends).
    df.createOrReplaceTempView("temp_sales")
    

CTEs (Common Table Expressions)

  • In-memory, query-scoped.
    WITH cleaned_sales AS (
      SELECT * FROM sales_raw WHERE amount > 0
    )
    SELECT * FROM cleaned_sales;
    

4. External Tables vs. Delta Lake Tables

  • External Tables:
    • Point to files (CSV, Parquet) without Delta features (ACID, time travel).
    • Example:
      CREATE TABLE external_sales USING PARQUET LOCATION '/path/to/parquet/';
      
  • Delta Lake Tables:
    • Support transactions, schema enforcement, optimizations.
    • Example:
      CREATE TABLE delta_sales USING DELTA LOCATION '/path/to/delta/';
      

5. Creating Tables from JDBC & External CSV

From JDBC

  • Connect to databases (PostgreSQL, MySQL).
    jdbc_df = spark.read.format("jdbc") \
      .option("url", "jdbc:postgresql://host/db") \
      .option("dbtable", "schema.table") \
      .option("user", "user") \
      .option("password", "pass") \
      .load()
    

From External CSV

  • Directly create a table from CSV:
    CREATE TABLE csv_sales USING CSV OPTIONS (path "/path/to/file.csv", header "true");
    

6. Counting Records with count_if and count

  • count_if(condition): Counts rows matching condition.
    SELECT count_if(amount > 100) FROM sales; -- Counts high-value sales
    
  • count(*) vs. count(column):
    • count(*) → All rows (including NULLs).
    • count(column) → Skips NULLs.

7. Deduplicating Data

From Existing Delta Table

MERGE INTO deduped_sales AS target
USING (
  SELECT *, row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
  FROM sales_raw
) AS source
ON target.id = source.id AND source.rn = 1
WHEN NOT MATCHED THEN INSERT *;

New Table Without Duplicates

CREATE TABLE clean_sales AS
SELECT DISTINCT * FROM sales_raw; -- Removes exact duplicates

Dedupe by Specific Columns

SELECT * FROM (
  SELECT *, row_number() OVER (PARTITION BY customer_id, product_id ORDER BY date DESC) AS rn
  FROM sales
) WHERE rn = 1;

8. Data Validation

Primary Key Uniqueness

SELECT customer_id, COUNT(*) 
FROM customers 
GROUP BY customer_id 
HAVING COUNT(*) > 1; -- Finds duplicates

One-to-One Field Validation

SELECT field1, COUNT(DISTINCT field2) 
FROM table 
GROUP BY field1 
HAVING COUNT(DISTINCT field2) > 1; -- Checks if field1 maps to multiple field2

Value Not Present

SELECT * FROM orders 
WHERE order_id NOT IN (SELECT order_id FROM returns); -- Unreturned orders

9. Timestamp Operations

Casting to Timestamp

SELECT CAST(date_string AS TIMESTAMP) FROM events;

Extracting Calendar Data

SELECT 
  YEAR(timestamp) AS year,
  MONTH(timestamp) AS month,
  DAY(timestamp) AS day
FROM logs;

10. String & Nested Data Handling

Extract Pattern (Regex)

SELECT regexp_extract(description, 'Order #([0-9]+)', 1) AS order_num FROM sales;

Dot Syntax for Nested Data

SELECT user.name.first FROM json_table; -- Extracts nested JSON field

Array Functions

  • explode(): Converts array elements into rows.
    SELECT explode(items) FROM orders; -- One row per item
    
  • flatten(): Merges arrays (vs. explode which splits).

11. Joins & Pivoting

Join Result Prediction

  • Inner Join: Only matching rows.
  • Left Join: All left rows + matched right rows (NULL if no match).

PIVOT Clause (Long → Wide)

SELECT * FROM sales
PIVOT (SUM(amount) FOR product IN ('A', 'B', 'C'));

12. SQL UDFs (User-Defined Functions)

Define a UDF

CREATE FUNCTION get_discount(price DOUBLE) RETURNS DOUBLE
RETURN price * 0.1;

Location & Security

  • Stored in: Hive metastore (default schema by default).
  • Permissions:
    • GRANT SELECT ON FUNCTION get_discount TO user;

13. CASE/WHEN for Control Flow

SELECT 
  CASE 
    WHEN amount > 1000 THEN 'High'
    WHEN amount > 500 THEN 'Medium'
    ELSE 'Low'
  END AS priority
FROM sales;