ELT with Apache Spark

1. Extracting Data from Files

From a Single File

Use spark.read with format (e.g., CSV, JSON, Parquet).

Example:

df = spark.read.format("csv").option("header", "true").load("/path/to/file.csv")

From a Directory of Files

Spark automatically reads all files in a directory if path is given.

Example:

df = spark.read.parquet("/path/to/parquet_files/")

Glob Patterns: Filter files (e.g., *.json).

2. Prefix in `FROM` Keyword (Data Type Identification)

In SQL, the prefix after FROM indicates the file type:

SELECT * FROM csv.`/path/to/file.csv` -- CSV file
SELECT * FROM json.`/path/to/file.json` -- JSON file
SELECT * FROM delta.`/path/to/delta_table` -- Delta Lake table

3. Creating Views, Temporary Views, and CTEs

Views

Persisted in metastore (visible across sessions).

CREATE VIEW sales_view AS SELECT * FROM sales_raw;

Temporary Views

Session-scoped (disappears after session ends).
```
df.createOrReplaceTempView("temp_sales")
```

CTEs (Common Table Expressions)

In-memory, query-scoped.

WITH cleaned_sales AS (
  SELECT * FROM sales_raw WHERE amount > 0
)
SELECT * FROM cleaned_sales;

4. External Tables vs. Delta Lake Tables

External Tables:
- Point to files (CSV, Parquet) without Delta features (ACID, time travel).
- Example:
  CREATE TABLE external_sales USING PARQUET LOCATION '/path/to/parquet/';
Delta Lake Tables:
- Support transactions, schema enforcement, optimizations.
- Example:
  CREATE TABLE delta_sales USING DELTA LOCATION '/path/to/delta/';

5. Creating Tables from JDBC & External CSV

From JDBC

Connect to databases (PostgreSQL, MySQL).

jdbc_df = spark.read.format("jdbc") \
  .option("url", "jdbc:postgresql://host/db") \
  .option("dbtable", "schema.table") \
  .option("user", "user") \
  .option("password", "pass") \
  .load()

From External CSV

Directly create a table from CSV:

CREATE TABLE csv_sales USING CSV OPTIONS (path "/path/to/file.csv", header "true");

6. Counting Records with `count_if` and `count`

count_if(condition): Counts rows matching condition.

SELECT count_if(amount > 100) FROM sales; -- Counts high-value sales

count(*) vs. count(column):
- count(*) → All rows (including NULLs).
- count(column) → Skips NULLs.

7. Deduplicating Data

From Existing Delta Table

MERGE INTO deduped_sales AS target
USING (
  SELECT *, row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
  FROM sales_raw
) AS source
ON target.id = source.id AND source.rn = 1
WHEN NOT MATCHED THEN INSERT *;

New Table Without Duplicates

CREATE TABLE clean_sales AS
SELECT DISTINCT * FROM sales_raw; -- Removes exact duplicates

Dedupe by Specific Columns

SELECT * FROM (
  SELECT *, row_number() OVER (PARTITION BY customer_id, product_id ORDER BY date DESC) AS rn
  FROM sales
) WHERE rn = 1;

8. Data Validation

Primary Key Uniqueness

SELECT customer_id, COUNT(*) 
FROM customers 
GROUP BY customer_id 
HAVING COUNT(*) > 1; -- Finds duplicates

One-to-One Field Validation

SELECT field1, COUNT(DISTINCT field2) 
FROM table 
GROUP BY field1 
HAVING COUNT(DISTINCT field2) > 1; -- Checks if field1 maps to multiple field2

Value Not Present

SELECT * FROM orders 
WHERE order_id NOT IN (SELECT order_id FROM returns); -- Unreturned orders

9. Timestamp Operations

Casting to Timestamp

SELECT CAST(date_string AS TIMESTAMP) FROM events;

Extracting Calendar Data

SELECT 
  YEAR(timestamp) AS year,
  MONTH(timestamp) AS month,
  DAY(timestamp) AS day
FROM logs;

10. String & Nested Data Handling

Extract Pattern (Regex)

SELECT regexp_extract(description, 'Order #([0-9]+)', 1) AS order_num FROM sales;

Dot Syntax for Nested Data

SELECT user.name.first FROM json_table; -- Extracts nested JSON field

Array Functions

explode(): Converts array elements into rows.

SELECT explode(items) FROM orders; -- One row per item

flatten(): Merges arrays (vs. explode which splits).

11. Joins & Pivoting

Join Result Prediction

Inner Join: Only matching rows.
Left Join: All left rows + matched right rows (NULL if no match).

PIVOT Clause (Long → Wide)

SELECT * FROM sales
PIVOT (SUM(amount) FOR product IN ('A', 'B', 'C'));

12. SQL UDFs (User-Defined Functions)

Define a UDF

CREATE FUNCTION get_discount(price DOUBLE) RETURNS DOUBLE
RETURN price * 0.1;

Location & Security

Stored in: Hive metastore (default schema by default).
Permissions:
- GRANT SELECT ON FUNCTION get_discount TO user;

13. `CASE/WHEN` for Control Flow

SELECT 
  CASE 
    WHEN amount > 1000 THEN 'High'
    WHEN amount > 500 THEN 'Medium'
    ELSE 'Low'
  END AS priority
FROM sales;

Data Engineer Associate

​1. Extracting Data from Files

​From a Single File

​From a Directory of Files

​2. Prefix in FROM Keyword (Data Type Identification)

​3. Creating Views, Temporary Views, and CTEs

​Views

​Temporary Views

​CTEs (Common Table Expressions)

​4. External Tables vs. Delta Lake Tables

​5. Creating Tables from JDBC & External CSV

​From JDBC

​From External CSV

​6. Counting Records with count_if and count

​7. Deduplicating Data

​From Existing Delta Table

​New Table Without Duplicates

​Dedupe by Specific Columns

​8. Data Validation

​Primary Key Uniqueness

​One-to-One Field Validation

​Value Not Present

​9. Timestamp Operations

​Casting to Timestamp

​Extracting Calendar Data

​10. String & Nested Data Handling

​Extract Pattern (Regex)

​Dot Syntax for Nested Data

​Array Functions

​11. Joins & Pivoting

​Join Result Prediction

​PIVOT Clause (Long → Wide)

​12. SQL UDFs (User-Defined Functions)

​Define a UDF

​Location & Security

​13. CASE/WHEN for Control Flow