Data Engineer Associate
ELT with Apache Spark
1. Extracting Data from Files
From a Single File
- Use
spark.read
with format (e.g., CSV, JSON, Parquet). - Example:
From a Directory of Files
- Spark automatically reads all files in a directory if path is given.
- Example:
- Glob Patterns: Filter files (e.g.,
*.json
).
2. Prefix in FROM
Keyword (Data Type Identification)
- In SQL, the prefix after
FROM
indicates the file type:
3. Creating Views, Temporary Views, and CTEs
Views
- Persisted in metastore (visible across sessions).
Temporary Views
- Session-scoped (disappears after session ends).
CTEs (Common Table Expressions)
- In-memory, query-scoped.
4. External Tables vs. Delta Lake Tables
- External Tables:
- Point to files (CSV, Parquet) without Delta features (ACID, time travel).
- Example:
- Delta Lake Tables:
- Support transactions, schema enforcement, optimizations.
- Example:
5. Creating Tables from JDBC & External CSV
From JDBC
- Connect to databases (PostgreSQL, MySQL).
From External CSV
- Directly create a table from CSV:
6. Counting Records with count_if
and count
count_if(condition)
: Counts rows matching condition.count(*)
vs.count(column)
:count(*)
→ All rows (including NULLs).count(column)
→ Skips NULLs.
7. Deduplicating Data
From Existing Delta Table
New Table Without Duplicates
Dedupe by Specific Columns
8. Data Validation
Primary Key Uniqueness
One-to-One Field Validation
Value Not Present
9. Timestamp Operations
Casting to Timestamp
Extracting Calendar Data
10. String & Nested Data Handling
Extract Pattern (Regex)
Dot Syntax for Nested Data
Array Functions
explode()
: Converts array elements into rows.flatten()
: Merges arrays (vs.explode
which splits).
11. Joins & Pivoting
Join Result Prediction
- Inner Join: Only matching rows.
- Left Join: All left rows + matched right rows (NULL if no match).
PIVOT Clause (Long → Wide)
12. SQL UDFs (User-Defined Functions)
Define a UDF
Location & Security
- Stored in: Hive metastore (
default
schema by default). - Permissions:
GRANT SELECT ON FUNCTION get_discount TO user;