Rajanand home page
Rajanand
💻 Tech
Home
Spark
SQL
Python
Notes
Contact
Newsletter
Newsletter
Search...
Navigation
Intro to Data Engineering
Requirements Gathering
Have a great day! 🤩
⌘K
Intro to Data Engineering
Overview
Lifecycle
Undercurrents
Data Architecture
Right Technologies
Requirements Gathering
Data Pipelines
Source Systems
Data Ingestion
On this page
Introduction
Hierarchy of Needs
Requirements Gathering Process
Functional Requirements
Non-Functional Requirements
Conversations with Stakeholders
Trade-Offs in Requirements Gathering
Sample Project: Recommender System
Key Components of the Project
Functional Requirements
Non-Functional Requirements
Recommender System Details
Implementation Steps
Key Takeaways
Intro to Data Engineering
Requirements Gathering
Introduction
Goal
: Understand how to gather requirements and design data systems that meet stakeholder needs and business goals.
Key Concepts
:
Hierarchy of Needs
: Business goals → Stakeholder needs → System requirements (functional and non-functional).
Requirements Gathering
: Conversations with stakeholders to understand their needs and current systems.
Documentation
: Clearly document functional and non-functional requirements.
Hierarchy of Needs
Business Goals
🚩:
High-level objectives (e.g., increase revenue, improve customer retention, expand to new markets).
Example: “The company aims to grow by launching new products and improving customer retention.”
Stakeholder Needs
:
What stakeholders (e.g., marketing, data scientists) need to achieve business goals.
Example: “Marketing needs real-time dashboards to monitor product sales and a recommender system for personalized product recommendations.”
System Requirements
:
Functional Requirements
: What the system must do (e.g., serve data no more than one hour old).
Non-Functional Requirements
: Characteristics of the system (e.g., scalability, reliability, latency).
Requirements Gathering Process
Identify Stakeholders
:
Talk to leadership (e.g., CTO, CEO) to understand business goals.
Engage with end users (e.g., marketing, data scientists) to understand their needs.
Understand Current Systems
:
Learn about existing systems and their limitations.
Example: Marketing currently gets daily data files, but they need real-time data.
Ask Key Questions
:
What actions will stakeholders take with the data?
What problems exist with the current system?
Who else should you talk to for more information?
Document Requirements
:
Use a hierarchical format to connect business goals, stakeholder needs, and system requirements.
Functional Requirements
Analytics Dashboards
:
Serve data no more than one hour old.
Example: “The system must provide real-time product sales data for marketing dashboards.”
Recommender System
:
Provide training data for the recommender model.
Ingest, transform, and serve user data to the model.
Return product recommendations to the sales platform.
Example: “The system must serve personalized product recommendations based on user behavior.”
Non-Functional Requirements
Analytics Dashboards
:
Scalability
: Handle peak user activity without slowing down.
Reliability
: Perform data quality checks to ensure data conforms to the expected format.
Maintainability
: Easily adapt to changes in data schema.
Recommender System
:
Latency
: Serve recommendations in less than one second.
Scalability
: Handle maximum concurrent users.
Reliability
: Default to popular products if the recommender fails.
Conversations with Stakeholders
Marketing Team
:
Needs real-time dashboards to monitor product sales and react to demand spikes.
Wants a personalized recommender system for customers.
Data Scientists
:
Currently work with daily data files but need real-time data for dashboards and recommender models.
Software Engineers
:
Plan to set up a read replica database and API for continuous data access.
Will notify data engineers of schema changes and system outages.
Trade-Offs in Requirements Gathering
Iron Triangle
🔼:
Scope
: Features and functionality of the system.
Timeline
: How quickly the system needs to be built.
Cost
: Budget constraints for the project.
Key Insight
: You can’t optimize all three simultaneously (e.g., fast and cheap may compromise quality).
Solution
:
Build
loosely coupled systems
for flexibility.
Make
reversible decisions
(two-way doors).
Deeply understand
stakeholder needs
to prioritize effectively.
Sample Project: Recommender System
Key Components of the Project
Recommender System
: A content-based recommender system is being developed to recommend products to users based on:
User features
: Customer number, credit limit, city, postal code, country.
Product features
: Product code, quantity in stock, buy price, MSRP, product line, product scale.
User interactions
: Products browsed or added to the cart.
Two Data Pipelines
:
Batch Data Pipeline
: Delivers training data to the data scientist for model retraining.
Streaming Data Pipeline
: Provides real-time product recommendations to users based on their activity.
Functional Requirements
Batch Pipeline
:
Deliver training data in tabular format.
Include user features, product features, and user ratings (1-5).
Support retraining the model periodically (weekly, monthly, or quarterly).
Handle modifications in data format (e.g., new user or product features).
Streaming Pipeline
:
Provide real-time recommendations with subsecond latency (1-2 seconds).
Handle up to 10,000 concurrent users, with potential for growth.
Use pre-trained recommender system to generate recommendations.
Save model outputs for later analysis.
Non-Functional Requirements
Latency
: Recommendations must be generated in under 1 second to match page rendering times.
Scalability
: The system must handle spikes of up to 10,000 concurrent users and scale as the company grows.
Flexibility
: The system should accommodate changes in data format (e.g., new features).
Operational Overhead
: Minimize the effort required to deliver new batches of training data.
Recommender System Details
Content-Based Recommender
:
Uses vector embeddings for users and products to find similarities.
Predicts user ratings for products based on embeddings.
Combines recommendations from:
User features (e.g., “Based on your profile, you may like…”).
Product interactions (e.g., “Based on your browsing history, you may like…”).
Vector Database
:
Stores precomputed product embeddings for faster similarity searches.
Organizes embeddings so similar products are close together, speeding up retrieval.
Implementation Steps
Extract Requirements
: Identify functional and non-functional requirements from the conversation.
Select Tools
: Choose AWS tools and services that meet the requirements.
Lab Exercise
:
Set up batch pipeline to deliver training data.
Use pre-trained recommender system for streaming pipeline.
Implement vector database for fast similarity searches.
Key Takeaways
Stakeholder Engagement
:
Talk to leadership, end users, and source system owners to understand needs and constraints.
Documentation
:
Clearly document functional and non-functional requirements using a hierarchical format.
Trade-Offs
:
Balance scope, timeline, and cost by applying principles like loose coupling and reversible decisions.
System Design
:
Design systems that are scalable, reliable, and maintainable to meet stakeholder needs and business goals.
Source
: DeepLearning.ai data engineering course.
Choosing Right Technologies
Previous
Source Systems
Next
Assistant
Responses are generated using AI and may contain mistakes.