Intro to Data Engineering
Requirements Gathering
Introduction
- Goal: Understand how to gather requirements and design data systems that meet stakeholder needs and business goals.
- Key Concepts:
- Hierarchy of Needs: Business goals → Stakeholder needs → System requirements (functional and non-functional).
- Requirements Gathering: Conversations with stakeholders to understand their needs and current systems.
- Documentation: Clearly document functional and non-functional requirements.
Hierarchy of Needs
- Business Goals🚩:
- High-level objectives (e.g., increase revenue, improve customer retention, expand to new markets).
- Example: “The company aims to grow by launching new products and improving customer retention.”
- Stakeholder Needs:
- What stakeholders (e.g., marketing, data scientists) need to achieve business goals.
- Example: “Marketing needs real-time dashboards to monitor product sales and a recommender system for personalized product recommendations.”
- System Requirements:
- Functional Requirements: What the system must do (e.g., serve data no more than one hour old).
- Non-Functional Requirements: Characteristics of the system (e.g., scalability, reliability, latency).
Requirements Gathering Process
- Identify Stakeholders:
- Talk to leadership (e.g., CTO, CEO) to understand business goals.
- Engage with end users (e.g., marketing, data scientists) to understand their needs.
- Understand Current Systems:
- Learn about existing systems and their limitations.
- Example: Marketing currently gets daily data files, but they need real-time data.
- Ask Key Questions:
- What actions will stakeholders take with the data?
- What problems exist with the current system?
- Who else should you talk to for more information?
- Document Requirements:
- Use a hierarchical format to connect business goals, stakeholder needs, and system requirements.
Functional Requirements
- Analytics Dashboards:
- Serve data no more than one hour old.
- Example: “The system must provide real-time product sales data for marketing dashboards.”
- Recommender System:
- Provide training data for the recommender model.
- Ingest, transform, and serve user data to the model.
- Return product recommendations to the sales platform.
- Example: “The system must serve personalized product recommendations based on user behavior.”
Non-Functional Requirements
- Analytics Dashboards:
- Scalability: Handle peak user activity without slowing down.
- Reliability: Perform data quality checks to ensure data conforms to the expected format.
- Maintainability: Easily adapt to changes in data schema.
- Recommender System:
- Latency: Serve recommendations in less than one second.
- Scalability: Handle maximum concurrent users.
- Reliability: Default to popular products if the recommender fails.
Conversations with Stakeholders
- Marketing Team:
- Needs real-time dashboards to monitor product sales and react to demand spikes.
- Wants a personalized recommender system for customers.
- Data Scientists:
- Currently work with daily data files but need real-time data for dashboards and recommender models.
- Software Engineers:
- Plan to set up a read replica database and API for continuous data access.
- Will notify data engineers of schema changes and system outages.
Trade-Offs in Requirements Gathering
- Iron Triangle 🔼:
- Scope: Features and functionality of the system.
- Timeline: How quickly the system needs to be built.
- Cost: Budget constraints for the project.
- Key Insight: You can’t optimize all three simultaneously (e.g., fast and cheap may compromise quality).
- Solution:
- Build loosely coupled systems for flexibility.
- Make reversible decisions (two-way doors).
- Deeply understand stakeholder needs to prioritize effectively.
Sample Project: Recommender System
Key Components of the Project
- Recommender System: A content-based recommender system is being developed to recommend products to users based on:
- User features: Customer number, credit limit, city, postal code, country.
- Product features: Product code, quantity in stock, buy price, MSRP, product line, product scale.
- User interactions: Products browsed or added to the cart.
- Two Data Pipelines:
- Batch Data Pipeline: Delivers training data to the data scientist for model retraining.
- Streaming Data Pipeline: Provides real-time product recommendations to users based on their activity.
Functional Requirements
- Batch Pipeline:
- Deliver training data in tabular format.
- Include user features, product features, and user ratings (1-5).
- Support retraining the model periodically (weekly, monthly, or quarterly).
- Handle modifications in data format (e.g., new user or product features).
- Streaming Pipeline:
- Provide real-time recommendations with subsecond latency (1-2 seconds).
- Handle up to 10,000 concurrent users, with potential for growth.
- Use pre-trained recommender system to generate recommendations.
- Save model outputs for later analysis.
Non-Functional Requirements
- Latency: Recommendations must be generated in under 1 second to match page rendering times.
- Scalability: The system must handle spikes of up to 10,000 concurrent users and scale as the company grows.
- Flexibility: The system should accommodate changes in data format (e.g., new features).
- Operational Overhead: Minimize the effort required to deliver new batches of training data.
Recommender System Details
- Content-Based Recommender:
- Uses vector embeddings for users and products to find similarities.
- Predicts user ratings for products based on embeddings.
- Combines recommendations from:
- User features (e.g., “Based on your profile, you may like…”).
- Product interactions (e.g., “Based on your browsing history, you may like…”).
- Vector Database:
- Stores precomputed product embeddings for faster similarity searches.
- Organizes embeddings so similar products are close together, speeding up retrieval.
Implementation Steps
- Extract Requirements: Identify functional and non-functional requirements from the conversation.
- Select Tools: Choose AWS tools and services that meet the requirements.
- Lab Exercise:
- Set up batch pipeline to deliver training data.
- Use pre-trained recommender system for streaming pipeline.
- Implement vector database for fast similarity searches.
Key Takeaways
- Stakeholder Engagement:
- Talk to leadership, end users, and source system owners to understand needs and constraints.
- Documentation:
- Clearly document functional and non-functional requirements using a hierarchical format.
- Trade-Offs:
- Balance scope, timeline, and cost by applying principles like loose coupling and reversible decisions.
- System Design:
- Design systems that are scalable, reliable, and maintainable to meet stakeholder needs and business goals.
Source: DeepLearning.ai data engineering course.