Introduction

  • Goal: Understand how to gather requirements and design data systems that meet stakeholder needs and business goals.
  • Key Concepts:
    • Hierarchy of Needs: Business goals → Stakeholder needs → System requirements (functional and non-functional).
    • Requirements Gathering: Conversations with stakeholders to understand their needs and current systems.
    • Documentation: Clearly document functional and non-functional requirements.

Hierarchy of Needs

  1. Business Goals🚩:
    • High-level objectives (e.g., increase revenue, improve customer retention, expand to new markets).
    • Example: “The company aims to grow by launching new products and improving customer retention.”
  2. Stakeholder Needs:
    • What stakeholders (e.g., marketing, data scientists) need to achieve business goals.
    • Example: “Marketing needs real-time dashboards to monitor product sales and a recommender system for personalized product recommendations.”
  3. System Requirements:
    • Functional Requirements: What the system must do (e.g., serve data no more than one hour old).
    • Non-Functional Requirements: Characteristics of the system (e.g., scalability, reliability, latency).

Requirements Gathering Process

  1. Identify Stakeholders:
    • Talk to leadership (e.g., CTO, CEO) to understand business goals.
    • Engage with end users (e.g., marketing, data scientists) to understand their needs.
  2. Understand Current Systems:
    • Learn about existing systems and their limitations.
    • Example: Marketing currently gets daily data files, but they need real-time data.
  3. Ask Key Questions:
    • What actions will stakeholders take with the data?
    • What problems exist with the current system?
    • Who else should you talk to for more information?
  4. Document Requirements:
    • Use a hierarchical format to connect business goals, stakeholder needs, and system requirements.

Functional Requirements

  • Analytics Dashboards:
    • Serve data no more than one hour old.
    • Example: “The system must provide real-time product sales data for marketing dashboards.”
  • Recommender System:
    • Provide training data for the recommender model.
    • Ingest, transform, and serve user data to the model.
    • Return product recommendations to the sales platform.
    • Example: “The system must serve personalized product recommendations based on user behavior.”

Non-Functional Requirements

  • Analytics Dashboards:
    • Scalability: Handle peak user activity without slowing down.
    • Reliability: Perform data quality checks to ensure data conforms to the expected format.
    • Maintainability: Easily adapt to changes in data schema.
  • Recommender System:
    • Latency: Serve recommendations in less than one second.
    • Scalability: Handle maximum concurrent users.
    • Reliability: Default to popular products if the recommender fails.

Conversations with Stakeholders

  1. Marketing Team:
    • Needs real-time dashboards to monitor product sales and react to demand spikes.
    • Wants a personalized recommender system for customers.
  2. Data Scientists:
    • Currently work with daily data files but need real-time data for dashboards and recommender models.
  3. Software Engineers:
    • Plan to set up a read replica database and API for continuous data access.
    • Will notify data engineers of schema changes and system outages.

Trade-Offs in Requirements Gathering

  • Iron Triangle 🔼:
    • Scope: Features and functionality of the system.
    • Timeline: How quickly the system needs to be built.
    • Cost: Budget constraints for the project.
  • Key Insight: You can’t optimize all three simultaneously (e.g., fast and cheap may compromise quality).
  • Solution:
    • Build loosely coupled systems for flexibility.
    • Make reversible decisions (two-way doors).
    • Deeply understand stakeholder needs to prioritize effectively.

Sample Project: Recommender System

Key Components of the Project

  • Recommender System: A content-based recommender system is being developed to recommend products to users based on:
    • User features: Customer number, credit limit, city, postal code, country.
    • Product features: Product code, quantity in stock, buy price, MSRP, product line, product scale.
    • User interactions: Products browsed or added to the cart.
  • Two Data Pipelines:
    1. Batch Data Pipeline: Delivers training data to the data scientist for model retraining.
    2. Streaming Data Pipeline: Provides real-time product recommendations to users based on their activity.

Functional Requirements

  • Batch Pipeline:
    • Deliver training data in tabular format.
    • Include user features, product features, and user ratings (1-5).
    • Support retraining the model periodically (weekly, monthly, or quarterly).
    • Handle modifications in data format (e.g., new user or product features).
  • Streaming Pipeline:
    • Provide real-time recommendations with subsecond latency (1-2 seconds).
    • Handle up to 10,000 concurrent users, with potential for growth.
    • Use pre-trained recommender system to generate recommendations.
    • Save model outputs for later analysis.

Non-Functional Requirements

  • Latency: Recommendations must be generated in under 1 second to match page rendering times.
  • Scalability: The system must handle spikes of up to 10,000 concurrent users and scale as the company grows.
  • Flexibility: The system should accommodate changes in data format (e.g., new features).
  • Operational Overhead: Minimize the effort required to deliver new batches of training data.

Recommender System Details

  • Content-Based Recommender:
    • Uses vector embeddings for users and products to find similarities.
    • Predicts user ratings for products based on embeddings.
    • Combines recommendations from:
      • User features (e.g., “Based on your profile, you may like…”).
      • Product interactions (e.g., “Based on your browsing history, you may like…”).
  • Vector Database:
    • Stores precomputed product embeddings for faster similarity searches.
    • Organizes embeddings so similar products are close together, speeding up retrieval.

Implementation Steps

  1. Extract Requirements: Identify functional and non-functional requirements from the conversation.
  2. Select Tools: Choose AWS tools and services that meet the requirements.
  3. Lab Exercise:
    • Set up batch pipeline to deliver training data.
    • Use pre-trained recommender system for streaming pipeline.
    • Implement vector database for fast similarity searches.

Key Takeaways

  1. Stakeholder Engagement:
    • Talk to leadership, end users, and source system owners to understand needs and constraints.
  2. Documentation:
    • Clearly document functional and non-functional requirements using a hierarchical format.
  3. Trade-Offs:
    • Balance scope, timeline, and cost by applying principles like loose coupling and reversible decisions.
  4. System Design:
    • Design systems that are scalable, reliable, and maintainable to meet stakeholder needs and business goals.

Source: DeepLearning.ai data engineering course.