Requirements Gathering

Introduction

Goal: Understand how to gather requirements and design data systems that meet stakeholder needs and business goals.
Key Concepts:
- Hierarchy of Needs: Business goals → Stakeholder needs → System requirements (functional and non-functional).
- Requirements Gathering: Conversations with stakeholders to understand their needs and current systems.
- Documentation: Clearly document functional and non-functional requirements.

Hierarchy of Needs

Business Goals🚩:
- High-level objectives (e.g., increase revenue, improve customer retention, expand to new markets).
- Example: “The company aims to grow by launching new products and improving customer retention.”
Stakeholder Needs:
- What stakeholders (e.g., marketing, data scientists) need to achieve business goals.
- Example: “Marketing needs real-time dashboards to monitor product sales and a recommender system for personalized product recommendations.”
System Requirements:
- Functional Requirements: What the system must do (e.g., serve data no more than one hour old).
- Non-Functional Requirements: Characteristics of the system (e.g., scalability, reliability, latency).

Requirements Gathering Process

Identify Stakeholders:
- Talk to leadership (e.g., CTO, CEO) to understand business goals.
- Engage with end users (e.g., marketing, data scientists) to understand their needs.
Understand Current Systems:
- Learn about existing systems and their limitations.
- Example: Marketing currently gets daily data files, but they need real-time data.
Ask Key Questions:
- What actions will stakeholders take with the data?
- What problems exist with the current system?
- Who else should you talk to for more information?
Document Requirements:
- Use a hierarchical format to connect business goals, stakeholder needs, and system requirements.

Functional Requirements

Analytics Dashboards:
- Serve data no more than one hour old.
- Example: “The system must provide real-time product sales data for marketing dashboards.”
Recommender System:
- Provide training data for the recommender model.
- Ingest, transform, and serve user data to the model.
- Return product recommendations to the sales platform.
- Example: “The system must serve personalized product recommendations based on user behavior.”

Non-Functional Requirements

Analytics Dashboards:
- Scalability: Handle peak user activity without slowing down.
- Reliability: Perform data quality checks to ensure data conforms to the expected format.
- Maintainability: Easily adapt to changes in data schema.
Recommender System:
- Latency: Serve recommendations in less than one second.
- Scalability: Handle maximum concurrent users.
- Reliability: Default to popular products if the recommender fails.

Conversations with Stakeholders

Marketing Team:
- Needs real-time dashboards to monitor product sales and react to demand spikes.
- Wants a personalized recommender system for customers.
Data Scientists:
- Currently work with daily data files but need real-time data for dashboards and recommender models.
Software Engineers:
- Plan to set up a read replica database and API for continuous data access.
- Will notify data engineers of schema changes and system outages.

Trade-Offs in Requirements Gathering

Iron Triangle 🔼:
- Scope: Features and functionality of the system.
- Timeline: How quickly the system needs to be built.
- Cost: Budget constraints for the project.
Key Insight: You can’t optimize all three simultaneously (e.g., fast and cheap may compromise quality).
Solution:
- Build loosely coupled systems for flexibility.
- Make reversible decisions (two-way doors).
- Deeply understand stakeholder needs to prioritize effectively.

Sample Project: Recommender System

Key Components of the Project

Recommender System: A content-based recommender system is being developed to recommend products to users based on:
- User features: Customer number, credit limit, city, postal code, country.
- Product features: Product code, quantity in stock, buy price, MSRP, product line, product scale.
- User interactions: Products browsed or added to the cart.
Two Data Pipelines:
1. Batch Data Pipeline: Delivers training data to the data scientist for model retraining.
2. Streaming Data Pipeline: Provides real-time product recommendations to users based on their activity.

Functional Requirements

Batch Pipeline:
- Deliver training data in tabular format.
- Include user features, product features, and user ratings (1-5).
- Support retraining the model periodically (weekly, monthly, or quarterly).
- Handle modifications in data format (e.g., new user or product features).
Streaming Pipeline:
- Provide real-time recommendations with subsecond latency (1-2 seconds).
- Handle up to 10,000 concurrent users, with potential for growth.
- Use pre-trained recommender system to generate recommendations.
- Save model outputs for later analysis.

Non-Functional Requirements

Latency: Recommendations must be generated in under 1 second to match page rendering times.
Scalability: The system must handle spikes of up to 10,000 concurrent users and scale as the company grows.
Flexibility: The system should accommodate changes in data format (e.g., new features).
Operational Overhead: Minimize the effort required to deliver new batches of training data.

Recommender System Details

Content-Based Recommender:
- Uses vector embeddings for users and products to find similarities.
- Predicts user ratings for products based on embeddings.
- Combines recommendations from:
  - User features (e.g., “Based on your profile, you may like…”).
  - Product interactions (e.g., “Based on your browsing history, you may like…”).
Vector Database:
- Stores precomputed product embeddings for faster similarity searches.
- Organizes embeddings so similar products are close together, speeding up retrieval.

Implementation Steps

Extract Requirements: Identify functional and non-functional requirements from the conversation.
Select Tools: Choose AWS tools and services that meet the requirements.
Lab Exercise:
- Set up batch pipeline to deliver training data.
- Use pre-trained recommender system for streaming pipeline.
- Implement vector database for fast similarity searches.

Key Takeaways

Stakeholder Engagement:
- Talk to leadership, end users, and source system owners to understand needs and constraints.
Documentation:
- Clearly document functional and non-functional requirements using a hierarchical format.
Trade-Offs:
- Balance scope, timeline, and cost by applying principles like loose coupling and reversible decisions.
System Design:
- Design systems that are scalable, reliable, and maintainable to meet stakeholder needs and business goals.

Source: DeepLearning.ai data engineering course.

Intro to Data Engineering

Data Pipelines

​Introduction

​Hierarchy of Needs

​Requirements Gathering Process

​Functional Requirements

​Non-Functional Requirements

​Conversations with Stakeholders

​Trade-Offs in Requirements Gathering

​Sample Project: Recommender System

​Key Components of the Project

​Functional Requirements

​Non-Functional Requirements

​Recommender System Details

​Implementation Steps

​Key Takeaways

Introduction

Hierarchy of Needs

Requirements Gathering Process

Functional Requirements

Non-Functional Requirements

Conversations with Stakeholders

Trade-Offs in Requirements Gathering

Sample Project: Recommender System

Key Components of the Project

Functional Requirements

Non-Functional Requirements

Recommender System Details

Implementation Steps

Key Takeaways