Introduction to Data Engineering: Overview

Data Engineering Evolution and Fundamentals

Evolution of Data Engineering

Early Days:
- Data engineers were originally software engineers focused on building software applications.
- Data generated by these applications was seen as a byproduct or “exhaust,” useful mainly for troubleshooting or monitoring.
Shift in Perspective:
- Organizations began to recognize the intrinsic value of data as its volume and variety grew.
- Software engineers started building systems specifically for data ingestion, storage, transformation, and serving.
Emergence of Data Engineering:
- Data engineering became a central function in organizations.
- The role of a data engineer was born to focus on managing data systems and pipelines.

Definition of Data Engineering

Core Definition:
- Data engineering involves developing, implementing, and maintaining systems that take raw data and produce high-quality, consistent information for downstream use cases like analysis and machine learning.
Key Components:
- Data engineering sits at the intersection of:
  - Security
  - Data Management
  - DataOps
  - Data Architecture
  - Orchestration
  - Software Engineering

Data Engineering Life Cycle

Stages of the Life Cycle:
1. Data Generation: Data is created by source systems (e.g., software applications, user-generated data, sensors).
2. Ingestion: Data is collected from source systems.
3. Transformation: Data is processed and transformed into a usable format.
4. Storage: Data is stored, often spanning across ingestion, transformation, and serving stages.
5. Serving: Data is made available for end-use cases.
End Use Cases:
- Analytics, machine learning, and Reverse ETL (sending processed data back to source systems for additional value).
Data Pipeline:
- A combination of architecture, systems, and processes that move data through the stages of the life cycle.

Undercurrents of Data Engineering

Undercurrents:
- These are overarching themes that span the entire data engineering life cycle:
  1. Security: Ensuring data is protected.
  2. Data Management: Organizing and maintaining data.
  3. DataOps: Streamlining data operations.
  4. Data Architecture: Designing data systems.
  5. Orchestration: Coordinating data workflows.
  6. Software Engineering: Building and maintaining data systems.
Relevance:
- Each undercurrent is relevant to all stages of the data engineering life cycle.

Holistic Approach to Data Engineering

Focus on Value:
- Data engineers should think holistically about the life cycle and undercurrents to deliver real value to the organization.
Stakeholder Needs:
- Transforming stakeholder needs into system requirements is key to providing value.

History and Evolution of Data Engineering

Data is Everywhere

Definition of Data:
- Data comprises the building blocks of information.
- It can take many forms: words, numbers, photons, wind, etc.
Recording Data: Data can be recorded as memories, writings, or digitally (e.g., videos, computer files).
Digital Data: “Data” refers to digitally recorded data that can be stored on computers or transmitted over the internet.

The Birth of Digital Data

1960s: The advent of computers led to the creation of the first computerized databases.
1970s: Relational databases emerged.
- IBM developed SQL (Structured Query Language).
1980s: Bill Inmon developed the first data warehouse to support analytical decision-making.
1990s:
- Growth of data systems led to the need for dedicated tools and pipelines for reporting and business intelligence.
- Ralph Kimball and Bill Inmon developed data modeling approaches for analytics.
- The internet went mainstream, leading to the rise of web-first companies like Amazon.
- Backend systems (servers, databases, storage) emerged to support web applications.

The Big Data Era

Early 2000s:
- After the dotcom bust, companies like Yahoo, Google, and Amazon faced an explosion of data.
- Traditional relational databases and data warehouses couldn’t handle the scale.
Definition of Big Data:
- Extremely large datasets analyzed computationally to reveal patterns, trends, and associations.
- Characterized by the 3 Vs:
  1. Velocity: High speed of data generation.
  2. Variety: Diverse types of data.
  3. Volume: Large amounts of data.
2004: Google published the MapReduce paper, a scalable data processing paradigm.
2006: Yahoo developed and open-sourced Apache Hadoop, a revolutionary big data tool.
Impact of Hadoop:
- Drew software engineers to large-scale data problems.
- Marked the beginning of the big data engineer role.

The Rise of Cloud Computing

Amazon Web Services (AWS):
- Created scalable solutions like EC2 (Elastic Cloud Compute), S3 (Simple Storage Service), and DynamoDB.
- AWS became the first popular public cloud, offering pay-as-you-go compute and storage.
- Google Cloud Platform and Microsoft Azure followed AWS.
Impact of the Cloud:
- Revolutionized how software and data applications are developed and deployed.
- Enabled startups to access the same tools as top tech companies.

Transition to Real-Time Data

Shift from Batch to Event Streaming: Batch processing (analyzing data in chunks) gave way to event streaming (handling data as a continuous flow).
Big Real-Time Data: Real-time data processing became a new focus.

The Decline of “Big Data” as a Term

Challenges of Big Data Tools:
- Managing tools like Hadoop required significant effort and cost.
- Big data engineers spent more time maintaining systems than delivering business value.
Modern Data Engineering:
- Big data processing became more accessible.
- The term “big data” lost momentum as all companies, regardless of size, aimed to derive value from their data.
- Big data engineers are now simply data engineers.

The Modern Data Ecosystem

2010s:
- Emergence of Cloud-first, open-source, and third-party products simplified working with data at scale.
Data Engineering Today:
- Focuses on interoperability and connecting various technologies like Lego bricks.
- Data engineers are higher up the value chain, contributing directly to business goals.
Opportunities for Data Engineers:
- Build scalable data systems using advanced tools.
- Contribute to the development of new technologies.
- Play a central role in achieving business strategy across industries.

Stakeholder Management in Data Engineering

Overview of the Data Engineer’s Role

A data engineer’s primary task is to:
1. Acquire raw data.
2. Transform it into a useful format.
3. Make it available for downstream use cases.
Success depends on understanding the needs of downstream data consumers to add value.

Downstream Stakeholders

Key Use Cases: Analytics and machine learning.
Potential Stakeholders:
- Business professionals (e.g., sales, marketing, executives)
- Data scientists, Machine learning engineers
- Analysts
Example: Supporting Business Analysts
- Analysts use SQL queries to generate dashboards, analyze trends, and predict metrics.
- Questions to consider for serving analysts:
  1. Query frequency for dashboard refreshes.
  2. Information needed in queries.
  3. Preprocessing needs like joins and aggregations for better performance.
  4. Latency tolerance (e.g., real-time data vs. hourly/daily updates).
- Data Definitions: Ensure alignment on metrics definitions (e.g., time zones for daily sales totals).
Key Considerations:
- Engage in company strategy to identify potential business value from data.
- Understand metrics and priorities important to downstream stakeholders.

Upstream Stakeholders

Who Are They?
- Source system owners, often software engineers, responsible for systems generating raw data.
- These can be: Internal software engineers and external third-party system developers.
Data Engineer’s Role as a Consumer:
- Communicate with source system owners to understand:
  1. Volume, frequency, and format of raw data.
  2. Security and compliance considerations.
- Develop relationships to:
  - Influence how data is served.
  - Receive advance notice of changes (e.g., outages, schema updates).
Dealing with External Systems:
- While external systems are often beyond direct control, connecting with their owners provides valuable insights into the source application.

** Key Takeaways**

Stakeholder Categories:
1. Downstream stakeholders: Rely on transformed data to meet their goals.
2. Upstream stakeholders: Provide the raw data needed for engineering pipelines.
Best Practices:
- Understand how upstream disruptions impact pipelines.
- Ensure data served downstream aligns with organizational goals and adds measurable value.

Business Value

Role of a Data Engineer

Key Responsibility: Transform raw data into useful data and make it accessible for downstream use cases.
Understanding Downstream Consumers:
- Engage deeply with downstream stakeholders to understand their requirements.
- Downstream consumers could include: Analysts, data scientists, machine learning engineers, and other decision-makers (e.g., salespeople, product managers, executives).
Tailoring Solutions:
- Example: Serving a business analyst:
  - Understand query frequency, latency tolerance, and specific data definitions.
  - Align on critical metrics like time zones or aggregation logic.
- Provide pre-aggregated data or optimized query structures for faster results.
Aligning with Business Goals:
- Be aware of the company’s strategy to align data solutions with organizational goals.
- Understand key business metrics and their significance.

Stakeholder Management

Downstream Stakeholders:
- Value comes from enabling stakeholders to meet their objectives (e.g., trend analysis, dashboard creation, predictions).
- Addressing stakeholders’ requirements enhances data usability and business impact.
Upstream Stakeholders:
- Collaborate with software engineers or third-party system developers to:
  - Understand source data formats, volumes, and frequencies.
  - Plan for potential data flow disruptions, schema changes, or security compliance.
- Maintain open communication for proactive issue resolution.
Two-Way Interaction:
- Downstream stakeholders rely on you for valuable data delivery.
- You depend on upstream systems for accurate, consistent raw data.

Importance of Business Value

Core Principle: Success as a data engineer is tied to delivering measurable business value.
Expert Insights:
- Advice from Bill Inman:
  - Focus on projects that bring tangible business value rather than chasing the latest technologies.
  - Align technical efforts with areas that impact revenue, cost savings, or efficiency.
Perception of Value:
- Stakeholders judge value based on how solutions help achieve their goals:
  - Increased revenue.
  - Cost efficiency.
  - Simplified workflows.
  - Successful product launches.

Challenges in Adding Value

Managing Conflicting Needs:
- Stakeholder demands may exceed available resources or capacity.
- Prioritization of projects becomes crucial:
  - Focus on feasible projects with high impact.
  - Estimate timelines and resource requirements.
Strategic Decision-Making:
- Effective prioritization requires balancing stakeholder goals with organizational constraints.

System Requirements

Understand the Types of Requirements:
- Business Requirements: High-level organizational goals (e.g., increase revenue or grow user base).
- Stakeholder Requirements: Individual needs to accomplish tasks (e.g., accurate reports, anomaly detection).
- System Requirements:
  - Functional Requirements: The “what” the system must do (e.g., data pipeline schedules).
  - Non-Functional Requirements: The “how” the system operates (e.g., performance, scalability, and compliance).
The Requirement Gathering Process:
- Not Unique to Data Engineering:
  - Commonly used in product development and management.
  - The process involves understanding stakeholder needs and translating them into system requirements.
- Start with Stakeholder Conversations:
  - Understand their roles, goals, and technical background.
  - Identify how their work ties into broader business objectives.
- Translate Needs into Requirements:
  - Break down broad goals into actionable system features.
  - Include technical specifications and constraints (e.g., memory limits, budget).
- Steps in Requirements Gathering:
  1. Identify Business Goals: Understand the high-level objectives of the organization.
  2. Identify Stakeholders: Determine who will use or benefit from the data system.
  3. Understand Current Systems: Learn about existing systems and their limitations.
  4. Define Stakeholder Needs: Gather detailed requirements from stakeholders.
  5. Determine Functional and Non-Functional Requirements:
    - Functional Requirements: What the system must do (e.g., generate reports, support queries).
    - Non-Functional Requirements: How the system should perform (e.g., latency, scalability, security).
Anticipate Constraints: Address cost limitations and compliance with security and regulatory standards early in the planning process.
Iterative Collaboration: Recognize stakeholders’ perspectives evolve, and refine requirements iteratively.
Skill Development:
- Enhance communication skills for varied technical audiences.
- Develop systems thinking to bridge abstract goals with concrete implementations.

Actionable Approach

Prepare for Stakeholder Meetings:
- Research the business area or department’s goals.
- Prepare questions to uncover unspoken needs.
Draft Requirement Templates: Create templates for functional and non-functional requirements to organize your findings.
Review with Stakeholders: Regularly validate the gathered requirements to ensure alignment with expectations.
Bridge Communication Gaps: Use visualization tools (e.g., workflows, diagrams) to clarify complex system ideas to less technical stakeholders.
Balance Prioritization: Evaluate feasibility based on impact and resources while ensuring essential requirements are met.

Thinking Like a Data Engineer

Thinking Like a Data Engineer: A Framework

Stage 1: Identify Business Goals and Stakeholder Needs:
- Objective: Understand the business goals and how stakeholder needs align with them.
- Key Actions:
  - Clarify high-level business goals.
  - Identify stakeholders and their needs.
  - Conduct conversations with stakeholders to understand their pain points and expectations.
  - Ask stakeholders what actions they plan to take with the data products (e.g., dashboards, machine learning models).
Stage 2: Define Functional and Non-Functional Requirements:
- Objective: Translate stakeholder needs into clear system requirements.
- Key Actions:
  - Document functional requirements (what the system must do).
  - Define non-functional requirements (how the system should perform).
  - Confirm with stakeholders that the documented requirements will meet their needs.
Stage 3: Choose Tools and Technologies:
- Objective: Select the best tools and technologies to meet the requirements.
- Key Actions:
  - Identify tools and technologies that can meet the requirements.
  - Evaluate trade-offs between tools (e.g., cost, scalability, ease of use).
  - Perform a cost-benefit analysis (e.g., licensing fees, cloud resource costs).
  - Build a prototype to test the chosen tools and technologies.
Stage 4: Build, Deploy, and Iterate:
- Objective: Implement the system and continuously improve it.
- Key Actions:
  - Build and deploy the data system.
  - Continuously monitor and evaluate system performance.
  - Iterate on the system to adapt to changing stakeholder needs or new technologies.

Key Considerations in the Framework
- Prototyping/POC:
  - Before fully building the system, create a prototype to test whether it meets stakeholder needs.
  - Iterate on the prototype to ensure the final system will deliver value.
- Evolution of Data Systems:
  - Data systems are not static; they must evolve as business goals and stakeholder needs change.
  - Regularly communicate with stakeholders to ensure the system continues to meet their needs.
- Cyclical Process: The framework is not linear but cyclical. As needs and technologies change, revisit earlier stages to update the system.

Source: DeepLearning.ai data engineering course.

Intro to Data Engineering

Data Pipelines

Introduction to Data Engineering: Overview

Data Engineering Evolution and Fundamentals

History and Evolution of Data Engineering

Stakeholder Management in Data Engineering

Business Value

System Requirements

Thinking Like a Data Engineer

Intro to Data Engineering

Data Pipelines

​Data Engineering Evolution and Fundamentals

​History and Evolution of Data Engineering

​Stakeholder Management in Data Engineering

​Business Value

​System Requirements

​Thinking Like a Data Engineer

Data Engineering Evolution and Fundamentals

History and Evolution of Data Engineering

Stakeholder Management in Data Engineering

Business Value

System Requirements

Thinking Like a Data Engineer