How to Design the Data Infra for Uber Eats Order Tracking Page

Intro

A real world example to build scalable data analytics with tools like Kafka, Flink, Spark, Presto, Airflow, Iceberg, etc. Let’s delve into a more technical scenario that incorporates Kafka, Flink, Spark, Hadoop, Iceberg, Airflow, and Presto into a comprehensive real-time data processing and analytics pipeline:

Assumptions for the Technical Workflow:

An Uber Eats-like app tracks the location of drivers and communicates order details to users.
The system must handle real-time processing, such as matching drivers with orders, and longer-term analytical processing, like optimizing driver routes and forecasting delivery times.

Overall Architecture Design

Simple Architecture Diagram

Ascii Style

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Driver Apps
     |
     | Publish GPS locations
     v
    Kafka
     |\
     | | Driver locations stream
     | |
     | v
     | Flink ----------> Compute aggregations
     | |               | Join streams
     | | Order events  | Manage stateful processing
     | | Stream        |
     v |               |
Order System & User Apps
     |               |
     | Write Flink results
     v               |
    Hadoop <--------/   <-------------- Airflow orchestrating tasks
     |   ^                            |
     |   | Data written back by Spark |
     |   | and other batch jobs       |
     v   |                            |
 Iceberg Tables <---------------------+
     |
     | SQL queries for analytics
     v
   Presto
     |
     | Visualization & Reporting
     v
Business Apps

Mermaid Style

graph TB
    DA[Driver Apps] -->|Publish GPS locations| K[Kafka]
    K -->|Driver locations stream| F[Flink]
    K -->|Order events stream| F
    F -->|Real-time processing| OA[Order System & User Apps]
    F -->|Write Flink results| H[Hadoop]
    
    SP[Spark for Batch Analysis] -->|Write batch results| H
    A[Airflow] -->|Schedule & orchestrate| SP
    
    H -->|Data management| I[Iceberg Tables]
    
    P[Presto] -->|SQL queries| I
    
    B[Business Apps] -->|Visualization & Reporting| P
    
    %% Define styling
    classDef default fontsize:16px, color:#fff, fill:#000, stroke:#fff, stroke-width:2px;
    linkStyle default stroke:white,stroke-width:1px;

Technical Workflow:

1. Data Ingestion (Apache Kafka):

Kafka Producers: The driver app constantly publishes GPS coordinates to a Kafka topic driver-locations, using a producer client library. Each message might be keyed by driver ID to ensure location updates for a given driver go to the same partition for order.
Kafka Consumers: Other systems subscribe to the driver-locations topic to receive updates. Additionally, consumer applications are set up for topics like order-events that handle order placement, acceptance, and completions.
Kafka topics are partitioned and replicated to ensure scalability and fault tolerance.

2. Real-time Stream Processing (Apache Flink):

Flink Jobs: Flink jobs are configured to subscribe to driver-locations and order-events. They perform tasks such as:
- Windowed aggregations to compute the latest location of each driver.
- Join operations between driver locations and pending orders to facilitate real-time matching and dispatching.
- Stateful processing for ETA calculations and sending notifications to customers via external systems when a driver is nearby.

3. Batch Processing and Analysis (Apache Spark):

Spark Jobs: Scheduled batch Spark jobs are written in Scala or Python and deal with complex analytical tasks, including:
- Training machine learning models on historical delivery data to predict delivery times (using MLlib).
- Aggregating delivery data nightly to create performance metrics per driver or per region.

4. Workflow Orchestration (Apache Airflow):

Airflow DAGs (Directed Acyclic Graphs): Define the dependent jobs and workflows that orchestrate processing tasks. An Airflow DAG might include:
- A task to run a Spark job that rebuilds machine learning models.
- Another task to batch-process the day’s data and update Iceberg tables.
- Scheduling and triggering Flink jobs for real-time streaming when needed.

5. Data Storage (Hadoop + Apache Iceberg):

HDFS: Raw data from Kafka is landed into HDFS, and processed data is stored there for long-term historical analysis.
Iceberg Tables: Transactional data tables managed by Iceberg provide consistent snapshots of large-scale datasets and support schema evolution for the raw Kafka data and the results of Spark jobs.

6. Interactive Querying (Presto):

Presto SQL Queries: Analysts use Presto to perform SQL queries on stored Hadoop data and Iceberg tables to gain insights or create real-time dashboards. Example queries might include:
- SQL to join today’s real-time data with historical trends.
- Aggregation queries to analyze delivery efficiency across different areas.

7. Data Lifecycle and Schema Management:

Iceberg table partitioning is configured for efficient data retrieval, e.g., by time window or geographical region.
Data retention policies are configured in Kafka to govern how long data is kept, considering storage costs and compliance requirements.
Iceberg’s schema evolution allows for seamless transitions when altering data structures or incorporating new data sources.

8. Advanced Machine Learning and Analytics (Spark + MLlib):

Historical data in Iceberg tables is used to train and refine predictive models for delivery ETA, taking advantage of features like lookback windows and seasonal trends.
Airflow DAGs keep model training and evaluations consistent, reproducible, and on schedule.

9. Monitoring and Governance:

Comprehensive logging is set up for all components, including Kafka brokers, Flink job managers, and Spark driver/executor processes.
A monitoring solution (such as Prometheus with Grafana) watches over the health and performance of the infrastructure, ensuring SLAs are met.
Data governance is enforced via policies and access controls for sensitive data in HDFS and Iceberg, audited by tools such as Apache Ranger.

The synergy in such a technical workflow allows an Uber Eats-like service to leverage strengths from each tool — Kafka for ingestion, Flink for real-time processing, Spark for batch processing and ML, Airflow for orchestration, Iceberg for managing large-scale data tables, and Presto for interactive querying. This enables real-time communication to users and drivers while also providing a platform for advanced analytics and strategic decision-making based on historical and current data.

Technical Detail Followups

Kafka

Q: How do you ensure data consistency in your Kafka cluster?
- A: Kafka ensures data consistency by replicable partitions, where each partition has one leader and multiple in-sync replicas (ISRs). The leader handles all the read and write requests for that partition, and ISRs replicate the leader’s data. Consumers typically read from the leader to ensure they receive the most up-to-date records.
Q: Can you explain the role of Zookeeper in a Kafka ecosystem?
- A: Zookeeper plays a critical role in Kafka’s architecture; it manages and coordinates Kafka brokers. It’s responsible for leader election for partitions, managing configuration information, and maintaining a list of Kafka brokers and topics, as well as their status.

Flink

Q: What advantages does Flink provide when dealing with stateful computations?
- A: Flink offers robust state management and fault tolerance for stateful computations in streaming data. It maintains a consistent snapshot of all state throughout an application’s execution, which can be recovered in case of a failure. Flink’s checkpointing and savepoints feature are key for ensuring exactly-once processing semantics.
Q: How does Flink handle event time and out-of-order events?
- A: Flink features an event time concept that deals with the time at which events actually occurred, as opposed to when they are processed by the system. With watermarks, which are special types of events that specify the progress of event time, Flink is capable of handling out-of-order events or late-arriving data.

Spark

Q: What is the difference between RDDs, DataFrames, and Datasets in Spark?
- A: RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark - an immutable distributed collection of objects that can be processed in parallel. DataFrames are a collection of RDDs, but with a schema, which provides a higher level abstraction and optimization opportunities through Catalyst optimizer. Datasets are a type-safe version of DataFrames, allowing users to work with strongly-typed data collected as JVM objects.
Q: How does Spark achieve fault tolerance?
- A: Spark achieves fault tolerance through its immutable data structure and lineage graph. In case of a partition failure, Spark can recompute the lost partition by replaying the transformations used to build the lineage of the RDD. This significantly reduces the need for data replication for fault tolerance purposes.

Hadoop

Q: What are some of the core components of a Hadoop ecosystem?
- A: The core of Hadoop consists of the storage layer, HDFS (Hadoop Distributed File System), for reliable storage of massive datasets, and YARN (Yet Another Resource Negotiator), for cluster resource management, and the MapReduce programming model for processing large datasets in parallel.
Q: How is data stored and processed in Hadoop?
- A: In Hadoop, data is stored in HDFS across a distributed file system that spans the nodes in a Hadoop cluster. Data is processed using a MapReduce job where the data is first mapped, sorted/shuffled, and then reduced to produce an output - each of these steps can run in parallel across the cluster’s nodes.

Iceberg

Q: What is Apache Iceberg and how does it help with managing large data sets?
- A: Apache Iceberg is a table format for large data sets that enables high-performance queries in distributed computing environments. It adds more structure and optimization to data storage, providing benefits like snapshot isolation, schema evolution, and efficient querying through hidden partitioning.
Q: Can Iceberg tables work with both batch and streaming data?
- A: Yes, Iceberg tables are designed to support both batch and stream processing workloads seamlessly. They can be read and written to using traditional batch operations or incrementally using streaming workloads, thus, providing flexibility and ensuring up-to-date views of data.

Airflow

Q: What are the benefits of using Apache Airflow for workflow orchestration?
- A: Apache Airflow provides a highly customizable platform for scheduling and coordinating complex data pipelines. It enables developers to define workflows as code, which allows for easy versioning, testing, and reusability. Its rich user interface aids in visualizing workflows, monitoring job progress, and diagnosing issues.
Q: How do you ensure that a task in Airflow is idempotent?
- A: To make sure that a task in Airflow is idempotent - producing the same result regardless of how many times it’s executed - you’d structure your tasks so that the output depends solely on the inputs and the code at execution time. Also, using Airflow’s extensive logging and external systems for state management can help ensure idempotency.

Presto

Q: Why would you use Presto over other SQL-on-Hadoop engines?
- A: Presto is designed for low-latency queries and is generally faster than other SQL-on-Hadoop engines due to its distributed architecture and query execution strategies. Unlike others, Presto does not use MapReduce; it processes data using an in-memory pipeline execution model which is significantly faster.
Q: How does Presto interact with different data sources, such as Iceberg or Kafka?
- A: Presto has a connector architecture that allows it to interact with various data sources. For Iceberg, the Presto Iceberg connector provides support for reading from and writing to Iceberg tables. For Kafka, the Presto Kafka connector allows Presto to query Kafka topics directly as if they were tables. This flexibility makes it easy to combine and query data across different storage systems.

Design Choice Follow ups