Skip to main content
  1. Data-Engineerings/

Apache Airflow 101

·8 mins· ·
Blog En Data-Engineering
Zhe-You Liu
Author
Zhe-You Liu
Focused on Contributing to open source, Distributed Systems, and Data Engineering.
Table of Contents

What is Apache Airflow?
#

Apache Airflow – A platform to programmatically author, schedule, and monitor workflows

airflow-logo

Apache Airflow is an open-source workflow management platform
that enables developers to define data pipelines using the Workflow as Code paradigm.

In Airflow, an entire workflow is called a DAG
which stands for Directed Acyclic Graph.

A DAG consists of multiple Tasks
and these DAGs or tasks can have various dependencies,
as long as they do not form cycles.

Airflow supports various scheduling strategies
and allows developers to define complex DAGs. It also provides a comprehensive UI for monitoring and managing workflows.

Airflow is one of the most popular workflow orchestration tools today:

  • Over 40,000 stars on GitHub
  • The 5th largest project under the Apache Software Foundation
  • Used by more than 20,000 companies worldwide

    According to HG Insight

Key Features of Apache Airflow
#

The Role of Apache Airflow in a Data Pipeline
#

Airflow is often described as a Data Orchestration Tool.

In the context of a data platform architecture,
Airflow functions as the orchestration layer—essentially the brain of the entire platform.
It is responsible for precisely scheduling, coordinating, logging, and monitoring the status of every task within each workflow.

orchestration-layer

Reference from chaossearch.io / cloud-data-platform-architecture-guide

In large-scale data platforms,
Airflow typically sits at the top layer, orchestrating every step from the data source all the way to the final data product delivered to end users.
However, each individual task is usually executed by specialized tools dedicated to specific purposes.

For example:

  • An Airflow DAG might be responsible for periodically initiating data ingestion, then running ETL processes, and finally landing the data into the appropriate tables or views based on its type.
    • Within this process:
      • Data ingestion might be handled by tools like Airbyte
      • ETL might be processed by Spark, Flink, DuckDB, or other compute engines
    • Airflow’s focus is solely on orchestration

    For small to medium-sized data platforms,
    using Airflow with LocalExecutor or CeleryExecutor is often more than sufficient!

The relationship between Airflow and other common components of a data pipeline is illustrated below:

Airflow in Data Pipeline

Reference from ml4devs.com / scalable-efficient-big-data-analytics-machine-learning-pipeline-architecture-on-cloud

What Problems Does Airflow Solve?
#

Based on my experience working in data teams,
Airflow addresses several key challenges:

  • Workflow Observability

    • The Grid View (DagRun/TaskInstance records) can act as a real-time metric dashboard for pipeline execution.
    • Examples:
        1. If the UI shows (or you receive alerts about) frequent failures in a specific task like xxx:
        • This could indicate an issue with the component or service responsible for that task.
        1. If recent DAG runs are taking longer to complete:
        • You can investigate specific DagRuns and TaskInstances to identify performance issues,
        • such as increased data volume or a performance degradation in one of the tasks.
  • Retryability of Workflows

    • DAGs or individual tasks can be manually retried directly from the UI.
    • This reduces operational overhead.
    • Common scenarios:
        1. Downstream services occasionally experience transient failures:
        • Operations can simply review TaskInstance logs and retry the affected tasks or entire DAGs—all from the UI.
        1. Upstream services produce incorrect data, requiring reprocessing of X days of data:
  • Dynamic DAG Generation Based on Configuration

    • DAGs can be dynamically created based on configuration within the DAG code itself.
    • Example:
      • Use a list of current customer IDs from a config file to dynamically generate DAGs like client_<client_id>_dag.
      • This enables dynamic generation and updating of DAGs.
  • Automated Onboarding

    • While backfilling might not be necessary for every data team,
    • it’s a common use case in onboarding scenarios:
      • Example:
        • A @daily DAG generates a daily snapshot view.
        • When a new client is onboarded, setting up catchup or triggering a backfill ensures their historical data is automatically processed.

In summary,
using Airflow can significantly reduce operational workload and streamline workflow management.

Airflow Architecture
#

airflow-architecture

Reference from airflow.apache.org / 2.10.5/core-concepts

  • Meta Database

    • All DAG and Task states are stored in the Meta Database.
    • It is recommended to use a connection pool proxy like PgBouncer,
      as nearly all components access the Meta Database constantly.
  • Scheduler

    • Responsible for monitoring all tasks and DAGs:
      • Checks if any DAG needs to trigger a new DagRun.
      • Checks within each DagRun if any TaskInstances or the entire DAG need to be scheduled.
      • Selects TaskInstances to be scheduled, and adds them to the execution queue considering execution pools and concurrency limits.
    • See more about how to fine-tune the Scheduler.
  • Worker

    • Executes the callable for each TaskInstance.
  • Trigger

    • Used for Deferrable Tasks
      • Use case: Tasks that wait for an external system to change state or need to be delayed for a long time.
      • Triggers run asyncio coroutines to perform polling in the background.
  • DAG Processor (DAG Parser)

    • Periodically parses DAG files (Python scripts) and stores them as Dag records in the Meta Database.
  • Web Server (API Server)

    • Provides the UI and REST API for interacting with Airflow.

Common Use Cases for Airflow
#

Ideal Scenarios for Using Airflow
#

  • Recurring Workflows

    • Workflows that must run at regular time intervals.
    • For example: every few hours, days, or weeks.
  • Workflows Split into Retryable Tasks

    • Suitable when a workflow involves multiple services and each task may need to be retried manually.
    • Example: a multi-step ETL process spanning several services.
  • Workflows with Complex Task Dependencies

    • Where task execution order depends on certain conditions or outcomes.
  • Workflows that Expand Dynamically Based on External State

    • Dynamic DAG rendering based on configuration, metadata, or queries.
    • Example: DAGs and tasks are determined based on dynamic configs or query results.

Real-World Use Cases of Airflow
#

  • Data Pipelines / ETL / Business Intelligence
  • Infrastructure Automation
  • MLOps (Machine Learning Operations)

These use cases align closely with the ideal scenarios outlined above.

For more practical examples, refer to astronomer.io / use-cases

Limitations of Airflow
#

Based on the Airflow architecture diagram above, we can already infer the following:

The Meta Database I/O is the bottleneck of Airflow!

Every component constantly reads from and writes to the Meta Database.
For example: each DagRun creates a new record; each TaskInstance creates another.
These are continuously updated based on their execution status.

Therefore, Airflow is not suitable for:

  • Low-latency, event-driven workflows

    • Here, “low-latency” refers to millisecond-level execution.
        1. As mentioned in the architecture, even with the Trigger component, the internal implementation involves polling the external system every few seconds.
        1. Even if you use the REST API to trigger a DAG, the task still has to go through the Scheduler before being picked up by a Worker.
  • Consumers processing thousands to tens of thousands of messages per second

Currently, the maximum throughput of Airflow is approximately a little over 100 DAG runs per second.

Based on:
Airflow Summit 2024: How we Tuned our Airflow to Make 1.2 Million DAG Runs per Day
Calculation: 1,200,000 / 86,400 = 138.89
Even with fine-tuned Airflow configurations,
due to database I/O bottlenecks, throughput is still limited to just over 100 DAG runs per second.

Anti-pattern: Using a single Airflow cluster as a consumer downstream of RabbitMQ or Kafka.
A single Airflow cluster is not suitable for handling thousands or tens of thousands of messages per second.

The emphasis on single cluster is important because:
It’s possible to leverage the concept of partitions and use multiple Airflow clusters to handle different Kafka topics.

For example:
airflow-cluster-a handles all messages from topic-group-a-*
airflow-cluster-b handles all messages from topic-group-b-*

By partitioning the workload, we can achieve Kafka-level message throughput
while preserving the benefits of easy monitoring and manual retries at the DAG or Task level.

Conclusion
#

Through this article, you should now have a solid understanding of Apache Airflow’s overall architecture and its suitable use cases.

  • For small to medium-sized data pipelines:

  • For large-scale data pipelines:

    • Let Airflow focus solely on its role as the orchestration layer.
      • Delegate actual task execution to specialized tools for each domain.
    • Fine-tune the Scheduler and Meta Database settings for performance and scalability.
  • Leverage Airflow’s built-in strengths to build a flexible, observable, and retryable data pipeline.

At the same time, it’s crucial to understand Airflow’s limitations:

For scenarios requiring low-latency (millisecond-level) or message throughput in the thousands to tens of thousands per second,
Apache Airflow may not be the best fit.

Related

Contributing to Apache Airflow from 0
·7 mins
Blog En Open-Source-Contribution
How a beginner with no prior Data Engineering experience managed to contribute over 50 PRs in just 3 months
Push-based CLI Workflow on MacOS
·3 mins
Blog En
How to reduce context switching in CLI workflows on Mac through push-based notifications
Becoming an Apache Airflow Committer from 0
·14 mins
Blog En Open-Source-Contribution
How a Complete Beginner in Data Engineering Became an Apache Airflow Committer in 5 Months with 70+ PRs and 300 Hours of Contributions
What is an Embedded Database? A brief introduction to RocksDB
·3 mins
Blog En Database
A brief introduction to Embedded Database and concept of RocksDB
2024 Dcard Summer Intern Interview
·5 mins
Blog En Intern
2024 Dcard Summer Intern Interview
NCKU CSIE Freshman First Semester
·10 mins
Blog En
What did I do in the first semester of freshman year?