Apache Airflow 101 ·

What is Apache Airflow?
#

Apache Airflow – A platform to programmatically author, schedule, and monitor workflows

Apache Airflow is an open-source workflow management platform
that enables developers to define data pipelines using the Workflow as Code paradigm.

In Airflow, an entire workflow is called a DAG
which stands for Directed Acyclic Graph.
A DAG consists of multiple Tasks
and these DAGs or tasks can have various dependencies,
as long as they do not form cycles.

Airflow supports various scheduling strategies
and allows developers to define complex DAGs. It also provides a comprehensive UI for monitoring and managing workflows.

Airflow is one of the most popular workflow orchestration tools today:

Over 40,000 stars on GitHub
The 5th largest project under the Apache Software Foundation
Used by more than 80,000 companies worldwide
According to Astronomer

# Example code for Apache Airflow
import json

import pendulum

from airflow.sdk import dag, task
@dag(
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
)
def tutorial_taskflow_api():
    """
    ### TaskFlow API Tutorial Documentation
    This is a simple data pipeline example which demonstrates the use of
    the TaskFlow API using three simple tasks for Extract, Transform, and Load.
    Documentation that goes along with the Airflow TaskFlow API tutorial is
    located
    [here](https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html)
    """
    @task()
    def extract():
        """
        #### Extract task
        A simple Extract task to get data ready for the rest of the data
        pipeline. In this case, getting data is simulated by reading from a
        hardcoded JSON string.
        """
        data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'

        order_data_dict = json.loads(data_string)
        return order_data_dict
    @task(multiple_outputs=True)
    def transform(order_data_dict: dict):
        """
        #### Transform task
        A simple Transform task which takes in the collection of order data and
        computes the total order value.
        """
        total_order_value = 0

        for value in order_data_dict.values():
            total_order_value += value

        return {"total_order_value": total_order_value}
    @task()
    def load(total_order_value: float):
        """
        #### Load task
        A simple Load task which takes in the result of the Transform task and
        instead of saving it to end user review, just prints it out.
        """
        print(f"Total order value is: {total_order_value:.2f}")

    order_data = extract()
    order_summary = transform(order_data)
    load(order_summary["total_order_value"])

tutorial_taskflow_api()

Key Features of Apache Airflow
#

Powerful Scheduling Capabilities
- Supports various scheduling options for executing DAGs:
Support for Complex Workflows
Multiple Execution Backends (Executors)
Backfilling Mechanism
- Enables data backfilling for historical time intervals
Monitoring and Logging
- A variety of UI views to monitor the status of DAGs and tasks
  Screenshots below are from the Apache Airflow GitHub
  For a detailed UI overview, see the UI Documentation
  - DAG View
    
    Displays the status of all DAGs
  - Grid View
    
    Shows recent DAG runs and task statuses for a single DAG
  - Graph View
    
    Clearly visualizes task dependencies
  - Task Instance Log View
    
    View task logs directly in the UI
    No need to SSH into workers to check logs
    - Supports various remote logging backends:
      - Local, Elasticsearch, S3, GCS, and more
  - Most importantly, you can manually retry DAGs or individual tasks directly from the UI
Alerting & Notifications
- Notifies users of DAG or task failures via:
  - Email, Slack, and other integrations
Integrations with Third-Party Services
- Supports a wide range of tools and platforms:
  - AWS, GCP, Azure
  - MySQL, PostgreSQL, MongoDB
  - Kafka, Spark, and more

The Role of Apache Airflow in a Data Pipeline
#

Airflow is often described as a Data Orchestration Tool.

In the context of a data platform architecture,
Airflow functions as the orchestration layer—essentially the brain of the entire platform.
It is responsible for precisely scheduling, coordinating, logging, and monitoring the status of every task within each workflow.

Reference from chaossearch.io / cloud-data-platform-architecture-guide

In large-scale data platforms,
Airflow typically sits at the top layer, orchestrating every step from the data source all the way to the final data product delivered to end users.
However, each individual task is usually executed by specialized tools dedicated to specific purposes.

For example:

An Airflow DAG might be responsible for periodically initiating data ingestion, then running ETL processes, and finally landing the data into the appropriate tables or views based on its type.
- Within this process:
  - Data ingestion might be handled by tools like Airbyte
  - ETL might be processed by Spark, Flink, DuckDB, or other compute engines
- Airflow’s focus is solely on orchestration
For small to medium-sized data platforms,
using Airflow with LocalExecutor or CeleryExecutor is often more than sufficient!

The relationship between Airflow and other common components of a data pipeline is illustrated below:

Reference from ml4devs.com / scalable-efficient-big-data-analytics-machine-learning-pipeline-architecture-on-cloud

What Problems Does Airflow Solve?
#

Based on my experience working in data teams,
Airflow addresses several key challenges:

Workflow Observability
- The Grid View (DagRun/TaskInstance records) can act as a real-time metric dashboard for pipeline execution.
- Examples:
  - 1. If the UI shows (or you receive alerts about) frequent failures in a specific task like xxx:
    - This could indicate an issue with the component or service responsible for that task.
  - 1. If recent DAG runs are taking longer to complete:
    - You can investigate specific DagRuns and TaskInstances to identify performance issues,
    - such as increased data volume or a performance degradation in one of the tasks.
Retryability of Workflows
- DAGs or individual tasks can be manually retried directly from the UI.
- This reduces operational overhead.
- Common scenarios:
  - 1. Downstream services occasionally experience transient failures:
    - Operations can simply review TaskInstance logs and retry the affected tasks or entire DAGs—all from the UI.
  - 1. Upstream services produce incorrect data, requiring reprocessing of X days of data:
    - You can easily perform a backfill to fill in the missing or corrected data.
      Good news: As of Airflow 3, backfills can now be triggered directly from the UI!
Dynamic DAG Generation Based on Configuration
- DAGs can be dynamically created based on configuration within the DAG code itself.
- Example:
  - Use a list of current customer IDs from a config file to dynamically generate DAGs like client_<client_id>_dag.
  - This enables dynamic generation and updating of DAGs.
Automated Onboarding
- While backfilling might not be necessary for every data team,
- it’s a common use case in onboarding scenarios:
  - Example:
    - A @daily DAG generates a daily snapshot view.
    - When a new client is onboarded, setting up catchup or triggering a backfill ensures their historical data is automatically processed.

In summary,
using Airflow can significantly reduce operational workload and streamline workflow management.

Airflow Architecture
#

Reference from airflow.apache.org / 2.10.5/core-concepts

Meta Database
- All DAG and Task states are stored in the Meta Database.
- It is recommended to use a connection pool proxy like PgBouncer,
  as nearly all components access the Meta Database constantly.
Scheduler
- Responsible for monitoring all tasks and DAGs:
  - Checks if any DAG needs to trigger a new DagRun.
  - Checks within each DagRun if any TaskInstances or the entire DAG need to be scheduled.
  - Selects TaskInstances to be scheduled, and adds them to the execution queue considering execution pools and concurrency limits.
- See more about how to fine-tune the Scheduler.
Worker
- Executes the callable for each TaskInstance.
Trigger
- Used for Deferrable Tasks
  - Use case: Tasks that wait for an external system to change state or need to be delayed for a long time.
  - Triggers run asyncio coroutines to perform polling in the background.
    - Example: See the poll_interval in AwaitMessageTrigger.
DAG Processor (DAG Parser)
- Periodically parses DAG files (Python scripts) and stores them as Dag records in the Meta Database.
Web Server (API Server)
- Provides the UI and REST API for interacting with Airflow.

Common Use Cases for Airflow
#

Ideal Scenarios for Using Airflow
#

Recurring Workflows
- Workflows that must run at regular time intervals.
- For example: every few hours, days, or weeks.
Workflows Split into Retryable Tasks
- Suitable when a workflow involves multiple services and each task may need to be retried manually.
- Example: a multi-step ETL process spanning several services.
Workflows with Complex Task Dependencies
- Where task execution order depends on certain conditions or outcomes.
Workflows that Expand Dynamically Based on External State
- Dynamic DAG rendering based on configuration, metadata, or queries.
- Example: DAGs and tasks are determined based on dynamic configs or query results.

Real-World Use Cases of Airflow
#

Data Pipelines / ETL / Business Intelligence
Infrastructure Automation
MLOps (Machine Learning Operations)

These use cases align closely with the ideal scenarios outlined above.

For more practical examples, refer to astronomer.io / use-cases

Limitations of Airflow
#

Based on the Airflow architecture diagram above, we can already infer the following:

The Meta Database I/O is the bottleneck of Airflow!

Every component constantly reads from and writes to the Meta Database.
For example: each DagRun creates a new record; each TaskInstance creates another.
These are continuously updated based on their execution status.

Therefore, Airflow is not suitable for:

Low-latency, event-driven workflows
- Here, “low-latency” refers to millisecond-level execution.
  - 1. As mentioned in the architecture, even with the Trigger component, the internal implementation involves polling the external system every few seconds.
  - 1. Even if you use the REST API to trigger a DAG, the task still has to go through the Scheduler before being picked up by a Worker.
Consumers processing thousands to tens of thousands of messages per second

Currently, the maximum throughput of Airflow is approximately a little over 100 DAG runs per second.

Based on:
Airflow Summit 2024: How we Tuned our Airflow to Make 1.2 Million DAG Runs per Day
Calculation: 1,200,000 / 86,400 = 138.89
Even with fine-tuned Airflow configurations,
due to database I/O bottlenecks, throughput is still limited to just over 100 DAG runs per second.

Anti-pattern: Using a single Airflow cluster as a consumer downstream of RabbitMQ or Kafka.
A single Airflow cluster is not suitable for handling thousands or tens of thousands of messages per second.

The emphasis on single cluster is important because:
It’s possible to leverage the concept of partitions and use multiple Airflow clusters to handle different Kafka topics.
For example:
airflow-cluster-a handles all messages from topic-group-a-*
airflow-cluster-b handles all messages from topic-group-b-*
…
By partitioning the workload, we can achieve Kafka-level message throughput
while preserving the benefits of easy monitoring and manual retries at the DAG or Task level.

Conclusion
#

Through this article, you should now have a solid understanding of Apache Airflow’s overall architecture and its suitable use cases.

For small to medium-sized data pipelines:
- Use the default LocalExecutor along with various third-party providers to quickly meet business needs.
- If your DAGs/Tasks require more compute resources, scale up to CeleryExecutor or KubernetesExecutor to distribute tasks across multiple workers, depending on your use case.
  - You can refer to the Remote Executor comparison guide to choose the most appropriate executor.
For large-scale data pipelines:
- Let Airflow focus solely on its role as the orchestration layer.
  - Delegate actual task execution to specialized tools for each domain.
- Fine-tune the Scheduler and Meta Database settings for performance and scalability.
Leverage Airflow’s built-in strengths to build a flexible, observable, and retryable data pipeline.

At the same time, it’s crucial to understand Airflow’s limitations:

For scenarios requiring low-latency (millisecond-level) or message throughput in the thousands to tens of thousands per second,
Apache Airflow may not be the best fit.