Skip to main content
  1. Open-Source-Contributions/

Contributing to Apache Airflow from 0

·7 mins· ·
Blog En Open-Source-Contribution
Liu Zhe You
Author
Liu Zhe You
Focused on Contributing to open source, Distributed Systems, and Data Engineering.
Table of Contents

Contributing to Apache Airflow from 0
#

Why Choose Apache Airflow?
#

I wanted to start contributing to a Top Level Project from the Apache Foundation.
I noticed that Apache Airflow has 38.6k stars.
Even within our Data Team, Airflow is recognized as a crucial tool in Data Engineering.
Plus, I’m most familiar with Python.

Background
#

I’m Liu Zhe You (Jason), a CSIE junior from Taiwan 🇹🇼 with a passion for OSS contributions and an interest in backend distributed systems.

Before I actually started contributing to Apache Airflow, I had only interned in a Data Engineering–related department for just over 3 months.
I didn’t even get a chance to write a DAG; I was mainly handling general backend tasks.

Contribution Statistics
#

Let me first share the current contribution statistics:

Total PR Count: 50+
#

total_pr
Link to Total Merged PR

GitHub Contribution Ranking (Since the Project’s Inception): Rank 72
#

gh_rank
Link to Contribution Graph on GitHub

OSS Rank Contribution Ranking (Weighted by Recent Contributions): Rank 29
#

oss_rank
Link to OSS Rank

First PR
#

I officially started contributing to Apache Airflow in early October 2024, after noticing the issue Fix PythonOperator DAG error when DAG has hyphen in name.

It was marked as a good first issue, so I decided to trace the problem and found that it seemed to require only a one-line change in the code. I decided to give it a try.

Open Source For You
#

Open Source For You is an organization in Taiwan dedicated to actively contributing to open source.

Here’s a more in-depth introduction to Open Source For You through the Kafka Community Spotlight: TAIWAN 🇹🇼 by Stanislav’s Big Data Stream. In addition to #kafka, our community also includes #airflow.

Since the first issue was related to DAGs, I ran into some issues reproducing the problem following the steps in the Breeze Container documentation.
I reached out to committer @Lee-W for help.

I guess you could say I became a mentee of Lee-W (lol). From then on, whenever I encountered problems, needed a PR review, or labeling assistance, I’d always ask for his help!

Lee-W’s blog: Contributing to Airflow 101: Sort of a Mentor(?), I Guess…

First PR Merged
#

I submitted my first Apache Airflow PR: Fix PythonOperator DAG error when DAG has hyphen in name #42902.
Interestingly, a colleague from another department, @josix, whom I hadn’t met before, helped review it !

Although I only changed one line of code, the PR involved over 20 comments exchanged during the iterative review process.
It made me realize that open source contribution isn’t as simple as just changing one line of code.

Especially on the Unit Test side—since I had primarily written integration tests before, I wasn’t very experienced with mocking.

How I Manage Tasks
#

Initially, I used HackMD to simply record potential issues to investigate in Markdown.

tasks_management_using_hackmd

Issue list recorded using HackMD.

Now, I manage tasks using the Kanban board in GitHub Projects, since I often work on 2–3 issues at once.
Some tasks are under development, some await code review, and others spotted in the Issue List go into the Backlog.

tasks_management_using_github_projects

PR list managed via GitHub Projects.

First 50 PRs
#

The AIP-XX references below refer to one of the proposals in the Airflow Improvement Proposal.

AIP-84: Modern REST API
#

Back in October last year, many AIP-84 issues were opened, primarily aimed at migrating the legacy API (written in Flask) to a FastAPI-based API.
Since I was most familiar with FastAPI at the time, I ended up taking on nearly 10 API migrations.

During these API migrations, I learned quite a bit about Airflow’s architecture and became acquainted with commonly used pytest fixtures for testing, such as dag_maker, dag_bag, create_dag_run, create_task_instances, and more.

Refactoring the Parameter System
#

Context
Under the FastAPI framework, each filter (more precisely, each query parameter) inherits from BaseParam.
When there are many filters in an API, using the BaseParam architecture helps keep the router layer clean.

The definition of BaseParam is as follows:

67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class BaseParam(Generic[T], ABC):
    """Base class for filters."""

    def __init__(self, value: T | None = None, skip_none: bool = True) -> None:
        self.value = value
        self.attribute: ColumnElement | None = None
        self.skip_none = skip_none

    @abstractmethod
    def to_orm(self, select: Select) -> Select:
        pass

    def set_value(self, value: T | None) -> Self:
        self.value = value
        return self

    @abstractmethod
    def depends(self, *args: Any, **kwargs: Any) -> Self:
        pass

Problem
As more and more APIs are migrated to FastAPI, each one adds a class that inherits from BaseParam in the common/parameters.py module.
For an API with n entities, this results in n additional classes.

Therefore, a universal Factory Pattern is needed to generate these classes with the appropriate type bindings required by FastAPI.
After this refactoring PR, over 50 APIs utilize the filter_param_factory.

Global Unique Constraint Handler
#

This approach leverages FastAPI’s Exception Handler to process the Unique Constraint Error raised by SQLAlchemy,
eliminating the need to handle this exception in each individual router.

Fixing the Display of Logs after Applying Filters
#

Fix wrong display of multiline messages in the log after filtering #44457

Before the fix, error highlighting in the logs was determined solely by a regex that searched for the string ERROR in the current line.
However, it should instead rely on a currentLevel that tracks the current log level, ensuring that all logs categorized under ERROR are highlighted.

After Fixed

Since this directly affects the log page used by end users, it felt like a particularly rewarding PR.

Although the old UI is likely to be deprecated in the future, this PR will at least be included in version 2.10.x.

Removing AIP-44 Internal API
#

Next, I encountered the Meta Issue Removal of AIP-44 code #44436.

The Internal API can be understood as an internal RPC (implemented using thrift RPC).
This was my first encounter with a crowdsourced issue, and in terms of value, it was mainly driven by the fact that starting with Airflow 3.0, components such as the TaskSDK and Operators should not directly access the Metadata Database.

The Internal API represents a portion of the codebase that directly accesses the Metadata Database, and it has been criticized for being difficult to trace.

Open Source For You – Winter Koi Fish Season
#

Since these crowdsourced issues are typically solved by many people together, everyone picks up a batch of tasks (possibly 5–10 sub-tasks) at a time.

Around this time, Open Source For You organized the Winter Koi Fish Season event, offering a Starbucks coffee reward to the top 3 contributors of the week.

With some pending PRs getting merged, this wave of PRs to remove the Internal API led to as many as 15 merged PRs in a single week. So, I unexpectedly won a Starbucks coffee reward! 😆

winter_koi_fish_season

The Facebook post from Open Source For You

Next Steps
#

Continuing to Explore the Core
#

I plan to delve deeper into Airflow’s architecture, focusing on core components such as the Scheduler, Trigger, and Executor.

I also intend to explore feature details related to Airflow 3. Currently, I’m involved in issues related to AIP-63: DAG Versioning and AIP-66: DAGs Bundles & Parsing.

While tackling tasks, it’s important to consider not just how to solve them but also the rationale behind the design and the true value of the issue, rather than simply aiming to rack up numbers.

Engaging More in Community Discussions
#

This mainly includes participating in:

  • GitHub Issues
  • Developer Mailing List
  • Slack
  • AIP Documentation

Answering More Questions on Slack
#

Answering questions on Slack is also part of engaging in community discussions.
Whenever I have free time and come across topics I’m familiar with, I help answer questions in channels such as:

  • #new-contributor
  • #contributor
  • #airflow
  • #user-troble-shooting

Conclusion
#

Contributing to Apache Airflow has been incredibly rewarding, and it’s a truly unique experience to collaborate with top developers from around the world!

github_heatmap_airflow

My GitHub HeatMap - Apache Airflow

There’s a special sense of accomplishment when a PR gets merged, along with the recognition from reviewers.
It’s a bit like solving algorithm problems in high school—except now, your contribution might actually be used by a company somewhere in the world!
It’s far more meaningful than just practicing coding challenges.

In the future, I will write more in-depth PR write-ups to document my experiences, hoping to help others who want to contribute to Apache Airflow.

Related

Push-based CLI Workflow on MacOS
·3 mins
Blog En
How to reduce context switching in CLI workflows on Mac through push-based notifications
What is an Embedded Database? A brief introduction to RocksDB
·3 mins
Blog En Database
A brief introduction to Embedded Database and concept of RocksDB
2024 Dcard Summer Intern Interview
·5 mins
Blog En Intern
2024 Dcard Summer Intern Interview
2024 Software Summer Intern Interview
·2 mins
Blog En Intern
2024 Software Summer Intern Interview - Appier / Dcard / TSMC
NCKU CSIE Freshman First Semester
·10 mins
Blog En
What did I do in the first semester of freshman year?
Life Is Short Use 'uv'
·1 min
Blog En Python
How to use uv to manage Python packages and projects