I wanted to start contributing to a Top Level Project from the Apache Foundation. I noticed that Apache Airflow has 38.6k stars. Even within our Data Team, Airflow is recognized as a crucial tool in Data Engineering. Plus, I’m most familiar with Python.
I’m Liu Zhe You (Jason), a CSIE junior from Taiwan 🇹🇼 with a passion for OSS contributions and an interest in backend distributed systems.
Before I actually started contributing to Apache Airflow, I had only interned in a Data Engineering–related department for just over 3 months. I didn’t even get a chance to write a DAG; I was mainly handling general backend tasks.
It was marked as a good first issue, so I decided to trace the problem and found that it seemed to require only a one-line change in the code. I decided to give it a try.
Since the first issue was related to DAGs, I ran into some issues reproducing the problem following the steps in the Breeze Container documentation. I reached out to committer @Lee-W for help.
I guess you could say I became a mentee of Lee-W (lol). From then on, whenever I encountered problems, needed a PR review, or labeling assistance, I’d always ask for his help!
Although I only changed one line of code, the PR involved over 20 comments exchanged during the iterative review process. It made me realize that open source contribution isn’t as simple as just changing one line of code.
Especially on the Unit Test side—since I had primarily written integration tests before, I wasn’t very experienced with mocking.
Initially, I used HackMD to simply record potential issues to investigate in Markdown.
Issue list recorded using HackMD.
Now, I manage tasks using the Kanban board in GitHub Projects, since I often work on 2–3 issues at once. Some tasks are under development, some await code review, and others spotted in the Issue List go into the Backlog.
Back in October last year, many AIP-84 issues were opened, primarily aimed at migrating the legacy API (written in Flask) to a FastAPI-based API. Since I was most familiar with FastAPI at the time, I ended up taking on nearly 10 API migrations.
During these API migrations, I learned quite a bit about Airflow’s architecture and became acquainted with commonly used pytest fixtures for testing, such as dag_maker, dag_bag, create_dag_run, create_task_instances, and more.
Context Under the FastAPI framework, each filter (more precisely, each query parameter) inherits from BaseParam. When there are many filters in an API, using the BaseParam architecture helps keep the router layer clean.
classBaseParam(Generic[T],ABC):"""Base class for filters."""def__init__(self,value:T|None=None,skip_none:bool=True)->None:self.value=valueself.attribute:ColumnElement|None=Noneself.skip_none=skip_none@abstractmethoddefto_orm(self,select:Select)->Select:passdefset_value(self,value:T|None)->Self:self.value=valuereturnself@abstractmethoddefdepends(self,*args:Any,**kwargs:Any)->Self:pass
Problem As more and more APIs are migrated to FastAPI, each one adds a class that inherits from BaseParam in the common/parameters.py module. For an API with n entities, this results in n additional classes.
Therefore, a universal Factory Pattern is needed to generate these classes with the appropriate type bindings required by FastAPI. After this refactoring PR, over 50 APIs utilize the filter_param_factory.
This approach leverages FastAPI’s Exception Handler to process the Unique Constraint Error raised by SQLAlchemy, eliminating the need to handle this exception in each individual router.
Fixing the Display of Logs after Applying Filters#
Before the fix, error highlighting in the logs was determined solely by a regex that searched for the string ERROR in the current line. However, it should instead rely on a currentLevel that tracks the current log level, ensuring that all logs categorized under ERROR are highlighted.
Since this directly affects the log page used by end users, it felt like a particularly rewarding PR.
Although the old UI is likely to be deprecated in the future, this PR will at least be included in version 2.10.x.
The Internal API can be understood as an internal RPC (implemented using thrift RPC). This was my first encounter with a crowdsourced issue, and in terms of value, it was mainly driven by the fact that starting with Airflow 3.0, components such as the TaskSDK and Operators should not directly access the Metadata Database.
The Internal API represents a portion of the codebase that directly accesses the Metadata Database, and it has been criticized for being difficult to trace.
Since these crowdsourced issues are typically solved by many people together, everyone picks up a batch of tasks (possibly 5–10 sub-tasks) at a time.
Around this time, Open Source For You organized the Winter Koi Fish Season event, offering a Starbucks coffee reward to the top 3 contributors of the week.
With some pending PRs getting merged, this wave of PRs to remove the Internal API led to as many as 15 merged PRs in a single week. So, I unexpectedly won a Starbucks coffee reward! 😆
While tackling tasks, it’s important to consider not just how to solve them but also the rationale behind the design and the true value of the issue, rather than simply aiming to rack up numbers.
Answering questions on Slack is also part of engaging in community discussions. Whenever I have free time and come across topics I’m familiar with, I help answer questions in channels such as:
Contributing to Apache Airflow has been incredibly rewarding, and it’s a truly unique experience to collaborate with top developers from around the world!
There’s a special sense of accomplishment when a PR gets merged, along with the recognition from reviewers. It’s a bit like solving algorithm problems in high school—except now, your contribution might actually be used by a company somewhere in the world! It’s far more meaningful than just practicing coding challenges.
In the future, I will write more in-depth PR write-ups to document my experiences, hoping to help others who want to contribute to Apache Airflow.