I am Liu Zhe-You (Jason), currently a junior at NCKU CSIE. I focus on contributing to open source and have a keen interest in Distributed Systems and Data Engineering.
Before contributing to Apache Airflow, I had only interned in a Data Engineering-related department for about three months. Interestingly, my tasks didn’t even involve Airflow—I was mainly responsible for general backend development.
When I decided to contribute to open source, I wanted to start with a Top-Level Project from the Apache Foundation. I noticed that Apache Airflow had nearly 39.2k stars, and after consulting with Data Teams, I confirmed that Airflow is indeed a crucial tool in Data Engineering. It was a perfect fit, especially since Python is my most familiar language.
Before diving in, I want to share what I’ve gained from contributing to open source. Hopefully, this will inspire those who are still hesitating to get started!
Growth at Both the Code Level and System Level
Projects under Apache’s Top-Level Projects are massive in scale. You’ll encounter numerous design patterns and learn how large-scale software achieves scalability and fault tolerance. Additionally, you’ll see how CI pipelines are designed to ensure system stability while using minimal resources.
For example, in Airflow, every new feature or refactor requires careful consideration of backward compatibility. Even a small change to the codebase could potentially affect users worldwide. This kind of experience is hard to gain from personal side projects or even internal company projects.
Opportunity to Collaborate with Top Developers from Around the World
This is one of the coolest parts of contributing to open source. Even though I’m based in Taiwan, the moment I submit a PR on GitHub, I have the chance to receive feedback from PMCs, Committers, and developers with 10+ or even 20+ years of experience from around the world. Sometimes, I even get to collaborate with them on new features and refactors—a fantastic learning opportunity!
A Way to Prove Your Skills
Contributing to large open-source projects is a great way to showcase your abilities. Since all PRs are publicly available on GitHub, they serve as proof of your:
Problem-solving skills
Code quality
Communication ability
Currently, there are only about 9,000 Apache Committers worldwide. That makes it a highly valuable credential.
Here’s a summary of my contributions so far. I started contributing to Apache Airflow in early October 2024, and the following stats are as of March 14, 2025.
Since it was labeled as a “good first issue”, I decided to trace the problem. It turned out that the fix only required a single line of code, so I went ahead and submitted my first PR!
4 in Mandarin is pronounced as “Si,” which represents “For” in English. So, OpenSource4You can be interpreted as “Open Source For You.”
OpenSource4You is a non-profit organization in Taiwan dedicated to hands-on open-source contributions. It provides mentorship for contributing to various open-source projects, including:
Since my first issue was related to DAGs, I followed the documentation and tried to reproduce the problem in the Breeze Container. However, I ran into some issues along the way.
I reached out to Committer Wei Lee for help. From that moment on, I unofficially became Wei Lee’s mentee xD. Whenever I ran into problems or needed a PR review or label, I would ask for his help!
Although the fix was just a single line of code, the PR went through 20+ rounds of comments and revisions. That experience taught me that open-source contribution is much more than just modifying one line of code.
In particular, unit testing was a challenge for me. I was more familiar with integration tests, so I had little experience with mocking before.
meta issue: Issues with multiple subtasks, often involving refactoring or migrating multiple modules
Meta Issues are especially beginner-friendly because once you complete one subtask, the remaining ones follow a similar pattern, making them easier to solve.
If your goal is to build up issue contributions, this is one of the fastest ways. It also increases your visibility in the community and helps establish your presence.
This caught my attention because adding a feature to download logs from cloud storage wouldn’t actually solve the root cause of the OOM issue. Instead, the real problem was sorting and merging logs in memory, which needed to be fixed.
So I started structuring my PRs based on The Golden Circle:
Why: Why is this PR needed?
How: What was changed? What’s the approach or strategy?
What: What is the expected result or outcome?
Sometimes, I only include Why & How or Why & What. The distinction between How and What can be blurry depending on how I describe the PR changes. Personally, I include all three only for larger changes.
The goal is simply to help reviewers quickly understand why the PR is necessary and what has been changed.
Airflow discussions primarily happen on GitHub Issues (unless it’s urgent or requires in-depth discussion, in which case Slack is used).
So when asking questions or discussing an issue, it’s important to clearly provide all relevant context.
Since maintainers might be in different time zones, each round of communication can take several hours. The clearer you explain the issue, the fewer back-and-forth messages will be needed.
Before working on a PR, it’s helpful to discuss your approach with relevant stakeholders. For example, you can briefly explain your solution in the GitHub Issue or Slack and CC key stakeholders.
(For instance, if solving an issue requires modifying another component, you can tag the maintainer of that component in the issue comments.)
This helps avoid unnecessary extensive PR revisions and also increases visibility so that stakeholders can help review or provide feedback.
Now, I manage everything directly with GitHub Projects’ Kanban board.
You can create your own Kanban board in your forked repo.
Since I often have 2-3 issues in progress, with some waiting for code review and others pending reviewer feedback, a Jira-like Kanban helps keep track of everything efficiently. This way, I don’t lose track of my ongoing issues.
If I find an interesting issue while browsing or think of something to work on, I add it to the Backlog—so I don’t forget it later.
A few weeks ago, though, Pierre Jeambruntagged me and asked if I wanted to take it on. That was a pretty special moment—it felt like I wasn’t just a beginner anymore. Being recognized by a PMC member as capable of handling this issue was really encouraging.
I actually started working on this issue back in December last year. I conducted a comprehensive benchmark and proof of concept in the issue discussion, which showed that the change could reduce memory usage by 90%. It was also the first PR where a PMC member praised my work (I think?).
At the time, I thought that once it got merged, I could asynchronously refactor 10 different providers, essentially giving me 10 extra PRs to work on.
This was my first major refactor directly related to a core feature, but ironically, I was nominated as a committer before the PR was even merged.
It still hasn’t been merged yet. But it should be merged before Airflow 2.10.6 or 2.11.0! If it’s going into the Airflow 3.0 main branch, I’ll need to spend some time resolving conflicts 🚧.
The actual code change for this issue was pretty quick to write. All the related tests passed individually, but only when running the full test suite did the failure appear. Even after retrying multiple times, it still failed.
Jarek suggested that I might need to bisect the tests to identify the one causing a side effect:
Likely this is a side effect of some other test that does not clean up after itself. You can repeat what CI is doing — i.e. run the Core test type… In many cases, you can guess which tests are related to your changes. What I often do in such cases is try to bisect the issue— instead of running the whole test_type (“Core”) test suite, I enter Breeze and run individual test packages/modules seen in the output.
In the end, binary search actually helped me find the problematic test! I never expected this kind of issue to happen. ~Who knew contributing to open source also required psychic abilities?~
This was the last PR I submitted before being invited as a committer.
One afternoon, just before heading to my PE class, I saw that Ash had tagged me in the #internal-ci-cd channel on Slack, saying that the Kubernetes test I had fixed earlier was still very flaky.
TL;DR: My Kubernetes test fix was still unstable, and the CI failure rate was quite high.
So, right before leaving, I quickly drafted a fix and speed-ran a PR in 20 minutes. Luckily, I got it right on the first commit!
The funny part? Jarek commented below: “Looking at it with 🍿”
Definitely the most nerve-wracking yet satisfying PR so far!
Being a committer ≠ fully understanding the entire project. Personally, I’d say my current understanding of Airflow is only about 15%.
(So far, I’m most familiar with API server, Task Log, Auth Manager, Executor, and Kubernetes Tests.)
There are still many core Airflow features I need to dive deeper into, such as Scheduler, Trigger, Pool, and TaskSDK (a new feature in Airflow 3.0), which are still quite unfamiliar to me.
Previously, I focused more on solving issues and spent relatively less time reviewing PRs.
Moving forward, I plan to review more PRs, including ones outside my expertise, so I can explore related contexts while reviewing.
Join the Fun: “Let’s Contribute to Apache Airflow!”#
If you’re interested in Python, Data Engineering, and want to start contributing to a world-class open-source project, but you’re worried about complex setups or needing a high-end computer,
Why not try contributing to Apache Airflow?
Apache Airflow offers an excellent developer experience. I never expected an open-source project to have a dedicated CLI just to make life easier for contributors and CI!
With this CLI, you can effortlessly run unit tests, integration tests, Kubernetes tests, or even spin up an Airflow system with different executors.
It also has an incredibly robust CI system. There are over 100 pre-commit hooks, covering everything from basic linting and type checking to generating documentation, ERDs, and frontend API services— all designed to maintain high PR quality.
The pre-commit hooks that automatically run during git commit. Even the Available pre-commit checks documentation is automatically updated by one of the pre-commit hooks!
The project also has well-defined GitHub Labels, with over 250 labels to help categorize different issues efficiently.
~And, of course, plenty of fun memes~
Overall, I’d say Apache Airflow is an incredibly beginner-friendly open-source project!
Lee-W, for patiently reviewing PRs, adding labels, and re-triggering CI runs. Also, for sharing new issues in the OpenSource4You Slack channel.
Jarek, for always providing insightful feedback that improves PRs. Often the first to respond online (seemingly 24/7!), and for inviting me as a committer—thank you!
Pierre Jeambrun, for reviewing countless API-related PRs (probably over 40!) and helping debug strange test failures.
Chia-Ping Tsai, founder of @opensource4you. Without this community, I wouldn’t have imagined being able to contribute to a global open-source project from Taiwan. It gave me the courage to jump in and start contributing!
I truly understand now what @Lee-W meant by: “A developer who gets code reviews is living the dream.”
“A developer who gets code reviews is living the dream.”
Without the support of these amazing people, my open-source journey wouldn’t have been this smooth!