Launcher

You've reached the end of the results

This year I’ve quit my 9 to 5 job in pursuit of building a career as a freelance analytics engineer… which turned out to be easier said than done.

Yes, there are more work opportunities now than ever, but one has to consider that 4 years ago you had to only compete with other developers when applying for gigs.

But today, in the age of 💩GPT, you also have to stand out from hundreds of applications coming in from AI platforms that promise to 10x your number of clients in just a month.

Why I did the DAG Authoring Certification Exam

Needing to differentiate myself from the mass of chatbots and 10x vibe coders I discovered 1 thing they have in common. Both vibe coders and LLMs suck at certification exams.

A good certification exam puts you in front of a real problem, testing not only your knowledge of the tool, but your critical thinking as well.

And to the question of which certification exams to take on first: the best certificates are the ones you get for free.

It just so happened that Astronomer was giving out free shots at the new Airflow 3.0 DAG authoring certificate as a reward for participating in their annual user survey…

Post Summary;

Here are the topics I’ll talk about in this post:

  1. What you’ll learn in the DAG Authoring Certificate Even though I’ve been working with the tool intensively for 4+ years, there were some things I had no idea about, in this section, I’ll list a few I found most interesting.

  2. Sample exam questions. Just like any good college buddy, I made sure to write down all the questions that were confusing or hard for me, so you know what to prepare for.

  3. Study guide. Where to start? What topics to check out? In this section I’m going give you a 3 step study plan that leads to success.

What can you learn in the Airflow DAG Authoring Certificate

Some of these are new to Airflow 3.0, some of these I just never had a chance to use. But here are some cool things you’ll learn while doing the Airflow DAG Authoring Certificate.

DAG Versioning

If you’ve ever worked with Airlfow in production environments, you may know how big of a pain can DAG versioning used to be.

Airflow assumes that the most recent DAG code applies to all of its runs.

Because of this, if you decide to remove, or add a task from a DAG that was already running, your UI would get messed up.

Code changes during execution create unpredictable behaviour.

Airflow always runs the latest version of your tasks code, even if you made a change in the middle of the execution of the pipeline.

To solve this Airflow 3.0. introduces the concept of DAG Bundles

Important

DAG bundles are collections of one or multiple DAGs and their associated files that can be sourced from various locations.

At the moment Airflow supports two types of DAG bundles:

  • Local DAG bundle -> the old un-versioned system
  • Git DAG bundle -> uses git for detecting version changes in the DAG code

You can even have multiple DAG bundles, allowing your to organize your DAGs into more than one Git repository.

To configure a versioned DAG bundle, add the following to the airflow.cfg file:

airflow.cfg
[dag_processor]
dag_bundle_config_list = [
{
"name": "my_git_repo",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {"tracking_ref": "main", "git_conn_id": "my_git_conn"}
},
{
"name": "dags-folder",
"classpath": "airflow.dag_processing.bundles.local.LocalDagBundle",
"kwargs": {}
}
]

XCOM Map and Zip functions

I knew about the native map and zip functions in Python, but I had no idea XCom had also implemented them.

Map — Takes a Python function as an argument and maps the values of XComArgs (but doesn’t create an Airflow task).

dag_with_map.py
# Define the DAG
with DAG(
dag_id="example_dag",
start_date=datetime(2025, 1, 1),
schedule=None, # This DAG does not run on a schedule
) as dag:
# Return a list of folder paths
def list_folders():
return ["usr/folder_a", "usr/folder_b", "usr/folder_c"]
list_files = PythonOperator(
task_id="list_files",
python_callable=list_folders
)
# Append 'data/' to each folder path
# Using the .output.map() to process each path individually
files_processed = list_files.output.map(lambda path: path + "data/")

Zip — useful for merging the output of different tasks to pass as a result to another task.

dag_with_zip.py
@dag(start_date=datetime(2025, 1, 1), schedule=None, dag_id="download_dag")
def download_dag():
@task
def get_paths():
return ['/usr/local/', '/bin/test/', '/home/me/']
@task
def get_filenames():
return ['file_a', 'file_b', 'file_c']
@task
def get_extensions():
return ['.txt', '.zip', '.parquet']
@task
def download_files(paths, filenames, extensions):
# Combine path, filename, and extension
for path, filename, ext in zip(paths, filenames, extensions):
full_path = f"{path}{filename}{ext}"
print(f"Downloading {full_path}...")
# Expected output:
# Downloading /usr/local/file_a.txt...
# Downloading /bin/test/file_b.zip...
# Downloading /home/me/file_c.parquet...

Parametrized task groups

Not only can TaskGroups be parametrized in Airflow, but you can also:

  • Call the parameterized task group function with different arguments, to create multiple instances of the same task group which greatly impacts re usability of DAG code.
  • Import task groups from external modules which means less DAG code in a single file.
  • Dynamically map task group to create multiple instances based on runtime data.
parametrized_task_group.py
@task_group(group_id="file_processor")
def process_file(file_path: str)
@task
def extract_data(path: str):
return f"data_from_{path}"
@task
def transform_data(data: str):
return f"transformed_{data}"
return transform_data(extract_data(file_path))
# Create multiple instances with expand()
file_paths = ["/data/file1.csv", "data/file2.csv", "data/file3.csv"]
mapped_groups = process_file.expand(file_path=file_paths)
# Access specific instances
downstream_task = pull_xcom(mapped_groups)

Triggering DAGs using Assets

Assets are a neat trick when you want to make your workflows more data driven. They allow your DAGs to trigger based on updates, rather than just time-based schedules.

Note

Assets create a visible relationship between the DAGs and the data they process.

Each time an asset is updated by a DAG, an AssetEvent is dispatched and sent to other DAGs listening for changes.

I did know about assets before I started the exam, but what I didn’t know that the asset can carry metadata between DAGs.

If we want to attach some metadata to this AssetEvent, we can do so by yielding it from a task.

outputing_metadata.py
my_asset = Asset("my_asset")
@task(outlets=[my_asset])
def attach_metadata():
num = 23
yield Metadata(my_asset, {"my_num": num})
return "hello :)"
attach_metadata()

Certificate question examples

Now that you have an idea what you can learn during the certification, I’ll go through some of the harder questions questions on the exam, so you know what to prepare for.

What is an Asset? (Select all that apply)

Correct answers:

  • (T) An Airflow asset represents a logical grouping of data
  • (T) An Airflow asset is technically a DAG
Don’t forget to mark that second option as True.

When you define an asset using the @asset taskflow decorator, Airflow actually creates a DAG with the same id in the background.

Is the task below idempotent?

execute_query = SQLExecuteQueryOperator(
task_id="execute_query",
sql="SELECT * FROM my_table WHERE date = '{{ ds }}' LIMIT 1;",
split_statements=True,
return_last=False,
)

Correct answer: Yes

I initially wanted to answer “no” because of the LIMIT 1 part of the query. A query like this can, if run multiple times, return different results. But the exam ignores this, what’s important is that {{ ds }} is used in the date filter.

This makes the function non-deterministic, but it is idempotent.

What does print_user output to the standard output?

@dag
def my_dag():
@task
def get_user() -> Dict[str, Any]:
return {
'id': 1,
'name': 'John',
'city': 'New York'
}
@task_group
def get_user_info(my_user: Dict[str, Any]):
@task
def get_city() -> str:
return my_user['city']
@task
def get_name() -> str:
return my_user['name']
@task
def print_user(name: str, city: str):
print(f"Name: {name}, City: {city}")
print_user(get_name(), get_city())
user = get_user()
get_user_info(user)
my_dag()

Correct answer: Nothing, the DAG has an error

The two tasks: get_city and get_name cannot directly access variables from the scope above. It has to explicitly be passed to them.

How many XComs does this task produce?

@dag
def my_dag():
@task(multiple_outputs=True)
def my_task():
return {
'id': 1,
'name': 'John'
}
my_task()
my_dag()

Correct answer: 3

It is a little bit confusing, but the 3 XComs the task produces are the following:

  1. id -> 1
  2. name -> “Johnn”
  3. return_value -> {“id”: 1, “name”:” “john”}

Which XCom method should you use to retrieve results from a specific mapped task instance?

@task
def get_result_from_mapped_task(ti):
# How to get result from the 2nd mapped task instance?
pass

Correct answer: ti.xcom_pull(key=’return_value’, map_indexes=1)

I have no comments for this one. Had no idea about the map_indexes parameter

Does this DAG cause a parsing error?

@dag
def my_dag():
@task
def task_a():
print("Hello from task A!")
@task
def task_b():
print("Hello from task B!")
@task
def task_c():
print("Hello from task C!")
task_a >> task_b >> task_c
my_dag()

Correct answer: Yes

The tasks defined using the taskflow operator have to be invoked. The correct chaining would look like this: task_a() >> task_b() >> task_c()

What task ID will choose_branch return?

@task.branch
def choose_branch(result):
if result > 0.5 and result < 1:
return ['task_a', 'task_b']
elif > 1:
return ['task_c']
choose_branch(0)

Correct answer: This code raises an error

This question is very missleading. elif > 1 is not valid Python syntax. It should be written: elif result > 1

Study guide

Photo by Aaron Burden on Unsplash

Before you start the certification, here are a couple of tips:

The exam is not hard per-se but it does tackle more advanced topics like dynamic tasks, data aware mapping and XCom management, and there are some confusing questions.

You’ll have enough time for everything. 60 minutes for 75 multiple choice questions is more than enough. The passing score is quite generous. You need to answer only 53 of the 75 questions correctly to get certified.

Step #1 - finish the Fundamentals certificate first

Before you jump into the DAG Authoring certificate, do the Apache Airflow 3 Fundamentals first.

This certification handles the basic concepts of Airflow, and while it doesn’t cover a ton of topics it presents a lot of challenging questions that repeat in the DAG Authoring certificates.

This certificate should be passable with just 1-2 years of hands-on Airflow experience …Or if you just hit the books hard enough.

Step #2 - go through the astronomer learning paths

The Astronomer Academy is a free source of educational materials on the topic of Apache Airflow and some more general data engineering concepts.

It has a ton of video materials, cheat sheets and quizes that will prepare you for the certification.

Marc Lamberti is great at breaking down complex Airflow concepts and creating video tutorials that don’t take up much of your time.

Step #3 - apply the things you learn

Getting hands on is always the best way to gain a strong grasp on topics you learn through the documentation and video materials.

Are you struggling to grasp some concepts like data assets or XCom, then set up a local airflow instance and start coding.

Resources