Get started with Apache Airflow®

This tutorial will get you started as quickly as possible while explaining the core concepts of Apache Airflow. You will explore galaxies 🌌 while extending an existing workflow with modern Airflow features, setting you up for diving into the world of data orchestration with Apache Airflow.

No matter if you are an absolute Airflow beginner or already know about certain concepts, in 5 minutes from now, you will have your first data pipeline, aka Dag, running in a fully functional Airflow environment!

Set up in minutes

Get a fully functional Airflow environment running in your browser with zero local setup using Astro IDE.

Build your first pipeline

Create and run an ETL pipeline that processes galaxy data with extraction, transformation, and loading steps.

Master core concepts

Learn Dags, tasks, operators, dependencies, and asset-aware scheduling through hands-on practice.

Step 1: Astro trial and Astro IDE setup

  1. The first step is to start a free Astro trial.

    All Astro accounts have access to the Astro IDE, which is the easiest way to develop Airflow Dags right in your browser. You can directly deploy your Dags from the Astro IDE to an Astro Deployment, an Airflow environment running in the cloud. After entering your email address, starting the trial includes 4 steps:

  2. Choose between professional and personal. The choice has no impact on this tutorial.

  3. Enter an organization and workspace name. Each customer has a dedicated organization on Astro. Each team or project has a workspace, which is a collection of deployments. A deployment is an Airflow environment hosted on Astro. For this tutorial, you can use any names.

  4. You can choose to upload Dags, use a template, or start with an empty workspace. For this tutorial, choose start with a template.

  5. Choose the ETL template.

    Astro trial flow

    Astro Concepts
    • Astro: Fully-managed platform that helps teams write and run data pipelines with Airflow at any scale.
    • Astro IDE: In-browser IDE with context-aware AI and zero local setup.
    • Organization: Each customer has a dedicated org on Astro.
    • Workspace: Each team or project has a dedicated workspace, containing a collection of deployments.
    • Deployment: Airflow environment hosted on Astro.
    • Summary: 1 Organization → n Workspace → n Deployment → 1 Airflow instance.

    Once your environment is created, you will find yourself in the Astro IDE with your very first ETL Dag, ready to be deployed! The Python code is a programmatic representation of your workflow, and by clicking the Start Test Deployment button on the top right, a fully functional Airflow environment will be started and your code will be deployed.

  6. Click the Start Test Deployment button and wait for the deployment to finish.

    Astro trial flow

  7. Your first Airflow Dag is deployed and ready to be executed. Click on the dropdown next to Sync to Test and select Open Airflow.

    Open Airflow from Astro IDE

    The Airflow UI home dashboard of your Airflow instance will open in a new browser tab.

    Airflow home dashboard

Step 2: Run your first Dag

  1. Within the navbar on the left, click on Dags.

    This view shows all your Dags defined in your Python code. The ETL template comes with one Dag named example_etl_galaxies.

    Dags view

    This ETL (Extract, Transform, Load) pipeline retrieves data about galaxies, filters them based on their distance from the Milky Way, and stores the results in a DuckDB database.

    Graph representation of the Dag

    Tasks breakdown

    • create_galaxy_table_in_duckdb: Creates a table in DuckDB with columns for galaxy name, distances, type, and characteristics.
    • extract_galaxy_data: Retrieves raw data about 20 galaxies and returns it as a pandas DataFrame.
    • transform_galaxy_data: Filters the galaxy data to keep only galaxies within a specified distance from the Milky Way (default: 500,000 light years).
    • load_galaxy_data: Inserts the filtered galaxy data into the DuckDB table and produces an Airflow Asset update.
    • print_loaded_galaxies: Queries and prints all stored galaxies from DuckDB, sorted by distance from the Milky Way.

    Task dependencies

    • create_galaxy_table_in_duckdbload_galaxy_data (table must exist before loading)
    • extract_galaxy_datatransform_galaxy_data (raw data is needed for filtering)
    • transform_galaxy_dataload_galaxy_data (filtered data is needed for loading)
    • load_galaxy_dataprint_loaded_galaxies (data must be loaded before printing)
  2. Let’s run the pipeline! Click the play button next to the Dag.

    Trigger Dag run via Dags view

    The button will open a trigger dialog, allowing you to trigger a single run or a backfill to process a range of dates right from the UI. Dags can also have parameters that can be used within the implementation to keep certain parts of your pipeline configurable.

  3. Select Single Run, keep the parameters at their defaults, and click the Trigger button.

    Trigger Dag run dialog

    Your Dag will start, and under Latest Run in the Dags view it will show the current running instance of it.

  4. Click that run date to got to the individual Dag run view.

    Latest run in the Dags view

    Watch how the Dag run finishes and explore the grid and graph views (buttons on the top left), two different representation of your pipeline.

  5. Once all tasks have finished successfully, open the grid view and click the print_loaded_galaxies task, the last step in your pipeline graph.

    Task selection of a Dag run in the grid view

    It will open the logs of this task instance and we see the output: a table of galaxies with their distance from the Milky Way and from our solar system, as well as the type of galaxy.

    Task logs in the Airflow UI

You just set up your Airflow development environment, started your first Airflow environment, and deployed and ran your first Dag. Take a moment to check the time and internalize what happened, isn’t this amazing?

Take your time to explore the UI, trigger more runs, check the logs of other tasks, and make yourself familiar with the interface. Feel free to read the Airflow UI guide for a deep dive into its different views and functionality.

Step 3: Understand the basic concepts

Once you’ve finished your exploration, switch back to the Astro IDE and have a look at the Python code inside example_etl_galaxies.py. The code contains a lot of comments explaining each step in detail. However, let’s get an overview before you dive into details.

The Python file contains the following key elements:

  • Imports: All modules, classes, and functions needed for your implementation. Always use the Airflow Task SDK by importing from airflow.sdk, as this is the user-facing SDK.
  • Constants: Any constants like, in our case, the connection string for our DuckDB instance.
  • Dag definition: The data pipeline together with settings like its schedule.
  • Tasks: The units of work. Tasks should be atomic and idempotent (producing the same result when run multiple times with the same inputs).
  • Dependencies: We define how the tasks are connected, so Airflow knows how to construct the graph.
1# imports
2from airflow.sdk import Asset, chain, Param, dag, task
3# ...
4
5# constants
6_DUCKDB_INSTANCE_NAME = os.getenv("DUCKDB_INSTANCE_NAME", "include/astronomy.db")
7_DUCKDB_TABLE_NAME = os.getenv("DUCKDB_TABLE_NAME", "galaxy_data")
8_DUCKDB_TABLE_URI = f"duckdb://{_DUCKDB_INSTANCE_NAME}/{_DUCKDB_TABLE_NAME}"
9# ...
10
11# Dag definition
12@dag(...)
13def example_etl_galaxies():
14
15 # tasks
16 @task(retries=2)
17 def create_galaxy_table_in_duckdb(...):
18 # ...
19
20 @task
21 def extract_galaxy_data(...):
22 # ...
23
24 @task
25 def transform_galaxy_data(...):
26 # ..
27
28 @task
29 def load_galaxy_data(...):
30 # ...
31
32 @task
33 def print_loaded_galaxies(...):
34 # ...
35
36 # create task instances and define implicit dependencies
37 create_galaxy_table_in_duckdb_obj = create_galaxy_table_in_duckdb()
38 extract_galaxy_data_obj = extract_galaxy_data()
39 transform_galaxy_data_obj = transform_galaxy_data(extract_galaxy_data_obj)
40 load_galaxy_data_obj = load_galaxy_data(transform_galaxy_data_obj)
41
42 # define explicit dependencies
43 chain(
44 create_galaxy_table_in_duckdb_obj, load_galaxy_data_obj, print_loaded_galaxies()
45 )
46
47
48# Instantiate the Dag
49example_etl_galaxies()
Airflow Concepts
  • Dag: Your entire pipeline from start to finish, consisting of one or more tasks.
  • Task: A unit of work within your pipeline.
  • Operator/Decorator: The template/class that defines what work a task does, serving as the building blocks of pipelines.
    • Traditional: task = PythonOperator(...) → returns operator directly.
    • TaskFlow API: @task def my_task(): ... → creates operator, wrapped in XComArg.
    • Many decorators available: @task, @task.bash, @task.docker, @task.kubernetes, etc.
    • XComArg: Wrapper enabling automatic data passing and dependency inference.

Step 4: Extend the demo project

Let’s level up! Now that you’ve run your first Dag, we’ll extend the project by adding a second Dag that builds on top of the first one.

We’ll create a galaxy_maintenance Dag that allows you to manually enter new galaxy data through an interactive form. The data will be automatically added to the database and validated with automated quality checks.

What you’ll learn:

By the end of this section, you’ll have a powerful toolbox of concepts to explore Airflow further and confidently jump into your first real-world ETL/ELT project!

Step 4.1: Add provider packages

The example_etl_galaxies Dag currently connects directly to the DuckDB database using:

1cursor = duckdb.connect(duckdb_instance_name)

While this works, Airflow offers a better approach: common SQL operators that execute queries using Airflow connections. This unifies and simplifies SQL workloads across your pipelines. Let’s set this up properly.

Airflow’s core functionality can be extended with provider packages for specific use cases. We need two providers for our DuckDB connection.

  1. Open the requirements.txt file in the Astro IDE.

  2. Add the following lines at the bottom:

    apache-airflow-providers-common-sql==1.28.2
    airflow-provider-duckdb==0.2.0
  3. Since we added new dependencies, we need to sync the changes. Click on Sync to Test and wait for the changes to be deployed.

Step 4.2: Setup a connection

An Airflow connection stores configuration details for connecting to external tools in your data ecosystem. Most hooks (what is a hook?) and operators that interact with external systems require a connection.

To create the connection:

  1. Open Airflow and click Admin in the left navbar

  2. Select Connections

  3. Click Add Connection (top right)

  4. Enter the following details:

    • Connection ID: duckdb_astronomy
    • Connection Type: DuckDB
    • Host: include/astronomy.db
    • keep the rest empty

    Add connection to DuckDB database

  5. Save the connection and you’re now ready to connect! You can find the Airflow task that uses this connection in the example code in Step 4.3.

Astro Concepts

We just added a connection to our deployment (a single Airflow instance). If we deployed our Dags to another environment or recreated the test deployment, we’d need to add the connection again. Astro offers a helpful solution: under EnvironmentConnections in the Astro platform, you can set up workspace-wide connections that are available across all your Airflow instances. See Manage Airflow connections and variables in the Astro documentation.

Airflow Concepts
  • Provider package: Provider packages are installable modules that contain pre-built decorators, operators, hooks, and sensors for integrating with external services and extending Airflow functionality.
  • Connection: Connections in Airflow are sets of configurations used to connect with other tools in the data ecosystem.

Step 4.3: Prepare test deployment for advanced usage

The test deployment is a fully functional but minimal Airflow setup. To enable advanced features like asset-aware scheduling (explained later), we need to apply a quick configuration change.

  1. In the Astro IDE, click the dropdown next to Sync to Test (top right).
  2. Select Test Deployment Details.
  3. Navigate to the Environment tab, click Edit Deployment Variables, and remove AIRFLOW__SCHEDULER__USE_JOB_SCHEDULE by clicking the trash bin icon next to it.
  4. Click Update Environment Variables (bottom right) and you’re ready to go! Head back to the Astro IDE.

Change test deployment environment

Step 4.4: Implement Dag with human-in-the-loop

  1. Within the Astro IDE, create a new file by right-clicking on the dags folder → New File… and name it galaxy_maintenance.py.

  2. Paste the following content:

    1from airflow.sdk import chain, dag, Asset, Param
    2from airflow.providers.standard.operators.hitl import HITLEntryOperator
    3from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator, SQLColumnCheckOperator
    4
    5_DUCKDB_TABLE_URI = "duckdb://include/astronomy.db/galaxy_data"
    6_DUCKDB_CONN_ID = "duckdb_astronomy"
    7
    8galaxy_table_asset = Asset(_DUCKDB_TABLE_URI)
    9
    10
    11@dag(schedule=galaxy_table_asset)
    12def galaxy_maintenance():
    13
    14 _enter_galaxy_details = HITLEntryOperator(
    15 task_id="enter_galaxy_details",
    16 subject="Please provide required information: ",
    17 params={
    18 "name": Param("", type="string"),
    19 "distance_from_milkyway": Param(10000, type="number"),
    20 "distance_from_solarsystem": Param(10000, type="number"),
    21 "type_of_galaxy": Param("Dwarf", type="string", enum=[
    22 "Dwarf Spheroidal",
    23 "Dwarf",
    24 "Irregular",
    25 "Spiral"
    26 ]),
    27 "characteristics": Param("", type="string")
    28 }
    29 )
    30
    31 _insert_galaxy_details = SQLExecuteQueryOperator(
    32 task_id="insert_galaxy_details",
    33 conn_id=_DUCKDB_CONN_ID,
    34 show_return_value_in_logs=True,
    35 sql="""
    36 -- in case db was removed due to sync
    37 CREATE TABLE IF NOT EXISTS galaxy_data (
    38 name STRING PRIMARY KEY,
    39 distance_from_milkyway INT,
    40 distance_from_solarsystem INT,
    41 type_of_galaxy STRING,
    42 characteristics STRING
    43 );
    44 INSERT OR IGNORE INTO galaxy_data BY NAME
    45 SELECT
    46 $name AS name,
    47 $distance_from_milkyway AS distance_from_milkyway,
    48 $distance_from_solarsystem AS distance_from_solarsystem,
    49 $type_of_galaxy AS type_of_galaxy,
    50 $characteristics AS characteristics
    51 """,
    52 parameters={
    53 "name": "{{ task_instance.xcom_pull('enter_galaxy_details')['params_input']['name'] }}",
    54 "distance_from_milkyway": "{{ task_instance.xcom_pull('enter_galaxy_details')['params_input']['distance_from_milkyway'] }}",
    55 "distance_from_solarsystem": "{{ task_instance.xcom_pull('enter_galaxy_details')['params_input']['distance_from_solarsystem'] }}",
    56 "type_of_galaxy": "{{ task_instance.xcom_pull('enter_galaxy_details')['params_input']['type_of_galaxy'] }}",
    57 "characteristics": "{{ task_instance.xcom_pull('enter_galaxy_details')['params_input']['characteristics'] }}"
    58 }
    59 )
    60
    61 _galaxy_dq_checks = SQLColumnCheckOperator(
    62 task_id="dq_checks",
    63 conn_id=_DUCKDB_CONN_ID,
    64 table="galaxy_data",
    65 column_mapping={
    66 "distance_from_milkyway": {
    67 "min": {"geq_to": 10000},
    68 "max": {"leq_to": 900000},
    69 },
    70 "distance_from_solarsystem": {
    71 "min": {"geq_to": 10000},
    72 "max": {"leq_to": 900000},
    73 },
    74 },
    75 )
    76
    77 chain(_enter_galaxy_details, _insert_galaxy_details, _galaxy_dq_checks)
    78
    79
    80galaxy_maintenance()

    This maintenance pipeline is triggered automatically whenever the galaxy data table is updated. It allows manual entry of new galaxy data through a human-in-the-loop interface, inserts the data into DuckDB, and runs data quality checks to ensure the values are within acceptable ranges.

    Graph representation of the maintenance Dag

    Tasks Breakdown

    • enter_galaxy_details: Pauses the pipeline and prompts a user to manually enter galaxy information (name, distances, type, and characteristics) through a form interface.
    • insert_galaxy_details: Inserts the user-provided galaxy data into the DuckDB table using the values collected from the previous task.
    • dq_checks: Validates the data quality by checking that distance values are within acceptable ranges (between 10,000 and 900,000 light years).

    Task Dependencies

    • enter_galaxy_detailsinsert_galaxy_details (user input needed before insertion)
    • insert_galaxy_detailsdq_checks (data must be inserted before validation)
    Airflow Concepts
    • Human-in-the-loop: Human-in-the-loop workflows are processes that require human intervention, for example, to approve or reject an AI generated output, or choose a branch in a Dag depending on the result of an upstream task.
    • SQL operators: The common SQL provider is a great place to start when looking for SQL-related operators. It includes the SQLExecuteQueryOperator operator, which is a generic operator that can be used with a variety of databases, including Snowflake and Postgres. It also comes with data quality related operators, like the SQLColumnCheckOperator.
    • Parameters: You can use parameters to have dynamic queries with placeholders. These will be handled on database-driver level.
  3. Click Sync to Test (top right) to sync your changes to the test deployment.

  4. Once the sync process finishes, head back to the Airflow UI.

  5. Open the Dags view, and a new Dag should appear in the list.

Notice how the schedule is set to be triggered whenever the asset named duckdb://include/astronomy.db/galaxy_data is updated.

Second Dag in the Dags view

Our first Dag updates this asset when data is loaded to DuckDB by using the outlets parameter:

1 @task(outlets=[Asset(_DUCKDB_TABLE_URI)])
2 def load_galaxy_data(
3 filtered_galaxy_df: pd.DataFrame,
4 duckdb_instance_name: str = _DUCKDB_INSTANCE_NAME,
5 table_name: str = _DUCKDB_TABLE_NAME,
6 ):
7 # ...
Airflow Concepts
  • Asset (object): Logical representation of data (table, model, file) used to establish dependencies. Can be used imperatively (code-based) or declaratively (implicit definition via the @asset decorator). It is an abstract representation of data.
  • Asset event: Each time an asset is updated, the system creates an asset event object. This object includes the ID of the Dag that produced the update, the update timestamp, and optional custom information.
  • Asset-aware scheduling: Set the schedule of a Dag to one or more assets, optionally with a logical expression using AND (&) and OR (|) operators, so that the Dag is triggered when these assets receive asset update events.
  • Producer task: a task that produces updates to one or more assets provided to its outlets parameter, creating asset events when it completes successfully.
  • Materialize: Running a producer task, which updates an asset.
  • @asset: Declarative shortcut (Dag + task + asset(s) in one).

Step 4.5: Try your advanced Dag

Time to see asset-aware scheduling and your new Dag in action!

We store our DuckDB database in a project file (include/astronomy.db). The example_etl_galaxies Dag creates a table in this database, but the file isn’t included in our auto-generated project repository. As a result, each time we sync changes to the deployment, the database file disappears. To handle this, the insert_galaxy_details task in the second Dag uses CREATE TABLE IF NOT EXISTS in case the database file was removed between runs. To improve this, we could use a persistent database service, for example Snowflake or BigQuery.

  1. Trigger example_etl_galaxies and observe what happens.

    You’ll notice that galaxy_maintenance starts when example_etl_galaxies finishes. More precisely, when it updates the asset that triggers the other Dag.

  2. Once galaxy_maintenance is running, open the latest run and you’ll notice there’s a required action. This is part of the human-in-the-loop feature: your task is waiting for user input.

    Required actions for a Dag run

  3. Take time to explore the Airflow UI and see where these required actions are visible!

  4. Open the required action to see the form we defined in the code, and enter the following details:

    • name: Astro
    • distance_from_milkyway: 10000
    • distance_from_solarsystem: 10000
    • type_of_galaxy: Dwarf
    • characteristics: Looks amazing

    Human-in-the-loop form

  5. Click OK and observe how the pipeline proceeds. Pay close attention to the dq_checks task, which successfully validates the data.

  6. Try it again by running galaxy_maintenance once more. This time, enter 42 as the distance and observe how the dq_checks task fails because the data quality check detected an issue with your galaxy data.

Conclusion and next steps

Congratulations 🎉! You’ve just built two interconnected data pipelines using Apache Airflow, and along the way you’ve learned the fundamental concepts that power modern data orchestration.

In this tutorial, you:

  • Set up a complete Airflow development environment in minutes using Astro IDE.
  • Built and ran your first ETL pipeline with extraction, transformation, and loading steps
  • Mastered core Airflow concepts: Dags, tasks, operators, and dependencies
  • Extended your project with provider packages and database connections
  • Implemented human-in-the-loop workflows for manual data entry
  • Added automated data quality checks to ensure data integrity
  • Used asset-aware scheduling to create a dependency between two Dags

Ready to dive deeper?

Explore more guides:

Join the academy and get certified: