Apache Airflow: The Unsung Hero in Everyday Data Science

May 25

Introduction

In the dynamic world of Data Science, efficiency and automation are paramount. One of the goals of being a “productive” Data Scientist is to strive to optimize every process to ensure data is clean, accurate, and readily available. Many Data Scientists rely heavily on Apache Airflow to set the groundwork for all their tables relating to various domains and industries. Despite its immense utility, Airflow is often overlooked when people discuss becoming a Data Scientist. This article will explore how Airflow is used in practical applications and why it deserves more recognition in the data science community.

Why Apache Airflow?

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It allows users to automate complex processes, ensuring tasks are completed accurately and on time. For Data Scientists, this means more time can be spent on analysis and less on manual data management.

Use Cases

Aggregating Metrics for Different Stakeholders

One of the primary uses of Airflow is to aggregate metrics from different domains and create domain-specific tables. For instance, it can be used to compile category data, business intelligence data, and cohort and lifecycle understanding, among others. Airflow helps in orchestrating these tasks seamlessly, ensuring that each table is updated with the latest information. This automated aggregation provides stakeholders with accurate and timely insights.

Automating Reports

Reporting is a critical part of any Data Scientist’s job. Automating daily, weekly, and monthly reports using Airflow not only saves time but also ensures consistency and accuracy in the reports. Airflow allows users to set up schedules and dependencies, so reports are generated and distributed without any manual intervention. This automation is crucial for maintaining a smooth flow of information across the organization.

Backfilling Tables

Backfilling tables, especially those spanning multiple years, can be a daunting task. Airflow provides a quick and efficient way to backfill these tables, ensuring that historical data is accurately captured and integrated. This capability is essential for comprehensive data analysis and decision-making.

Breakdown of an Example DAG:

**This is not a real DAG**

# Importing necessary libraries for the DAG
from datetime import datetime, timedelta
from airflow.models import DAG
from tahoe_api.api import table_delete, table_create, table_rename
from contrib.shared.utils.airflow_base.operators import TahoeQueryOperator
from contrib.shared.utils.airflow_base.utilities import AirflowUtilities
from contrib.shared import TAHOE_POOL, TAHOE_QUEUE

# Check if the DAG is running in a testing environment
DEBUG_MODE = AirflowUtilities.is_testing_host()

# Setting execution pool and queue based on environment (testing or production)
EXECUTION_POOL = TAHOE_POOL.TESTING if DEBUG_MODE else TAHOE_POOL.PROD
EXECUTION_QUEUE = TAHOE_QUEUE.TESTING if DEBUG_MODE else TAHOE_QUEUE.PROD

# Configuring default arguments for the DAG
default_args = AirflowUtilities.update_default_args(
    owner="user",
    queue=EXECUTION_QUEUE,
    pool=EXECUTION_POOL,
    start_date=datetime(2023, 1, 22),
    email=["user@example.com"]  # Email notifications for DAG issues
)

# Database and table settings based on environment
TARGET_DB = 'test' if DEBUG_MODE else 'actual_schema'
TARGET_TABLE = 'category_table'
TARGET_TABLE_TMP = 'category_table_tmp'  # Temporary table for new data

# Define the DAG: schedule and configuration
dag = DAG(
    dag_id="category_dag",
    default_args=default_args,
    schedule_interval="0 7 * * *"  # Scheduled to run daily at 07:00 AM
)

# Schema for the temporary table
SCHEMA = [
    ["product_id", "string"],
    ["product_category_name", "string"]
]

# Function to handle pre-processing tasks (e.g., setting up temporary tables)
def pre_process(**kwargs):
    table_delete(TARGET_DB, TARGET_TABLE_TMP)
    table_create(TARGET_DB, TARGET_TABLE_TMP, SCHEMA)

# Function to handle post-processing tasks (e.g., updating main tables)
def post_process(results, **kwargs):
    table_delete(TARGET_DB, TARGET_TABLE)
    table_rename(TARGET_DB, TARGET_TABLE_TMP, TARGET_TABLE)

# Builds the SQL query for inserting data into the temporary table
def build_insert_query(**kwargs):
    sql = """
        INSERT INTO {db}.{target}
        WITH new_products AS (
            SELECT product_id
            FROM new_product_table
        )
        SELECT a.product_id, b.product_category_name
        FROM new_products a
        LEFT JOIN category_mapping b ON a.product_id = b.product_id
    """.format(db=TARGET_DB, target=TARGET_TABLE_TMP)
    return sql

# Operator to execute the query, with defined pre and post-processing tasks
product_categories = TahoeQueryOperator(
    task_id='product_category_mapping',
    dag=dag,
    build_query=build_insert_query,
    pre_process=pre_process,
    post_process=post_process,
    db_engine='presto',
    dependencies=[]
)

This DAG runs every day at 7 AM to keep the category_table in the database fresh and accurate. It first deletes and recreates a temporary table, fills it with updated data by running a SQL query, and then renames this temporary table to replace the old table. This process ensures that category data is always current, which is crucial for smooth operations.

Why Airflow is Overlooked

Despite its powerful capabilities, Airflow is not often spoken about in data science circles. Many budding data scientists focus on learning algorithms and modeling techniques, overlooking the importance of data management and automation. However, clean, easy-to-work-with data is foundational to any data science project. Without it, the full potential of data science skills cannot be realized.

Airflow simplifies the complexity of data workflows, ensuring data is always in the right place at the right time. It allows data scientists to focus on what they do best — analyzing data and deriving insights.

Final Thoughts

Apache Airflow is a versatile and powerful tool that deserves more attention in the data science community. Its ability to automate and streamline data workflows is invaluable, especially in fast-paced environments. By leveraging Airflow, data can always be kept accurate and up-to-date, allowing data scientists to be their true “Data Scientist” selves. If you haven’t explored Airflow yet, it is highly recommended to give it a try. It might just become your new favorite tool.

Embrace the power of automation and take your Data Science projects to the next level with Apache Airflow.

Clean data makes better Data Scientists!

Additional Resources

For those eager to dive deeper into Apache Airflow, here are some valuable resources and articles:

Official Apache Airflow Documentation
The official documentation is a comprehensive resource for understanding Airflow’s capabilities, installation processes, and advanced configurations.
- Apache Airflow Documentation
Airflow Summit Talks
The Airflow Summit features numerous talks and tutorials by experts who share their experiences and best practices.
- Airflow Summit

Data Science ToolsData Preprocessing

Zakir Pasha

Apache Airflow: The Unsung Hero in Everyday Data Science

Harness the Power of Custom Python Libraries: Streamline Your Workflow and Amplify Your Productivity

Dominate Data Science