Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • I Made Google Translate My Default on iPhone Before a Trip and It Saved Me More Than Once
    • LG unveils world’s first 5K2K monitor with Thunderbolt 5
    • Dell 14 Plus Review: A Fresh Start or Same Old?
    • WWDC 2025 Is Tomorrow: How To Watch and See if iOS 26 Debuts
    • 5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments
    • Scientists find Earth’s largest gold deposit
    • 8 Best Weighted Blankets, Robes, and Eye Masks (2025)
    • Today’s NYT Connections: Sports Edition Hints, Answers for June 8 #258
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, June 8
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code | by Alvaro Leandro Cavalcante Carneiro | Jan, 2025
    Artificial Intelligence

    Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code | by Alvaro Leandro Cavalcante Carneiro | Jan, 2025

    Editor Times FeaturedBy Editor Times FeaturedJanuary 31, 2025No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Apache Airflow is among the hottest orchestration instruments within the information area, powering workflows for firms worldwide. Nonetheless, anybody who has already labored with Airflow in a manufacturing surroundings, particularly in a fancy one, is aware of that it will probably often current some issues and bizarre bugs.

    Among the many many elements it’s essential to handle in an Airflow surroundings, one vital metric usually flies below the radar: DAG parse time. Monitoring and optimizing parse time is important to keep away from efficiency bottlenecks and make sure the right functioning of your orchestrations, as we’ll discover on this article.

    That mentioned, this tutorial goals to introduce airflow-parse-bench, an open-source software I developed to assist information engineers monitor and optimize their Airflow environments, offering insights to scale back code complexity and parse time.

    Relating to Airflow, DAG parse time is commonly an neglected metric. Parsing happens each time Airflow processes your Python information to construct the DAGs dynamically.

    By default, all of your DAGs are parsed each 30 seconds — a frequency managed by the configuration variable min_file_process_interval. Which means each 30 seconds, all of the Python code that’s current in your dags folder is learn, imported, and processed to generate DAG objects containing the duties to be scheduled. Efficiently processed information are then added to the DAG Bag.

    Two key Airflow elements deal with this course of:

    Collectively, each elements (generally known as the dag processor) are executed by the Airflow Scheduler, making certain that your DAG objects are up to date earlier than being triggered. Nonetheless, for scalability and safety causes, it is usually doable to run your dag processor as a separate part in your cluster.

    In case your surroundings solely has a number of dozen DAGs, it’s unlikely that the parsing course of will trigger any sort of downside. Nonetheless, it’s widespread to seek out manufacturing environments with lots of and even 1000’s of DAGs. On this case, in case your parse time is simply too excessive, it will probably result in:

    • Delay DAG scheduling.
    • Enhance useful resource utilization.
    • Surroundings heartbeat points.
    • Scheduler failures.
    • Extreme CPU and reminiscence utilization, losing sources.

    Now, think about having an surroundings with lots of of DAGs containing unnecessarily complicated parsing logic. Small inefficiencies can rapidly flip into vital issues, affecting the soundness and efficiency of your total Airflow setup.

    When writing Airflow DAGs, there are some vital finest practices to remember to create optimized code. Though you could find a variety of tutorials on how you can enhance your DAGs, I’ll summarize a few of the key rules that may considerably improve your DAG efficiency.

    Restrict High-Stage Code

    Some of the widespread causes of excessive DAG parsing instances is inefficient or complicated top-level code. High-level code in an Airflow DAG file is executed each time the Scheduler parses the file. If this code consists of resource-intensive operations, reminiscent of database queries, API calls, or dynamic job technology, it will probably considerably impression parsing efficiency.

    The next code exhibits an instance of a non-optimized DAG:

    On this case, each time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which might considerably impression the parse time.

    One other vital issue contributing to sluggish parsing is top-level imports. Each library imported on the high stage is loaded into reminiscence throughout parsing, which will be time-consuming. To keep away from this, you’ll be able to transfer imports into features or job definitions.

    The next code exhibits a greater model of the identical DAG:

    Keep away from Xcoms and Variables in High-Stage Code

    Nonetheless speaking about the identical matter, is especially attention-grabbing to keep away from utilizing Xcoms and Variables in your top-level code. As acknowledged by Google documentation:

    If you’re utilizing Variable.get() in high stage code, each time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This may dramatically decelerate parse instances.

    To deal with this, think about using a JSON dictionary to retrieve a number of variables in a single database question, reasonably than making a number of Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this fashion are solely processed throughout job execution, not throughout DAG parsing.

    Take away Pointless DAGs

    Though it appears apparent, it’s all the time vital to recollect to periodically clear up pointless DAGs and information out of your surroundings:

    • Take away unused DAGs: Test your dags folder and delete any information which are now not wanted.
    • Use .airflowignore: Specify the information Airflow ought to deliberately ignore, skipping parsing.
    • Overview paused DAGs: Paused DAGs are nonetheless parsed by the Scheduler, consuming sources. If they’re now not required, take into account eradicating or archiving them.

    Change Airflow Configurations

    Lastly, you might change some Airflow configurations to scale back the Scheduler useful resource utilization:

    • min_file_process_interval: This setting controls how usually (in seconds) Airflow parses your DAG information. Growing it from the default 30 seconds can scale back the Scheduler’s load at the price of slower DAG updates.
    • dag_dir_list_interval: This determines how usually (in seconds) Airflow scans the dags listing for brand spanking new DAGs. In the event you deploy new DAGs sometimes, take into account growing this interval to scale back CPU utilization.

    We’ve mentioned lots in regards to the significance of making optimized DAGs to keep up a wholesome Airflow surroundings. However how do you truly measure the parse time of your DAGs? Thankfully, there are a number of methods to do that, relying in your Airflow deployment or working system.

    For instance, you probably have a Cloud Composer deployment, you’ll be able to simply retrieve a DAG parse report by executing the next command on Google CLI:

    gcloud composer environments run $ENVIRONMENT_NAME 
    — location $LOCATION
    dags report

    Whereas retrieving parse metrics is easy, measuring the effectiveness of your code optimizations will be much less so. Each time you modify your code, it’s essential to redeploy the up to date Python file to your cloud supplier, anticipate the DAG to be parsed, after which extract a brand new report — a sluggish and time-consuming course of.

    One other doable strategy, in case you’re on Linux or Mac, is to run this command to measure the parse time domestically in your machine:

    time python airflow/example_dags/instance.py

    Nonetheless, whereas easy, this strategy just isn’t sensible for systematically measuring and evaluating the parse instances of a number of DAGs.

    To deal with these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and evaluating the parse instances of your DAGs utilizing Airflow’s native parse technique.

    The airflow-parse-bench software makes it simple to retailer parse instances, evaluate outcomes, and standardize comparisons throughout your DAGs.

    Putting in the Library

    Earlier than set up, it’s really helpful to make use of a virtualenv to keep away from library conflicts. As soon as arrange, you’ll be able to set up the bundle by operating the next command:

    pip set up airflow-parse-bench

    Observe: This command solely installs the important dependencies (associated to Airflow and Airflow suppliers). You will need to manually set up any extra libraries your DAGs rely on.

    For instance, if a DAG makes use of boto3 to work together with AWS, be certain that boto3 is put in in your surroundings. In any other case, you will encounter parse errors.

    After that, it’s a necessity to initialize your Airflow database. This may be performed by executing the next command:

    airflow db init

    As well as, in case your DAGs use Airflow Variables, you need to outline them domestically as nicely. Nonetheless, it’s not vital to place actual values in your variables, because the precise values aren’t required for parsing functions:

    airflow variables set MY_VARIABLE 'ANY TEST VALUE'

    With out this, you’ll encounter an error like:

    error: 'Variable MY_VARIABLE doesn't exist'

    Utilizing the Instrument

    After putting in the library, you’ll be able to start measuring parse instances. For instance, suppose you’ve gotten a DAG file named dag_test.py containing the non-optimized DAG code used within the instance above.

    To measure its parse time, merely run:

    airflow-parse-bench --path dag_test.py

    This execution produces the next output:

    Execution end result. Picture by creator.

    As noticed, our DAG offered a parse time of 0.61 seconds. If I run the command once more, I’ll see some small variations, as parse instances can differ barely throughout runs because of system and environmental elements:

    Results of one other execution of the identical DAG. Picture by creator.

    In an effort to current a extra concise quantity, it’s doable to mixture a number of executions by specifying the variety of iterations:

    airflow-parse-bench --path dag_test.py --num-iterations 5

    Though it takes a bit longer to complete, this calculates the common parse time throughout 5 executions.

    Now, to guage the impression of the aforementioned optimizations, I changed the code in mydag_test.py with the optimized model shared earlier. After executing the identical command, I acquired the next end result:

    Parse results of the optimized code. Picture by creator.

    As observed, simply making use of some good practices was able to lowering virtually 0.5 seconds within the DAG parse time, highlighting the significance of the adjustments we made!

    There are different attention-grabbing options that I feel it’s related to share.

    As a reminder, you probably have any doubts or issues utilizing the software, you’ll be able to entry the whole documentation on GitHub.

    In addition to that, to view all of the parameters supported by the library, merely run:

    airflow-parse-bench --help

    Testing A number of DAGs

    Typically, you probably have dozens of DAGs to check the parse instances. To deal with this use case, I created a folder named dags and put 4 Python information inside it.

    To measure the parse instances for all of the DAGs in a folder, it is simply essential to specify the folder path within the --path parameter:

    airflow-parse-bench --path my_path/dags

    Operating this command produces a desk summarizing the parse instances for all of the DAGs within the folder:

    Testing the parse time of a number of DAGs. Picture by creator.

    By default, the desk is sorted from the quickest to the slowest DAG. Nonetheless, you’ll be able to reverse the order through the use of the --order parameter:

    airflow-parse-bench --path my_path/dags --order desc
    Inverted sorting order. Picture by creator.

    Skipping Unchanged DAGs

    The --skip-unchanged parameter will be particularly helpful throughout improvement. Because the identify suggests, this selection skips the parse execution for DAGs that have not been modified for the reason that final execution:

    airflow-parse-bench --path my_path/dags --skip-unchanged

    As proven under, when the DAGs stay unchanged, the output displays no distinction in parse instances:

    Output with no distinction for unchanged information. Picture by creator.

    Resetting the Database

    All DAG info, together with metrics and historical past, is saved in a neighborhood SQLite database. If you wish to clear all saved information and begin contemporary, use the --reset-db flag:

    airflow-parse-bench --path my_path/dags --reset-db

    This command resets the database and processes the DAGs as if it had been the primary execution.

    Parse time is a crucial metric for sustaining scalable and environment friendly Airflow environments, particularly as your orchestration necessities change into more and more complicated.

    Because of this, the airflow-parse-bench library will be an vital software for serving to information engineers create higher DAGs. By testing your DAGs’ parse time domestically, you’ll be able to simply and rapidly discover your code bottleneck, making your dags sooner and extra performant.

    Because the code is executed domestically, the produced parse time gained’t be the identical because the one current in your Airflow cluster. Nonetheless, if you’ll be able to scale back the parse time in your native machine, the identical could be reproduced in your cloud surroundings.

    Lastly, this challenge is open for collaboration! In case you have solutions, concepts, or enhancements, be at liberty to contribute on GitHub.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

    June 8, 2025

    Why AI Projects Fail | Towards Data Science

    June 8, 2025

    The Role of Luck in Sports: Can We Measure It?

    June 8, 2025

    The Rise of AI Girlfriends You Don’t Have to Sign Up For

    June 7, 2025

    What Happens When You Remove the Filters from AI Love Generators?

    June 7, 2025

    7 AI Hentai Girlfriend Chat Websites No Filter

    June 7, 2025

    Comments are closed.

    Editors Picks

    I Made Google Translate My Default on iPhone Before a Trip and It Saved Me More Than Once

    June 8, 2025

    LG unveils world’s first 5K2K monitor with Thunderbolt 5

    June 8, 2025

    Dell 14 Plus Review: A Fresh Start or Same Old?

    June 8, 2025

    WWDC 2025 Is Tomorrow: How To Watch and See if iOS 26 Debuts

    June 8, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Conveo secures €4.9 million to transform market research industry

    March 7, 2025

    Madrid-based cybersecurity startup Dedge Security bags €4 million to secure the Web3 ecosystem

    June 2, 2025

    This USAID Program Made Food Aid More Efficient for Decades. DOGE Gutted It Anyways

    February 19, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.