Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • ‘Major Anomaly’ Behind Latest SpaceX Starship Explosion
    • Flamengo vs. Chelsea From Anywhere for Free: Stream FIFA Club World Cup Soccer
    • IEEE’s Revamped Online Presence Better Showcases Offerings
    • Computer Vision’s Annotation Bottleneck Is Finally Breaking
    • Portable display expands for enhanced mobile productivity
    • Spanish HealthTech startup Punto Health receives €100k grant for their dementia care app
    • 3 Best Thermal Brush, Tested and Reviewed by WIRED (2025)
    • iPhone 16E Specs vs. Google Pixel 8A: How Apple and Google’s Lower-Cost Phones Match Up
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, June 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle
    Artificial Intelligence

    From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

    Editor Times FeaturedBy Editor Times FeaturedJune 20, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    to steer the cloud business with a whopping 32% share resulting from its early market entry, sturdy expertise and complete service choices. Nevertheless, many customers discover AWS difficult to navigate, and this discontentment lead extra firms and organisations to desire its rivals Microsoft Azure and Google Cloud Platform.

    Regardless of its steeper studying curve and fewer intuitive interface, AWS stays the highest cloud service resulting from its reliability, hybrid cloud and most service choices. Extra importantly, the number of correct methods can considerably cut back configuration complexity, streamline workflows, and enhance efficiency.

    On this article, I’ll introduce an environment friendly option to arrange an entire ETL pipeline with orchestration on AWS, based mostly alone expertise. It would additionally offer you a refreshed view on the manufacturing of information with AWS or make you’re feeling much less struggling when conducting configuration if that is your first time to make use of AWS for sure duties.

    Technique for Designing an Environment friendly Knowledge Pipeline

    AWS has essentially the most complete ecosystem with its huge companies. To construct a production-ready information warehouse on AWS not less than requires the next companies:

    • IAM – Though this service isn’t included into any a part of the workflow, it’s the inspiration for accessing all different companies.
    • AWS S3 – Knowledge Lake storage
    • AWS Glue – ETL processing
    • Amazon Redshift – Knowledge Warehouse
    • CloudWatch – Monitoring and logging

    You additionally want entry to Airflow if you must schedule extra complicated dependencies and conduct superior retries when it comes to error dealing with though Redshift can deal with some fundamental cron jobs.

    To make your work simpler, I extremely advocate to put in an IDE (Visible Studio Code or PyCharm and naturally you’ll be able to select your personal favorite IDE). An IDE dramatically improves your effectivity for complicated python code, native testing/debugging, model management integration and staff collaboration. And within the subsequent session, I’ll present step-by-step configurations.

    Preliminary Setup

    Listed below are the steps of preliminary configurations:

    • Launch a digital surroundings in your IDE
    • Set up dependencies – principally, we have to set up the libraries that shall be used afterward.
    pip set up apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy
    • Set up AWS CLI – this step means that you can write scripts to automate varied AWS operations and makes the administration of AWS sources extra effectively.
    • AWS Configuration – be certain that to enter these IAM person credentials when prompted:
      • AWS Entry Key ID: Out of your IAM person.
      • AWS Secret Entry Key: Out of your IAM person.
      • Default area: us-east-1 (or your most popular area)
      • Default output format: json.
    • Combine Airflow – listed here are the steps:
      • Initialize Airflow
      • Create DAG information in Airflow
      • Run the net server at http://localhost:8080 (login:admin/admin)
      • Open one other terminal tab and begin the scheduler
    export AIRFLOW_HOME=$(pwd)/airflow
    airflow db init
    airflow customers create 
      --username admin 
      --password admin 
      --firstname Admin 
      --lastname Consumer 
      --role Admin 
      --email [email protected]
    #Initialize Airflow
    airflow webserver --port 8080 ##run the webserver
    airflow scheduler #begin the scheduler

    Improvement Workflow: COVID-19 Knowledge Case Research

    I’m utilizing JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration function. You’ll be able to discuss with information here,

    The chart beneath reveals the workflow from information ingestion to information loading to Redshift tables within the growth surroundings.

    Improvement workflow created by writer

    Knowledge Ingestion

    In step one of information ingestion to AWS S3, I processed information by melting them to lengthy format and changing the date format. I saved the information within the parquet format to enhance the storage effectivity, improve question efficiency and cut back storage prices. The code for this step is as beneath:

    import pandas as pd
    from datetime import datetime
    import os
    import boto3
    import sys
    
    def process_covid_data():
        strive:
            # Load uncooked information
            url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
            df = pd.read_csv(url)
            
            # --- Knowledge Processing ---
            # 1. Soften to lengthy format
            df = df.soften(
                id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
                var_name='date_str',
                value_name='confirmed_cases'
            )
            
            # 2. Convert dates (JHU format: MM/DD/YY)
            df['date'] = pd.to_datetime(
                df['date_str'], 
                format='%m/%d/%y',
                errors='coerce'
            ).dropna()
            
            # 3. Save as partitioned Parquet
            output_dir = "covid_processed"
            df.to_parquet(
                output_dir,
                engine='pyarrow',
                compression='snappy',
                partition_cols=['date']
            )
            
            # 4. Add to S3
            s3 = boto3.shopper('s3')
            total_files = 0
            
            for root, _, information in os.stroll(output_dir):
                for file in information:
                    local_path = os.path.be a part of(root, file)
                    s3_path = os.path.be a part of(
                        'uncooked/covid/',
                        os.path.relpath(local_path, output_dir)
                    )
                    s3.upload_file(
                        Filename=local_path,
                        Bucket='my-dev-bucket',
                        Key=s3_path
                    )
                total_files += len(information)
            
            print(f"Efficiently processed and uploaded {total_files} Parquet information")
            print(f"Knowledge covers from {df['date'].min()} to {df['date'].max()}")
            return True
    
        besides Exception as e:
            print(f"Error: {str(e)}", file=sys.stderr)
            return False
    
    if __name__ == "__main__":
        process_covid_data()

    After operating the python code, you must be capable of see the parquet information within the S3 buckets, beneath the folder of ‘uncooked/covid/’.

    Screenshot by writer

    ETL Pipeline Improvement

    AWS Glue is especially used for ETL Pipeline Improvement. Though it can be used for information ingestion even when the information hasn’t loaded to S3, its power lies in processing information as soon as it’s in S3 for information warehousing functions. Right here’s PySpark scripts for information remodel:

    # transform_covid.py
    from awsglue.context import GlueContext
    from pyspark.sql.features import *
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    df = glueContext.create_dynamic_frame.from_options(
        "s3",
        {"paths": ["s3://my-dev-bucket/raw/covid/"]},
        format="parquet"
    ).toDF()
    
    # Add transformations right here
    df_transformed = df.withColumn("load_date", current_date())
    
    # Write to processed zone
    df_transformed.write.parquet(
        "s3://my-dev-bucket/processed/covid/",
        mode="overwrite"
    )
    Screenshot by writer

    The following step is to load information to Redshift. In Redshift Console, click on on “Question Editor Q2” on the left aspect and you may edit your SQL code and end the Redshift COPY.

    # Create a desk covid_data in dev schema
    CREATE TABLE dev.covid_data (
        "Province/State" VARCHAR(100),  
        "Nation/Area" VARCHAR(100),
        "Lat" FLOAT8,
        "Lengthy" FLOAT8,
        date_str VARCHAR(100),
        confirmed_cases FLOAT8  
    )
    DISTKEY("Nation/Area")   
    SORTKEY(date_str);
    # COPY information to redshift
    COPY dev.covid_data (
        "Province/State",
        "Nation/Area",
        "Lat",
        "Lengthy",
        date_str,
        confirmed_cases
    )
    FROM 's3://my-dev-bucket/processed/covid/'
    IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
    REGION 'your-region'
    FORMAT PARQUET;

    Then you definately’ll see the information efficiently uploaded to the information warehouse.

    Screenshot by writer

    Pipeline Automation

    The simplest option to automate your information pipeline is to schedule jobs beneath Redshift question editor v2 by making a Saved Process (I’ve a extra detailed introduction about SQL Saved Process, you’ll be able to discuss with this article).

    CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
    AS $$
    BEGIN
      TRUNCATE TABLE dev.covid_data;
      COPY dev.covid_data 
      FROM 's3://simba-dev-bucket/uncooked/covid'
      IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
      REGION 'your-region'
      FORMAT PARQUET;
    END;
    $$ LANGUAGE plpgsql;
    Screenshot by writer

    Alternatively, you’ll be able to run Airflow for scheduled jobs.

    from datetime import datetime
    from airflow import DAG
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    
    default_args = {
        'proprietor': 'data_team',
        'depends_on_past': False,
        'start_date': datetime(2023, 1, 1),
        'retries': 2
    }
    
    with DAG(
        'redshift_etl_dev',
        default_args=default_args,
        schedule_interval='@each day',
        catchup=False
    ) as dag:
    
        run_etl = RedshiftSQLOperator(
            task_id='run_covid_etl',
            redshift_conn_id='redshift_dev',
            sql='CALL dev.run_covid_etl()',
        )

    Manufacturing Workflow

    Airflow DAG is highly effective to orchestrates your whole ETL pipeline if there are lots of dependencies and it’s additionally a great observe in manufacturing surroundings.

    After creating and testing your ETL pipeline, you’ll be able to automate your duties in manufacturing surroundings utilizing Airflow.

    Manufacturing workflow created by writer

    Listed below are the test record of key preparation steps to assist the profitable deployment in Airflow:

    • Create S3 bucket my-prod-bucket 
    • Create Glue job prod_covid_transformation in AWS Console
    • Create Redshift Saved Process prod.load_covid_data()
    • Configure Airflow
    • Configure SMTP for emails in airflow.cfg

    Then the deployment of the information pipeline in Airflow is:

    from datetime import datetime, timedelta
    from airflow import DAG
    from airflow.operators.python import PythonOperator
    from airflow.suppliers.amazon.aws.operators.glue import GlueJobOperator
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    from airflow.operators.e-mail import EmailOperator
    
    # 1. DAG CONFIGURATION
    default_args = {
        'proprietor': 'data_team',
        'retries': 3,
        'retry_delay': timedelta(minutes=5),
        'start_date': datetime(2023, 1, 1)
    }
    
    # 2. DATA INGESTION FUNCTION
    def load_covid_data():
        import pandas as pd
        import boto3
        
        url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
    
        df = df.soften(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
        
        df.to_parquet(
            's3://my-prod-bucket/uncooked/covid/',
            engine='pyarrow',
            partition_cols=['date']
        )
    
    # 3. DAG DEFINITION
    with DAG(
        'covid_etl',
        default_args=default_args,
        schedule_interval='@each day',
        catchup=False
    ) as dag:
    
        # Activity 1: Ingest Knowledge
        ingest = PythonOperator(
            task_id='ingest_data',
            python_callable=load_covid_data
        )
    
        # Activity 2: Remodel with Glue
        remodel = GlueJobOperator(
            task_id='transform_data',
            job_name='prod_covid_transformation',
            script_args={
                '--input_path': 's3://my-prod-bucket/uncooked/covid/',
                '--output_path': 's3://my-prod-bucket/processed/covid/'
            }
        )
    
        # Activity 3: Load to Redshift
        load = RedshiftSQLOperator(
            task_id='load_data',
            sql="CALL prod.load_covid_data()"
        )
    
        # Activity 4: Notifications
        notify = EmailOperator(
            task_id='send_email',
            to='you-email-address',
            topic='ETL Standing: {{ ds }}',
            html_content='ETL job accomplished: View Logs'
        )

    My Remaining Ideas

    Though some customers, particularly those that are new to the cloud and in search of easy options are typically daunted by AWS’s excessive barrier to entry and be overwhelmed by the large decisions of companies, it’s well worth the time and efforts and listed here are the explanations:

    • The method of configuration, and the designing, constructing and testing of the information pipelines provides you the deep understanding of a typical information engineering workflow. The talents will profit you even for those who produce your initiatives with different cloud companies, comparable to Azure, GCP and Alibaba Cloud.
    • The mature ecosystem that AWS has and an enormous array of companies that it affords allow customers to customize their information structure methods and luxuriate in extra flexibility and scalability of their initiatives.

    Thanks for studying! Hope this text useful to construct your cloud-base information pipeline!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Computer Vision’s Annotation Bottleneck Is Finally Breaking

    June 20, 2025

    What PyTorch Really Means by a Leaf Tensor and Its Grad

    June 20, 2025

    LLM-as-a-Judge: A Practical Guide | Towards Data Science

    June 20, 2025

    Beyond Model Stacking: The Architecture Principles That Make Multimodal AI Systems Work

    June 19, 2025

    Understanding Matrices | Part 2: Matrix-Matrix Multiplication

    June 19, 2025

    Beyond Code Generation: Continuously Evolve Text with LLMs

    June 19, 2025
    Leave A Reply Cancel Reply

    Editors Picks

    ‘Major Anomaly’ Behind Latest SpaceX Starship Explosion

    June 20, 2025

    Flamengo vs. Chelsea From Anywhere for Free: Stream FIFA Club World Cup Soccer

    June 20, 2025

    IEEE’s Revamped Online Presence Better Showcases Offerings

    June 20, 2025

    Computer Vision’s Annotation Bottleneck Is Finally Breaking

    June 20, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    20% Off Brooks Promo Code & Deals for June 2025

    June 1, 2025

    Best Early REI Black Friday Deals on Outdoor Gear (2024)

    November 16, 2024

    How to Use Generative AI as a Software Developer | by Dr. Leon Eversberg | Jan, 2025

    January 31, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.