Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Francis Bacon and the Scientific Method
    • Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval
    • Sulfur lava exoplanet L 98-59 d defies classification
    • Hisense U7SG TV Review (2026): Better Design, Great Value
    • Google is in talks with Marvell Technology to develop a memory processing unit that works alongside TPUs, and a new TPU for running AI models (Qianer Liu/The Information)
    • Premier League Soccer: Stream Man City vs. Arsenal From Anywhere Live
    • Dreaming in Cubes | Towards Data Science
    • Onda tiny house flips layout to fit three bedrooms and two bathrooms
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Scaling Feature Engineering Pipelines with Feast and Ray
    Artificial Intelligence

    Scaling Feature Engineering Pipelines with Feast and Ray

    Editor Times FeaturedBy Editor Times FeaturedFebruary 26, 2026No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    mission involving the construct of propensity fashions to foretell clients’ potential purchases, I encountered characteristic engineering points that I had seen quite a few occasions earlier than.

    These challenges will be broadly labeled into two classes:

    1) Insufficient Characteristic Administration

    • Definitions, lineage, and variations of options generated by the crew weren’t systematically tracked, thereby limiting characteristic reuse and reproducibility of mannequin runs.
    • Characteristic logic was manually maintained throughout separate coaching and inference scripts, resulting in a threat of inconsistent options for coaching and inference (i.e., training-serving skew)
    • Options had been saved as flat recordsdata (e.g., CSV), which lack schema enforcement and help for low-latency or scalable entry.

    2) Excessive Characteristic Engineering Latency

    • Heavy characteristic engineering workloads typically come up when coping with time-series information, the place a number of window-based transformations should be computed.
    • When these computations are executed sequentially fairly than optimized for parallel execution, the latency of characteristic engineering can improve considerably.

    On this article, I clearly clarify the ideas and implementation of characteristic shops (Feast) and distributed compute frameworks (Ray) for characteristic engineering in manufacturing machine studying (ML) pipelines.

    Contents

    (1) Example Use Case
    (2) Understanding Feast and Ray
    (3) Roles of Feast and Ray in Feature Engineering
    (4) Code Walkthrough

    Yow will discover the accompanying GitHub repo here.


    (i) Goal

    For instance the capabilities of Feast and Ray, our instance state of affairs entails constructing an ML pipeline to coach and serve a 30-day buyer buy propensity mannequin.


    (ii) Dataset

    We are going to use the UCI Online Retail dataset (CC BY 4.0), which includes buy transactions for a UK on-line retailer between December 2010 and December 2011.

    Fig. 1 — Pattern rows of UCI On-line Retail dataset | Picture by creator

    (iii) Characteristic Engineering Strategy

    We will hold the characteristic engineering scope easy by limiting it to the next options (primarily based on a 90-day lookback window except in any other case said):

    Recency, Frequency, Financial Worth (RFM) options

    • recency_days: Days since final buy
    • frequency: Variety of distinct orders
    • financial: Whole financial spend
    • tenure_days: Days since first-ever buy (all-time)

    Buyer behavioral options

    • avg_order_value: Imply spend per order
    • avg_basket_size: Imply variety of gadgets per order
    • n_unique_products: Product range
    • return_rate: Share of cancelled orders
    • avg_days_between_purchases: Imply days between purchases

    (iv) Rolling Window Design

    The options are computed from a 90-day window earlier than every cutoff date, and buy labels (1 = at the least one buy, 0 = no buy) are computed from a 30-day window after every cutoff.

    Provided that the cutoff dates are spaced 30 days aside, it produces 9 snapshots from the dataset:

    Fig. 2 — Rolling window timeline for options technology and prediction labels | Picture by creator

    (i) About Feast

    Firstly, let’s perceive what a characteristic retailer is. 

    A characteristic retailer is a centralized information repository that manages, shops, and serves machine studying options, performing as a single supply of reality for each coaching and serving.

    Characteristic shops supply key advantages in managing characteristic pipelines:

    • Implement consistency between coaching and serving information
    • Stop information leakage by making certain options use solely information accessible on the time of prediction (i.e., point-in-time right information)
    • Permit cross-team reuse of options and have pipelines
    • Observe characteristic variations, lineage, and metadata for governance

    Feast (quick for Feature Store) is an open-source characteristic retailer that delivers characteristic information at scale throughout coaching and inference.

    It integrates with a number of database backends and ML frameworks that may work throughout or off cloud platforms.

    Fig 3. — Feast structure. Notice that information transformation for characteristic engineering sometimes sits outdoors of the Feast framework | Picture used beneath Apache License 2.0

    Feast helps each on-line (for real-time inference) and offline (for batch predictions), although our focus is on offline options, as batch prediction is extra related for our buy propensity use case.


    (ii) About Ray

    Ray is an open-source general-purpose distributed computing framework designed to scale ML functions from a single machine to giant clusters. It might run on any machine, cluster, cloud supplier, or Kubernetes.

    Ray provides a spread of capabilities, and the one we’ll use is the core distributed runtime referred to as Ray Core. 

    Fig. 4 — Overview of the Ray framework | Picture used beneath Apache License 2.0

    Ray Core supplies low-level primitives for the parallel execution of Python features as distributed duties and for managing duties throughout accessible compute assets.


    Let’s take a look at the areas the place Feast and Ray assist deal with characteristic engineering challenges.

    (i) Characteristic Retailer Setup with Feast

    For our case, we’ll arrange an offline characteristic retailer utilizing Feast. Our RFM and buyer habits options shall be registered within the characteristic retailer for centralized entry.

    In Feast terminology, offline options are additionally termed as ‘historic’ options


    (ii) Characteristic Retrieval with Feast and Ray

    With our Feast characteristic retailer prepared, we will allow the retrieval of related options from it throughout each phases of mannequin coaching and inference.

    We should first be clear about these three ideas: Entity, Characteristic, and Characteristic View.

    • An entity is the first key used to retrieve options. It principally refers back to the identifier “object” for every characteristic row (e.g., user_id, account_id, and many others)
    • A characteristic is a single typed attribute related to every entity (e.g., avg_basket_size)
    • A characteristic view defines a bunch of associated options for an entity, sourced from a dataset. Consider it as a desk with a main key (e.g., user_id) being coupled with related characteristic columns.
    Fig. 5 — Instance illustration of entity, characteristic, and have view | Picture by creator

    Occasion timestamps are an integral part of characteristic views because it permits usto generate point-in-time right characteristic information for coaching and inference.

    Say we now wish to acquire these offline options for coaching or inference. Right here’s how it’s finished:

    1. An entity DataFrame is first created, containing the entity keys and an occasion timestamp for every row. It corresponds to the 2 left-most columns in Fig. 5 above.
    2. A point-in-time right be a part of happens between the entity DataFrame and the characteristic tables outlined by the completely different Characteristic Views

    The output is a mixed dataset containing all of the requested options for the desired set of entities and timestamps.

    So the place does Ray are available right here?

    The Ray Offline Store is a distributed compute engine that allows quicker, extra scalable characteristic retrieval, particularly for big datasets. It does so by parallelizing information entry and be a part of operations:

    • Information (I/O) Entry: Distributed information reads by splitting Parquet recordsdata throughout a number of employees, the place every employee reads a distinct partition in parallel
    • Be part of Operations: Splits the entity DataFrame so that every partition independently performs temporal joins to retrieve the characteristic values per entity earlier than a given timestamp. With a number of characteristic views, Ray parallelizes the computationally intensive joins to scale effectively.

    (iii) Characteristic Engineering with Ray

    The characteristic engineering operate for producing RFM and buyer habits options should be utilized to every 90-day window (i.e., 9 unbiased cutoff dates, every requiring the identical computation).

    Ray Core turns every operate name right into a remote task, enabling the characteristic engineering to run in parallel throughout accessible cores (or machines in a cluster). 


    (4.1) Preliminary Setup

    We set up the next Python dependencies:

    feast[ray]==0.60.0
    openpyxl==3.1.5
    psycopg2-binary==2.9.11
    ray==2.54.0
    scikit-learn==1.8.0
    xgboost==3.2.0

    As we’ll use PostgreSQL for the characteristic registry, be sure that Docker is put in and operating earlier than operating docker compose up -d to begin the PostgreSQL container.


    (4.2) Put together Information 

    Apart from data ingestion and cleaning, there are two preparation steps to execute:

    • Rolling Cutoff Generation: Creates 9 snapshots spaced 30 days aside. Every cutoff date defines a coaching/prediction level at which options are computed from the 90 days previous it, and goal labels are computed from the 30 days after it.
    • Label Creation: For every cutoff, create a binary goal label indicating whether or not a buyer made at the least one buy inside the 30-day window after the cutoff.

    (4.3) Run Ray-Primarily based Characteristic Engineering

    After defining the code to generate RFM and customer behavior options, let’s parallelize the execution utilizing Ray for every rolling window.

    We begin by making a operate (compute_features_for_cutoff) to wrap all of the related characteristic engineering steps for each cutoff:

    The @ray.distant decorator registers the operate as a distant job to be run asynchronously in separate employees.

    The info preparation and have engineering pipeline is then run as follows:

    Right here’s how Ray is concerned within the pipeline:

    • ray.init() initiates a Ray cluster and permits distributed execution throughout all native cores by default. 
    • ray.put(df) shops the cleaned DataFrame in Ray’s shared reminiscence (aka distributed object retailer) and returns a reference (ObjectRef) so that each one parallel duties can entry the DataFrame with out copying it. This helps to enhance reminiscence effectivity and job launch efficiency
    • compute_features_for_cutoff.distant(...) sends our characteristic computation duties to Ray’s scheduler, the place Ray assigns every job to a employee for parallel execution and returns a reference to every job’s output.
    • futures = [...] shops all references returned by every .distant() name. They symbolize all of the in-flight parallel duties which have been launched
    • ray.get(futures) retrieves all of the precise return values from the parallel job executions at one go
    • The script then extracts and concatenates per-cutoff RFM and habits options into two DataFrames, saves them as Parquet recordsdata regionally
    • ray.shutdown() releases the assets allotted by stopping the Ray runtime

    Whereas our options are saved regionally on this case, do be aware that offline characteristic information is usually saved in information warehouses or information lakes (e.g., S3, BigQuery, and many others) in manufacturing settings.


    (4.4) Arrange Feast Characteristic Registry

    To this point, we have now lined the transformation and storage points of characteristic engineering. Allow us to transfer on to the Feast characteristic registry.

    A characteristic registry is the centralized catalog of characteristic definitions and metadata that serves as a single supply of reality for characteristic data.

    There are two key parts within the registry setup: Definitions and Configuration.


    Definitions

    We first outline the Python objects to symbolize the options engineered to this point. For instance, one of many first objects to find out is the Entity (i.e., the first key that hyperlinks the characteristic rows):

    Subsequent, we outline the information sources wherein our characteristic information are saved:

    Notice that the timestamp_field is crucial because it permits right point-in-time information views and joins when options are retrieved for coaching or inference.

    After defining entities and information sources, we are able to outline the characteristic views. Provided that we have now two units of options (RFM and buyer habits), we anticipate to have two characteristic views:

    The schema (subject names, dtypes) is vital for making certain that characteristic information is correctly validated and registered.

    Configuration

    The characteristic registry configuration is outlined in a YAML file referred to as feature_store.yaml:

    The configuration tells Feast what infrastructure to make use of and the place its metadata and have information dwell, and it typically includes the next:

    • Venture title: Namespace for mission
    • Supplier: Execution setting (e.g., native, Kubernetes, cloud)
    • Registry location: Location of characteristic metadata storage (file or databases like PostgreSQL)
    • Offline retailer: Location from which historic options information is learn
    • On-line retailer: Location from which low-latency options are served (not related in our case)

    In our case, we use PostgreSQL (operating in a Docker container) for the characteristic registry and the Ray offline retailer for optimized characteristic retrieval.

    We use PostgreSQL as an alternative of native SQLite to simulate production-grade infrastructure for the characteristic registry setup, the place a number of companies can entry the registry concurrently

    Feast Apply

    As soon as definitions and configuration are arrange, we run feast apply to register and synchronize the definitions with the registry and provision the required infrastructure.

    The command will be discovered within the Makefile:

    # Step 2: Register Feast characteristic definitions in PostgreSQL registry
    apply:
     cd feature_store && feast apply

    (4.5) Retrieve Options for Mannequin Coaching

    As soon as our characteristic retailer is prepared, we proceed with coaching the ML mannequin. 

    We begin by creating the entity backbone for retrieval (i.e., the 2 columns of customer_id and event_timestamp), which Feast makes use of to retrieve the right characteristic snapshot.

    We then execute the retrieval of options for mannequin coaching at runtime:

    • FeatureStore is the Feast object that’s used to outline, create, and retrieve options at runtime
    • get_historical_features() is designed for offline characteristic retrieval (versus get_online_features()), and it expects the entity DataFrame and the record of options to retrieve. The distributed reads and point-in-time joins of characteristic information happen right here.

    (4.7) Retrieve Options for Inference

    We finish off by producing predictions from our skilled mannequin.

    The characteristic retrieval codes for inference are largely just like these for coaching, since we’re reaping the advantages of a constant characteristic retailer.

    The primary distinction comes from the completely different cutoff dates used.


    Wrapping It Up

    Characteristic engineering is a crucial element of constructing ML fashions, but it surely additionally introduces information administration challenges if not correctly dealt with.

    On this article, we clearly demonstrated how you can use Feast and Ray to enhance the administration, reusability, and effectivity of characteristic engineering.

    Understanding and making use of these ideas will allow groups to construct environment friendly ML pipelines with scalable characteristic engineering capabilities.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    Comments are closed.

    Editors Picks

    Francis Bacon and the Scientific Method

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Sulfur lava exoplanet L 98-59 d defies classification

    April 19, 2026

    Hisense U7SG TV Review (2026): Better Design, Great Value

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Mini Crossword Answers for Dec. 2

    December 2, 2024

    How to Claim Your Share of the $117.5 Million Comcast Data Breach Settlement

    April 14, 2026

    Promising real world dynamic wireless charging for electric vehicles

    October 29, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.