Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How Elon Musk Squeezed OpenAI: They ‘Are Gonna Want to Kill Me’
    • Resorts World NYC opens first full casino in New York City with live table games in Queens
    • Sony’s Latest PlayStation Update Sparks DRM Fears: What We Know
    • System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine
    • 15-second semicylinder air tent unboxes from the cube
    • Emergency First Responders Say Waymos Are Getting Worse
    • Motorola Razr Fold vs. Samsung Galaxy Z Fold 7: How the Book-Style Phones Compare
    • Agentic AI: How to Save on Tokens
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, April 30
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine
    Artificial Intelligence

    System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

    Editor Times FeaturedBy Editor Times FeaturedApril 30, 2026No Comments17 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    , I’ve had Apache Flink on my “issues I really want to know correctly” listing. I’d seen it talked about alongside Kafka, heard it come up in conversations about real-time pipelines, and sort-of understood the use case. However I’d by no means really sat down and discovered it correctly.

    In case you really feel the identical means, you’re in good firm. There’s good motive to study Flink, it’s one of the crucial well-liked instruments in software program engineering proper now. Netflix makes use of it for near-real-time anomaly detection of their streaming infrastructure. Alibaba reportedly runs one of many largest Flink deployments on the earth — processing a whole bunch of billions of occasions per day throughout tens of 1000’s of machines. Uber constructed their analytical platform round it. Flink has develop into the spine of how a number of the most data-intensive corporations on the earth course of data because it occurs. So if Flink has been in your listing too, it is a good time to really perceive it.

    So I dove in. And I used to be actually stunned, not simply by what Flink is, however by why it exists and how its constructed. The story of Flink is admittedly the story of a a lot deeper thought: the concept of the way to perceive high-scale, continually streaming information. The issue assertion is definitely fairly easy, how do you construct real-world and sensible solutions from huge scale of steady information. This put up is my try to clarify that concept from the bottom up, and present you the place Flink matches into it.

    Let’s dive in.

    Earlier than We Begin

    Two ideas come up continually on this put up which can be price ensuring we’re on the identical web page about earlier than we go additional.

    What’s a stream? A stream is a steady, doubtlessly endless sequence of information arriving over time. Take into consideration a person looking an internet site — each web page view, each click on, each scroll is an occasion being produced. One after one other, in actual time. There’s no pure “finish” to this — so long as the person is energetic, occasions hold coming. That’s a stream.

    Picture By Creator

    What’s batch processing? Batch processing means taking a finite, bounded assortment of information and processing it abruptly. As a substitute of reacting to every occasion because it arrives, you gather occasions for a time period — say, an hour — after which run a computation over all of them collectively. The computation has a transparent begin and a transparent finish.

    Picture By Creator

    Each are official methods to course of information. The stress between them is what Flink was constructed to resolve — and we’ll get there.

    Again To The Drawback: How We Truly Produce Knowledge

    Let me make this concrete with an instance we’ll use all through this put up.

    Think about you’re constructing a advice engine — the sort that reveals customers “you may additionally like these” based mostly on what they’ve been viewing. To do that nicely, your system must know issues like:
    What has this person been clicking on in the previous couple of minutes?
    What gadgets are trending proper now throughout all customers?
    Which merchandise did this person view however not buy within the final session?

    Now, the place does that information come from? Each time a person opens a product web page, you report an occasion. Each click on, each buy, each search — your software is repeatedly writing information that look roughly like this:

    { "user_id": "u-8821", "item_id": "p-443", "event_type": "view", "timestamp": "2024–03–10T14:32:01Z" }
    { "user_id": "u-1042", "item_id": "p-117", "event_type": "buy", "timestamp": "2024–03–10T14:32:03Z" }
    { "user_id": "u-8821", "item_id": "p-501", "event_type": "click on", "timestamp": "2024–03–10T14:32:07Z" }

    One report each few seconds for each person, throughout tens of millions of concurrent customers, repeatedly. That’s your information. Not a file. Not a desk that refreshes as soon as a day. A stream — an ongoing, endless sequence of occasions that describes what your customers are doing proper now.

    Once more, that is what this stream seems to be like — 

    Picture By Creator

    And but the dominant paradigm for years was to take that stream and… ignore the truth that it was a stream. Dump the occasions into information each hour. Watch for the batch job to run. Then serve suggestions based mostly on what customers had been doing final hour.

    Picture By Creator

    Why? As a result of batch processing is conceptually easy. You understand precisely what information you have got. You may motive in regards to the computation clearly — it begins, it runs, it finishes. Methods like Hadoop and MapReduce (you don’t must know these in depth for this put up) had been constructed round this mannequin and scaled to huge information sizes. They labored.

    However there’s a basic price: latency. In case your batch job runs each hour, then at worst case, a person’s habits proper now gained’t affect their suggestions for as much as an hour. For a advice engine, which means a person who simply confirmed sturdy curiosity in mountaineering gear will get proven laptop computer equipment — as a result of the system hasn’t caught up but. The person looked for a mountaineering rucksack, and it is advisable to present them tents and mountaineering poles on the subsequent web page load, not one hour later.

    For fraud detection, hourly latency means fraudulent transactions go undetected for an hour. For a dwell dashboard, it means your “real-time” metrics may be upto 59 minutes stale. The price of batch is that occasions occur in actual time, however your system solely learns about them on a schedule.

    In order information volumes grew and latency necessities tightened, engineers began constructing streaming techniques alongside their batch techniques — techniques that would course of every occasion because it arrived, in milliseconds. Apache Storm was an early chief right here. Amazon Kinesis. LinkedIn’s Samza.

    Picture By Creator

    However constructing a brand new streaming system, whereas sustaining an current batch system, isn’t so simple. Now you have got two techniques to keep up. Your streaming pipeline computed approximate, real-time outcomes. Your batch pipeline ran in a single day and produced correct, full outcomes. You needed to write the identical enterprise logic twice — as soon as for every system, in numerous frameworks, in numerous languages, saved in sync manually. When the batch job and the streaming job disagreed on a quantity (and so they at all times disagreed finally), you had to determine which one was mistaken.

    Your advice engine on this new world now seems to be like this: a streaming element that updates suggestions in near-real-time based mostly on latest occasions, and a batch element that rebuilds the complete advice mannequin each evening based mostly on historic information.

    Two codebases. Two deployment pipelines. Two units of bugs. One serving layer making an attempt to reconcile them.

    The Key Perception: Batch Is Only a Particular Case of Streaming

    Right here’s the concept on the coronary heart of Flink, and it’s fairly easy:

    A bounded information set is only a particular case of an unbounded information stream that occurs to finish.

    Your historic database of 5 years of person occasions — that’s a stream that began 5 years in the past and stopped at this time. Your log information from final month — that’s a stream with a starting and an finish. The distinction between “batch information” and “streaming information” is just not a basic distinction in regards to the nature of the info. On the finish of the day, it’s simply JSON occasions of what the person searched and clicked on. The query is whether or not the stream continues to be flowing or has stopped.
    Going again to our advice engine: the “historic information” you course of in your nightly batch job and the “real-time occasions” you course of in your streaming pipeline are each simply information in the identical sequence of person occasions. The one distinction is while you learn them. The nightly batch job reads information from 6 months in the past. The streaming pipeline reads information from 6 seconds in the past. Similar information, completely different time window.

    In case you construct a system that processes streams natively — and handles each infinite streams and finite ones — you don’t want separate techniques. You don’t want to keep up two codebases. You could have one engine, one set of logic, and also you level it at no matter slice of the stream you want.

    That’s what Flink tries to do.

    So What Is Apache Flink?

    Apache Flink is a distributed stream processing framework. It takes a doubtlessly unbounded stream of information (or a bounded batch of information — identical factor), processes it in parallel throughout a cluster of machines, and produces outcomes repeatedly as information flows by way of.

    Picture By Creator

    Internally, Flink jobs are written in code, and are transformed to a DAG. For instance, that is how code for a Flink Job would appear to be (It’s not necessary to know all the main points, that is to simply give a tough thought) –

    // ── 1. SOURCES ──────────────────────────────────────────────
    searches = readFromKafka("search-events")
    clicks = readFromKafka("click-events")
    
    
    // ── 2. PER-USER ACTIVITY (windowed aggregation) ─────────────
    // group occasions by person, compute rolling options over final 30 min
    userActivity = (searches + clicks)
    .keyBy(userId)
    .window(slidingWindow(measurement=30min, slide=1min))
    .combination(activityAggregator)
    // → { userId, recentQueries, recentClicks, classes, ... }
    
    
    // ── 3. USER EMBEDDING (name user-tower mannequin) ───────────────
    // flip the exercise options right into a vector
    userState = userActivity.asyncMap(callUserTowerModel)
    // → { userId, embedding[128], options }
    
    
    // ── 4. CANDIDATE GENERATION (2 sources, then merge) ─────────
    annCandidates = userState.asyncMap(vectorAnnLookup) // ~500 gadgets
    trendingCandidates = userState.asyncMap(trendingLookup) // ~200 gadgets
    
    allCandidates = (annCandidates + trendingCandidates)
    .keyBy(userId)
    .window(2sec)
    .scale back(mergeAndDedupe)
    // → { userId, candidates: ~1000 itemIds }
    
    
    // ── 5. FETCH ITEM FEATURES (batched lookup) ─────────────────
    scoringInputs = allCandidates
    .joinWith(userState, on=userId)
    .asyncMap(fetchItemFeatures)
    // → { userId, userFeatures, [(itemId, itemFeatures) × ~1000] }
    
    
    // ── 6. RANKING (name rating mannequin) ─────────────────────────
    ranked = scoringInputs.asyncMap(callRankingModel)
    // → { userId, high 100 (itemId, rating) pairs }
    
    
    // ── 7. SINK ─────────────────────────────────────────────────
    ranked.writeTo(redis)

    Internally, Flink breaks down this code to a graph of bodily duties to be executed, and breaks these duties to smaller set of parallel “subtasks” — 

    Picture By Creator

    Flink pushes duties to employee nodes. Every employee runs its assigned duties repeatedly, sends periodic heartbeats again to Flink, and stories if a process fails so Flink can restart it.

    Picture By Creator

    Lets break down the core ideas of Flink

    Core Ideas

    Streams and Operators

    Let me begin from the best attainable image and construct up.

    Each Flink program is a dataflow graph: a set of operators linked by information streams. Don’t fear if this sounds summary proper now — we’ll construct the image piece by piece and it’ll click on rapidly.

    Sources produce information (studying from Kafka, a file, a database).

    Operators remodel it.

    Sinks devour the output (writing to a database, one other Kafka matter, a dashboard).

    An operator is a unit of processing logic. For our advice engine, an operator may filter out bot visitors, or enrich an occasion with product metadata, or rely what number of occasions every product was seen. Every operator receives information from a number of enter streams, does one thing to them, and emits information to a number of output streams.

    A stream is the sequence of information flowing between operators. In our case, a stream of person occasions: view occasions, click on occasions, buy occasions, one after one other as they occur.

    That is the fundamental form of any Flink job.

    Parallelism

    A single machine can course of occasions quick — however when you’re dealing with tens of millions of customers, a single machine isn’t sufficient. Flink solves this by working each operator in parallel: every operator is break up into a number of subtasks that run concurrently on completely different machines in your cluster.

    If in case you have a Filter operator with parallelism 4, there are 4 situations working concurrently, every processing a special portion of the stream. Add extra machines, get extra subtasks, deal with increased volumes. That is how Flink scales to billions of occasions per day.
    For our advice engine, this implies the window aggregation for 10 million customers isn’t working sequentially on one machine — it’s break up throughout dozens of staff

    State

    Going again to our advice engine: when a person views a product, that single occasion by itself tells you nearly nothing. You want context. What else has this person been viewing prior to now jiffy? Have they been merchandise in the identical class? Did they nearly buy one thing related final session? To reply these questions, your system wants reminiscence — it wants to recollect what occurred earlier than.

    Within the early days of stream processing, most techniques had been stateless. Every occasion was processed in isolation: the operator noticed the occasion, reworked it, moved on. No reminiscence of what got here earlier than. This labored positive for easy pipelines — filtering out bot visitors, enriching occasions with metadata from a lookup desk. But it surely was essentially too restricted for something that required reasoning about patterns over time.

    Take into consideration what our advice engine really must do. For each incoming occasion, it must ask: “What has person u-8821 executed within the final 10 minutes?” To reply that query, somebody must be protecting a working listing of person u-8821’s latest occasions. And person u-1042’s latest occasions. And all the opposite customers. That’s state — information that accumulates and evolves as information circulate by way of the operator, slightly than being derived recent from every particular person report.
    Flink makes state a first-class idea. An operator can declare state explicitly — a counter, a hash map keyed by person ID, a sorted listing of latest occasions. Flink provides you that state as a managed object you may learn and write throughout processing. For our advice engine, the state may be a hash map from person ID to “listing of merchandise IDs seen within the final 10 minutes.” Each time a brand new view occasion arrives, you search for the person within the map, append the merchandise, and trim occasions older than 10 minutes.

    However managing state in a distributed system is genuinely onerous. What occurs when the machine working your operator crashes? That in-memory hash map is gone. Flink handles this: it periodically snapshots all operator state to sturdy storage, so on restoration it will possibly restore all the pieces to the place it was earlier than the failure. And it ensures that state updates are utilized precisely as soon as — even when a machine crashes and the identical occasions are replayed throughout restoration, your counts gained’t be doubled.

    We’ll go deep on how Flink achieves exactly-once ensures in a future structure put up. For now, simply know that Flink provides you state that feels as dependable as writing to a database, with the efficiency of an in-memory hash map.

    Home windows

    We’ve obtained a stream of person occasions, operators working in parallel, and state accumulating per person. Now right here’s an issue that comes up nearly instantly in any actual aggregation.

    Let’s say you need to compute “the ten most seen merchandise within the final 5 minutes” — to energy a “trending now” part of your web site. You could have an operator that’s counting views per product. However your stream is infinite. When do you emit a outcome? You may’t wait till “all of the occasions” arrive — they by no means cease arriving.

    You want a solution to slice the infinite stream into finite items and compute over every bit. That’s a window.

    A window is a bounded chunk of your stream. You outline it, Flink teams the occasions into that chunk, and when the chunk is “full,” it runs your aggregation and emits a outcome. Flink has a number of window sorts, Tumbling Home windows, Sliding Home windows, Session Home windows, and so forth. It’s not crucial to know the variations between every window sort, however the gist of home windows is that it seems to be at information for a while interval.

    Tidbits from the Authentic Paper

    I spent a while studying the 2015 Apache Flink paper — ”Apache Flink: Stream and Batch Processing in a Single Engine” by Carbone, Katsifodimos, Ewen, Markl, Haridi, and Tzoumas. Just a few issues from the paper that add helpful coloration to what we coated above:

    On Fault Tolerance and Precisely-As soon as Ensures

    The paper describes exactly-once semantics this manner: “Flink provides strict exactly-once-processing consistency ensures for stateful operators by way of a mix of distributed snapshots and partial re-execution upon restoration.” The important thing phrase there may be *partial* re-execution — when a machine fails, Flink doesn’t restart your complete job from the start. It rolls again all operators to their final profitable snapshot, then replays solely the enter from that time ahead. The utmost quantity of reprocessing is bounded by the hole between two consecutive checkpoints — which is a tunable parameter.

    The mechanism that makes this work with out pausing the computation known as Asynchronous Barrier Snapshotting (ABS) — and it’s genuinely intelligent. We’ll cowl it in full element within the subsequent put up. However the headline is: Flink injects particular “barrier” markers into the info stream, which circulate by way of the operators like common information. When an operator receives a barrier, it snapshots its state to sturdy storage and forwards the barrier downstream — all whereas persevering with to course of information. No pause, no freeze, no missed occasions.

    On Unified Batch and Stream Processing

    One of many clearest statements within the paper is that this: “A bounded information set is a particular case of an unbounded information stream.” The authors are making a philosophical declare, not only a technical one. And so they again it up: “Batch computations are executed by the identical runtime as streaming computations. The runtime executable could also be parameterized with blocked information streams to interrupt up giant computations into remoted levels which can be scheduled successively.”
    In plain phrases: there isn’t a separate batch engine in Flink. Batch jobs run on the very same distributed dataflow runtime that processes your Kafka streams. The one distinction is that batch jobs use “blocked” information alternate between levels — the upstream operator finishes totally earlier than the downstream one begins. All the pieces else — the operator mannequin, the state administration, the serialization — is similar.

    Going again to our advice engine: this implies the job that counts real-time view tendencies and the job that processes 6 months of historic occasions for mannequin retraining can share the identical operators, the identical cluster, and the identical codebase. The paper’s promise is that the Lambda Structure — with its two techniques and two codebases — is just now not needed.

    Wrapping Up

    Let’s rapidly do a TLDR:

    Knowledge is produced as steady streams, however we’ve traditionally pressured it into batches — creating latency and the operational ache of sustaining two techniques

    Flink is constructed on the perception that batch is only a particular case of streaming— and unifies each in a single engine

    The core constructing blocks are: operators (processing logic), streams (information in movement), state (reminiscence that persists throughout information), and home windows (bounded slices of a stream for computation)
    Fault tolerance with exactly-once ensures is in-built.

    Ideally, I might have actually preferred to go in-depth to every of those matters (and there’s a lot of depth in them), however this put up has already develop into fairly lengthy, so I’ll defer that to future Sanil for now. You may also comply with me on LinkedIn for extra byte-sized posts and to know what I’m studying about proper now.

    We talked rather a lot about Apache Kafka (given its the spine of most information architectures), however did you ever marvel how Apache Kafka works and what its structure is like? I used to be stunned to find out how easy Kafka actually is beneath the hood. I wrote a complete weblog put up about it right here — 
    System Design Series: Apache Kafka from 10,000 feet
    Let’s look at what Kafka is, how it works and when should we use it!medium.com

    In case you’re on the lookout for one thing extra in-depth, I’d suggest trying out one in all my hottest posts on Temporal, a workflow orchestration software, with in-depth explanations on how occasions are scheduled, began and accomplished — 
    System Design Series: A Step-by-Step Breakdown of Temporal’s Internal Architecture
    A step-by-step deep dive into Temporal’s architecture — covering workflows, tasks, shards, partitions, and how Temporal…medium.com



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Agentic AI: How to Save on Tokens

    April 29, 2026

    4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

    April 29, 2026

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    How AI Policy in South Africa Is Ruining Itself

    April 29, 2026

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    April 28, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    How Elon Musk Squeezed OpenAI: They ‘Are Gonna Want to Kill Me’

    April 30, 2026

    Resorts World NYC opens first full casino in New York City with live table games in Queens

    April 30, 2026

    Sony’s Latest PlayStation Update Sparks DRM Fears: What We Know

    April 30, 2026

    System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

    April 30, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Total Wireless Promo Codes & Deals: 50% Off Select Plans

    November 22, 2025

    What the rest of Europe can learn from France about defence startups

    October 10, 2025

    New attack on ChatGPT research agent pilfers secrets from Gmail inboxes

    September 21, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.