System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

, I’ve had Apache Flink on my “issues I really want to know correctly” listing. I’d seen it talked about alongside Kafka, heard it come up in conversations about real-time pipelines, and sort-of understood the use case. However I’d by no means really sat down and discovered it correctly.

In case you really feel the identical means, you’re in good firm. There’s good motive to study Flink, it’s one of the crucial well-liked instruments in software program engineering proper now. Netflix makes use of it for near-real-time anomaly detection of their streaming infrastructure. Alibaba reportedly runs one of many largest Flink deployments on the earth — processing a whole bunch of billions of occasions per day throughout tens of 1000’s of machines. Uber constructed their analytical platform round it. Flink has develop into the spine of how a number of the most data-intensive corporations on the earth course of data because it occurs. So if Flink has been in your listing too, it is a good time to really perceive it.

So I dove in. And I used to be actually stunned, not simply by what Flink is, however by why it exists and how its constructed. The story of Flink is admittedly the story of a a lot deeper thought: the concept of the way to perceive high-scale, continually streaming information. The issue assertion is definitely fairly easy, how do you construct real-world and sensible solutions from huge scale of steady information. This put up is my try to clarify that concept from the bottom up, and present you the place Flink matches into it.

Let’s dive in.

Earlier than We Begin

Two ideas come up continually on this put up which can be price ensuring we’re on the identical web page about earlier than we go additional.

What’s a stream? A stream is a steady, doubtlessly endless sequence of information arriving over time. Take into consideration a person looking an internet site — each web page view, each click on, each scroll is an occasion being produced. One after one other, in actual time. There’s no pure “finish” to this — so long as the person is energetic, occasions hold coming. That’s a stream.

Picture By Creator

What’s batch processing? Batch processing means taking a finite, bounded assortment of information and processing it abruptly. As a substitute of reacting to every occasion because it arrives, you gather occasions for a time period — say, an hour — after which run a computation over all of them collectively. The computation has a transparent begin and a transparent finish.

Each are official methods to course of information. The stress between them is what Flink was constructed to resolve — and we’ll get there.

Again To The Drawback: How We Truly Produce Knowledge

Let me make this concrete with an instance we’ll use all through this put up.

Think about you’re constructing a advice engine — the sort that reveals customers “you may additionally like these” based mostly on what they’ve been viewing. To do that nicely, your system must know issues like:
What has this person been clicking on in the previous couple of minutes?
What gadgets are trending proper now throughout all customers?
Which merchandise did this person view however not buy within the final session?

Now, the place does that information come from? Each time a person opens a product web page, you report an occasion. Each click on, each buy, each search — your software is repeatedly writing information that look roughly like this:

{ "user_id": "u-8821", "item_id": "p-443", "event_type": "view", "timestamp": "2024–03–10T14:32:01Z" }
{ "user_id": "u-1042", "item_id": "p-117", "event_type": "buy", "timestamp": "2024–03–10T14:32:03Z" }
{ "user_id": "u-8821", "item_id": "p-501", "event_type": "click on", "timestamp": "2024–03–10T14:32:07Z" }

One report each few seconds for each person, throughout tens of millions of concurrent customers, repeatedly. That’s your information. Not a file. Not a desk that refreshes as soon as a day. A stream — an ongoing, endless sequence of occasions that describes what your customers are doing proper now.

Once more, that is what this stream seems to be like —

And but the dominant paradigm for years was to take that stream and… ignore the truth that it was a stream. Dump the occasions into information each hour. Watch for the batch job to run. Then serve suggestions based mostly on what customers had been doing final hour.

Why? As a result of batch processing is conceptually easy. You understand precisely what information you have got. You may motive in regards to the computation clearly — it begins, it runs, it finishes. Methods like Hadoop and MapReduce (you don’t must know these in depth for this put up) had been constructed round this mannequin and scaled to huge information sizes. They labored.

However there’s a basic price: latency. In case your batch job runs each hour, then at worst case, a person’s habits proper now gained’t affect their suggestions for as much as an hour. For a advice engine, which means a person who simply confirmed sturdy curiosity in mountaineering gear will get proven laptop computer equipment — as a result of the system hasn’t caught up but. The person looked for a mountaineering rucksack, and it is advisable to present them tents and mountaineering poles on the subsequent web page load, not one hour later.

For fraud detection, hourly latency means fraudulent transactions go undetected for an hour. For a dwell dashboard, it means your “real-time” metrics may be upto 59 minutes stale. The price of batch is that occasions occur in actual time, however your system solely learns about them on a schedule.

In order information volumes grew and latency necessities tightened, engineers began constructing streaming techniques alongside their batch techniques — techniques that would course of every occasion because it arrived, in milliseconds. Apache Storm was an early chief right here. Amazon Kinesis. LinkedIn’s Samza.

However constructing a brand new streaming system, whereas sustaining an current batch system, isn’t so simple. Now you have got two techniques to keep up. Your streaming pipeline computed approximate, real-time outcomes. Your batch pipeline ran in a single day and produced correct, full outcomes. You needed to write the identical enterprise logic twice — as soon as for every system, in numerous frameworks, in numerous languages, saved in sync manually. When the batch job and the streaming job disagreed on a quantity (and so they at all times disagreed finally), you had to determine which one was mistaken.

Your advice engine on this new world now seems to be like this: a streaming element that updates suggestions in near-real-time based mostly on latest occasions, and a batch element that rebuilds the complete advice mannequin each evening based mostly on historic information.

Two codebases. Two deployment pipelines. Two units of bugs. One serving layer making an attempt to reconcile them.

The Key Perception: Batch Is Only a Particular Case of Streaming

Right here’s the concept on the coronary heart of Flink, and it’s fairly easy:

A bounded information set is only a particular case of an unbounded information stream that occurs to finish.

Your historic database of 5 years of person occasions — that’s a stream that began 5 years in the past and stopped at this time. Your log information from final month — that’s a stream with a starting and an finish. The distinction between “batch information” and “streaming information” is just not a basic distinction in regards to the nature of the info. On the finish of the day, it’s simply JSON occasions of what the person searched and clicked on. The query is whether or not the stream continues to be flowing or has stopped.
Going again to our advice engine: the “historic information” you course of in your nightly batch job and the “real-time occasions” you course of in your streaming pipeline are each simply information in the identical sequence of person occasions. The one distinction is while you learn them. The nightly batch job reads information from 6 months in the past. The streaming pipeline reads information from 6 seconds in the past. Similar information, completely different time window.

In case you construct a system that processes streams natively — and handles each infinite streams and finite ones — you don’t want separate techniques. You don’t want to keep up two codebases. You could have one engine, one set of logic, and also you level it at no matter slice of the stream you want.

That’s what Flink tries to do.

So What Is Apache Flink?

Apache Flink is a distributed stream processing framework. It takes a doubtlessly unbounded stream of information (or a bounded batch of information — identical factor), processes it in parallel throughout a cluster of machines, and produces outcomes repeatedly as information flows by way of.

Internally, Flink jobs are written in code, and are transformed to a DAG. For instance, that is how code for a Flink Job would appear to be (It’s not necessary to know all the main points, that is to simply give a tough thought) –

// ── 1. SOURCES ──────────────────────────────────────────────
searches = readFromKafka("search-events")
clicks = readFromKafka("click-events")


// ── 2. PER-USER ACTIVITY (windowed aggregation) ─────────────
// group occasions by person, compute rolling options over final 30 min
userActivity = (searches + clicks)
.keyBy(userId)
.window(slidingWindow(measurement=30min, slide=1min))
.combination(activityAggregator)
// → { userId, recentQueries, recentClicks, classes, ... }


// ── 3. USER EMBEDDING (name user-tower mannequin) ───────────────
// flip the exercise options right into a vector
userState = userActivity.asyncMap(callUserTowerModel)
// → { userId, embedding[128], options }


// ── 4. CANDIDATE GENERATION (2 sources, then merge) ─────────
annCandidates = userState.asyncMap(vectorAnnLookup) // ~500 gadgets
trendingCandidates = userState.asyncMap(trendingLookup) // ~200 gadgets

allCandidates = (annCandidates + trendingCandidates)
.keyBy(userId)
.window(2sec)
.scale back(mergeAndDedupe)
// → { userId, candidates: ~1000 itemIds }


// ── 5. FETCH ITEM FEATURES (batched lookup) ─────────────────
scoringInputs = allCandidates
.joinWith(userState, on=userId)
.asyncMap(fetchItemFeatures)
// → { userId, userFeatures, [(itemId, itemFeatures) × ~1000] }


// ── 6. RANKING (name rating mannequin) ─────────────────────────
ranked = scoringInputs.asyncMap(callRankingModel)
// → { userId, high 100 (itemId, rating) pairs }


// ── 7. SINK ─────────────────────────────────────────────────
ranked.writeTo(redis)

Internally, Flink breaks down this code to a graph of bodily duties to be executed, and breaks these duties to smaller set of parallel “subtasks” —

Flink pushes duties to employee nodes. Every employee runs its assigned duties repeatedly, sends periodic heartbeats again to Flink, and stories if a process fails so Flink can restart it.

Lets break down the core ideas of Flink

Core Ideas

Streams and Operators

Let me begin from the best attainable image and construct up.

Each Flink program is a dataflow graph: a set of operators linked by information streams. Don’t fear if this sounds summary proper now — we’ll construct the image piece by piece and it’ll click on rapidly.

Sources produce information (studying from Kafka, a file, a database).

Operators remodel it.

Sinks devour the output (writing to a database, one other Kafka matter, a dashboard).

An operator is a unit of processing logic. For our advice engine, an operator may filter out bot visitors, or enrich an occasion with product metadata, or rely what number of occasions every product was seen. Every operator receives information from a number of enter streams, does one thing to them, and emits information to a number of output streams.

A stream is the sequence of information flowing between operators. In our case, a stream of person occasions: view occasions, click on occasions, buy occasions, one after one other as they occur.

That is the fundamental form of any Flink job.

Parallelism

A single machine can course of occasions quick — however when you’re dealing with tens of millions of customers, a single machine isn’t sufficient. Flink solves this by working each operator in parallel: every operator is break up into a number of subtasks that run concurrently on completely different machines in your cluster.

If in case you have a Filter operator with parallelism 4, there are 4 situations working concurrently, every processing a special portion of the stream. Add extra machines, get extra subtasks, deal with increased volumes. That is how Flink scales to billions of occasions per day.
For our advice engine, this implies the window aggregation for 10 million customers isn’t working sequentially on one machine — it’s break up throughout dozens of staff

State

Going again to our advice engine: when a person views a product, that single occasion by itself tells you nearly nothing. You want context. What else has this person been viewing prior to now jiffy? Have they been merchandise in the identical class? Did they nearly buy one thing related final session? To reply these questions, your system wants reminiscence — it wants to recollect what occurred earlier than.

Within the early days of stream processing, most techniques had been stateless. Every occasion was processed in isolation: the operator noticed the occasion, reworked it, moved on. No reminiscence of what got here earlier than. This labored positive for easy pipelines — filtering out bot visitors, enriching occasions with metadata from a lookup desk. But it surely was essentially too restricted for something that required reasoning about patterns over time.

Take into consideration what our advice engine really must do. For each incoming occasion, it must ask: “What has person u-8821 executed within the final 10 minutes?” To reply that query, somebody must be protecting a working listing of person u-8821’s latest occasions. And person u-1042’s latest occasions. And all the opposite customers. That’s state — information that accumulates and evolves as information circulate by way of the operator, slightly than being derived recent from every particular person report.
Flink makes state a first-class idea. An operator can declare state explicitly — a counter, a hash map keyed by person ID, a sorted listing of latest occasions. Flink provides you that state as a managed object you may learn and write throughout processing. For our advice engine, the state may be a hash map from person ID to “listing of merchandise IDs seen within the final 10 minutes.” Each time a brand new view occasion arrives, you search for the person within the map, append the merchandise, and trim occasions older than 10 minutes.

However managing state in a distributed system is genuinely onerous. What occurs when the machine working your operator crashes? That in-memory hash map is gone. Flink handles this: it periodically snapshots all operator state to sturdy storage, so on restoration it will possibly restore all the pieces to the place it was earlier than the failure. And it ensures that state updates are utilized precisely as soon as — even when a machine crashes and the identical occasions are replayed throughout restoration, your counts gained’t be doubled.

We’ll go deep on how Flink achieves exactly-once ensures in a future structure put up. For now, simply know that Flink provides you state that feels as dependable as writing to a database, with the efficiency of an in-memory hash map.

Home windows

We’ve obtained a stream of person occasions, operators working in parallel, and state accumulating per person. Now right here’s an issue that comes up nearly instantly in any actual aggregation.

Let’s say you need to compute “the ten most seen merchandise within the final 5 minutes” — to energy a “trending now” part of your web site. You could have an operator that’s counting views per product. However your stream is infinite. When do you emit a outcome? You may’t wait till “all of the occasions” arrive — they by no means cease arriving.

You want a solution to slice the infinite stream into finite items and compute over every bit. That’s a window.

A window is a bounded chunk of your stream. You outline it, Flink teams the occasions into that chunk, and when the chunk is “full,” it runs your aggregation and emits a outcome. Flink has a number of window sorts, Tumbling Home windows, Sliding Home windows, Session Home windows, and so forth. It’s not crucial to know the variations between every window sort, however the gist of home windows is that it seems to be at information for a while interval.

Tidbits from the Authentic Paper

I spent a while studying the 2015 Apache Flink paper — ”Apache Flink: Stream and Batch Processing in a Single Engine” by Carbone, Katsifodimos, Ewen, Markl, Haridi, and Tzoumas. Just a few issues from the paper that add helpful coloration to what we coated above:

On Fault Tolerance and Precisely-As soon as Ensures

The paper describes exactly-once semantics this manner: “Flink provides strict exactly-once-processing consistency ensures for stateful operators by way of a mix of distributed snapshots and partial re-execution upon restoration.” The important thing phrase there may be *partial* re-execution — when a machine fails, Flink doesn’t restart your complete job from the start. It rolls again all operators to their final profitable snapshot, then replays solely the enter from that time ahead. The utmost quantity of reprocessing is bounded by the hole between two consecutive checkpoints — which is a tunable parameter.

The mechanism that makes this work with out pausing the computation known as Asynchronous Barrier Snapshotting (ABS) — and it’s genuinely intelligent. We’ll cowl it in full element within the subsequent put up. However the headline is: Flink injects particular “barrier” markers into the info stream, which circulate by way of the operators like common information. When an operator receives a barrier, it snapshots its state to sturdy storage and forwards the barrier downstream — all whereas persevering with to course of information. No pause, no freeze, no missed occasions.

On Unified Batch and Stream Processing

One of many clearest statements within the paper is that this: “A bounded information set is a particular case of an unbounded information stream.” The authors are making a philosophical declare, not only a technical one. And so they again it up: “Batch computations are executed by the identical runtime as streaming computations. The runtime executable could also be parameterized with blocked information streams to interrupt up giant computations into remoted levels which can be scheduled successively.”
In plain phrases: there isn’t a separate batch engine in Flink. Batch jobs run on the very same distributed dataflow runtime that processes your Kafka streams. The one distinction is that batch jobs use “blocked” information alternate between levels — the upstream operator finishes totally earlier than the downstream one begins. All the pieces else — the operator mannequin, the state administration, the serialization — is similar.

Going again to our advice engine: this implies the job that counts real-time view tendencies and the job that processes 6 months of historic occasions for mannequin retraining can share the identical operators, the identical cluster, and the identical codebase. The paper’s promise is that the Lambda Structure — with its two techniques and two codebases — is just now not needed.

Wrapping Up

Let’s rapidly do a TLDR:

Knowledge is produced as steady streams, however we’ve traditionally pressured it into batches — creating latency and the operational ache of sustaining two techniques

Flink is constructed on the perception that batch is only a particular case of streaming— and unifies each in a single engine

The core constructing blocks are: operators (processing logic), streams (information in movement), state (reminiscence that persists throughout information), and home windows (bounded slices of a stream for computation)
Fault tolerance with exactly-once ensures is in-built.

Ideally, I might have actually preferred to go in-depth to every of those matters (and there’s a lot of depth in them), however this put up has already develop into fairly lengthy, so I’ll defer that to future Sanil for now. You may also comply with me on LinkedIn for extra byte-sized posts and to know what I’m studying about proper now.

We talked rather a lot about Apache Kafka (given its the spine of most information architectures), however did you ever marvel how Apache Kafka works and what its structure is like? I used to be stunned to find out how easy Kafka actually is beneath the hood. I wrote a complete weblog put up about it right here —
System Design Series: Apache Kafka from 10,000 feet
Let’s look at what Kafka is, how it works and when should we use it!medium.com

In case you’re on the lookout for one thing extra in-depth, I’d suggest trying out one in all my hottest posts on Temporal, a workflow orchestration software, with in-depth explanations on how occasions are scheduled, began and accomplished —
System Design Series: A Step-by-Step Breakdown of Temporal’s Internal Architecture
A step-by-step deep dive into Temporal’s architecture — covering workflows, tasks, shards, partitions, and how Temporal…medium.com

Source link

System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

Agentic AI: How to Save on Tokens

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

Ensembles of Ensembles of Ensembles: A Guide to Stacking

How AI Policy in South Africa Is Ruining Itself

PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

Correlation Doesn’t Mean Causation! But What Does It Mean?

How Elon Musk Squeezed OpenAI: They ‘Are Gonna Want to Kill Me’

Resorts World NYC opens first full casino in New York City with live table games in Queens

Sony’s Latest PlayStation Update Sparks DRM Fears: What We Know

System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

Featured Picks

Total Wireless Promo Codes & Deals: 50% Off Select Plans

What the rest of Europe can learn from France about defence startups

New attack on ChatGPT research agent pilfers secrets from Gmail inboxes

System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

Earlier than We Begin

Again To The Drawback: How We Truly Produce Knowledge

The Key Perception: Batch Is Only a Particular Case of Streaming

So What Is Apache Flink?

Core Ideas

Streams and Operators

Parallelism

State

Home windows

Tidbits from the Authentic Paper

On Fault Tolerance and Precisely-As soon as Ensures

On Unified Batch and Stream Processing

Wrapping Up

Related Posts