My Apache Flink Pains

My Apache Flink Pains

Flink is powerful, but it's a brutal babysitting job. From "GC death spirals" to week-long schema evolution nightmares, the operational burden is immense. My Flink war stories reveal the true, hidden cost in engineers and time.

By

Bo Lei

Co-Founder & CTO, Fleak

The True Cost of Stateful Streaming: My Apache Flink Pains

Apache Flink is one of the most powerful stream processing frameworks out there. I get why it’s so popular. Its ability to handle massive throughput with low latency, all while managing complex state with exactly-once guarantees, is a huge technical achievement.

In my past roles, I've been deep in the trenches with Flink, scaling it for pipelines processing enormous amounts of data. And I can tell you, that power comes at a steep, steep price.

I've watched teams get completely bogged down, spending more time debugging Flink's internals than building their own product. This isn't a post to talk you out of Flink. It's a post to share what I learned the hard way, so you can go in with your eyes open.

I'm not going to give you a tidy "three-point list." We're just going to talk about the things that caused me the most pain. It really boils down to state, operations, and the real cost in people and money.

The State Management Nightmare

Flink's primary selling point is its state management. It's what lets you do complex sessionization or build 24-hour analytics. It's also, in my experience, the single biggest source of pain.

To give its guarantees, Flink is constantly snapshotting your application's state (checkpoints) so it can recover from failures. Where you store this state is your first big decision, and it’s a bad one either way.

Your first option is the HeapStateBackend. It's fast because all your state lives in RAM. It's also a ticking time bomb.

I remember one specific incident. A job that had been stable for months suddenly started falling over. We saw checkpoint timeouts all over the place. TaskManagers were getting killed, the job would restart, and the cycle would repeat. The weird part is that our traffic hadn't spiked. The load looked normal.

We burned two days on this. It turned out to be GC. Someone on another team had pushed a "minor" change that added a new field to one of the objects we were holding in state. It was a tiny change, but it was just enough to push our heap usage over a critical threshold. The GC pauses went from a healthy ~200ms to over five seconds.

In Flink, five seconds is an eternity. The checkpoint barriers would time out, Flink's coordination would fail, and the job would die. That’s the "GC death spiral" people talk about, and it's a nightmare to diagnose because nothing looks wrong... until everything is on fire.

Your other option is the RocksDBStateBackend. This puts your state on disk, so you're not limited by RAM. Great, right? Except now you’ve just signed yourself up to become a RocksDB tuning expert, on top of being a Flink expert. You'll be spending your nights reading about block cache sizes, write buffers, and compaction, all because you wanted to count things in a window.

The Operational Burden is Not Optional

A Flink job isn't a "fire and forget" application. It's a living, breathing, distributed system that you have to babysit.

I was lucky when I was at Netflix. We had a dedicated platform team that had built an incredible, custom observability suite just for Flink. We had dashboards for everything: backpressure per-task, checkpoint durations, watermark lag, JVM metrics. We needed all of it.

Most teams don't have this. They're flying blind. Backpressure is the most critical metric, and it's notoriously hard to spot without the right tools.

But the real operational pain comes when you need to change anything.

Because your job is stateful, you can't just deploy a new version. The "simple" upgrade path is:

  1. Take a savepoint (a manual snapshot) of the running job.

  2. Stop the job.

  3. Deploy your new code.

  4. Restart the job, telling it to resume from the savepoint.

This process is incredibly fragile. And if you changed any of the classes you're using in your state, you're now in the world of schema evolution. This is a process that can, and will, ruin your week. I've personally lost days trying to figure out why a new serializer was "not found" and how to map an old state class to a new one.

Let's Talk About the Real Cost: People and Money

This is the part most folks miss. The cost of Flink isn't your AWS bill (though that's high, too). The real cost is in your engineers.

When you adopt Flink, you are committing to hiring or training specialists. You can't just ask a data engineer who knows Spark or SQL to be productive in Flink in a week. The learning curve is brutal, not for the API, but for the internals you'll be forced to debug.

Every hour your team spends trying to understand Flink's memory model or why a checkpoint is failing is an hour they aren't building your actual product. It demands a platform-level investment. For any serious Flink deployment, you're going to end up with a team of SREs and platform engineers whose full-time job is just keeping the Flink clusters alive.

The computation cost is also non-trivial. These are long-running, 24/7 jobs. You can't just scale them to zero. And because recovery from a failure is so resource-intensive (reloading all that state), you often have to over-provision your clusters just to handle a restart, even if the steady-state load is much lower.

So, When is Flink the Right Tool?

After all that, it sounds like I'm saying "never use Flink." I'm not. But my bar for it is high:

  1. Is my problem fundamentally stateful? I'm not talking about a simple 5-minute counter. I mean true Complex Event Processing (CEP), correlating events over long-time windows, or building sessions where data can arrive out of order.

  2. Are "exactly-once" semantics a hard, non-negotiable business requirement? (Hint: most of the time, "at-least-once" is fine).

  3. Do I have a dedicated platform engineering team, or the budget and political will to build one, whose job it will be to own this platform?

If your problem is stateless (just reading from Kafka, transforming a message, and writing it out), please, don't use Flink. You're bringing in a mountain of complexity for no reason.

Flink is a specialist's tool. It's built for a class of problems most companies don't actually have. Before you get pulled in by the impressive feature list, be brutally honest about your problem, and even more honest about your team's and your company's willingness to pay the very real, very high price.

Start Building with Fleak Today

Production Ready Transformation in Minutes

Request a Demo

Start Building with Fleak Today

Production Ready Transformation in Minutes

Request a Demo