We’ve been hearing a lot about real-time data platforms and also about streaming data and streaming data-enabling messaging technology such as Apache Kafka.
The idea is to turn your streaming data into real-time data by using it while it still has value. This isn’t always easy to achieve. In fact, it’s almost never easy to achieve, and most companies are still struggling to do it on a consistent basis without sacrificing other things, such as consistency and reliability.
This blog explores why.
Real-time data platforms are rarely 100% real time
The first thing we need to consider is that very few real-time data platforms exist in some kind of mysterious void where they don’t need to speak to any other, slower system. Real-time latency might be a non-negotiable requirement of what you’re building and/or doing, but it rarely works alone. At a minimum, it needs to tell some back office system somewhere what it’s done, so that the company can update its financials.
The same thing is true in reverse All real-time systems have, at a minimum, an updatable list of real-time products and their prices, or something functionally equivalent. So while upstream feeds might be small in comparison, they’re still critical to the functioning of the system as a whole.
Doing real-time and streaming at the same time is challenging
Historically, sending and receiving data to and from real-time systems was an awkward batch process that tended to get tacked on as an afterthought. Legacy RDBMS products weren’t very good at this, as you tended to get big, expensive SQL operations that had a nasty habit of going wrong at two in the morning. In theory, the widespread adoption of streaming ought to make life a lot simpler, but the reality is sadly different. Real-time data and streaming data have wildly different ways of seeing the world compared to the old way. For example, there isn’t really such a thing as ‘complete’ in a streaming world, where we used to be able to assume the contents of a file or results of a big query were, by definition, up to date.
There’s a fairly deep-seated and hard-to-fix ‘impedance mismatch’ that isn’t visible when you first sketch this out on a whiteboard but becomes visible at implementation.
The first issue is that most real-time systems think in terms of 1-10ms, whereas streaming thinks in multiples of 100ms. So a real-time system might have finished with something and have moved on to another task long before the streaming subsystem becomes aware a data item even exists. The problem starts a few seconds later when you want to know an aggregate, like the total number of widgets you sold at the last minute. Because there’s a lag with an undefined duration between the event happening and you reliably knowing about it downstream, there’s a gap in time between the real-time system and your back-office system when nobody knows for certain what’s going on.
Scaling is another issue. Popular streaming systems such as Kafka can be very brittle when it comes to rapid changes in volumes, which means people tend to settle for transient backlogs, which makes the time lag issue we just discussed worse.
In-house solutions connecting real-time data processing to streaming data are problematic
The obvious approach to this issue is for the developers to build some form of CDC (Change Data Capture) functionality into the base application. This is a lot of work, much more than first appears, because if you want your downstream numbers to match then it needs to be an ACID-compliant CDC system, where there is a guaranteed 1:1 relationship between an event happening and a CDC message reaching its destination, even in the case of outages.
Most NoSQL products or DataGrids can’t do this without heroic levels of effort by developers, and you’re often left with an unpleasant choice between slow and complicated or fast and unreliable. The bottom line is that CDC systems are complicated and tricky enough to spawn entire families of products to do the job.
What about buying an off-the-shelf streaming CDC solution and adding it to my data platform?
This may or may not help. Assuming the new system does actually work as advertised, then it will allow you to get data out of your real-time data processing platform reasonably quickly and effectively, but you still need to consider the following caveats and ‘gotchas’:
Most CDC systems were designed with smaller scales in mind.
Modern systems are producing 10 to 100x the output compared to ones from a few years ago. Any CDC system you integrate needs to be able to keep pace not just with the demands you face today but able to meet the needs of 3-5 years from now.
Copying every change to every record may not be what you actually want.
For a lot of modern systems, the sheer amount of activity means that sending every single change downstream may overwhelm back-end systems. In many cases, you’d be better off doing what the telco industry calls ‘aggregation’ — consolidating a stream of hundreds of technical events into a much smaller one consisting of ‘business relevant’ events. By aggregating before records are sent downstream, you can avoid having to create an elaborate system to do the same aggregation on arrival.
Off-the-shelf CDC may have the same problems with latency and lag we discussed earlier.
Without extensive testing it’s hard to tell whether an off-the-shelf CDC product will work. A fundamental issue is that, ironically, because it’s a mature market sector, many of the offerings are going to struggle with the volumes we can expect in the future.
How Active(SD) Helps
Volt’s Active(SD) [Streaming Decisions] combines processing, storage, and streaming into a single platform, allowing companies to avoid the issues mentioned above. Active(SD) can store data to make sense of late-arriving events, then compare incoming records with stored ones, and finally stream the data you need when you need it. A key feature is that it can be made to look like a Kafka cluster on your network, so existing applications that send data to Kafka can be re-plumbed so they send it to Volt.