Home < Blog < Fast Data Recipe: Design Data Pipelines

Fast Data Recipe: Design Data Pipelines

Jan 18, 2016

4 min read

Key Takeaways

Processing big data effectively often requires multiple database engines, each specialized to a purpose.
Data arriving at high-velocity, ingest-oriented systems needs to be processed and captured into volume-oriented systems; advanced cases require communication between these systems.
Transformations can be executed in a streaming fashion before the data reaches the long-term repository, avoiding ETL.
Real-time applications processing incoming events often require analytics from backend systems, necessitating a data management system for holding and regularly updating state.
When connecting multiple systems, ensure they have an independent fate; any part of the pipeline should be able to fail while leaving other systems available and functional.

Processing big data effectively often requires multiple database engines, each specialized to a purpose. Databases that are very good at event-oriented real-time processing are likely not good at batch analytics against large volumes. Here’s a quick look at another of the Fast Data recipes from the ebook, “Fast Data: Smart and at Scale” Ryan Betts and I authored.

Data arriving at high-velocity, ingest-oriented systems needs to be processed and captured into volume-oriented systems. In more advanced cases, reports, analytics, and predictive models generated from volume-oriented systems need to be communicated to velocity-oriented systems to support real-time applications. Real-time analytics from the velocity side need to be integrated into operational dashboards or downstream applications that process real-time alerts, alarms, insights, and trends.

In practice, this means that many big data applications sit on top of a platform of tools. Data and processing outputs move between all of these systems. Designing that dataflow—designing a processing pipeline—that coordinates these different platform components is key to solving many big data challenges.

Pattern: Use Streaming Transformations to Avoid ETL
Pattern: Connect Big Data Analytics to Real-Time Stream Processing
Pattern: Use Loose Coupling to Improve Reliability

Pattern: Use Streaming Transformations to Avoid ETL

New events being captured into a long-term repository often require transformation, filtering, or processing before they are available for reporting use cases. There are at least two approaches to running these transformations.

All of the data can be landed to a long-term repository and then extracted, transformed, and re-loaded back in its final form.
The transformations can be executed in a streaming fashion before the data reaches the long-term repository.

Pattern: Connect Big Data Analytics to Real-Time Stream Processing

Real-time applications processing incoming events often require analytics from backend systems. This introduces a few important requirements. First, the fast data, velocity-oriented application requires a data management system capable of holding the state generated by the batch system; second, this state needs to be regularly updated or replaced in full. There are a few common ways to manage the refresh cycle—the best tradeoff will depend on your specific application.

Other applications require the analytics data to be strictly consistent; if it is insufficient for each record to be internally consistent, the set of records as a whole requires a consistency guarantee. Producing a correct result therefore requires that the full data set be consistent. A reasonable approach to transferring these report data from the batch analytics system to the real-time system is to write the data to a shadow table. Once the shadow table is completely written, it can be atomically renamed, or swapped, with the main table that is addressed by the application. The application will either see only data from the previous version of the report, or only data from the new version of the report, but will never see a mix of data from both reports in a single query.

Pattern: Use Loose Coupling to Improve Reliability

When connecting multiple systems, it is imperative that all systems have an independent fate. Any part of the pipeline should be able to fail while leaving other systems available and functional. If the batch back end is offline, the high-velocity front end should still be operating, and vice versa.

In every pipeline, there is by definition a slowest component—a bottleneck. When designing, explicitly choose the component that will be your bottleneck. Having many systems, each with identical performance, means a minor degradation to any system will create a new overall bottleneck. This is operationally painful. It is often better to choose your most reliable component as your bottleneck or your most expensive resource as your bottleneck. Overall you will achieve a more predictable level of reliability.

This was a quick overview of effective data pipeline recipes. If you are interested in this topic and would like to read more, download the ebook “Fast Data: Smart and at Scale.”

About Author

Adrian Scholes

Get Started with Volt

Architecture

Capabilities

Data Center Replication

In-Service Upgrades

Low Latency

Consistency

High Availability

Scalability

Page group one

Fraud Prevention

Hyper-Personalization

Private 5G Networks

Streaming Data

Edge-Based Deployments

Page group two

Industrial IoT

AI + ML

Business Support Systems

5G Streaming Mediation

The 6 Reasons BFSI Companies Need Real-Time Data Processing

From Tsunami to Transformation: 6 Key Takeaways from IoT Tech Expo North America 2025

Telco

BFSI

Intelligent Manufacturing

Smart Utilities

Supply Chain

Fantasy Sports

Retail

Resource Library

Blog

Partners

For Customers

Support

Professional Services

Documentation

For Developers

Developer Hub

Quick Start Guide

Developer Edition

About

Careers

News

Press Releases

Webinars & Events

Our Team

Contact Us

Fast Data Recipe: Design Data Pipelines

Key Takeaways

Table Of Contents

Pattern: Use Streaming Transformations to Avoid ETL

Pattern: Connect Big Data Analytics to Real-Time Stream Processing

Pattern: Use Loose Coupling to Improve Reliability

About Author

Featured Resources

5 Reasons Volt Was Built for Telco-Grade Resiliency

The Real-Time Data Platform for Financial Services

Follow Us:

Categories

Power Real-Time BFSI Success

Guide to Streaming Data Platforms

Volt Active Data’s Top-10 Capabilities

Why Your Tech Stack Is About to Break (and How to Avoid It)

Test Drive the Only Lightening-Fast No-Compromise Real-Time Data Platform on the Planet

Guide to Private 5G Networks