Here at Volt Active Data we have been stressing the importance of utilizing a in-memory data management platform for stateful stream processing as a trend for the future. It seems that now, with the evolution of the 5G era, the market is finally ready for this functionality.
New Messaging in the Marketplace: How it Falls Short
Vendors of traditional message bus technologies are now trying to play catch up. Confluent is the newest vendor to jump on the SQL bandwagon (with kSQL) and they recently announced “ksqlDB: a streaming database for Apache Kafka”. In their “What’s new/major features” blog post, Confluent put a marketing spin on features that have existed in databases for decades. They announced:
- Pull queries which are referred to as continuous database queries in the database industry, and
- Push queries which are essentially long-established traditional database queries on locally stored data.
Confluent and kSQL / ksqlDB however still have a long way to go to even come close to offering ANSI SQL compliance or even the most basic SQL features that are required to perform stateful event streaming data.
What Does kSQL Lack?
The following are a few examples of some of the core database features that kSQL lacks:
- Efficient state management/disaster recovery
Apache Kafka was originally architected as a message hub for stateless message processing. With stateful processing, however, it is also necessary to recover the state if there is either a human error or potential network/hardware failure. State recovery requires messages to be replayed and this can take hours to days and is dependent on the size of your data(1). Can your mission critical revenue making application really afford that much down time?
Volt Active Data’s data management platform offers comprehensive data replication, with both inter-cluster and intra-cluster (Active-Passive, and Active-Active) replication. These replication options ensure High Availability (HA) in the face of either hardware, software, or network failures. Learn more about Volt Active Data’s data replication capabilities here.
- Indexing / sorting data
Indexing and sorting of data is fundamental to distributed processing in database management systems. When Big Data comes at you really fast, indexing is the key to delivering low latency query responses. To index / sort data, Apache Kafka creates a new topic saves and replicates all data in the internal repartitioning topic. Whenever the user changes a key to facilitate real-time analytics, a new topic is created to approximate the index/sort. These topics add up real quick, especially as your business grows and you rely on more real-time analytics. The overload of keys can put tremendous pressure on the message broker and add significant latency to analytical queries(1). This problem is a classic case of Apache Kafka putting the cart before the horse. The best practice in Operational Database Management Systems (ODBMS) (and with Volt Active Data) is to create an optimal data model/schema that perfectly matches the data analysis needs of your application, this includes the indexing and sorting of data. As new data comes in, it fits right into the established and optimized application-specific data model. With Apache Kafka however there is no existing data model, hence the reliance on creating new keys to index/sort data.
Learn more about creating an optimal database schema here.
- SQL queries across partitions
Data partitioning helps deliver high throughput, blazing-fast query performance, simplified data management, and high availability. To benefit from data partitioning, however, it is also important to provide users the flexibility to query across partitions. This important feature enables app developers to optimize to customize their database as per the requirements of their unique application. While Apache Kafka provides the ability to partition data, it does not offer the ability to run SQL queries across partitions. Thus imposing a severe handicap on the app developer’s ability to build an application that is customized to a specific use case and highly performant.
Volt Active Data enables users to take full advantage of the partitioning of data; by distributing data across multiple nodes, Volt Active Data can achieve very high query performance. Since the data and the processing is partitioned, multiple queries can be run in parallel at the same time, and queries can be run across multiple partitions. Learn more about Volt Active Data partitioning works here.
- Replicated tables
Replication of some database tables (such as read-only tables) across all of the nodes in your cluster provides additional flexibility to the app developer. Replication of tables allows for high availability of data, for example: tables that need to be accessed frequently by columns other than the partitioning column are frequently replicated. kSQL does not offer this feature.
Replicated tables is a handy feature offered by Volt Active Data in addition to partitioning. Learn more about Volt Active Data’s replicated tables.
Features Absent from kSQL
In addition to some of the basic data management capabilities discussed above, there are numerous other advanced transactional and analytical features that are absent from kSQL, such as:
- ACID transactions,
- the ability to embed Machine Learning models in database for real-time cognitive actionability,
- stored procedures / user-defined functions,
- availability to state in-memory locally (kSQL uses an external key/value store: RocksDB to provide some state which is limited due to key/value lookups),
- High Availability,
- durability, and many more!
Message Bus: Not the Right Tool for Every Job
Apache Kafka/kSQL is disk-based and extremely slow for mission-critical / latency-sensitive applications. It is a terrific message bus that is extremely proficient are delivering messages from multiple publishers to multiple subscribers. However when it comes to data storage / stateful stream processing of the data in motion kSQL has numerous handicaps. In the words of Dr. Michael Stonebreaker, it is crucial to “use the right tool for the right job”. Employing a message bus (with minimal database functionality bolted on) to process data for your mission critical application is akin to adding eggs to a cake after it is fully baked.
Volt Active Data is a proven data management platform that provides comprehensive translytical stateful processing on streaming data. Volt Active Data has been fully ANSI SQL compliant since its inception. It has been proven in production environments to power data hungry mission critical applications for close to a decade. To learn more about stateful stream processing with embedded cognitive capabilities download this white paper.