Volt Active Data hired Kyle Kingsbury, creator of the Jepsen Tests, to build a new, stronger, Jepsen test especially for Volt Active Data. We promise strong serializability in a distributed database, a stronger promise than almost any other system, and we’ve been working with Kingsbury to validate that promise.
What is Jepsen Testing?
“Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. It encompasses a software library for systems testing, as well as blog posts, and conference talks exploring particular systems’ failure modes. In each post we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators.”
- Volt Active Data 6.4 has passed official Jepsen testing performed by Kyle Kingsbury, Jepsen’s creator.
- Volt Active Data 6.4 has passed more stringent testing than any other system Jepsen has tested.
- Jepsen found several issues in Volt Active Data 6.3 and we fixed every one. Our tolerance for these bugs is zero.
- We have integrated Jepsen testing into our automated testing for each Volt Active Data build. We have and will continue to make our Jepsen and non-Jepsen tests stronger and better.
- We stake our reputation on correctness, consistency, and safety.
You can read Kingsbury’s detailed post on his experiences here:https://aphyr.com/posts/331-jepsen-voltdb-6-3
The most important quote:
“Volt Active Data’s pre-6.4 development builds have now passed all the original Jepsen tests, as well as more aggressive elaborations on their themes. Version 6.4 appears to provide strong serializability: the strongest safety invariant of any system we’ve tested thus far.”
Jepsen has proven its value as a tool in any distributed system tester’s arsenal, and multiple people have asked us about Jepsen and Volt Active Data specifically. Jepsen is both famous and notorious in the database industry for finding undiscovered problems with distributed systems. There are few tests of mettle as recognizable as Jepsen in our community.
While we had been planning to do the testing ourselves, we understood that nothing we did would have the same credibility as a test run by Kyle Kingsbury himself, creator of Jepsen and embarraser of databases. When Kingsbury started his Jepsen-For-Hire business last fall, we immediately got in line, and over the past two months, we’ve been working closely with him as he tested Volt Active Data.
The Most Stringent Jepsen Tests So Far
The Volt Active Data default consistency setting is Strong Serializability. This combines the ACID properties of serializable transactions (every transaction appears to happen in some global order) with CP-in-CAP-style linearizability (operations all happen essentially in the order the client sends them). Peter Bailis, notable database researcher and professor at Stanford University, has a blog post on the difference for those who want technical detail: Linearizability versus Serializability.
And, we’d like to point out, conventional wisdom is that this kind of consistency is too expensive, and you have to accept less (often much less) in order to scale. Volt Active Data manages to be fast and consistent by leveraging smart design and by specifically making some tradeoffs about what applications can do. You can read more about this on the Volt Active Data website: Reasons Behind the Volt Active Data Architecture.
As Kingsbury discusses in his post, Volt Active Data was run not only against Jepsen tests that look for linearizability faults, but also was run against Jepsen’s new multi-key linearizability tests. This tests Volt Active Data’s multi-statement, multi-key transactions for strong serializability. These tests don’t even apply to other systems with multi-key transactions because they require linearizability on top of serializability (strong serializability). Since this is a thing few other distributed databases promise (none?), it’s a test that only really applies to Volt Active Data.
Are the Issues Found Serious?
Jepsen found an issue in versions of Volt Active Data prior to 6.4 that could lead to stale reads or even dirty reads of uncommitted data under certain network partition scenarios. A user’s likelihood of encountering this issue is hard to predict and varies by application. If encountered, the seriousness of its effects vary depending on the application as well.
Jepsen also found two issues where writes could be lost under certain partition scenarios. These issues are more serious, but also easier to avoid because they are only possible to hit on uncommon deployment configurations. We have identified one production deployment out of hundreds we know about that is susceptible to these issues.
Of course we consider all correctness and data loss bugs to be drop-everything-and-fix serious. If we start getting grey—weighing the likelihood of this and the impact of that—we start down a slippery slope. To our engineering team, it must be black and white.
To our users, it is less straightforward. Many users and apps will be unaffected by these issues, but their impact to others is less clear. We’ve already reached out to our customers and other users known to be in production with Volt Active Data.
We have additional in-depth detail available on a technical companion page focused on these issues:
- Volt Active Data single-partition read-only transactions can read stale or uncommitted data under certain network partition scenarios.
If you have questions about how likely you are to hit these issues in your deployment and/or you are unable to update to 6.4, please reach out to Volt Active Data support at firstname.lastname@example.org.
Reproducible and Open
We believe having Kingsbury run these tests himself adds credibility to the results, but everything Kingsbury has done is reproducible and open. You can find the Jepsen driver for Volt Active Data at https://github.com/jepsen-io/voltdb, which allows you to run the full Jepsen tests described in Kingsbury’s blog post against a 30-day trial of Volt Active Data, or your own licensed copy.
We’d like to point out that this kind of reproducible testing is only possible because Volt Active Data is standalone software that our users fully control. People often cite lock-in as a major tradeoff of popular Database-as-a-Service offerings, but it’s also important to note that this kind of fault-injection-based testing just isn’t possible when you don’t control the environment.
Data Safety and Correct Answers Are Our Highest Priority
Among the things that set Volt Active Data apart is our combination of a strongly ACID relational SQL database on a natively clustered platform. We’re database people, but we’re also distributed systems people.
Selling a data product on its strong consistency and robust fault tolerance can be challenging, and is based on credibility and trust. Our marketing material can tell you your data is safe, but anyone can do that. We show you we take this seriously through our actions.
Take today for example. We hired Jepsen as soon as we could. We held the release of Volt Active Data 6.4 until every bug was fixed. In one case, we made a minor performance sacrifice to make sure default consistency settings were as strong as we promised.
Volt Active Data is, first and foremost, an operational database for the 21st century. To us, that means it has to check a few boxes:
- You can trust Volt Active Data with your data. Keeping your data safe is job #1 at Volt Active Data. Any data corruption or loss issues are release-blocking bugs that are prioritized over all other work.
- The reads, computations and writes you do in a Volt Active Data transaction are 100% correct as of the time the transaction executes. We believe it is our job to worry about as much of the complexity of distributed systems and database consistency models as we can, leaving the developer to focus primarily on his or her business logic.
- It needs to be as easy as we can make it to keep Volt Active Data running for years on end. We don’t require an external ZooKeeper. We don’t have different kinds of nodes in a cluster. Installation is as simple as unpacking a tarball. Replacing failed nodes or recovering a full cluster are single command-line operations.
- Our testing has to be outstanding and as transparent as possible.
There are certainly other things we care about. We spend lots of time making Volt Active Data easier to use and also expanding our use cases. For example, in 6.0, we released added support for geospatial types, queries and indexes.
But being the best at something means focus and prioritizing. At Volt Active Data, building an operational database means data safety, high availability, and management and administrative simplicity. This is why Volt Active Data is trusted by many of the world’s largest telecommunications networks in their critical infrastructures. These customers are not easy to please, but it’s exactly these high operational standards that allow us to stand apart from other systems.
The immediate next step for us after Jepsen involved getting the Jepsen testing harness into our continuous integration process. This allows us to test all nightly builds and upcoming releases automatically. It also allows us to run Jepsen on specific branches as we develop new features.
We’re also in the middle of a post-mortem on our other tests. Some of our tests overlap with the kinds of issues Jepsen is designed to find. These tests have found many issues over the years and have been invaluable in making Volt Active Data as robust as it is. Still, Jepsen found a few issues that weren’t covered by our existing tests. We are working to understand why these issues weren’t found, and also what kinds of things we can change to find these issues in the future. In the meantime, as mentioned above, we are internally running Jepsen regularly alongside our existing tests.
As we push forward on the 6.x releases, and ultimately to 7.0 and beyond, we plan to continue to expand our existing tests and create new ones that come at problems from new directions.
We’ve created a Transaction and Consistency FAQ with some additional background. We hope this helps readers understand how Volt Active Data works and what Jepsen found. Kingsbury’s post also mentions some future areas he’d like to explore with Volt Active Data and we cover those topics too.
It’s our goal at Volt Active Data to make each release stronger than the previous ones. When customers ask me what the best release of Volt Active Data is, I always say the latest one. I owe my confidence to an engineering team that cares about quality, and a continuous integration effort that makes it easy to build on past releases and make software better, all while minimizing regressions.
Finally, we’d especially like to thank Kyle Kingsbury for his work on this project. It’s been a pleasure working with him, and it’s always very helpful whenever we have a third-party expert evaluate our software and give us feedback.
So give Volt Active Data a try and feel free to reach out to us if you have questions.
Discuss Jepsen & Volt Active Data at Hacker News.