NoSQL Due Diligence, Part 3: the Open Source Debacle

NoSQL Due Diligence, Part 3: the Open Source Debacle

November 03, 2021

Introduction

The concept of open source has been integral to promoting a generation of NoSQL database platforms, each of which has a VC-funded backer and many of which now have successful IPOs; which has created some interesting dilemmas.

But:

  1. Have these much-publicized ‘liquidity events’ created market expectations that are in direct conflict with the very idea (and mission) of open source?
  2. How much will the willingness of hyperscalers such as AWS to offer ‘store brand’ versions of many of these products upend the carefully laid plans of many database vendors? 
  3. Will adding new features to products that weren’t originally architected for them create its own set of longer-term problems for users?

In order to address these dilemmas, let’s first take a look at how we got here.  

How We Ended Up With 130 Legacy RDBMS Alternatives

Much has already been written about why the Open Source NoSQL Database Revolution happened. Needless to say, an explosion of innovation and creativity saw more than 130 databases pop up over the span of about five years, roughly between 2007 and 2012.

For perspective, the 451 Group was able to produce a parody of London’s subway system showing how all the technologies related to each other:

Source: 451 Group

The vast majority of these, including Volt Active Data, is open source to some degree or other. This was partly a reaction to the deeply proprietary nature of legacy RDBMS, but it had the side effect of making these technologies much more attractive to early adopters who felt they could remove much of the risk of using a new technology if they had unrestricted access to the source code.

Enter VC: The Current Open Source NoSQL Market

We are now in a very different situation. Clearly, the industry could never sustain the super-saturation of 130 players, and over time, data platforms have started to either disappear or become dormant.  

We also saw the arrival of venture capital, with a willingness to make large investments. The VC industry observed a huge market being disrupted and wanted to be part of it. By becoming involved, they offered the original developers a chance of a significant personal payday and also provided funding to support developer communities.

The business plan was usually:

  1. Spend money on marketing and support to jumpstart a community of developers. 
  2. Pay developers to build out the product, even though the code is open source. 
  3. Encourage developers to answer questions in forums such as StackOverflow to drive adoption and create confidence in the viability of underlying the open source project.
  4. Charge enterprise customers for support and an expanded feature set.

While this is a relatively standard VC approach, the key thing to understand is that the VCs viewed open source as a step on the path to a liquidity event, not a worthy goal in itself. As MongoDB CEO Dev Ittycheria said:

“We didn’t open source it to get help from the community, to make the product better. We open sourced as a freemium strategy; to drive adoption.”

As an overarching strategy, this worked wonderfully because developers could deploy enterprise-grade products without having to go through an acquisition process. 

But Ittycheria also made it clear that being open source does not mean avowing ownership of the code, he and used the language of a patent lawyer to do so:

“MongoDB was built by MongoDB. There was no prior art.”

Having looked through all the contributors to MongoDB, I’d say this statement is 99% fair. My point is, while the code may well be open source, there’s often very little community involvement. 

VC funding made a massive difference in the open source space. Depending on your perspective, the VCs have either massively accelerated an inevitable shakeout or unfairly loaded the dice in favor of their portfolio. 

With the honorable exception of Postgres, almost every database has been affected by this. We are now entering a new era where the oligopoly of legacy RDBMS products is being replaced by a different oligopoly of new technology data platforms. What do they have in common? They’ve all had VC funding and now need to meet investor expectations instead of keeping developers happy. For example MongoDB now has a market cap of US$34B, which implies they need to earn around US$1.7B a year to keep investors happy. In 2021, they had a loss of US$271M on a turnover of US$590M. Confluent is positioning itself as a database and has a market cap of US$17B. They face similar expectations.

Given the historical size of the database market (between US$35B and $50B) and the extreme difficulty of moving applications off legacy RDBMS, it’s clear that these companies are going to have to become very, very good at extracting revenue from the Fortune 500 in order to keep stockholders happy. Which is unfortunate, because the hyperscalers have their own plans for that end of the database market.

Enter the Hyperscalers 

Remember the VC open source business plan? The one involving keeping the product free for small users and making it up with revenue from enterprise customers? Well, that plan didn’t foresee hyperscalers offering hosted versions of open source products to those same customers. 

As part of their DBaaS offerings, they included support, which means you could use data platform ‘X’ without ever speaking to or paying the people who wrote data platform ‘X’. Instead, you paid the hyperscaler for hosting plus support and also became ensnared in their broader ecosystem. Both end users and hyperscalers were happy, but the original code developer was left with a giant revenue hole to fill. 

Vendors have responded to these events by doing three things: 

  1. Adopting the Server Side Public Licence (SSPL)
  2. Hosting their own products, and 
  3. Trying to add features to capture more of the legacy RDBMS market. 

Let’s take a close look at each strategy. 

SSPL and NoSQL 

The SSPL is now seeing widespread adoption. While the vendors continue to paint themselves as paragons of open source virtue, they make it clear that the intent of the new license is to “restrict cloud service providers from offering our software as a service”. 

The SSPL is a logical response to what’s happened, but it has an unpleasant side effect. Aside from the obvious issues with the source code inevitably being forked and the developer community being torn asunder, there’s a more serious problem if you are one of the Fortune 500s everybody wants as a customer: the SSPL wording may prevent you from offering the open source database in question as DBaaS from a private or corporate cloud, as well as a public one.

Why? 

The issue is that the license says that you must share any related source code if you make the service available to ‘third parties’. What, exactly is a ‘third party’? Many large corporations consist of dozens or even hundreds of different legal entities, any one of which would be a ‘third party’ in the eyes of the license. This is not idle catastrophization on our part. We’re aware of at least one very large German conglomerate that has walked away from an SSPL product in their corporate cloud as a result of this.

Vendor-Hosted Solutions

The second strategy open source NoSQL database vendors have adopted to cope with hyperscalers is to host their own products. MongoDB, DataStax, and Couchbase are going down this route. While it gives them a path to revenue, it does leave end users wondering what ‘open source’ even means in this scenario. 

What’s even weirder is the cognitive dissonance we encounter when some of the same people who insisted that they would only ever use a database if they had control over the source code and do all their own support are apparently happy to use a DBaaS product with zero visibility to what goes on internally. 

We’re also seeing “hosted-only” products such as Snowflake, MongoDB Atlas, and Couchbase Capella appearing. How this works in the long term depends on whether the vendors are ever going to be able to get the economies of scale needed to compete with hyperscalers. Users will also have to understand what it will entail costs, from both a financial and latency perspective, to route traffic between your DBaaS vendor’s cloud and the hyperscalers. 

Adding More Features

The third strategy is to add more features to NoSQL products to absorb more of the market for legacy RDBMS products. We’re now seeing this happen on a wide scale, with ACID transactions and SQL being retrofitted to newer NoSQL data platforms. 

Here’s where history starts to repeat itself.  

When they first appeared, early legacy RDBMS products were considerably faster than the ‘state of the art’. But as the vendors added more and more features targeting specific use cases, complexity, slowness and bloat emerged, especially because many of these use cases require the data platform to act in ways it wasn’t designed to act. This is one of the reasons we had a revolution in data platforms in the first place.  

Let’s look at two specific examples: SQL and ACID transactions. 

SQL retrofits

The whole point of using a schemaless structure to store data is to give developers total authority as to what gets stored, in what format, and as part of what master record. These design choices operate at the level of individual records, not ‘tables’. 

How, then, does a SQL parser even know which ‘tables’ exist, when one of them may occur as a single entry three levels down out of millions of records? If I, as a developer, add a new ‘table’ inside a new JSON object I’m inserting, how does the data platform know to add it to the dictionary of ‘tables’? Note that I’m not questioning for a second that a well-funded publicly quoted corporation lacks the resources or skill to do this. Instead, I’m trying to make the point that any implementation will be messy, and have side effects, and will slow down the product as a whole. 

And this is just the tip of the iceberg when it comes to retrofitting SQL. What we’re seeing now are ‘SQL’ implementations that are cosmetically compliant but frequently have deep semantic differences, with kSQL’s implementation of SELECT being a classic example of this. The thinking behind it is really clever, but the implementation is very different from what an average SQL user was expecting.

ACID transactions

ACID transactions are another area in which retrofitting will go badly. Volt Active Data does ACID transactions at scale, so we know a lot about the architectural choices that need to be made very early on to enable this. 

This problem can’t be solved with clever programming or more CPU power. Instead, you need to reduce the duration of transactions to the absolute minimum, which means making architectural choices that reduce the number of network trips, guarantee serialization, and move logic to the database server. 

Developers instinctively dislike this, but the science and benchmarks are on our side. If you’ve written a data platform without making ACID a ‘day 1’ issue, you will find that the best possible implementation you can do without a disruptive rewrite that breaks existing applications will be to match the row locking capabilities and non-scalable ACID performance of legacy RDBMS products, and in doing so slow everything else down. 

These are but two examples of the challenges a NoSQL vendor is going to face if they try and get into a ‘feature war’ with legacy RDBMS. The bottom line is that the old saying that ‘A jack of all trades is a master of none’ holds true here.

Conclusion

With the honorable exception of Postgres, the surviving open source data platforms face real challenges. The same funding that enabled them to become wildly successful is now requiring  them to find sources of revenue, but by offering ‘store brand’ versions, the hyperscalers are forcing them to either add features to broaden the appeal of their products, regardless of underlying architectural sanity or do a ‘Silicon Valley’-style pivot into the hosting business while simultaneously repairing relationships with the Fortune 500 companies that were harmed by the deployment of the SSPL. 

Let’s be honest: The dream of open source data platforms may not be dying, but it’s not going well. The answer is to find a happy medium: a platform able to offer the ACID transaction at scale without compromising on performance or making promises or value proclamations that it’s not going to be able to keep in the face of a rapidly changing data world. 

Volt Active Data foresaw the NoSQL open source debacle and stepped in to help. We’re not asking you to plunge right in—we’re just asking you to take a look around. You can start here

Read our paper SQL VS. NOSQL VS. NEWSQL to learn why NoSQL (and SQL) mostly fall short of 5G.

  • 184/A, Newman, Main Street Victor
  • info@examplehigh.com
  • 889 787 685 6