We thank Jeremy Winters of Volt Active Data partner Full360 for this guest post, which first appeared on Full360’s Elastic Ninja blog. It is published here in its entirety.
About a year ago, a Full360 gaming customer had a requirement to create an analytics data warehouse for in-game events. These events are generated by various mobile and desktop versions of their games, representing significant actions in the game such as wins, losses, level ups, and in-game currency transactions. The final target for the data was a Vertica database running on AWS EC2 instances… to be analyzed and visualized with Tableau.
The twist in this story for Full 360 is that the events were not being collected yet… so I was tasked with determining the mechanism for capturing the events and staging them in S3 before they were loaded into the database. It was obvious that this mechanism needed to be a HTTP based service of some sort. At the time I was familiar with building web services using JRuby, Sinatra, and Puma… the combination of which allows for a truly concurrent solution that is relatively easy to build.
Around that time Full360 became partners with Volt Active Data. I was browsing the Volt Active Data documentation and saw two features that knocked JRuby off the table and became a slam dunk for this requirement.
To give a little background, Volt Active Data is an in-memory SQL database that provides an MPP, shared nothing architecture. These principles are very similar to the architecture of Vertica, which is not surprising as they are both the brain children of Michael Stonebraker, who recently won a Turing award for his contribution to database systems technology. A database like Vertica is intended for high performance querying of very large datasets, while Volt Active Data is a bit of the opposite, designed for high throughput of complex transactional events. These events are sped up through design aspects such as the in-memory model and partitioned stored procedures. Volt Active Data is an excellent tool for achieving a high throughput of intelligent operations such as personalization, fraud detection, and analytics that capitalize on the value of freshly generated data.
But… we’re not doing any of that fancy stuff here! We are simply collecting events for ingestion into a data warehouse.
The two features of Volt Active Data that rang a bell for me were the following:
- JSON Interface – Volt Active Data provides a simple RESTlike HTTP interface that allows you to call procedures and pass parameters via the URL in a JSON format. This functionality can be enabled for any Volt Active Data stored procedure.
- Export tables – Volt Active Data is a SQL database that supports basic SELECT, INSERT, and UPDATE statements (among others). Tables can be defined as export only. Export tables will only accept an INSERT statement. When the record is inserted, the field is immediately queued for export to a text file on a defined directory of the Volt Active Data server. These files are rotated at an interval defined by the database configuration.
Defining the table as export is easy… you simply create the table definition as you normally would in a DDL statement, then declare it as an export table:
create table sessions ( session_start timestamp ,user_id bigint ,session_id bigint ,b_user_id bigint ... ); export table sessions;
Export functionality is enabled in the deployment.xml file that contains the resource definition of the Volt Active Data instance you are running. Note that the example below not only defines the export parameters, but enables the JSON API interface.
<?xml version="1.0"?> <deployment>
<cluster hostcount=”1″ sitesperhost=”2″ kfactor=”0″ />
<jsonapi enabled=”true” />
<user name=”someuser” password=”abc123″ roles=”device_role”/>
When the database is running… any rows inserted into export only tables are written directly to text files which output in a basic CSV format. To complement the export process, I built a simple job that checks for new files (after they are rotated) then uploads them into Amazon S3 with a filename based upon a sortable timestamp and the server hostname, allowing multiple servers to collect events and write them safely into the S3 bucket.
Getting events into Volt Active Data is quite easy when using the JSON API. This API allows your application (in this case the mobile game) to send events to Volt Active Data by way of simple HTTP calls such as the example below…
Of course the example above uses HTTP… but in production we use an Amazon Elastic Load Balancer that has TLS/HTTPS to encrypt the traffic. The ELB also enables us to spread the incoming events to multiple Volt Active Data servers running in different Availability Zones (electrically isolated data centers) allowing for a data center to fail without losing the ability to collect events.
Each Volt Active Data server works in isolation, gathering events and writing them to files without awareness of the other servers. This customer typically gathers around ~10M events per day. This volume is handled by two m1.large servers neither of which ever goes past 20% CPU. If they were to require more horsepower, we could easily scale up the instances or add more of them to handle the load.
Behind the scenes we have a few other tricks, such as the way we generated all of the Volt Active Data event definitions (Java source files) using a metadata-driven Ruby template system… but that’s a story for another time!