Why We Built Splice Machine

To Simplify Distributed Data Management and ML

6 min readMay 27, 2020

Mission

I am Monte Zweben, CEO and co-founder of Splice Machine (https://www.splicemachine.com), a database company striving to make it really easy to build operational applications that take full advantage of big data and machine learning.

Our mission is to make distributed data management and ML accessible to all.

Data Platform Complexity

Building a big data application usually requires the integration of a few data platforms. You typically need an operational database, an analytical database and a data science/ML platform. Many of you are too familiar with this exercise. For example one full data stack for an ML application might glue together a scale-out NoSQL database like MongoDB to power the application, with Snowflake to drive analytics, and Databricks to get Jupyter notebooks, SparkML, MLFlow and model deployment to Sagemaker or Azure ML.

While using any one of these platforms is easy, managing all of these infrastructure components together — deploying them and keeping them performant and stable in production over time — is hard. You need to learn how they all work under the hood, become an expert on each, manually instrument each, choose dozens of infrastructure parameters and configurations, and go through painfully slow iterations to develop, debug, and productionize the “complete” stack. And worst of all, this loosely coupled architecture typically moves data between these platforms, leading to lots of latency.

Under The Hood

After multiple startups, including Rocket Fuel and Blue Martini, my co-founders and I had become extremely frustrated with this process and decided to do something about it. In response, we built Splice Machine so that our peer application developers, data scientists and data engineers could focus on their core mission — building models, pipelines, and applications — without having to integrate the platforms or manage all the DevOps.

OLTP & OLAP Combined

To realize this goal, we took a few of the open source components discussed above and tightly integrated them in Kubernetes so that developers could simply use them as a complete platform. No engineering required. We built an open source, scale-out, SQL RDBMS that can choose between different engines under the hood based on the nature of the query. Our OLTP engine is the Apache HBase key-value store. For those unfamiliar with HBase, here is a one liner: HBase auto-shards data across a cluster and keeps the data ordered by a primary key for fast lookup, mutations, and range scans. You don’t use HBase per se on Splice. You are not limited by its API or quirkiness. We’ve engineered that out, and you just interact with ANSI SQL. In addition, and this was the most technically challenging effort, we built an ACID transaction layer over this key-value store which is a full SQL MVCC that implements snapshot isolation semantics. This is not bulk ACID like Databricks Delta or Hive. This is true OLTP, ACID compliance that enables concurrent mutation of individual cells without locking. Our cost-based optimizer estimates the number of rows it plans to scan, and it chooses either Spark or HBase to execute the compiled SQL plan based on this metric. We’ve integrated HBase and Spark tightly with our MVCC so that all computations respect ACID compliance, pass data to each other in a performant fashion, and all background processes like compaction are scheduled on Spark relieving HBase of memory contention. We also support full columnar ORC/Parquet tables as first class objects for analytical performance. You can join row-based and columnar tables together and create very interesting hybrid views.

Adding Machine Learning

On top of the database we tightly integrated a number of machine learning components such as:

JupyterHub
Spark ML
BeakerX
MLFlow
H2O
Scikit
Keras
PyTorch.

We also built-in deployment managers to AWS Sagemaker, Azure ML and finally a deployment mechanism native to our database. In our Jupyter system, all users get their own Notebook infrastructure with BeakerX preconfigured. One of the most important features of this customization is polyglot programming. This allows data scientists to write code in a number of different languages like: Python, SQL, Javascript, Java, Scala, Ruby, HTML, markdown, sh, and even LateX all within the same notebook. Even more, you can share variables across these polyglot cells. Finally, we created a Native Spark Datasource so that data scientists and data engineers can manipulate Spark dataframes as input and output of database CRUD operations without serializing the data like you have to do with traditional JDBC/ODBC connectivity to Spark.

We also wanted to provide a complete solution for model workflow management and deployment. We selected the open-source MLFlow library for tracking ML models and experiments because it was the most flexible. We have a design philosophy of not imposing any particular modeling workflow, and not restricting the parameters, metrics, or model library that data scientists can track.

We made some cool extensions as well because MLFlow is tightly integrated to our database. We can actually store the models in the database and we provide RBAC-level access privileges enhancing the governance of models. Moreover, we automate the logging of parameters and metrics for large classes of libraries making the notebooks less cluttered with logging statements.

And finally, we can deploy models directly to the database. This is a bit subtle but powerful. We serialize the models, store them in the database and automatically create a scoring function in the database. Then we generate a trigger on a feature vector table so that when new records are inserted into that table, the scoring functions fire and insert predictions directly into a table. This is all turn-key with no code to write. So now you can monitor the predictions in the table for skew and the application can just access the table for predictions. We believe this simplifies the use of models for application developers because real-time predictions become simple table lookups. We also support other deployment mechanisms like AWS Sagemaker and Azure ML.

Work To Still Do

Splice Machine supports both OLTP and OLAP. It’s an engineering truism that it’s not possible to be optimal for everything simultaneously. That said, OLTP performance is similar to BigTable architectures and OLAP performance is comparable to Spark SQL (i.e., really good). We include an HTAP benchmark so engineers can test different workloads. Also, we don’t support JSON as a first class datatype yet but most customers freely marshal their data into schema using Spark JSON support functions or GraphQL. Last, our cloud service’s UI/UX is not as polished as some others but it’s rapidly improving.

Kubernetes

We deploy all of the components directly on Kubernetes. We hide this complexity from our users by deploying Splice Machine in their cloud account on a Kubernetes cluster that we manage for them. Our users can simply interact with our web UI and our API/CLI — they don’t need to poke around Kubernetes.

Provision Databases and ML Managers Easily

The platform is available on AWS, Azure, and GCP (beta). Some of our customers use us for their new AI applications that need ML models like one customer that is building a medical advisory application to predict the trajectory of neurological diseases. But many customers also use us to migrate their old applications to the cloud like one insurance customer that migrated their Db2-based claims application to the cloud. In some cases, like old IBM Db2 applications, we can migrate the application over with little to no code changes because of our Db2 dialect extensions. Last, some customers use us like Snowflake is typically used — to offload workloads from expensive data warehouses like Teradata.

Feedback

Like any software provider, we overlap with many efforts out there, but we hope you’ll find our approach unique! We’re excited to share our story with the community today and we look forward to hearing about your experience in the data engineering and data science spaces deploying ML applications. Have you had to glue these platforms together like we described above and did you feel the frustrations we talked about? If you are considering connecting things like Dynamo or Mongo with Snowflake and Databricks for your next project, does our platform look appealing? We invite you to give it a try at cloud.splicemachine.io. We are happy to give you an unlimited free trial — meaning any size cluster — in your cloud account in exchange for your feedback.

Thank you!

Monte