Data Lake on ACID


2019-10-22, 11:00 Posted by: Tomi Asp

 

Have you ever dreamed of a data platform that allows you to

  • Maintain versions of data sets used to train Machine Learning (ML) models
  • Provide full audit log to any changes made to the data
  • Have ACID (Atomicity, Consistency, Isolation, Durability) transactions in your data lake
  • Enables schemas to help with data quality
  • Handles metadata with ease?

data platform

Databricks includes an open source storage layer Delta Lake enabling ACID transactions on data lakes. This is enabled utilising transaction logs which record every ransaction made to the data lake. To study more about how these logs are used within Databricks, refer their own documentation here. Many of the capabilities mentioned above are enabled thanks to this.

With Lambda architecture

Lambda architecture is a data-processing architecture that enables batch and stream processing in a single system. The purpose is to make it more efficient to manage big-data workloads; lower latency and increased throughput.

Lambda architecture is made of three separate layers: batch layer, speed layer and serving layer. In this design, the batch pipeline includes a Databricks jobs operation with every batch input that invokes a necessary Databricks job to update the delta table as well as provide batch data into the speed layer.

lambda view

With MLOps

The following design also includes Azure’s machine learning service to manage, deploy and monitor models develop with Databricks, or any other tool for that matter. With Databricks’ transaction logs and Azure’s machine learning service, you can build an end-to-end audit trail capability to see which data version was used, who last edited that data and who deployed the model into production.

Together with full audit trail, you can utilise the services to build sophisticated automated ML pipelines, tracking model degradation and automatically creating new ones and deploying them.

Get in touch 

This is not to say that this would be the only approach or the best one for you. Cloud enables many variations to data platforms for managing your data from which this is only one. If you want to discuss your needs and find the best one for you, I am sure one of our sales guys would be happy to contact you.

We are also going to Microsoft Ignite 2019, I will be there as well if you want to meet up and discuss data with some beer, Niko buys.


comments powered by Disqus

Do you need help with your data driven business? Contact us

Niko Ylänen

Sales Executive

Call contact
Email contact