In this post, we talk about design process of a serverless data lake using cloud-native technologies.
What is a data lake?
In short, a data lake is a storage that allows you to store all your data in a centralized manner, without need to carefully structure the data. The data can be stored for a specific use case, but that should not definitely be a presumption. Data can be collected and stored to a data lake with just an idea that it might have some use in the future. Fast development of data analytics software for the purposes of data-driven companies has opened new possibilities to utilize the data that just a while ago was categorized as useless or obsolete. Every second there is a huge amount of data being created from which some value may be created, if it just can be analyzed. This includes analytics like machine learning, utilizing and aggregating real-time streaming data to whole history of log files from a data lake.
Consider the sheer volume of data collected across a modern organization. It is obvious that one of the main enablers for the current wave(sic) of data lakes has been an inexpensive cloud storage and compute, available at any scale to anyone. This can be understood as a purely technical or economical solution for data-driven model. Another, sometimes even bigger, challenge for which an organization must adapt is breaking down data silos which are preventing access to a data, scattered around in different locations and different systems.
A data lake is not a data warehouse
Now that we have set out some idea of what a data lake is one may argue that it is the same thing as a bit older definition, data warehouse. But even both are used for storing a vast amount of data, data lake and data warehouse terms are not interchangeable. They are very different concepts and, in many cases, both are capable of co-existing.
Data warehouse is a relational, hierarchical data repository, typically integrating data from a multiple source. Data loaded to a data warehouse is expected to be of a specific schema (schema-on-write model). This means that before loading the data, ETL (Extract, Transform, Load) - operation must be run to extract the valuable data and transform it to meet some purpose. This pattern makes it mandatory to understand the data before it can be loaded to a data warehouse. Are there any anomalies in the data which result discard, what is the data structure on the source system, what is a data structure on data warehouse, is this data valuable? All these kinds of questions must be answered beforehand.
At moment when the data is made available for use, it has already been enriched, transformed and cleaned. Data on a data warehouse is typically of a high quality, appearing to many use cases as a single-source-of-truth.
In contrast, data lakes are relying on schema-on-read model, which makes is easy to import a new data to a lake, structured, semi-structured, unstructured, it doesn’t matter. Typically, data intake is not curated in terms of quality, keeping ingested data in the purest form, raw data. When a data warehouse gatekeeper asks if there is a business requirement for imported data, data lake gatekeeper doesn’t ask such questions. With a set of right analysis, right tools, right hypothesis, all imported data may be valuable.
After data is ingested to a data lake, each data element is assigned a unique identifier and attached with set of metadata. This is called a data cataloging or a metastore, an important principle in the data lake architecture. With a data catalog, queries to a data can be limited to a smaller data set, instead of processing all the data in the data lake. When the data is read, it is transformed, and the schema is applied to fulfill the needs of analysis or a use case.
Hopefully this short comparison helps to understand the key difference between data lake and data warehouse. It also underlines strategy and technology which must be adapted when moving towards a flexible data lake architecture.
Data lake is an abstract idea. Implementations vary from using on-premises block storage based HDFS (Hadoop Distributed File System) to cloud-native object storages, offering limitless data storage like AWS Simple Storage Service(S3) or Azure Storage. The main problem with a data lake i.e. creating a storage and cataloging the data so that it can be queried is relatively easy to solve, but main challenge of generating business value from that data is a hard one. It is not a purely technical solution, but more a question of applying a complete mindset of data management platform to a company data.
We wanted to utilize modern cloud-native serverless architecture for a data lake. Offering standardized core of a data lake architecture but tailoring it to match a varying company data landscape. One of the main objectives was to reduce a key metric of time-to-interpret for an ingested data. How fast it is possible to extract the sensible information from a data lake in a cost-efficient way? Part of the technical solution was maximizing the direct query capabilities to a stored data, minimizing the need to move the data between the data consuming systems or to import the data to external data warehouse for analysis.
We wanted to go beyond the data lake as a chaotical storage. Instead, having a though-out solution for big data analytics, empowering collected data usage. Be it a protected sandbox for exploring and understanding the collected data or use case of transforming and loading data from data lake to external business intelligence systems, we wanted to ensure that core functionality is available and easily tailored to match customer needs.
In a design process following questions were addressed:
Picture: Data management platform
How to ingest data?
- Making ingest patterns as versatile as possible. Data is collected and moved to a data lake from various sources. Possibility for real-time streaming.
How to store the data?
- Security. To protect data assets, granular access control for a data is mandatory. Various roles, stake holders and data consuming tools must have controlled way to access the data. All access requests to data must be logged.
- Processed data store. Automatically use batch jobs to transform selected raw data to enriched ready-to-be-consumed data.
- Utilizing Parquet columnar storage format for fast and cost-efficient queries.
- Data life-cycle policies for changing storage types to meet regulatory requirements.
How to analyze and consume?
- Raw data has moved a responsibility of making sense of the data to a moment when the data is consumed. On the other hand, raw data also helps to find new ways to explore the data, to go beyond the structured model of a data warehouse.
- Business intelligence tools and external data consumers need to integrate seamlessly to a data lake. ODBC, JDBC and support for Hadoop framework must be covered.
- Faster insights, data can be queried as it is. Providing an interface to data for faster results than on traditional data warehouse model.
Data lake without the data is non-sense. Therefore, basic architecture can be set beforehand, but it is only after company data storages and data flows are analyzed that a data lake can be fully deployed to produce any business value.
In this post we’ve set the stage, what is the data lake and what are the challenges it needs to solve. On a following post, we will focus more on technical details, how to implement a serverless data lake, what were the options and what were the trade-offs for each option.