To avoid getting into an infinite spiral of blog posts or Google searches, we will try to move away from the trend a little to focus on understanding what a modern data stack consists of and, above all, review the benefits that implementing it can give us.
Lately, we’ve been hearing and reading a lot about the Modern Data Stack (MDS from now on), to the point that one of the most prominent companies in the world of data (if not the most), Databricks, is promoting the concept along with other vendors. The term has become popular — and distorted — to the point that there is even talk of post MDS.
What’s new, old rabbit?
It is important to start by clarifying that the implementation of a stack of this style is based on three pillars:
1) Adoption of cloud services to take advantage of all their benefits: flexibility, pay-per-use, and scalability to mention the most important. On the other hand, we will find a large offering of specific purpose services that we will implement as needed.
2) Separation of the architecture into layers in such a way that we can build and validate it in an iterative and modular way, avoiding monolithic solutions that do not scale.
3) Change of the ETL paradigm to EL-T, that is, we leave behind the good old times where a tool was in charge of extracting the source data, processing it, and loading it to our final destination, a Data Warehouse. The proposal now is to load our data as they are in the source to a central repository (Data Lake, Data Warehouse — still valid — or Lakehouse) and then transform them, this way we will make use of the benefits of distributed data storage and processing.
The fact that the three points mentioned have been established in the industry as a kind of standard, made it so that, on that basis, there is an evident evolution in the emergence of new companies focused on building highly specific purpose tools. The novelty then, is that we find ourselves with a large number of tools to solve very specific problems in the construction of a modern data architecture.
To bring this down to earth, let’s review the different layers of these architectures. For each one, we will see:
The desirable characteristics to look for.
Some available tools.
Benefits that we can obtain.
Storage and computing
As we mentioned, modern data architectures have a central repository where we will store our data, either in its original version (raw data) or in a later stage, after applying transformations or business logic. That is why, in addition to storage, the technology we implement should provide us with the necessary computing to access the data efficiently.
Whatever tool we choose, let’s keep in mind that this layer will be the center of the architecture and, therefore, the one that will have the most influence on the rest.
In terms of the desirable characteristics that we should look for, we can list:
Separation of storage and computing: it is crucial that we can scale them separately as needed, the mere fact of ingesting more data is not a reason to have to acquire more computing capacity, the same for the opposite case.
Pay-per-use: one of the most outstanding features of operating in the cloud. What we pay should be directly related to the use we give to the data, without the need for advance investments.
Integration: if this tool will be the center of the architecture, it goes without saying that it must integrate with, at least:
ELT tools
Visualization tools.
Open formats: the adoption of open data formats is increasing, either traditional, such as Parquet or JSON, or the evolutions that allow database-type operations (transactions with ACID characteristics), such as Delta Lake or Apache Iceberg. That is why it is essential that we can consume these types of files without being tied to a proprietary format that generates vendor lock-in.
Security: if we are going to store all our data, it goes without saying that security plays a key role. Among other things, the tool has to allow:
Granular access: each user will only access the data on which they have permissions, with the possibility of applying them down to a row and column level.
Encryption.
Isolation: if required, we could configure the environment in such a way that data exchange takes place within the resources we are using without going through the Internet.
Performance.
The most popular tools in this layer are:
Amazon Redshift.
Snowflake.
BigQuery.
Databricks.
Azure Synapse.
Regarding the benefits of implementing these tools, we have:
Cost efficiency associated with pay-per-use.
Scalability as we can increase storage and/or computing capacity on demand and easily.
Data availability at the time of loading.
Unified data repository avoiding silos of information.
From here on we will refer to this layer as the Data Warehouse (DW).
Ingestion
In this layer, we will take care of feeding the tool that we will use in the storage and computing layer. That is, we will bring data from different sources and then use it in various ways.
Within the scope of MDS, what will interest us the most is:
Zero Coding: to not devote data engineer resources to tasks that we can have solved by the tool.
Variety of connectors: the way we will interact with these tools is through the use of connectors that we will configure in a visual interface.
Source: any application or product that generates data, from a CRM to tracking tools.
Destination: DW.
Some available tools:
Open Source:
Airbyte.
Meltano.
Paid:
Fivetran.
Stitch.
Matillion.
The benefits of using these tools: focus on the what, not the how. That is:
More data in less time.
Focus on adding value to the business since the savings in resources and development time of integrations can be dedicated to tasks that add value.
Robustness.
Scalability.
Transformation
The data we ingest by itself will not generate any value, it is necessary to clean them, enrich them, cross them with each other and, finally, generate datasets that can be exploited later, either by users or by applications.
What characteristics are we interested in this layer?
Use of the DW as a processing engine.
Ability to delegate tasks that do not add value to the tool.
Automatic validations.
Metadata documentation: as we are building datasets that will be used by other people, it is important — if not essential — to consider documenting the metadata so that the use and discovery is more accessible.
Data as Code to approach the adoption of DataOps, that’s why we need the tool to support:
Reusability.
Versioning.
CI/CD.
The star tool in this layer is dbt, which in the last two years has consolidated almost as the only option.
The benefits we will obtain in this case are:
Shorter development times.
Improvements in the quality and availability of data.
Maintainable and clean code.
Increased understanding through documentation and data lineage.
Exploitation
Having a data architecture will not be useful if we do not use the data we generate.
Although there are many ways to exploit them, the most common are:
Visualization.
Machine Learning.
Operationalization (it’s easier to do it than say it): this is also known as reverse ETL, it is basically acting on the data directly by feeding back the operational tools. For example, we could send a calculated field in the DW to CRM contacts.
The main feature to keep in mind is the ability of these tools to connect and operate directly with the DW. If we build our data architecture around this unified information repository, it is reasonable to think our processes and developments to use it, as much as possible, as the only source of data exploitation.
Regarding the availability of tools, in visualization we will find more alternatives, mainly within the traditional Business Intelligence world (Power BU, Tableau, Superset, Metabase, among paid and free options). It’s worth mentioning that the trend to decouple the semantic layer of these tools to manage it separately is becoming stronger, this is generally known as Headless BI. In short, the definition of how the tables of our data model are crossed, where to find metrics and dimensions, or how to calculate KPIs, is defined in a tool and then consumed from another place. We believe that the evolution and adoption of this paradigm is the best way to achieve the ideal of self-service BI.
Regarding Machine Learning, although tools are beginning to emerge under the MDS umbrella such as Continual or Neptune, the most important progress is the recognition of the lack of processes and the chaos that reigns in the development and implementation (if we are lucky) of models. To improve this situation, teams are beginning to incorporate the practice of MLOps is more of a cultural issue than a question of tools to use, although new technologies will certainly emerge that, like in other layers, will allow us to deliver value much faster by solving issues that today we must resolve manually.
Lastly, we mention reverse ETL and how it allows us to intervene directly in business operations. The most popular tools today are: Grouparoo (Open Source), Census, and Hightouch (paid).
Orchestration
The final layer that brings together our entire stack.
If connectors are important in other layers, they become crucial here as the tool must communicate with each of the layers we have detailed.
It is also important here to be able to write our pipelines as code in order to achieve reusability, versioning, and the rest of the features we have already mentioned.
Other two fundamental features in this layer are:
Observability: we need to measure and monitor every step of our pipeline to anticipate errors and correct them preventively.
Idempotence: our code must be prepared so that, in case there is a need for reprocessing (who says it hasn’t happened, lies), if we execute with the same set of input parameters, we get exactly the same result.
The most widely adopted tool today is Apache Airflow, with two new contenders that were developed with the promise of solving or improving problems that we can find when using Airflow: Dagster and Prefect.
Myth or reality 🤔
Now that we have covered each layer of an MDS, let’s see some of the myths that often appear:
“If I incorporate the MDS tools, I modernize my architecture.”
On the one hand, yes, but the reality is that we must incorporate them taking into account the characteristics and benefits we want to obtain. If we add tools without criterion, the remedy will be worse than the disease.
“If I don’t use the MDS tools, my stack is obsolete.”
Inverse of the previous myth, the answer will not depend on the tools we use, but how rigid my architecture is and, above all, how quickly I am delivering value to the business.
“There is only one way to implement MDS.”
Of course not. We have an infinity of alternatives, not only in tools (each with its pros and cons) but also in function of the maturity level or even of the composition of the team. These issues can influence the need to dispense with some layer.
“With a MDS I solve all my use cases.”
Maybe yes, but it’s most likely that cases such as real-time data processing or machine learning on unstructured data will be left out.
“The MDS is costly”
The advantage of working with cloud services and thinking architecture in a modular way is that we can plug and unplug tools according to convenience and also pay for what we use. We can easily start with a stack that requires an investment of less than $100 a month.
“The MDS is complex”
It can be as complex as we want it to be, but as we always prefer at Minimalistech, better Keep It Simple Stupid.
By Minimalistech´s editorial team.
Minimalistech has more than 10 years of experience in providing a wide range of technology solutions based on the latest standards. We have built successful partnerships with several SF Bay Area, top performing companies, enhancing their potential and growth by providing the highest skilled IT engineers to work on their developments and projects.
コメント