Data Lakes: Uniting Systems
It’s been several years since I dabbled into semi-structured data theory. With very complex mathematical formulas, they left behind the equations of sets that I already looked so simple next to the semi-structured data ones.
At that time we did not have a clear application for semi-structured databases, as it is with any new theoretical branch, there was hardly any talk of indexing in some way and with some kind of tree to those XML data that were so fashionable with the rise of popularity of WSDL 1.0 Web Services, but an era of data so dynamic arrived that the rules imposed by relational databases were falling short for the accelerated dynamism of new web and mobile applications.
In those initial Web Services, creating the famous “Envelope” represented more computational expense than what they really wanted to transmit. Imagine if you were to transport just a small package in a trailer and make it travel a long distance, that’s how optimized everything was back then, and I’m talking about the 2000s. I still remember the first few optimizations I was involved with, waiting for that truck to fill up before sending it out on the media. Obviously we are talking about millions of messages per second, filling the truck was fast, and the optimization of using that envelope with more data load than the envelope itself was great.
And all for scheme compliance; to give continuity to those structured data of our databases of that time. But the first semi-structured database management systems such as MongoDB arrived to store JSON documents and be able to exploit them using their own MQL query language to query and traverse paths within the data with flexible structures.
And all this to reduce the amount of data that travels in our networks, especially noticeable on smartphones when they lose a little signal, the data still arrives and we can consult what we need without much problem, with the exception of videos or images, but that’s a whole other topic.
Now, we are taking another leap in all this theory, with the arrival of Data Lakes. Although discussions of data lakes have been going on for a while, there was no clear implementation and it sounded more like a theory without applications. Reading those first implementations of Apache Data Lakes that were more of a headache trying to shape them, one was disappointed when seeing the speed of processing, of response, that it was better to stick a MongoBD to everything again.
But things have improved, now we don’t just have the Apache option, Microsoft has also joined in making a doable but still complex architecture implementation. We have more options like ElasticSearch which gives hope for simple and useful implementations.
But after all this, what is a Data Lake? It is an information repository, like a database, with the difference that the source of information comes from a variety of sources, with different data structures and from a varied availability. This is especially useful now where “there is an app for everything”, yes, a GPS app, an app to manage money, an app to place orders, etc. For an organization that has already implemented several systems for each area, it becomes complicated and very expensive to extract all that information into a single point and then exploit it by extracting its corresponding KPIs.
That’s what a data lake is for. It lumps together information regardless of whether they belong to one schema or another. They are specialists in the storage and indexing of semi-structured data, which, although the response for a significant amount of data may still not be what is expected, it can very well be solved with cache components such as Redis, so the user will not have a great impact on their tasks, but you will be able to consume and interact with several systems at the same time from a single point.
A technology that honestly displeased me a lot 2 years ago, especially where I learned about it, now I realize its potential and above all how useful it is to create new digital business ideas for this new world that is so dynamic and full of information. Today, we can already tell you that we have had more than 2 implementations of Data Lakes, and the results are fantastic for the end user.