At Enable, we see more and more companies in sectors such as energy, clean cooking or water filtration asking for data lake implementations. But what is all this buzz about? What problem are data lakes attempting to solve? How do you know if a data lake is right for your situation? Are data lakes always the best fit? What are the benefits and potential pitfalls of data lakes and how well do they work for SMEs in emerging markets? These are some of the questions we will try to answer in this article.
Before we explore data lakes in more detail, it is worth taking a moment to recognize why you should be thinking about data at all. Data can be the key to unlocking the potential of a company. Through the right use of it, businesses can gain insights that drive strategic decisions, detect fraud, or even get access to funding by providing investors with the information they require. Data is the key driver for success and growth for SMEs and big market players alike, and technology is what empowers effective use of it.
Now we have established that being able to use and analyse data is vital for the success of organisations, the next question is how do we achieve this? The answer is through effective data management — and the key stone to that management is in having an appropriate data infrastructure. It is worth keeping in mind that any decision when it comes to data infrastructure should be preceded by answering questions such as:
- What problems are we intending to address?
- What is the goal we are trying to achieve?
So often we see companies asking for a data lake simply because everyone seems to be talking about it and very frequently not fully understanding what it actually is and what benefits it could bring. With this mindset, even if the implementation turns out to be the right choice, it will be hard for the management to take full advantage of the solution.
What is a data lake?
Data lake is a type of architectural design for data storage. Put simply, it is used to store and access raw data that has been collected from multiple sources of an application network, all from one place.
Contrary to some misconceptions floating around, implementation of a data lake does not necessarily rule out a data warehouse. It can in fact be used both alone or in combination with a data warehouse depending on the specific needs of the company in question.
What are the potential advantages and risks for SMEs in emerging markets?
Speed of implementation and low data transformation
One of the biggest advantages of a data lake is the speed with which it can be implemented compared to the more traditional data infrastructure solutions. Since it is literally a data dump, very little transformation needs to be done. This is especially useful for the fast growing companies in the emerging markets whose needs often change rapidly. A data lake allows them to not lock themselves in a strictly predefined structure which is difficult to review, whilst being able to simply access all data as it comes. This brings us to the next advantage which is…
Access to all data from one place
One of the biggest headaches for companies we work with is the fact that their data is often stored in many places and maintained by so many different employees that the management struggles to find the relevant information at the time when it is really needed. A data lake can automate the flow of data from databases, Google Sheets, applications and even Excel sheets. This means that the C-level employees can access it all in close to real-time from one interface, allowing for a more encompassing view instantly, without even needing to know where the data is stored.
It also makes it faster and easier to show potential investors the data they are interested in, monitor whether the company is on track with its OKRs, as well as implement commissions based plans and detect any suspicious activities, especially when working with subcontractors.
As both structured and unstructured data can be stored together with little transformation, it makes it much easier to maintain. This is particularly important considering the scarce IT/data resources available in emerging markets. You don’t want to worry about complex transformations and deep analysis that is needed for a data warehouse implementation, especially if your company is just starting out or your IT team is not yet well established. We often see SMEs in African countries struggle to find highly skilled IT resources, or not prioritizing hiring those. Going for a simpler data infrastructure solution such as data lake can give them more time and flexibility as they grow and remove the burden of maintaining a complex solution, while still reaping plenty of benefits.
Easing the burden on application performance while being able to provide aggregated reports and dashboards to management and investors
Querying live applications can cause them to significantly slow down, especially when dealing with low bandwidth networks or applications that need to work offline. By using a data lake as a source for dashboards and reports, you remove this performance burden.
Whether you implement the solution on AWS, Google Cloud, Microsoft Azure or even leverage Snowflake, the storage solutions that can be used for data lakes are the cheapest possible option. Take, for example, this comparison: AWS Redshift, which is a classic data warehouse storage solution, incurs a monthly cost of at least $182 for the smallest 15GB machine (roughly $0.30 per hour), whilst the price of storage of the same size on AWS S3, where your data lake could “sit”, would be roughly $0.30 ($0.02 per GB per month). This makes the data lake storage 600 times cheaper! Clearly, the data lake is a no brainer if you’re currently tight on the budget but still want to take advantage of the benefits of not querying the live data directly and combining all data sources in one place.
Another advantage of a data lake is that even though all data is stored in one place, it can be made highly secure through the use of security rules. These rules define who can access the different parts of the data lake on either a user or group level. This way you can have a clear overview and a full control over these permissions from a single place.
Simplifying data warehouse implementation
As you get more familiar with all the data you’re gathering, your requirements for Business Intelligence will develop and you might find that the data you want to present now requires complex transformations before it can be used on a dashboard. If that is the case, don’t worry, you haven’t thrown away your money in a data lake implementation. Having all data already gathered in one place makes it much easier for data scientists to explore, and a data lake itself can even be used as a staging area for the data warehouse. The transformation pipelines can be connected directly to the data lake and load the data into the data warehouse, leaving the live data safe from heavy querying that could overload your databases and harm the performance of your applications.
So data lakes sound awesome, don’t they? Before we get too excited, let’s take a look at some of the drawbacks.
Incorrect use of raw data can lead to skewed results
The data in the data lake is raw, and different sources might need to be linked before meaningful conclusions can be drawn. The risk is that incorrect conclusions are drawn through faulty analysis, be it due to incorrect queries, or through invalid assumptions. Analysing and linking data requires both technical skills and a good understanding of the data structure — both of which need to be developed with and by the users throughout the implementation. These challenges can be minimized by predefining most common queries (that can be run with a click of a button) or scheduling specific reports.
Data quality risk
The data that flows into the data lake is not cleaned or processed in any way, which means its quality cannot be guaranteed. However, by browsing the data directly in the data lake it is possible to detect potential problems with the data, such as agents leaving key customer information empty or entering their own phone number for multiple customers.
The other side of the coin for security
Since all data is accessible from one place, special attention needs to be taken when defining security policies so that sensitive information is protected from users that shouldn’t be allowed to access it.
We have seen companies in sectors such as paygo solar, water filtration franchises, solar water pumps distribution, C&I solar and many more benefiting hugely from successful data lake implementations. If this sounds like something that could benefit your company, or if you need help with any other data needs, please do get in touch with our Principal Data Consultant at firstname.lastname@example.org so that we can look into your specific case.