Data Lake
Lake of DATA? What??
As per Wikipedia, Big data is a field that treats ways to analyze, systematically extract information from, or otherwise, deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
You must have heard about Data Warehouse (DW) and Data Mart (DM). But an emerging term Data Lake is new in the field of Big Data and something you must be aware of if you are thinking of pursuing a career in the field of Business Intelligence. The difference between Data Warehouse and Data Lake is necessary to understand because they serve different purposes.
Data Warehouse consists of all the structured data which is being processed and organized into a single schema. Further, all the analysis and reporting procedures are being done on the cleansed data. But looking at the present scenario and future, even the data warehouse is not enough to store all the information. The reason being, it only consists of data related to one subject. What if we want a storage system where we can put all the raw information and create some meaningful data out of it when needed.
Data Lake (DL) can be thought of as a pool of data where data are stored in its natural form. In technical terms, Data Lake is a repository of all enterprise data which includes source data in its raw format as well as transformed data which is used for various tasks such as reporting, visualization, advanced analytics, and machine learning. By enterprise data, we mean that Data Lake can include structured data from relational databases, semi-structured data (XML, HTML) and unstructured data (images, videos, audios, etc). Data Lakes are mostly used by Data Scientists. One major advantage of DL over DW is that they are highly accessible and are quick to update, unlike Data Warehouse.
Data Lake often comes with the related term Data Swamp. Data Swamp is highly unorganized and unmanaged Data Lakes that are either inaccessible to its intended users or provide little value. Organizations often tends to create Data Swamp while creating Data Lakes.
What makes a Data Lake turn into Data Swamp?
1. Lack of Metadata
Metadata is often understood as “data about data”. In a more specific manner, it can be described as data that provides information about other data. Think of it as hashtags we use in our twitter posts or LinkedIn posts. In a similar manner, when metadata are used in a data lake, it acts as a tagging system and thus making the data search easy. Data swamp lacks metadata which makes the data search very difficult and thus creating a problematic scenario.
2. Irrivalent Data
Often information gathering from various sources results in the collection of data with no goal. Thus, a data lake can easily turn into a data swamp if data that is being collected is done without setting any parameters about the kinds of data they want to gather and why. Thus turning a well-organized data lake into a data swamp flooded with data that may never be needed.
3. Lack of Data Governance
Data Governance is very important from an organization's point of view. It defines how to treat data, who should handle it, how long the company should retain it and where it should go, etc. Excellent data governance and high data quality go hand in hand. Data lake lacks in data governance which results in dumping all the information at one place thus putting a question mark on data quality. Also, a lack of governance can put an organization at risk when it comes to audits.
4. Lack of Data Cleaning Strategy
There is an immense lack of data cleaning strategy when it comes to data swamp. A data swamp might contain duplicate data as well as some erroneous data. As such, it's just like dumping everything in one place and never bothering about what is present there and what needs to be cleaned. Hence, it is necessary and very important for an organization to make and stick to plans for regularly cleaning their data.