Data lake is one logical starting point towards being data-driven in an agile way. In this post we will see hadoop based data lake, high level architecture and its components.
Data lake is the centralized repository of all data collected from variety of sources.
- Stores both relational and non-relational data in raw form with lowest granularity.
- That can store data as long as required.
- Serves as a staging layer for further structured and unstructured analysis.
The following is a high level picture of how Data Lake might look like and some of the technologies involved.
Raw Data Ingestion
It extracts data from various conventional and unstructured data sources in batch mode or streaming mode. The data ingested from any of the possible sources is first stored in raw form in HDFS. Ideally, this raw staged data is what is called the data lake. It can serve as a source for data for any analysis.
Tools like flume, kafka, filebeat, logstash can be used to forward data from data sources into HDFS.
Data Processing / Analytics
Raw data is deduplicated, transformed, indexed and stored in a queriable format. Data may be partitioned based on time interval to reduce the data scan at the time of query.
Tools like pig, prestodb, spark, storm can be used to process batch or streaming data. NoSql databases like mongoDB or storage engines like solr can also used for data analytics or indexing.
Processed data will be visualized to get better insights on data. Visualization tools like Zeppelin or D3/ NVD3 can be used for this purpose. Existing BI tools can directly connect to Hadoop / Hive using JDBC/ODBC connectors.