Today, we have variety of data sources like social media, sensors, databases, data warehouse, file systems, logs, web/app servers. Data generated from this sources are much required for processing, dashboards, predictions or for later use. We have so many data ingestion tools to achieve this. In this post, let see about data ingestion and some list of data ingestion tools.

Data Ingestion:

Data ingestion is the process of importing, transferring, loading and processing data for later use or storage in a database. This involves collecting data from multiple sources, detecting changes in data (CDC). Data ingestion can be either real time or batch. In real time, each event is imported as it is emitted by the source (CDC). When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time. When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time.

Data Ingestion Tools:

Apache Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralised data store. A flume agent consists of three core components : Sink, Channel, and Source.

  • Source: An interface implementation that can consume events delivered to it via a specific mechanism.
  • Channel: A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport.
  • Sink: An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event’s final destination. Sinks that transmit the event to it’s final destination are also known as terminal sinks.

This is quite popular tool among hadoop ecosystem. This is used for real time data ingestion.

Apache Kafka: Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

This is very popular data ingestion tool. This has variety of use cases like microservices, iot. This is highly scalable real time framework which can handle millions of messages per second. You can find the performance benchmarking here.

Apache Sqoop: Sqoop is a tool designed to transfer bulk data from hadoop and relational databases. It imports/exports from/to relational databases, enterprise data warehouses and NoSQL databases and populate tables in HDFS, Hive and Hbase. It uses MapReduce for import and export the data.

This is a batch ingestion framework.

Apache Nifi: Nifi is a real-time integrated data logistics and simple event processing platform. Apache NiFi automates the movement of data between disparate data sources and systems, making data ingestion fast, easy and secure. NiFi allows you to trace your data in real time, just like you could trace a delivery.

Streamsets: StreamSets Data Collector is an enterprise grade, open source, continuous big data ingestion infrastructure. It has an advanced and easy to use User Interface that lets data scientists, developers and data infrastructure teams easily create data pipelines in a fraction of the time typically required to create complex ingest scenarios. StreamSets Data Collector reads from and writes to a large number of end-points, including S3, JDBC, Hadoop, Kafka, Cassandra and many others. You can use Python, Javascript and Java Expression Language in addition to a large number of pre-built stages to transform and process the data on the fly. For fault tolerance and scale out, you can setup data pipelines in cluster mode and perform fine grained monitoring at every stage of the pipeline.

Filebeat: Filebeat is a lightweight shipper for forwarding and centralizing log data. Installed as an agent on your servers, Filebeat monitors the log files or locations that you specify, collects log events, and forwards them to either to Elasticsearch or Logstash for indexing.

Logstash: Logstash is part of the Elastic Stack along with Beats, Elasticsearch and Kibana. Logstash is a server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to the destination.

Apache Gobblin: Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Apache Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Suro: Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data. It has the following features:

  • It is distributed and can be horizontally scaled.
  • It supports streaming data flow, large number of connections, and high throughput.
  • It allows dynamically dispatching events to different locations with flexible dispatching rules.
  • It has a simple and flexible architecture to allow users to add additional data destinations.
  • It fits well into NetflixOSS ecosystem
  • It is a best-effort data pipeline with support of flexible retries and store-and-forward to minimize message loss

Apache Chukwa: Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

If there is any other data ingestion tools, please mention it in the comment section.


1 Comment

Anonymous · July 22, 2018 at 10:22 pm

Very nice. Awesome read. I guess, you missed dagger(i dont know, if dagger could be used for desktop development. Though, it is being widely used in mobile development).

Leave a Reply

%d bloggers like this: