There are different stream processing frameworks like Apache Storm, Apache Spark, Apache Samza, Apache Flink, etc which has unique features. In this post, lets see about the features of Apache flink.

flinkArchitecture

Before we get started with the features, lets see what is flink?

Flink: Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics. Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.

Flink does not provide its own data storage system and provides data source and sink connectors to systems such as Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.

Now lets see the features of flink with a short description,

Data Processing: Apache Flink provides a single runtime for the streaming and batch processing. Apache Flink provides a single runtime for the streaming and batch processing.

Streaming Engine: Apache Flink is the true streaming engine. It uses streams for workloads: streaming, SQL, micro-batch, and batch. Batch is a finite set of streamed data.

Data Flow: Flink supports controlled cyclic dependency graph in run time. This helps to represent the Machine Learning algorithms in a very efficient way.

Computation Model: Flink has adopted a continuous flow, operator-based streaming model. A continuous flow operator processes data when it arrives, without any delay in collecting the data or processing the data.

Performance: Performance of Apache Flink is excellent as compared to any other data processing system. Apache Flink uses native closed loop iteration operators which make machine learning and graph processing more faster

Memory management: It provides automatic memory management. It has its own memory management system, separate from Java’s garbage collector.

Fault tolerance: The fault tolerance mechanism followed by Apache Flink is based on Chandy-Lamport distributed snapshots. The mechanism is lightweight, which results in maintaining high throughput rates and provide strong consistency guarantees at the same time.

Scalability: Apache Flink is also highly scalable, we can keep adding n number of nodes in the cluster.

Iterative Processing: It iterates data by using its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of the job.

Recovery: It supports checkpointing mechanism that stores the program in the data sources and data sink, the state of the window, as well as user-defined state that recovers streaming job after failure.

Hadoop Compatibility: Apache Flink is a scalable data analytics framework that is fully compatible to Hadoop. It provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs.

Caching: It can cache data in memory for further iterations to enhance its performance.

Machine Learning: It has FlinkML which is Machine Learning library for Flink. It supports controlled cyclic dependency graph in runtime. This makes them represent the ML algorithms in a very efficient way compared to DAG representation.

SQL Support: In Flink, Table API is an SQL-like expression language that supports data frame like DSL and it’s still in beta. There are plans to add the SQL interface but not sure when it will land in the framework.

If I have missed out any feature, please mention it.


Leave a Reply

%d bloggers like this: