Flink系列(0)——准备篇(流处理基础)

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Apache Flink是一个分布式、有状态的流计算引擎。

Flink系列(0)——准备篇(流处理基础)

下面将正式开启Flink系列的学习笔记与总结。(https://flink.apache.org/)。此篇是准备篇,主要介绍流处理相关的基础概念。别小看这些理论,对后续的学习与理解很有帮助哦。

下面很多词汇来自flink官方:https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/glossary.html

什么是数据流

An event is a statement about a change of the state of the domain modelled by the application. Events can be input and/or output of a stream or batch processing application.

Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream.

数据流:是一个可能无限的事件序列。这里的事件可以是实时监控数据、传感器测量值、转账交易、物流信息、电商网购下单、用户在界面上的操作等等。

流又可分为无界流和有界流。如下图所示:

无界流和有界流的区分

Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated.

Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations

数据流上的操作

流处理引擎通常会提供一系列操作来实现数据流的获取(Source)、转换(Transformation)、输出(Sink)。

借用flink官方1.10版本的图https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/programming-model.html,如下:

Dataflows

A logical graph is a directed graph where the nodes are Operators and the edges define input/output-relationships of the operators and correspond to data streams or data sets. A logical graph is created by submitting jobs from a Flink Application.Logical graphs are also often referred to as dataflow graphs.

Operator: Node of a Logical Graph. An Operator performs a certain operation, which is usually executed by a Function. Sources and Sinks are special Operators for data ingestion and data egress.

dataflow图:描述了数据如何在不同的操作之间流动。dataflow图通常为有向图。图中节点称为算子(也常称为操作),表示计算;边表示数据依赖关系,算子从输入获取数据,对其进行计算,然后产生数据并发往输出以供后续处理。没有输入端的算子称为source,没有输出端的算子称为sink。一个dataflow图至少要有一个source和一个sink。

dataflow图被称作逻辑图(logical graph),因为仅仅表达的是计算逻辑,实际分布式处理时,每个算子可能会在不同机器上运行多个并行计算。

Parallel Dataflows

Source、Sink、Transformation

Sources are where your program reads its input from.

Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them.

数据接入(Source)和数据输出(Sink)操作允许流处理引擎和外部系统进行通信。数据接入操作是从外部数据源获取原始数据并将其转换成适合流处理引擎后续处理的格式。实现数据接入操作逻辑的算子称为Source,可以来自socket、文件、kafka等。数据输出操作是将流处理引擎中的数据以适合外部存储格式输出,负责数据输出的算子称为Sink,可以写入文件、数据库、kafka等。

A Transformation is applied on one or more data streams or data sets and results in one or more output data streams or data sets. A transformation might change a data stream or data set on a per-record basis, but might also only change its partitioning or perform an aggregation.

在数据接入与数据输出中间,往往有大量的转换操作(Transformation)。转换操作会分别处理每个事件。这些操作逐个读取事件,对其应用某些转换并产生一条新的输出流。算子既可以同时接收多个输入流或产出多条输出流,也可以进行单流分割、多流合并等。

有状态怎么理解

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wpywzg.html