Flink系列(0)——准备篇(流处理基础) (3)

基于时间的滚动窗口

A tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.

滚动窗口将事件分配到长度固定且互不重叠的桶中。

基于数量的(count-based)滚动窗口定义了在触发计算前需要集齐多少条事件;

基于时间的(time-based)滚动窗口定义了在桶中缓冲事件的时间间隔。如上图所示,基于时间间隔的滚动窗口将事件汇集到桶中,每隔一段时间(window size)触发一次计算。

滑动窗口

基于时间的滑动窗口

The sliding windows assigner assigns elements to windows of fixed length. Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.

滑动窗口将事件分配到大小固定且允许相互重叠的桶中,这意味着每个事件可能会同时属于多个桶。

For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.

我们通过指定长度(size)和滑动间隔(slide)来定义滑动窗口。滑动间隔决定每隔多久生成一个新的桶。举个例子,SlidingTimeWindows(size = 10min, slide = 5min)表示的语义是每隔5分钟统计一次最近10分钟的数据。

会话窗口

会话窗口

The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows. Instead a session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.

会话窗口在一些实际场景中非常有用,这些场景既不适合用滚动窗口也不适合用滑动窗口。比如说,有一个应用要在线分析用户行为,在该应用中我们要把事件按照用户的同一活动或者会话来源进行分组。会话由发生在相邻时间内的一系列事件外加一段非活动时间组成。具体来说,用户浏览一连串新闻文章的交互过程可以看做是一个会话。由于会话长度并非预先定义好,而是和实际数据有关,所以无论是滚动还是滑动窗口都无法适用该场景。而我们需要一个窗口操作,能将属于同一会话的事件分配到相同桶中。会话窗口根据会话间隔(session gap)将事件分为不同的会话,该间隔值定义了会话在关闭前的非活动时间长度。

时间语义

相关文档:https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/timely-stream-processing.html

上面介绍的几种窗口类型都要在生成结果前缓冲数据,不难发现时间成了应用中较为核心的要素。举例来说,算子在计算时是应该依赖事件实际发生的时间还是应用处理事件的时间呢?这需要根据具体的应用场景来分析。

流式应用中有两个不同概念的时间,即处理时间(Processing time)和事件时间(Event time)。如下图所示:

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wpywzg.html