On the 6th of June 2019, the paper “D2IA: Stream Analytics on User-Defined Event Intervals” won the best paper award in the 31st International Conference on Advanced Information Systems Engineering (CAiSE’19), which was written by Ahmed Awad, Riccardo Tommasini, Mahmoud Kamel, Emanuele Della Valle and Sherif Sakr. CAiSE is one of the oldest and top tier (Tier A) information system conferences.
- Ahmed Awad, Senior Research Fellow of Big Data at the University of Tartu
- Mahmoud Kamel, MSc student at the University of Tartu
- Sherif Sakr, Research Professor of Big Data at the University of Tartu
Nowadays data are generated at unprecedented rates and in various forms of formats. Thus, the traditional data analytics lifecycle that starts with an offline collection of data for analysis wouldn’t fit many of today’s scenarios and applications like fraud detection, smart transportation, plant monitoring etc. For this, data has to be analyzed as they are generated or with a very short delay. This is known as data stream analytics.
In general, data stream analytics can be divided into pattern matching which utilizes complex event processing (CEP) or by so-called continuous queries, which compute aggregates over predefined intervals of the data stream by means of windows. Both worlds were somewhat disconnected. In the former case, the result is returned of the constituent events that matched. In the latter summaries (aggregates) are computed for mainly time-based windows, e.g., every 5 minutes.
In our paper, we have built two families of operators that bridge the gap between the two worlds. In other words, a user can define a query, e.g. an aggregate, on a data-driven window that can be computed based on conditions defined by the user. This gives more expressiveness for the user and in the meantime results in much less overhead over the processing systems. User-defined windows are constructed internally by means of CEP patterns (this is how the two worlds are linked) and are defined as event intervals. That is, they have a starting time point and an ending time point. In addition, stream elements that realize the pattern form the window (interval) content that is used to compute the aggregate query designed by the user. With the definition of more than D2IA query (pattern), it is possible to introduce another level of analytics. That is to study relationships among the different intervals that might come from different sources. For example, it is possible in a smart home setting, to detect the intervals in which temperatures rise above a certain threshold while air condition has not started yet. To reason about intervals relations, we fall back to Allen’s algebra for intervals. However, we optimize the computation by exploiting properties of these intervals operators as well as operating on maximal intervals.
The figure below summarizes our contribution. The homogeneous interval event (HoIE) operator generates windows (intervals) that are based on atomic events of the same type, whereas the heterogeneous interval event operator works on events of different types. Resulting intervals can then be fed to the interval operator to generate matching events again as a stream.
We have realized our operators in two systems that are representatives of centralized and distributed (scalable) stream processing, ESPER and Apache Flink respectively. Source code and experimental setup are available in our Github repository.
The full text of the paper can be found here.