Enhancing stream reasoning by modeling the importance of the streaming data

Thumbnail Image
Yan, Rui
Issue Date
Electronic thesis
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Generally speaking, this dissertation delivers a conceptual model, and a set of infrastructure that can facilitate its general application in stream reasoning. Specifically, the first contribution is an innovative notion of semantic importance. It is formalized in an ontology, represented in a priority vector, and works with carefully extended window semantics. The second contribution introduces a general sequential stream reasoning architecture, with the purpose of both showing how semantic importance can be used in stream reasoning systems, and providing pragmatic performance metrics to configure stream reasoning systems in different scale scenarios. Two exemplar real world use cases are implemented and evaluated based on this architecture and semantic importance. The third contribution proposes a generalization and benchmark framework for semantic importance. This part focuses on how to reuse and benchmark semantic importance in a generic and quantitative way. The semantic importance is generalized by connecting itself to the state of the art stream reasoning techniques. This framework also provides a benchmark interface compatible with a wide range of continuous queries, ontologies, data streams, and a set of built-in data-aware window management strategies enabled by semantic importance. The key performance indicators recorded for the benchmark includes precision, response time, memory consumption and throughput. The results are analyzed and visualized so as to facilitate decision-making on how to compose and deploy the suitable semantic importance in real use cases.
流推理(Stream Reasoning)是一个崭新的研究领域,旨在将推理融合进流数据处理当中,从而能够从数据流中不仅提取浅显的信息,更能够得到深层次的隐藏信息。本博士论文的研究问题主要围绕对数据流的重要性建模。一般来讲,窗口是处理流数据的一个广泛应用的方法。一个窗口只能够观察有限多个数据,而这些数据所能够提供的信息往往很有限,这就导致查询的结果不一定精确。比如,如果两个数据之间的距离大于窗口的长度,而这两个数据又是回答一个查询所必须的,那么显然这个查询会得到一个假阴性的结果,即:本来应该有一个查询答案,可是查询返回的却是没有答案。这个问题可以通过增大窗口的长度来解决:只要这个窗口足够长,就可以保证这两个必要的数据在窗口中,从而达到返回正确答案的目的。然而,这种方法实际上并不可行。因为一般来讲,不可能预知两个数据之间的距离来提前设置窗口大小,甚至不一定知道当前窗口中有多少必要的数据。一味地增大窗口长度只会增加窗口中的数据量,从而加大系统响应的压力。
Streaming data intrinsically has many different orderings, such as temporarily, precision, provenance, and trust, etc. If diverse data orderings can be utilized to model the data importance, stream reasoning can be benefited by being data-discriminative. It is able to understand the concept of importance so as to identify, and leverage more important data that are crucial to the query answering, which can improve the system performance. The notion that models the data importance is named as semantic importance. It is an umbrella-like concept with multiple branches, such that each branch models one aspect of currently included data orderings. The combinations of different branches describe the data importance, and enable various smart and flexible window management strategies that are previously dominated and limited by FIFO.
实际上,绝大多数流处理系统的窗口管理数据的策略基本上是先进先出(FIFO),即:最先进入窗口的数据要最先离开来吸收最新的数据。FIFO策略仅仅根据数据的时间戳来判断何时移除或吸收数据,且基于一个“默认的世俗假设”:越老的数据越不重要。针对这个假设,本论文首先提出FIFO策略在一些情况下管理数据会造成“提早删除(early eviction)”或“提早过期(early expiration)”两个问题,并且指出仅仅基于时间戳的管理策略不能够有效地提高流推理系统的性能。其次,本论文提出了一个创新的概念——“语义重要性(semantic importance)”,用来对流数据的重要性进行建模。其目的就是基于流数据的重要性排序来进行数据管理,从而加快流推理系统的响应时间,降低内存消耗,增加吞吐量,和提高正确率。再次,为了验证语义重要性,并且提供一个部署语义重要性的系统范例,本论文提出了序列式流推理系统构架(sequential stream reasoning architecture),并且实现了两个实际案例。得到的结果表明语义重要性能够对流推理系统有比较好的增强效果。最后,本论文提出了语义重要性的一般化和基准测试工具(SIGenBench),旨在进一步一般化这个概念,并且量化其对于流推理系统四个性能指标(正确率,内存消耗,响应时间,吞吐量)的收益。
Streaming data is boundless, enormous, and heterogeneous, which adds extra dimensions to the challenges of realizing the vision of stream reasoning, in addition to temporal constraints. A widely-adopted way to process the streams is via leveraging a window that isolates the latest streaming portion. This snapshot, mostly managed by the first in first out (FIFO) strategy under a popular silent assumption that the latest data is the most important, is all that a window can know about the stream. This inevitably provides only limited information during the processing. However, modeling the importance of the data is not necessarily based on pure arrival timestamps. If the latest data does not convey the necessary information to answer the query, there is surely no need to do anything other than evicting it.
The requirement to extract the hidden information out of the data stream is rising, however, traditional stream processing systems cannot meet this requirement as they are not designed to do so. This gives birth to the new research domain of stream reasoning that aims to bring semantic reasoning into stream processing. An example is to predict highway traffic jam, given the explicit sensor data streams of cars' number and speed. It is very easy for humans to observe the traffic then forecast a traffic congestion. This is because humans know that a bigger car number and slower car speed can usually lead to a traffic jam. Unfortunately, machines do not. What they can ``see'' is probably a sequence of numerical numbers that are separated by commas.
May 2018
School of Science
Full Citation
Rensselaer Polytechnic Institute, Troy, NY
PubMed ID