One of reusable elements goals is to handle different degrees of information availability when managing stream data in smart cities. Specifically, we focus on transportation data streams that come from buses, trams, bicycles, etc. When analyzing congestion, predicting traveling times and delays, detecting anomalies and performing trip planning in transportation systems, one often faces situations when the observed data recordings lack essential information to complete these tasks.
Urban data is produced by a large array of sources and can be leveraged to detect disasters (e.g., car accidents), monitor special events (e.g., festivals) or improve the city efficiency (e.g., provide delay predictions in public transportation). Since these issues are common to practically any medium-to-large sized cities, any solution should be as flexible as possible in terms of the available data to allow portability despite the heterogeneity of different city ecosystems.
A major obstacle to a high degree of portability involves the differences in the levels of available information, given the variety of data sources that are available in a particular city. As an example of a Smart City application, we focus on predicting congestions in cities, based on historical and current data of public transportation. We aim at smooth portability of tools, developed for one city, to the data that is available in another. Our experience in the INSIGHT (http://www.insight-ict.eu/) and VaVeL European projects shows that deploying an algorithm that was developed in one city (Dublin in this case), in another (Warsaw), becomes at times impossible due to different levels of data availability. Therefore, it is clear that not taking into account the information availability may most likely jeopardize the generalization of any Big Data event-based systems. Even across different areas of the same city we may find different levels of availability. For instance, some sensors (measuring traffic, capturing video, etc) may be accessible in a suburb while not being available in another, due to either cost constraints or infrastructure failures.
REMI is a reusable elements framework to handle varying degrees of information availability by design from two complementary angles, namely graceful degradation (GRADE) and data enrichment (DARE). In a nutshell, we develop reusable machine learning black boxes for mining and aggregating streaming data, either to infer missing data from available data, or to adapt expected accuracy based on data availability. Being able to re-use these off-the-shelf modules can considerably speed-up the development of big data applications involving streaming data. On top of that, this design supports fault tolerance of smart city data architectures since it enables operation even when information sources become unavailable.
Reusable elements should be designed to adapt to various information levels, depending on the available data sources. Therefore, prior to algorithmic solutions, we consider the layering of possible information availability. This involves identifying the minimum data requirements, which is necessary to provide a meaningful answer to the targeted problem and that can be safely assumed to be available in any deployment setting. On top of this basic layer, one can design levels of increased information availability, adding new data sources that may be less available or, alternatively, more expensive.
Figure 1 shows an implementation of the aforementioned information layering approach, as applied to our transportation use-case. We assume that the basic information needed is a stream of trams' GPS positioning, without which it would be impossible to perform on-line spatio-temporal analysis. At the next layer, we add the network of roads, which is often a publicly available information. The third layer introduces locations of tram stations, and finally in the last layer consider the schedule (the planned arrival and departure times of trams into and from stations). Note that the data that is often available in Smart Cities complies to one of the aforementioned four levels of information.
Constructed on top of the information layering approach, the REMI allows to seamlessly analyze congestion given various information levels using (1) graceful degradation and/or (2) data enrichment.
The graceful degradation (GRADE) approach involves decreasing the accuracy of the output to cope with the unavailability of some data sources.
Alternatively, data enrichment (DARE) infers the data of some missing layers. Unlike GRADE, the error when analyzing congestion with DARE comes from the inherent inference inaccuracies (e.g., due to statistical errors) rather than from unavailable information at a given layer.