Project Details
Projekt Print View

ProvDS: Uncertain Provenance Management over Incomplete Linked Data Streams

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2017 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 323223507
 
Final Report Year 2022

Final Report Abstract

The Internet of Things (IoT) has gained significant momentum and provides intelligent and advanced services for industrial production and human lives. The new and versatile IoT applications bring new challenges in respect to heterogeneity, complexity, and dynamic data production and consumption, for example, security and privacy challenges, difficulties in identity management, etc. To address these problems, data provenance is necessary and provides essential enabling building block by tracking data origins and whole processing histories. When implementing data provenance management systems in the IoT environment, a number of challenges occur: (1) How can we effectively and efficiently capture the provenance of data origins and the complete history of data generation and processing? (2) How can we store the large amount of provenance data in a timely fashion? (3) How can users query the captured provenance data flexibly, efficiently and accurately? (4) How can we effectively recover missing data (incompleteness) when the IoT data and data provenance itself is missing. To address all these challenges, we investigated and implemented an event-sourcing and CQRS (Command Query Responsibility Segregation) based data provenance management system to fulfil 8 state-of-the-art implementation requirements proposed in the IoT provenance research community, i.e., completeness, granularity, depth, accuracy, efficiency, scalability, integrity, freshness. The ProvDS system is implemented as an asynchronous microservice, so that it can work as a standalone provenance management system, but also be used as a provenance management service to communicate either with batch-mode IoT data repositories or with systems handling real-time, dynamic IoT data streams. To efficiently address the incompleteness problems, we conducted an extensive literature review and investigated 270 incomplete data handling methods, which are based on statistical or machine learning techniques. We empirically evaluated 66 most representative methods from these investigated 270 methods using an incomplete data handling methods selection framework that we developed to facilitate our evaluations. We evaluated our ProvDS provenance management system with three real-world IoT data sets, i.e., the IoT-23 malicious and benign IoT network traffic capturing data set, the Billion Triples Challenge (BTC) data set, and the Web Data Commons (WDC) data set. Experiments showed that our vanilla ProvDS provenance management system, ProvDS-V, with basic provenance information, had similar time efficiency to answer the provenance queries, as the vanilla TripleProv system, without any provenance capabilities. In addition, our triple-level ProvDS system, ProvDS-Prov, with complete provenance information, outperformed all the other triple-level TripleProv variants, by (on average) 14.6%, 5.4%, 16.8%, and 4.5% (BTC data set); 12.4%, 5%, 31.8% and 10.5% (WDC data set) in terms of time efficiency (the percentages given are in the following order: TripleProv-SG, TripleProv- SA, TripleProv-TG, and TripleProv-TA). For accuracy of completeness and accuracy of depth evaluations, our ProvDS system achieved average accuracies of 0.9878, 0.9928, and 0.9825 (completeness); 0.9679, 0.9545 and 0.95575 (depth) for Vertex Overlap (VO), Edge Overlap (EO) and Vertex Edge Overlap (VEO) respectively. Finally, our ProvDS methods selection framework currently can achieve more than 54.6% accuracy to select the suitable incomplete data handling methods for the different classification and anomaly detection analyses.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung