Project Details
Projekt Print View

Web Data Analytics and Scientific Workflows

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2013 to 2017
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 132320961
 
The main objective of this project is to enhance Stratosphere’s abilities to quickly analyze evolving, large datasets in problems that require iterative analytical programs. Therein, we focus on two demanding areas: Web data and Scientific Workflows (SWfs). The first area deals with the analysis of unstructured, heterogeneous, and distributed web content. It builds on work performed in the first phase, in which our subproject was concerned with research in declarative information extraction. In the second phase, we focus on the analysis of textual web data including the dynamic acquisition of such data through focused web crawls. This requires a number of enhancements to Stratosphere, such as specific operators to deal with web data, methods for using advanced information extraction to improve accuracy of focused crawling, and a novel execution model supporting the inherently iterative nature of focused crawling. The second research area targets the implementation, optimization, and efficient execution of SWf as dataflow programs, in particular analysis workflows for next generation sequencing data in the Life Sciences. The cost of producing genome data has fallen so steeply that it is short before becoming standard in clinical practice, creating an avalanche of data that must be analyzed by a multitude of programs running in pipelines. SWfs essentially are dataflow programs and thus generally well suited to be managed by a system as Stratosphere. However, especially genome analysis has particular properties that must be considered during dataflow optimization and parallelization. In particular, large-scale sequencing is becoming routine even outside large centers, which means that analysis must be performed efficiently also on small clusters and should optimally exploit multi-core machines. Another property of this area is that there exists no single, standardized analysis pipeline; rather, genomes are subjected to various independent analyses, resulting in complex workloads of partly overlapping functionality leaving much room for holistic workload optimization.
DFG Programme Research Units
 
 

Additional Information

Textvergrößerung und Kontrastanpassung