DAPHNE: Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning
The DAPHNE project aims to define and build an open and extensible system infrastructure for integrated data analysis pipelines, including data management and processing, high-performance computing (HPC), and machine learning (ML) training and scoring. Key observations are that (1) systems of these areas share many compilation and runtime techniques, (2) there is a trend towards complex data analysis pipelines that combine these systems, and (3) the used, increasingly heterogeneous, hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, as well as data formats and representations differ substantially. Therefore, this project aims – with a joint consortium of experts from the data management, ML systems, and HPC communities – at systematically investigating the necessary system infrastructure, language abstractions, compilation and runtime techniques, as well as systems and tools necessary to increase the productivity when building such data analysis pipelines, and eliminating unnecessary performance bottlenecks.
- System Architecture, APIs and DSL: Improve the productivity for developing integrated data analysis pipelines via appropriate APIs and a domain-specific language, an overall system architecture for seamless integration with existing data processing frameworks, HPC libraries, and ML systems. A major goal is an open, extensible reference implementation of the necessary compiler and runtime infrastructure to simplify the integration of current and future state-of-the-art methods.
- Hierarchical Scheduling and Task Planning: Improve the utilization of existing computing clusters, multiple heterogeneous hardware devices, and capabilities of modern storage and memory technologies through improved scheduling as well as static (compile time) task planning. In this context, we also aim to automatically leverage interesting data characteristics such as the sorting order, degree of redundancy, and matrix/tensor sparsity.
- Use Cases and Benchmarking: The technological results will be evaluated on a variety of real-world use cases and datasets as well as a new benchmark developed as part of the DAPHNE project. We aim to improve the accuracy and runtime of these use cases combining data management, machine learning, and HPC – this exploratory analysis serves as a qualitative study on productivity improvements. The variety of real-world use cases will further be generalized to a benchmark for integrated data analysis pipelines quantifying the progress compared to state-of-the-art.
The DAPHNE project is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement number 957407 for the time period from Dec/2020 through Nov/2024.