Diaspora

(Argonne/LCLS)

ASCR DE-FOA-0002902, PI Ian Foster, 2023 - 2028, $150k

Adaptive resource management and partitioning (primary); modeling $2M/year, 5 years

Hypothesis: Technological advances make it feasible to construct a resilient, secure, and high- performance event fabric that extends across edge sensors, experimental facilities, clouds, and supercomputers to provide timely, reliable, and accurate information about data, workflow, and resource status. This fabric can be used to develop new resilience solutions that will accelerate progress in science domains of high importance to DOE, and forge new connections between previously disparate applications and environments.

Motivation: Long-running AI-driven materials design campaigns involving 1000s of experiments and simulations; rural and urban sensor networks that leverage advanced ML to guide data col- lection; smart instruments at light sources that integrate simulation and AI methods for real-time interpretation of terabit/s data streams; multi-messenger astronomy, linking optical, radio, and gravitational wave detectors; an exabyte climate data facility that schedules analyses and caches to support interactive analysis by 1000s of climate scientists worldwide. These and other applications of high importance to DOE share a common need for resilience solutions that scale to large data, complex workflows, numerous distributed resources, and extended timescales.

Problem statement: Distributed science applications require resilience solutions for 1) data generated by sensors and computations, and by software that manipulate those data, to satisfy application-specific accessibility and performance requirements; and 2) workflows used to generate and manipulate data, to ensure workflow progress. Increasing scale and heterogeneity make ap- proaches based on synchronous communications impractical. Instead, we hypothesize (see above) that new resilience approaches are needed in which a distributed and hierarchical event propaga- tion and summarization fabric provides event streams that are used as the foundation for broadly applicable resilience solutions that reduce coordination and synchronization overheads.

Key unknowns: Establishing the validity of the hypothesis just articulated requires resolving five key unknowns: 1) the resilience requirements of science applications; 2) the feasibility and architecture of a science event fabric that can support data, workflow, and resource resilience across a computing continuum spanning sensors, edge, and HPC, and supporting a broad spectrum of applications; 3) how data-driven models, constructed from historical data and event fabric data streams, can be used to reason about consistent states and quality of service in order to guide resilience decisions; 4) how to build effective data, workflow, and resource resilience solutions on top the event fabric and the data-driven models; 5) how best to incorporate resilience solutions into real DOE-relevant applications.

Preliminary work: Team members have developed and demonstrated technological advances that make the creation of such a unifying fabric conceivable. These include the hybrid software-as-a- service (SaaS) architecture that has enabled Globus to deliver reliable and secure data movement services to 100,000s of scientists and engineers, including at most DOE laboratories and many universities; automation capabilities for experimental data analysis flows, developed within Braid; innovative edge computing architectures developed within funcX, each deployed widely in DOE- relevant applications; high-speed data movement via GridFTP and SciStream; highly scalable HPC data services, checkpointing methods, and fault tolerant data cache and movement capabilities in Mochi; and new data models focusing on versioning and state capture in DataStates.

Architectural sketch: A simple distributed data pipeline consists of a data source, preprocessing, ML model training, and experiment steering steps. Event fabrics provide resource and workflow information that are used, along with historical data, to 1) construct digital twin models of resource and workflow properties, 2) select resources and workflow configurations to meet data and workflow resilience goals, and 3) build a resilience plan to survive and/or adapt to unexpected events with minimal quality of service loss. Proven technologies (right) are integrated and enhanced to provide the uniform security, data, and compute fabric required for state maintenance at different spatial and temporal scales, and for dynamic (re)configuration.

Research plan: We propose a research program that will simultaneously investigate each unknown in an iterative fashion. We organize this work in terms of four thrusts. 1) Event fabric: Capturing and processing geographically distributed events at the rates necessary to meet application needs and inform predictive models will require a new distributed approach to logging and communication. To this end, we will architect, develop, and evolve a scalable and resilient event fabric that builds upon industry standard methods (e.g., Apache Kafka). In this context, a key challenge is how to devise new decentralized, hierarchical methods that capture, aggregate and disseminate events at scale between interested parties (e.g., using a publish-subscribe pattern). 2) Resilient services: We will explore new resilient abstractions such as resilient data streams and data pipelines enhanced with a lineage to allow roll-back to consistent states and/or buffering/replay of lost/missing data. Combined with the event fabric and application-level hints that specify goals and constraints, such abstractions will be used for novel adaptive resilience services that are designed to handle unexpected events with minimal loss on the quality of service perceived by the workflow. We will integrate these services with other distributed system components, from edge to HPC, considering consumption and use not only by individual local components, but also by collections of components that make up the global compute and data fabric. 3) Modeling: Combine historical data and real-time event data to construct digital twins of infrastructure, services, and applications. Prior work by our team has demonstrated the ability to predict various workflow-level characteristics (e.g., wide area transfer time, transfer failures, distributed task execution time). Here we will combine historical and real-time data sources spanning heterogeneous resources and applications and develop robust models capable of predicting actionable adverse events. We will integrate these models with the services described above to enable real-time decisions to be enacted by the compute and data fabric (e.g., via application-level policies). 4) Applications: Apply event fabric, resilient services, and modeling methods in application experiments involving light source beamlines, sensor networks, multi-messenger astronomy, and federated learning.