LLAna
(LCLS/NERSC)
The LBNL/LCLS Pilot for Data Analytics Project (LLAna Project) is a 1 year pilot effort to address much-needed capabilities throughout the data analysis pipeline for LCLS. The work improved the FEL user experience when running at NERSC. The R&D projects addressed 1) data handling and management with HDF5, 2) scheduling, managing and optimizing workflows on NERSC supercomputers, and 3) data analysis and visualization with Jupyter.
The HDF5 part of the project developed an HDF5 interface to the LCLS-II data format to allow users who depend on tools other than the LCLS analysis framework, tools such as matlab, to transparently get access to the raw data, leveraging HDF5 scalable performance across platforms. Why not use HDF5 as our data format in the first place? HDF5 did not have adequate performance and support to handle the variable lengths and shapes of LCLS data (data reduction means that each event can have a different size) or the ability to stream data to a file while simultaneously reading from that file, a necessary precursor to analyzing data as they are being taken. This project implemented some capabilities to handle variable length data and reading-while-writing.
The second thrust was on scheduling, managing, and optimizing workflows on NERSC supercomputers, building support for scheduling I/O intensive workflows and developing tools to manage data analysis pipelines on complex HPC hardware. Most LCLS-II jobs are I/O limited and HPC currently struggles to efficiently handle the assignment of computing resources to such jobs. The foundational work informed long-term enhancements needed to policies and methods at the system software level. Scientific advances increasingly depend on the ability of researchers to harness the power of high-performance computing and data infrastructure to operate on large scientific data sets produced by experiments, observations, and simulations. The increasing complexity of computer hardware and exponential growth in scientific data creates the risk that these capabilities will only be fully leveraged by a shrinking number of expert users. This project advanced our understanding of user needs and workflows, hardware,and software characteristics, enabling us to build data management systems and future HPC systems that provide automation and guidance to the user (i.e., self-guiding and self-tuning). This project benchmarked the Serial Femtosecond X-ray (SFX) workflow which is used for reconstructing the molecular structure from millions of measurements of (slightly different) samples. These applications are unlike the tfraditional parallel applications that have run at HPC centers. This work identified and optimized some of the I/O challenges with these workflows and addressed the challenges of running these workflows in HPC environments. For data analysis workflows with large input datasets eliminating redundant I/O operations by sharing files and balancing I/O operations between processes can lead to significant reductions in the amount of I/O as well as time spent in I/O operations. Similarly, we have observed that staging input data in Burst Buffer significantly reduces run time in large node runs. Thus, these workflows need solutions where input, intermediate, and output data is automatically managed across different HPC storage layers. Integrating automatic data management with runtime optimizations would significantly improve end-to-end workflow performance as well as overall user experience.
Finally, Jupyter is popular among LCLS users for its ability to analyze and visualize data directly through a web browser. This project helped develop the capability to scale data processing over thousands of cores - or more - critical to the adoption of Jupyter for LCLS-II. This work enabled Jupyter-based experimental workflows on the NERSC supercomputer. We developed Jupyter notebook-based workflows that could leverage NERSC HPC resources directly through the Jupyter interface while interacting with the results from these jobs. Key scaling bottlenecks and gaps in the interactive user experience were identified.. We iteratively developed workflows that could leverage NERSC HPC resources directly through the Jupyter interface, while interacting with the results from these jobs.