Data Systems Publications & Invited Talks
Publications:
Thayer, Jana & Chen, Zhantao & Claus, Richard & Damiani, Daniel & Ford, Christopher & Dubrovin, Mikhail & Elmir, Victor & Kroeger, Wilko & Li, Xiang & Marchesini, Stefano & Mariani, Valerio & Melcchiori, Riccardo & Nelson, Silke & Peck, Ariana & Perazzo, Amedeo & Poitevin, Frederic & O’Grady, Christopher & Otero, Julieth & Quijano, Omar & Yoon, Chun Hong. (2024). Massive Scale Data Analytics at LCLS-II. EPJ Web of Conferences. 295. doi: 10.1051/epjconf/202429513002
Schwarz, N., Campbell, S., Hexemer, A., Mehta, A., Thayer, J.: Enabling scientific discovery at next-generation light sources with advanced AI and HPC. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 145–156. Springer, Cham (2020). doi: 10.1007/978-3-030-63393-6_10
Invited Talks
1) Artificial Intelligence & Robotics for Modern Accelerator-Based Light Sources (AIRA) virtual workshop, July 5th-8th, 2021
Title: Data Processing at the Linac Coherent Light source
Presenter: Jana Thayer
2) Synchrotron Radiation Instrumentation virtual conference, March 28th - April 1, 2022
Title: Data Processing at the Linac Coherent Light Source
Presenter: Chuck Yoon
The increase in data volume generated by LCLS-II presents a challenge for data acquisition, processing, and management. These systems face formidable challenges due to the high data throughput and to the intensive computational demand for scientific interpretation. The LCLS-II Data System includes a feature extraction layer designed to reduce the data volumes by at least an order of magnitude while preserving the science content of the data. A real-time analysis framework provides visualization and graphically-configurable analysis of data on the timescale of seconds. A fast feedback layer offers dedicated processing resources to the running experiment to provide data quality feedback within minutes. We will present an overview of the LCLS-II Data System architecture with an emphasis on real-time feedback for automation and tuning.
Plenary Talk, Title: Massive Scale Data Analytics at LCLS-II
Presenter: Jana Thayer
The increasing volumes of data produced at light sources such as the Linac Coherent Light Source (LCLS) enable the direct observation of materials and molecular assemblies at the length and timescales of molecular and atomic motion. This exponential increase in the scale and speed of data production is prohibitive to traditional analysis workflows that rely on scientists tuning parameters during live experiments to adapt data collection and analysis. User facilities will increasingly rely on the automated delivery of actionable information in real time for rapid experiment adaptation which presents a considerable challenge for data acquisition, data processing, data management, and workflow orchestration. In addition, the desire from researchers to accelerate science requires rapid analysis, dynamic integration of experiment and theory, the ability to visualize results in near real-time, and the introduction of ML and AI techniques. We present the LCLS-II Data System architecture which is designed to address these challenges via an adaptable data reduction pipeline (DRP) to reduce data volume on-the-fly, online monitoring analysis software for real-time data visualization and experiment feedback, and the ability to scale to computing needs by utilizing local and remote compute resources, such as the ASCR Leadership Class Facilities, to enable quasi-real-time data analysis in minutes. We discuss the overall challenges facing LCLS, our ongoing work to develop a system responsive to these challenges, and our vision for future developments.
Title: Data Analytics at the Linac Coherent Light Source
Session A079: Dealing with the Data Deluge
Presenter: Jana Thayer
5) International Forum on Detectors for Photon Science (IFDEPS), Port Jefferson, NY, March 17-20, 2024
Invited Talk, Title: Future Directions in Detector and Data Systems Integration through the lens of LCLS
Presenter: Jana Thayer
The LCLS-II Data System architecture addresses the challenges in data acquisition, data processing, data management, and workflow orchestration posed by the increase in data rate, volume, and complexity generated by the Linac Coherent Light source upgrade. However, the exponential increase in the scale and speed of the data is prohibitive to traditional data analysis workflows, which rely on scientists painstakingly tuning parameters during live experiments to guide data collection and analysis. Instead, the automated delivery of actionable information about the experiment in real-time and near real-time is needed to enable experiment steering and experiment design. Data processing and feature extraction at the detector (ASIC/FPGA) level and Edge Machine Learning (EdgeML) have been identified as a strategic solution to data processing problems in large high-rate detectors. This technology deploys high rate, low latency artificial intelligence (AI) inference engines early in the detector chain to reduce the amount of data that needs to be processed in the back end. Challenges and potential new directions in detector and data system integration will be discussed.
6) Synchrotron Radiation Instrumentation virtual conference, Hamburg, Germany, August 26-30 2024
Plenary Talk, Title: How I Learned to Stop Worrying and Love the Data Deluge
Presenter: Jana Thayer
Invited Talk, Title: How I Learned to Stop Worrying and Love the Data Deluge
Session: Non-Traditional Applications of HPC, Wednesday, Nov 20, 2024, 10:30 AM
Presenter: Jana Thayer
Advanced data and computing systems are vital to LCLS operations, data interpretation and overall scientific productivity. The transition to MHz-era operation marks a fundamental change in scale that requires new infrastructure and architectures to link LCLS to the required scale of computing needed for scientific interpretation. The LCLS-II Data System leverages access to High Performance Compute to reduce time to science, improve the efficiency and quality of acquired data sets, and solve exascale problems that cannot be solved by other means. Feature extracted information generated in the data analysis pipeline - at the edge, local compute, or remote HPC resources - can be used to steer experiments and inform user decisions during beam time. AI/ML presents new opportunities to rapidly analyze large datasets and direct experiments, but creates its own challenges in scaling, adaptability, complexity, and trustworthiness. Collectively these advances are poised to significantly enhance experimental output and enable groundbreaking scientific exploration. We discuss the overall challenges facing LCLS, and explore the opportunities afforded by fully leveraging the remote HPC resources of the DOE complex.
8) NERSC@50: Then, Now, and Into the Future, Berkeley, CA, October 22-24, 2024
Invited Talk, Title: How NERSC Helped Me Stop Worrying and Learn to Love the Data Deluge
Presenter: Jana Thayer
Advanced data and computing systems are vital to LCLS operations, data interpretation and overall scientific productivity. The transition to MHz-era operation marks a fundamental change in scale that requires new infrastructure and architectures to link LCLS to the required scale of computing needed for scientific interpretation. The LCLS-II Data System leverages access to High Performance Compute to reduce time to science, improve the efficiency and quality of acquired data sets, and solve exascale problems that cannot be solved by other means. Feature extracted information generated in the data analysis pipeline - at the edge, local compute, or remote HPC resources - can be used to steer experiments and inform user decisions during beam time. AI/ML presents new opportunities to rapidly analyze large datasets and direct experiments, but creates its own challenges in scaling, adaptability, complexity, and trustworthiness. Collectively these advances are poised to significantly enhance experimental output and enable groundbreaking scientific exploration. We discuss the overall challenges facing LCLS, and explore the opportunities afforded by fully leveraging the remote HPC resources of the DOE complex.
9) Confab25, San Francisco, CA, April 7-10, 2025
Keynote Talk, Title: Love in the Time of Exascale: Orchestrating a Future-Proof Romance Between LCLS-II, AI, and DOE’s HPC Ecosystem
Session: Integrated Research Infrastructure Workflows: Science & Technology Example, Wednesday, April 9, 2025
Advanced data and computing systems are the backbone of LCLS science, enabling rapid interpretation of experiments at unprecedented scales. The transition to MHz-era operation demands a seamless integration of edge processing, high-performance computing (HPC), and intelligent data workflows to bridge the gap between raw data and scientific insight. In this talk, I will share my perspective on how integrated infrastructure—spanning real-time data reduction at the detector edge, SLAC’s Shared Science Data Facility, ESnet’s high-speed networking, and DOE leadership-class HPC resources—enables transformative science at LCLS.
The LCLS-II Data System exemplifies this paradigm: leveraging DOE’s distributed HPC ecosystem, we reduce time-to-science by streaming feature-extracted data to steer experiments during beam time. For example, AI/ML-driven analysis at NERSC, rapid re-training of models at ALCF, and large-scale training on streamed data at OLCF have enabled adaptive decision-making, turning terabyte-scale datasets into actionable feedback in minutes. Yet challenges remain—scaling AI workflows, ensuring trust in autonomous systems, and democratizing access to exascale resources across the DOE science community.
Looking ahead, I will outline a vision for infrastructure evolution over the next few years including embedding AI-driven “smart steering” into experimental workflows, from edge to exascale, expanding partnerships like the LCLS-S3DF-NERSC model to create a unified DOE “Superfacility” ecosystem including use of innovative optimal traffic shaping and steering technologies from ESnet, and using initiatives like HPDF and IRI to transform LCLS data into a national resource, fueling AI innovation and multi-modal science.
These advances are not hypothetical—they are being tested today. For instance, S3DF’s role as a gateway to ASCR facilities demonstrates how integrated workflows can accelerate discoveries in quantum materials and ultrafast chemistry. By aligning detector design, real-time analytics, and shared infrastructure, we are building a future where every photon, pixel, and processor collaborates to push scientific frontiers. DOE’s integrated research infrastructure is rewriting the playbook for big-data science. Together, we can shape its evolution to unlock breakthroughs we’ve only begun to imagine.