Most data produced in modern research is never looked at
“In 2022 the Vera C. Rubin Observatory in Chile will collect 20 terabytes/night as part of the Legacy Survey of Space and Time (LSST), in 2028 the Square Kilometre Array, will generate 100 times that amount” (Scott A. Klasky, The data deluge: Overcoming the barriers to extreme scale science, slide deck, 2022)
Modern research is producing vast amounts of data. Unfortunately, a large part of this data is just being stored - it’s not being looked at and not used in the research because of its size and complexity.
From my own experience I ran into problems as a scientist, because it simply took too long to analyze data, and it was too complicated. The amounts of data that my research produced was huge and unmanageable. We tried to compress data, but still, this didn’t solve the issue. I needed a new way to work with data to be able to really access and exploit the huge amounts of data and information that I was generating in my research.
“So, I sat out to create a new framework - a next level of abstraction for data-analysis for modern research to benefit from the deluge of data” - Scott A. Klasky, Ph.D. in Physics, Oak Ridge National Laboratory
Imagine bringing more speed and precision into your research
Is there a way to make data analysis more effective for the benefit of the research itself and for the daily work of modern scientists?
What if you could perform data analysis in a more efficient way by using autonomous ML/AI algorithms to look at incoming data? What if you could actually work with and analyze most of your data faster and more efficient than is the case today?
What if you could look at selected data-set in the earlier phase of a project to give you a faster and more reliable indication if your hypothesis is correct or if you need to slightly alter your assumptions moving forward? Would you benefit from a higher degree of precision in the scientific research if you were able to steer the experiment in the right direction at an earlier stage in the process?
Today, many software applications that help analyze data is typically programmed to write/read to storage and perform in-line or in-transit visualization. The vision is to allow applications to publish and subscribe to data, and with no modification to your code they can transparently switch from on-line (in memory) analysis to offline (storage) analysis. A system where data can stream in a “refactored” manner allowing the “most important” information to be prioritized in the streams, and where data written and read to storage will be highly optimized on the HPC resources and queryable.
This new way of working with and analyzing data can greatly benefit modern research in 2 key areas. It can bring considerable speedup the work, and moreover, it can enhance the precision and efficiency of the research. But we need to have solutions available for that.
“There Is a need for more research and implementation of new storage Infrastructures on large scale HPC systems”, explains Eske Christiansen, HPC Manager, DeiC.
Rethinking software stacks at HPC systems
Modern research is generating huge amounts of unmanageable data. If we are looking to exploit more data and reap the benefits of both speed and precision using current and next generation exascale HPC systems in modern research, we need to also investigate better ways and new ways of thinking of the software stacks at HPC systems.
“Storage I/O for large scale exascale systems is the new battleground for getting the best performance out of the HPC systems”, continues Eske Christiansen, HPC Manager, DeiC.
We might have a solution at hand looking into ADIOS middle layer software. During 2023 DeiC will be looking more into this area in close collaboration with facilities like LUMI.
Thank you to Scott A. Klasky, Frederic Suter, and Norbert Podhorszki from Oak Ridge National Laboratory for this interesting day, and the compelling insights into a new way of analyzing research data.
Would you like to know more look to the resources below:
Oak Ridge National Laboratory Oak Ridge National Laboratory: Solving the Big Problems | ORNL
ADIOS software ADIOS 2: The Adaptable Input/Output System version 2 — ADIOS2 2.8.3 documentation