This paper presents a generic procedure to implement a scalable and high performance data analysis framework for large-scale scientific simulation within an in-situ infrastructure. It demonstrates a unique capability for global Earth system simulations using advanced computing technologies ( i.e., automated code analysis and instrumentation), in-situ infrastructure ( i.e., ADIOS) and big data analysis engines ( i.e., SciKit-learn). This paper also includes a useful case that analyzes a globe Earth System simulations with the integration of scalable in-situ infrastructure and advanced data processing package. The in-situ data analysis framework can provides new insights on scientific discoveries in multiscale modeling paradigms.
Earth system models (ESMs) are essential approaches to understand Earth system dynamics and to project future climate scenarios. It is well known that the validation and verification of Earth system process within EMSs are quite challenging [
Developing an in-situ data analysis platform for Earth System modeling requires several practical solutions, including 1) automated code instrumentation of large-scale code to extract appropriate data of interest; 2) efficient packing and transfer of highly customized data types; 3) optimal tuning of underlying in-situ infrastructure based on the unique characteristics of the applications; as well as 4) seamless integration with external data processing and machine learning packages.
In this paper, we first present design considerations and key components of a data analysis framework for Earth system simulation. We then list the computing and software environment used in our study. At last, a case study is designed to demonstrate the practical usefulness of the data analysis system and its computing performance.
The data analysis framework developed in our study consists of four major components: 1) source code analysis and data capture; 2) data packing and conversion tools; 3) in-situ infrastructure; and 4) data analysis packages. The work-flow of the framework is illustrated in
In this step, we analyze the source code dependency using the information extracted from compilers or language parsers. The main goal is to understand the software structure and to capture internal data structure and scientific workflow of the source code. For a given function or module of interest, we use programming language parsers to analyze the source code and store the program information as an abstract syntax tree (AST). Then, we conduct recursive name resolution through the AST to capture the input and output data streams of the target function. Finally, we generate a code segment into the original source code to
package all the data of interest into a continuous memory buffer ready for in-situ data transfer. More detailed information on the source code analysis and data capture can be found in previous publications [
As most scientific codes, Earth system models use highly customized data structures. As mentioned in the previous section, the data is packed into a continuous memory block for efficient data transfer using the underlying in-situ infrastructure. The internal information on the customized data structure has to be recorded as well, to reconstruct the customized data types after data transferring. It is also worth to mention that most in-situ infrastructures use specific data formations to facilitate its performance, therefore, it is convenient to create a software tool for the data conversions between applications and in-situ infrastructures.
By now, several in-situ infrastructure have emerged, including data analysis and visualization toolkits (such as ParaView [
There are several comprehensive packages widely used for big data analysis, including graph database and machine learning [
Energy Exascale Earth System Model (E3SM) is a national effort to address the challenging and demanding climate-change research imperatives. Within the E3SM modeling framework, E3SM Land Model (ELM) is a process-based model that represents the energy-water-biogeochemistry interactions between the atmosphere and the terrestrial landscape. We implement an in-situ data analysis system for the ELM and focus on key biogeophysical and biogeochemical functions.
Due to the complexity of ELM, the validation and verification of the terrestrial system process are quite challenging. Scientists routinely use post-simulation approaches to analyze results. Generating data for post-simulation earth system process investigation quickly becomes a cumbersome task once a simulation reaches a fairly large scale with a huge amount of data and daunting input/output cost. A previous pilot effort focused on demonstrating a concept and small-scale (pointwise) prototype [
The platform used in this study is a Linux cluster within the Computing and Data Environment for Science (CADES) at Oak Ridge National Laboratory. The cluster has 48 nodes of Cray CS400 machines. Each node contains 2 Intel E5-2698v3 16-core (total of 32 per node), 128 GB RAM, Dual port mellanox-FDR IB and 10GbE and 250 GB local hard drive. The cluster shares a Petascale parallel file system with other clusters within CADES. External users can access these ORNL HPC clusters via two dedicated login nodes.
The ELM used in the study comes from the newest version of E3SM code in 2017. In the past, we have developed two methods for parsing and analyzing the ELM code based on KGEN and PGI Fortran compiler [
In this study, we configure the global ELM simulation on a 0.5 × 0.5 degree grid.
The simulation starts at the spinup stage, driven by the climate datasets of 1920 - 1948, developed by the Climatic Research Unit in the United Kingdom. It also runs assuming a constant CO2 and land use in 1850. For the purpose of testing our approach, the spinup only runs for 320 years, long enough to demonstrate how plants evolve in warm regions such as the tropics, without in-situ Data Analysis Tools plugged in. Then the simulation restarts from the end of the spinup run and continues for a transit run (1850 to 2010). The transit run is configured to simulate the historic Earth system behavior since the industrial revolution. It is one of commonly used simulation configurations for future scenario projections.
For the demonstration, we capture and analyze the values of Total Leaf Area Index (TLAI) through the half year global terrestrial ecosystem simulations and calculate the data characteristics (such as statistics and primary components) of TLAI during the real-time simulations using built-in function from SciKit-Learn. Leaf area index (LAI), i.e. projected one-side foliage area over ground surface, is a dimensionless quantity that characterizes the vegetation foliage size and thus its function in the Earth system model to predict photosynthetic primary production, evapotranspiration, and it can be regarded as an indicator for plant growth or greenness. As such, LAI plays an essential role in theoretical production ecology. In ELM, LAI is simulated for total 17 individual vegetation types in a grid cell. Then, with the consideration of the actual vegetation coverage of each grid cell, these individual LAIs are fraction weighted to calculate total LAI (TLAI) at each grid cell on the land section of the Earth.
at the very beginning of simulation (12 am GWT, January 1, 1850). The TLAI at each grid around the globe (total 360 × 720 cells) is calculated as a weighted summarization of LAI on each vegetation type (called patch) times the percentage of vegetation area within each grid cell. Obviously, TLAI values of tropical cells, such as Amazon and other low altitude, well vegetated areas, are much higher than TLAI at other places. It is also noticeable from the TLAI map that the beginning simulation is a winter time at north hemisphere, since the TLAI is low at these middle-latitude areas in the north hemisphere.
This test is designed to demonstrate the automated data processing using a machine learning package, SciKit-learn. After the data of interest is extracted and transferred from the simulation, transferred data arrays are transformed into numpy arrays that can be accessed by a Python application. In our test, we use the built-in statistical functions of Scikit-learn to classify TLAI values during the simulation, so that the seasonal global vegetation greenness pattern can be recognized dynamically.
The high-resolution snapshot of raw model data output, like the one in
changes. Post data analysis, such as classification in
Two sets of experiments have been conducted to understand simulation performance. We first record the walltime used for data collection, packing and transfer. The pie chart of time division is illustrated in the left graph of
We have presented design considerations of a data analysis framework for Earth
system simulation based on automated source code instrumentation, in-situ infrastructure and real-time data processing. We have designed a case study of Earth system simulation to demonstrate the practical usefulness of the data analysis system and its computing performance. With the integration of external data processing package, such as Scikit-Learn, SPACK and TensorFlow, we can easily apply novel machine learning approaches to study simulation results in on the fly. We believe the in-situ processing is a feasible way to investigate large scale climate simulations without intensive human interaction and it avoids the prohibitive IO cost of post-processing on high performance computing platforms. Future efforts will have two directions. We will focus on tuning the performance on high end computers, such as Titan and Summit-Dev at the National Center for Computational Science at Oak Ridge National Laboratory. We will also work to integrate our data analysis system with other external packages, such as TensorFlow, for further large-scale Ecosystem simulation data analysis on hybrid architectures. Science efforts will focus on relationship identifications between the extreme weather events and long term ecosystem behaviors.
This research was funded by the U.S. Department of Energy, Office of Science, Biological and Environmental Research program (E3SM and TES). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Wang, D., Luo, X., Yuan, F. and Podhorszki, N. (2017) A Data Analysis Framework for Earth System Simulation within an In-Situ Infrastructure. Journal of Computer and Communications, 5, 76-85. https://doi.org/10.4236/jcc.2017.514007