MapReduce, a parallel data processing framework pioneered by Google, has been proven to be effective when it comes to handling big data challenges. As an open source implementation of MapReduce, Hadoop has gained increasing popularity over the past several years. However, Hadoop is not designed to handle spatiotemporal data. To bridge the gap, we propose a novel spatiotemporal indexing approach that significantly accelerates querying and processing of big climate data with MapReduce in their native format.
Structure of the spatiotemporal index
The Spatiotemporal Indexing Approach (SIA) bridges the gap between array-based data models and block-oriented HDFS storage models by linking the logical spatiotemporal information (space, time, and variables) to the physical location information (node, file, and byte). Based on the index, a grid partition algorithm was developed to optimize MapReduce processing performance by maximizing data locality and balancing the workload across cluster nodes.
SIA was adopted by NASA as one of the key technologies in their Data Analytics and Storage System (DASS). The SIA has also been extended and adapted to build a hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data (Fei at al., 2020).
Figure. Benchmarking performance measurements for the Spatiotemporal Indexing Approach (SIA) on multiple Portable Operating System Interface (POSIX) architectures leveraging connectors to the Hadoop Distributed File System (HDFS) environment; lower runtime is better. Carrie Spear, Michael Bowen, NASA/Goddard (Source: https://www.nas.nasa.gov/SC16/demos/demo37.html)
Credit/Source: Carrie Spear (email@example.com), HPC Architect/Contractor at the NASA Center for Climate Simulation (NCCS). http://files.gpfsug.org/presentations/2016/SC16/06_-_Carrie_Spear_-_Spectrum_Sclale_and_HDFS.pdf
Li, Z., Hu, F., Schnase, J. L., Duffy, D. Q., Lee, T., Bowen, M. K., & Yang, C. (2017). A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce. International Journal of Geographical Information Science, 31(1), 17-35.
Hu, F., Yang, C., Jiang, Y., Li, Y., Song, W., Duffy, D. Q., … & Lee, T. (2020). A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data. International Journal of Digital Earth, 13(3), 410-428.