MapReduce, a parallel data processing framework pioneered by Google, has been proven to be effective when it comes to handling big data challenges. As an open source implementation of MapReduce, Hadoop has gained increasing popularity over the past several years. However, Hadoop is not designed to handle spatiotemporal data. To bridge the gap, we propose a novel spatiotemporal indexing approach that significantly accelerates querying and processing of big climate data with MapReduce in their native format.
The Spatiotemporal Indexing Approach (SIA) bridges the gap between array-based data models and block-oriented HDFS storage models by linking the logical spatiotemporal information (space, time, and variables) to the physical location information (node, file, and byte). Based on the index, a grid partition algorithm was developed to optimize MapReduce processing performance by maximizing data locality and balancing the workload across cluster nodes.
SIA was adopted by NASA as one of the key technologies in their Data Analytics and Storage System (DASS). The SIA has also been extended and adapted to build a hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data (Fei at al., 2020).
“Testing the Spatiotemporal Indexing Approach (SIA) under a variety of configurations, including HDFS, General Parallel File System (GPFS), and Lustre, has helped to clarify and define the architecture of the DASS hardware and the software stack. The DASS will provide engineers and scientists with a platform for analyzing large climate datasets without the need to move the data.”
Credit/Source: Carrie Spear (email@example.com), HPC Architect/Contractor at the NASA Center for Climate Simulation (NCCS). http://files.gpfsug.org/presentations/2016/SC16/06_-_Carrie_Spear_-_Spectrum_Sclale_and_HDFS.pdf
Li, Z., Hu, F., Schnase, J. L., Duffy, D. Q., Lee, T., Bowen, M. K., & Yang, C. (2017). A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce. International Journal of Geographical Information Science, 31(1), 17-35.
Hu, F., Yang, C., Jiang, Y., Li, Y., Song, W., Duffy, D. Q., … & Lee, T. (2020). A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data. International Journal of Digital Earth, 13(3), 410-428.