The time OSM mappers invest in labeling the world is valuable. We present how methods from remote sensing, big data distributed computing and artificial intelligence can be combined to support human analysis of geo-spatial data.
The past decade has shown a dramatic increase in the amount of openly available geo-spatial datasets such as - multi-spectral and RADAR satellite imagery from space agencies like ESA and NASA - government-sponsored, high-resolution aerial survey raster data, e.g., USDA (U.S. Department of Agriculture) - weather reanalysis model data based on sensor networks, e.g. published by the PRISM Climate group - geo-tagged messages as well as images from social media platforms such as Twitter and Instagram, etc. accumulating geo-tagged information at data rates easily exceeding tens of terabytes a day**. Given that an open-source project such as OSM relies on volunteers to spend their valuable time*** to generate vector datasets that annotate and update information on roads, buildings, land cover, points of interest, etc., it is natural to ask how sources of freely available spatio-temporal information might help to support and guide mappers in their work. At the same time, major progress has been made in the "open-source digital arena" of big data processing and artificial intelligence (AI). For example, projects for distributed non-relational database systems such as HBase (https://hbase.apache.org/) or in-memory distributed compute frameworks such as Spark are available to run on commodity hardware to scale analytics. Deep learning libraries such as PyTorch (https://pytorch.org) in accordance with the explosive amount of neural network architectures published by academia enable, for example, state-of-the-art computer vision algorithms which can be leveraged for remote sensing tasks: detection of buildings, land classification, change detection, etc.. Our presentation will discuss and demonstrate how to link tools from big data analytics and machine learning to geo-spatial datasets at scale in order to extract value from openly available spatio-temporal datasets to the potential benefit of OSM mappers. In particular, we show the design of a system that employs the key-value store HBase to index spatio-temporal satellite imagery to let Spark-SQL (https://spark.apache.org/sql/) user-defined functions act on it to remotely identify human signatures on Earth's surface by the aid of AI. Finally, when it comes to pixel-wise land classification, we are using the 1-meter resolution USDA aerial survey data and information derived from the Open Street Map project. The goal is to establish a scalable pixel level translation model from aerial map to OSM, where colors and shapes define land classification, i.e., forestry, grassland, building, road, etc. The USDA aerial survey is refreshed every other year, so we expect to translate the latest aerial survey to OSM and compare with the current OSM state to identify changes on the actual land use. This information will guide the OSM community where the map needs to be updated. We believe that the techniques and use cases presented will help to identify "hot spots" of where OSM needs human labor most - either in mapping or updating labels. Moreover, we hope to spark a scientific, strategic and technical conversation with the OSM community on needs regarding semi-automated support systems for global mapping. If time permits, as a bonus, we will introduce the open-source tool https://github.com/IBM/ibmpairs to interact with the spatio-temporal platform PAIRS that supports our research. **Although precise numbers are hard to find in the literature, a rough estimate based on ESA’s Sentinel satellite imagery can be made: Approximately providing global coverage over land on 30 meters resolution in 10 spectral bands once a week, 150 million square kilometers of global land surface generate about 15*10^7*10*1000 pixels, i.e. we get rates of order of 6*10^12 bytes per week, assuming 4 byte floating-point numbers as information storage per band. Scanning the list of openly available data from NASA and ESA alone, it is fair to assume there exists at least a hand full of such products. Please note, typically there exists an inverse relation between spatial and temporal resolution of geo-spatial data products: The coarser the spatial resolution, the more often data is published. ***Beginning of 2018 the (un)compressed overall OSM XML history file was 0.1TB (2TB). 5 years before, the historical data had accumulated about 40GB compressed. Assuming an OSM mapper adds a new record/modification (nodes+way+tags) of order of tens of KB in minutes, the time-invested to real-time ratio reads [60*10^6KB*20/(20KB/60s)]/(5*365*24*60*60s) ~ 10^6/(4*10^4) ~ 25. I.e. an estimated 25 mappers have to work around the clock to generate OSM labels.
Speakers: Rui Zhang Marcus Freitag