talk on conference website
We develop and test user embeddings approaches to vandalism detection in OSM. We successfully demonstrate improvements to previous vandalism detection methods, and additionally how the user embeddings can further be applied to detect different communities of mappers. We validated the embedding model with a prepared vandalism corpus that we are also releasing to the OSM community.
With more than 11B edits from 1.6M unique mappers and openly editable by anyone, the OpenStreetMap (OSM) database inevitably contains vandalism. Our approach to detecting it leverages the analytical power and scalability of machine learning through OSM user embeddings. Embeddings are effective in capturing semantic entity similarities that are not explicitly represented by the data. Since word embeddings were first introduced based on the assumption that words adjacent to each other share similar meanings [1,2], the concept of embeddings has been extended beyond word representations to any entity, so long as one can produce a meaningful sequence of the entities. Therefore, we build OSM user embeddings with mappers as entities by constructing sequences of mappers based on shared editing histories and similar behaviors.
**Methods**
_Creating a Vandalism Corpus_
Development of automated vandalism detection methods in OSM has been slow in part because there is no published corpus of bad or vandalized edits from which to train and validate [3]. Vandalized name attributes are especially problematic because this text is rendered on the basemap. The most infamous instance of this type of vandalism was the changing of "New York City" to an ethnic slur; this name attribute was subsequently rendered on maps drawing from OSM data [4]. As part of this work, we construct and make available the first OSM vandalism corpus for the name attribute of OSM features. Potential examples of vandalism are collected from the OSM Changeset Analyzer (OSMCha) web-based validation tool. These records are then manually reviewed by the Facebook mapping team to identify egregious name changes. Negative samples (non-vandalism) were randomly sampled from a previously validated vandalism-free snapshot of OSM. All of our examples are extracted from OSM data only, no external or conflated data sources.
_User Embeddings_
To construct meaningful sequences of OSM users where adjacent users share similar mapping patterns, we analyzed the edit history of every OSM object and the temporal/semantic editing patterns of individual mappers. These sequences were then fed into a word2vec skip-gram model to train OSM user embeddings.
**Shared object editing histories** are sequences of OSM users who have edited the same object, in chronological order of editing. These sequences represent mappers who share interest in the same objects on the map. This yields 2B sequences of mappers.
**Semantic and temporal mapping patterns** are sequences of OSM users that have shared editing characteristics with regard to how and when they edit the map. Starting with _changesets_, we extract the following keys for each OSM element edited in a given changeset when present: `addr:country`, `admin_level`, `amenity`, `building`, `highway`, `natural`, `place`, `source`. Additionally, we extract the following metadata: the presence of `name` tag, the `version` number, the editing software (e.g. iD editor, JOSM), and any hashtags (possibly denoting specific mapping campaigns). Finally, we group all of these edits by two types of temporal patterns: first, the date of the changeset, and second, the hour of the week of the changeset, per year (with 168 hours in a week, we aggregate across each _week-hour_ in a given year). This yields 30M sequences of mappers.
**Results**
_Community Detection_
OSM is comprised of many distinct groups of mappers; considering each of these groups a different sub-community makes OSM a "community of communities" [5]. The creation of the temporal and semantic editing patterns was specifically designed to create sequences of mappers with high likelihood of belonging to the same community. One type of easy-to-identify communities are corporate editing teams: groups of employees that are paid to edit OSM [6]. Results of corporate editing team detection can be easily validated against published lists of known editors.
The five largest corporate mapping teams are Apple (>1,200 mappers), Amazon (>700), Grab (>550), Facebook (>250), and Kaart (>200). These counts are based on extracting affiliation from a mapperās OSM user profile, looking for sentences such as āI work for Amazon" and are likely an under-representation [7].
To validate the performance of the modelās ability to successfully identify members of an editing team based on editing semantics, we used the cosine similarity to compare users. First, we identified the 100 most similar users to the _top 10 most active mappers_ in each company (by number of changesets). Next, we confirm how many of the top 100 most similar users are also on that team. This is a measure of recall for our model.
Amazon is the most identifiable team, with all 100 of the most similar editors also belonging to the Amazon Logistics data team. The mean cosine similarity (`mcs`) among these 100 mappers is 0.98. Apple is the second most identifiable with 97% of the top 100 most similar mappers also belonging to the Apple data team and an `mcs` between the top 10 and these 97 users of 0.94. Third was Kaart, with 96% and `mcs=0.88`. Facebook was fourth with 87% and `mcs=0.87`. The Grab data team, however, was more difficult to identify: only 68% of the top 100 most similar mappers were also part of the Grab data team. The `mcs` between these 68 mappers and the top 10, however, is high at 0.94.
_Vandalism Detection_
To detect vandalism, we train a Gradient Boosting Decision Tree (GBDT) model, which consists of metadata, user reputation, object history, and content features. We applied OSM user embeddings into this model by creating two embedding features, `kmeans_cluster` and `cos_sim_last_5_users`. To create `kmeans_cluster`, we ran k-means clustering on OSM users and assigned a cluster to any user with an embedding, and then encoded the cluster based on the average number of edited changesets among this cluster. The idea behind `cos_sim_last_5_users` is that users who are similar to each other are more likely to edit the same objects. Starting with an edit to an OSM object, we compute the cosine similarity between the user responsible for the edit and the previous five mappers that edited the object.
Next, we trained a new model by injecting the embedding features, and we have seen a relative improvement of 1.3% in our primary metric, area under the receiver-operator curve (AUC-ROC). The feature importance of `kmeans_cluster` is ranked as high as 2/49, with a coverage of 99.9%, while `cos_sim_last_5_users` has an importance rank of 16/49, largely due to a relatively low coverage of 64%, meaning that the majority of edits in OSM create new objects, so there can be no editing history for these.
Because of the AUC improvements and high feature importance, Facebook has deployed this model in production to detect vandalism, as a part of the data validation in the Facebook Map and Daylight Map, a validated, vandalism-free distribution of OSM [8].
_Vandalism Corpus_
The accurately labeled dataset of vandalism to named elements in OSM is a tremendous asset to researchers hoping to further the work of automated vandalism detection. As part of the continual quality-assurance work at Facebook, teams of professional mappers are consistently labeling and improving this running list. As part of this work, we are publishing this fully labeled vandalism corpus for others in the OSM research community to use [9].