Similarity Detection in Online Integrity

FOSDEM 2023

How Meta manages to take offline millions of pictures, videos and text that violate its community standards, all of them adversarially engineered, in a catalog that counts in the trillions. We'll talk about open source technologies that embrace vector search, state of the art in neural and non-neural embeddings, as well as turnkey solutions.

Content moderation is a problem that affects every service that hosts user uploaded media. From the avatars to a personal collection of pictures, the platform holds the responsibility of removing the violating content. The problem can be tackled with clssifiers, human moderators and by comparing media signatures; this presentation will be about the latter. Similarity Detection is an approach that tries to detect media based on an archive of "definitions" (yes, like the antiviruses) of things that have already been classified as violating. But how do we measure similarity between images from the perspective of a machine (not to mention video/audio clips of different lenghts)? The answer is not MD5... We'll talk how we do it, what technologies you can use too and how we can leverage a public, crowdsourced archive of signatures to defeat various threats, from terrorism to misinformation to Child Exploitation.

Speakers: Alberto Massidda