Data Quality in Wikidata

Wikimania 2019

Within a few years, Wikidata has developed into a central knowledge base for structured data through the collaborative efforts of Wikidata’s peer production community. One of the benefits of peer production is that knowledge is curated and maintained by a wide range of editors, with different cultural, experience and educational backgrounds, which hopefully results in potentially fewer biases and content-wise in a more diverse knowledge base. Ensuring data quality is, thus, of utmost importance, as the goal of Wikidata is to [[wm2015:Submissions/State_of_Wikidata_-_giving_more_people_more_access_to_more_knowledge_one_edit_at_a_time|“give more people more access to knowledge”]] and therefore, the data needs to be “fit for use by data consumers” ([http://mitiq.mit.edu/Documents/Publications/TDQMpub/14_Beyond_Accuracy.pdf Wang et al., 1996]). The Wikidata community has already developed methods and tools that monitor relative completeness (e.g., Recoin gadget [http://www.simonrazniewski.com/wp-content/uploads/2018_Wiki-workshop.pdf Balaraman et al., 2018]), encourage link validation and correction (e.g. [[metawiki:Mix'n'match|Mix’N’Match]]) and help e [[Category:2019:Quality submissions]] ditors [[c:File:Wikimania_2018_-_data_quality_in_Wikidata_poster.pdf|observe recent changes and identify vandalism]]. Moreover, the community started global discussions about relevant dimensions of data quality in a recent [[wikidata:Wikidata:Requests_for_comment/Data_quality_framework_for_Wikidata|RFC]] that used a survey of Linked Data Quality methods as the debate’s starting point to better describe and categorize quality issues and add more quality aspects/ dimensions, with the goal of developing a data quality framework for Wikidata. Despite this progress, recent research has shown the dominant role of a Western perspective in the represented languages ([https://dl.acm.org/citation.cfm?id=3233965 Kaffee and Simperl, 2018]), thus, more work needs to be done to strive for more knowledge diversity. It is therefore a major concern of data quality, to support such knowledge diversity and ensure that Wikidata covers a wide variety of topics, from various trustworthy sources, where facts can be contradictory. In this talk, we would like to present '''a classification of existing tools for data quality monitoring and data quality assurance in the context of Wikidata''' (extending [https://docs.google.com/presentation/d/1rwjqzPaHTsXNNqDc2Op1-qSbcFyaFwOSnkEkStp5L3E/edit#slide=id.g15105b408d_0_287 previous work]), drawing the Wikimedia community’s attention to gaps and opportunities for editors and developers to improve the collaborative data management cycle. Additionally, we will provide a comparison of data quality management strategies in Wikidata and Wikipedia, and present a summary of scientific findings relevant to the topic.

Speakers: Cristina Sarasua Mariam Farda-Sarbas Claudia Müller-Birn Lydia Pintscher