S136: Centralized interface for extracting big data from web archives - new perspectives for web archive data research

Bibliothekskongress 2022

Session: Autoren- und Nutzungsrechte (S136)

Centralized Interface for Extracting Big Data from Web Archives - New Perspectives for Web Archive Data Research
T. Foltýn¹, M. Haškovcová²
¹National Library of the Czech Republic, Director General, Prague, Tschechische Republik, ²National Library of the Czech Republic, Web Archiving Department, Prague, Tschechische Republik

Abstract Text: Webarchiv, the Czech web archive of the National Library of the Czech Republic, is a digital library of Czech web resources that has collected more than 400 TB of data in the twenty years of its existence. Due to the growing interest of social scientists in these datasets, an ongoing research project Development of the centralized interface for extracting big data from web archives was originated. The goal was to create an user interface that would allow work with large amounts of data. The National Library of the Czech Republic with the Faculty of Applied Sciences of the University of West Bohemia - The Department of Cybernetics and with the Institute of Sociology of the Czech Academy of Sciences collaborated on the project. Project was accepted into the program of the Ministry of Culture, which helps to support applied research and experimental development of national and cultural identity (NAKI). Webarchiv has the big datasets, knowledge related to their archiving and manages the infrastructure enabling data clustering including HADOOP and Hbase solutions. The University of West Bohemia designs technical solutions, such as machine processing of large volumes of data or automatic recognition of information from video or audio files, analysis of text document topics and their automatic detection. It uses approaches based on deep neural networks for document classification. The research requirements of the Institute of Sociology within the project define the needs of the research community in the social sciences and the software solution is created in cooperation with the external company InQool. The outcome of this unique research project, which started in 2018 and will be completed in 2022, is faceted full text search engine for analyzing large quantities of web archive data with an integrated application for exporting selected datasets. It brings new perspectives on how data from the web archive can be researched and made available to professional researchers.

Speakers: Tomáš Foltýn