Page d'accueil // Recherche // FSTC // Computer Sci... // Projets de r... // A Distributed Graph Database for Large-Scale Text Analytics

A Distributed Graph Database for Large-Scale Text Analytics

Financement: Fonds National de la Recherche > CORE
Date de début: 1 janvier 2018
Date de fin: 31 décembre 2020


The World Wide Web (in the following referred to as just “the Web”) is the most comprehensive – but likely also the most complex – source of information that we have access to today. More than 95 percent of all information in the so-called surface Web, i.e., the part of the Web that is publically accessible either as static pages or in the form of dynamically created contents, is in fact estimated to consist of text. This textual data just happens to sometimes be interspersed with semi-structured components such as form fields, lists, and tables – or so-called “infoboxes” in Wikipedia. These infoboxes, plus perhaps some more metadata, however still constitute the main source of information for all of the currently available, Web-extracted knowledge bases such as DBpedia, YAGO, Freebase, and Wikidata. This means we in fact only exploit a tiny fraction of the information that is published on the Web for Information Extraction (IE) purposes.By exploiting the syntactic and semantics dependencies of information conveyed in Web documents, BigText aims to build a large-scale, distributed graph database of highly interlinked and semantically enriched documents that serves as a basis for high-accuracy retrieval of information, mining of syntactic and semantic relationships among real-world entities, and – more broadly – a whole line of online analytical tasks that serve as a basis for further text and knowledge mining. In other words, we intended to investigate a radically new approach to information access and retrieval that bridges the three key areas of Information Extraction, Information Retrieval and Big Data Analytics.