[DBpedia-discussion] Strengthening the NIF dataset extraction

Markus Freudenberg

2017-07-11 12:19:56 UTC

Dear community,

you probably heard about our newest regular addition to the dataset family
of DBpedia, published with the latest release
<http://wiki.dbpedia.org/datasets/dbpedia-version-2016-10>. The NIF
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html>
annotation datasets record the whole wiki text, its basic structure
(sections, titles, paragraphs, etc.) and the included text links. As a
by-product the abstracts are extracted, along with other useful date (i.e.
all included tables in its raw HTML format and equations in MathML).

While we are happy with the results of this extraction, it could always be
improved. Since I tested mainly on the English and German Wikipedia, other
languages might include HTML artifacts or do not capture the correct CSS
paths to find a certain element.

For those interested in improving this extraction for a given language, you
can contribute by specifying or updating CSS selectors to point out, remove
or replace any HTML element.

This wikipage provides more insight on this topic:
https://github.com/dbpedia/extraction-framework/wiki/Improving-CSS-selectors-for-NIF-extraction.
Any additional question can be directed to me.

Markus Freudenberg

Release Manager, DBpedia <http://wiki.dbpedia.org>