NLP with Linguistic Linked Open Data

Linked Data has a huge potential in language learning and Natural Language Processing (NLP). Open Data has become popular in a variety of applications, however in linguistics, a large share of data are still published in proprietary, closed formats not publicly available on the Web. By using Linked Data principles, language resources can be published and interlinked openly on the Web, making the storage, connection, and exploitation of rich language datasets very efficient through machine-readable ontologies (Linguistic Linked Open Data). The first principle of Linked Data provides a unique identifier (URI) to every element of a resource, which corresponds to each entry in a lexicon, each document in a corpus, and every token in a corpus as well as to each data category that we use for annotation purposes, resulting in uniquely and globally identifiable resources in an unambiguous fashion. The Resource Description Framework (RDF), a core Semantic Web standard, provides a data model based on labelled, directed multigraphs that can be serialized in different formats. A number of RDF-based machine-readable vocabularies and ontologies can be directly applied to linguistic resources such as to describe general relations between resources with owl:sameAs, concept hierarchies with rdfs:subClassOf, relations between vocabularies with skos:broader, or linguistic annotations with nif:lemma, nif:word, and nif:sentence. The knowledge representation and modelling of lexical-semantic resources and annotated corpora can be expressed in the form of labelled directed graphs. The Web Ontology Language is a very expressive ontology language that supports the definition of axioms to constrain vocabulary implementations through formal data types and provides the option to check a lexicon or an annotated corpus for consistency.