LOD Datasets


Structured LOD files containing thousands or millions of RDF triples are called LOD datasets that are meaningful collections of triples covering a field of interest. LOD datasets collect descriptions of entities within the field of interest, and these descriptions often share a common URI prefix (as for example, http://dbpedia.org/resource/). The authors of the largest datasets provide advanced features that enable easy access to their structured data such as downloadable compressed files of the datasets or an infrastructure for efficient querying.

RDF Crawling

Similar to the web crawlers that systematically browse conventional web sites for indexing, Semantic Web crawlers browse semantic contents to extract structured data and automatically find relationships between seemingly unrelated entities. LOD datasets should be published so that they are available through RDF crawling.

RDF Dumps

The most popular LOD datasets are regularly published as a downloadable compressed file (usually Gzip or bzip2), called an RDF dump, which is the latest version of the dataset. RDF dumps should be valid XML and valid RDF files at the same time. The reason while the RDF dump files are compressed is that the datasets containing millions of RDF triples are quite large. The size of Gzip-compressed RDF dumps is approximately 100MB per every 10 million triples, but it also depends on the RDF serialization of the dataset.

SPARQL Endpoints

Similar to relational database queries in MySQL, the data of semantic datasets can also be retrieved through powerful queries. The query language designed specifically for RDF datasets is called SPARQL (pronounced “sparkle”, stands for SPARQL Protocol and RDF Query Language). Some datasets provide a SPARQL endpoint, which is an address from where you can directly run SPARQL queries (powered by a backend database engine and an HTTP/SPARQL server). For example, the SPARQL Endpoint of the LOD dataset of Leslie Sikos is http://www.lesliesikos.com/sparql-endpoint/. The four basic query types for retrieving data from a dataset are:

  • SELECT query: extracts raw values from a SPARQL endpoint, where the results are returned in a table format
  • CONSTRUCT query: extracts information from a SPARQL endpoint, and transform the results into RDF
  • ASK query: returns a Boolean (True or False) result for a query on a SPARQL endpoint
  • DESCRIBE query: extracts an RDF graph from a SPARQL endpoint

The WHERE block is used in the SPARQL queries to restrict the query (in DESCRIBE queries WHERE is optional). For example, to read the names of every person in a dataset, you can perform the following query:


PREFIX foaf: 
SELECT ?name
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
}

The Largest Linked Datasets

LOD Dataset Collections

LOD datasets can be registered and managed using datahub.io, an open data registry. datahub.io is used by governments, research institutions, and other organizations. Powered by structured data, datahub.io provides efficient search and faceting, browsing user data, previewing data using maps, graphs, and tables.

The LOD Cloud Diagram

The LOD Cloud Diagram represents datasets with at least 1,000 RDF triples and the links between them. The size of the bubbles corresponds to the data amount stored in each dataset. In the middle of the cloud you can see the largest datasets, DBpedia and GeoNames, followed by the W3C.

LOD Cloud Diagram

Reply