Structured LOD files containing thousands or millions of RDF triples are called LOD datasets that are meaningful collections of triples covering a field of interest. LOD datasets collect descriptions of entities within the field of interest, and these descriptions often share a common URI prefix (as for example, http://dbpedia.org/resource/
). The authors of the largest datasets provide advanced features that enable easy access to their structured data such as downloadable compressed files of the datasets or an infrastructure for efficient querying.
RDF Crawling
Similar to the web crawlers that systematically browse conventional web sites for indexing, Semantic Web crawlers browse semantic contents to extract structured data and automatically find relationships between seemingly unrelated entities. LOD datasets should be published so that they are available through RDF crawling.
RDF Dumps
The most popular LOD datasets are regularly published as a downloadable compressed file (usually Gzip or bzip2), called an RDF dump, which is the latest version of the dataset. RDF dumps should be valid XML and valid RDF files at the same time. The reason while the RDF dump files are compressed is that the datasets containing millions of RDF triples are quite large. The size of Gzip-compressed RDF dumps is approximately 100MB per every 10 million triples, but it also depends on the RDF serialization of the dataset.
SPARQL Endpoints
Similar to relational database queries in MySQL, the data of semantic datasets can also be retrieved through powerful queries. The query language designed specifically for RDF datasets is called SPARQL (pronounced “sparkle”, stands for SPARQL Protocol and RDF Query Language). Some datasets provide a SPARQL endpoint, which is an address from where you can directly run SPARQL queries (powered by a backend database engine and an HTTP/SPARQL server). For example, the SPARQL Endpoint of the LOD dataset of Leslie Sikos is http://www.lesliesikos.com/sparql-endpoint/. The four basic query types for retrieving data from a dataset are:
SELECT
query: extracts raw values from a SPARQL endpoint, where the results are returned in a table formatCONSTRUCT
query: extracts information from a SPARQL endpoint, and transform the results into RDFASK
query: returns a Boolean (True or False) result for a query on a SPARQL endpointDESCRIBE
query: extracts an RDF graph from a SPARQL endpoint
The WHERE
block is used in the SPARQL queries to restrict the query (in DESCRIBE
queries WHERE
is optional). For example, to read the names of every person in a dataset, you can perform the following query:
PREFIX foaf:
SELECT ?name
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
}
The Largest Linked Datasets
LOD Dataset Collections
LOD datasets can be registered and managed using datahub.io, an open data registry. datahub.io is used by governments, research institutions, and other organizations. Powered by structured data, datahub.io provides efficient search and faceting, browsing user data, previewing data using maps, graphs, and tables.
The LOD Cloud Diagram
The LOD Cloud Diagram represents datasets with at least 1,000 RDF triples and the links between them. The size of the bubbles corresponds to the data amount stored in each dataset. In the middle of the cloud you can see the largest datasets, DBpedia and GeoNames, followed by the W3C.