Semantic Web: Machine-Readable, Structured Data With Meaningful Annotations

Until recently, software agents could not handle many kinds of information that could have been associated with files. Although file structure and extensions provided some information about files, much information could not be expressed. For example, a file with a .jpg extension has always represented a JPEG image but provided no information about the shutter speed, exposure program, F-stop, aperture, ISO speed rating, or focal length until the introduction of metadata formats such as Exif and XMP. However, sharing metadata stored in binary files is still not the most efficient way to share metadata, especially if it is much more generic. In the digital era, electronic files are being sold (e-books, MP3 files, and so on) that might be retrieved or played on many types of devices. A variety of metadata technologies can be used to express arbitrary information and represent any kind of knowledge associated with electronic documents in a machine-readable format. Machine-readable data (automated data) is data stored in a machine-readable format, making it possible for automated software agents to access and process it without human intervention. To browsers, web documents consisted of human-readable data only. In fact, information was confused with the containers that contained them. In contrast to the conventional Web (the “Web of documents”), the Semantic Web is the “Web of data.” The Semantic Web provides machine-processable data, making it possible for software agents to “understand” the meaning of information (in other words, semantics) presented by web documents. This feature can be used for a variety of services, such as museums, community sites, or podcasting.

Note that the word semantic is used on the Web in other contexts as well. For example, HTML5 supports semantic (in other words, meaningful) structuring elements, but this expression refers to the “meaning” of elements. In this context, the word semantic contrasts the “meaning” of elements, such as that of section (a thematic grouping), with the generic elements of older HTML versions, such as the “meaningless” div. The semantics of markup elements should not be confused with the semantics (in other words, machine-processability) of metadata annotations and web ontologies used on the Semantic Web. The latter can provide far more sophisticated data than the meaning of a markup element.

Conventional web documents can be extended with additional data that add meaning to them rather than structure alone. Semantic Web is a new approach that is going to change the world of the Web. Surprisingly, as early as 2001, Tim Berners-Lee described the reason for the existence of the Semantic Web. On the Semantic Web, data can be retrieved from seemingly unrelated fields automatically in order to combine them, find relations, and make discoveries. The Semantic Web should be considered an extension of the conventional Web.

Two terms are frequently associated with the Semantic Web, although neither of them has a clear definition: Web 2.0 and Web 3.0. Web 2.0 is an umbrella term used for a collection of technologies that form the second generation of the Web, such as Extensible Markup Language (XML), Asynchronous JavaScript and XML (Ajax), Really Simple Syndication (RSS), and Session Initiation Protocol (SIP). They are the underlying technologies and standards behind instant messaging, Voice over IP, wikis, blogs, forums, and syndication. The next generation of web services is more and more frequently denoted as Web 3.0, which is an umbrella term usually referring to customization and semantic contents and more sophisticated web applications toward Artificial Intelligence (AI), including computer-generated contents.

The Semantic Web is a major aspect of Web 2.0 and Web 3.0. Web 3.0 can be considered a superset of the Semantic Web that features social connections and personalization. Several technologies contribute to the sharing of such information instead of web pages alone, and the number of Semantic Web applications is constantly increasing.

On the Semantic Web, there is a variety of structured data, usually expressed in, or based on, the Resource Description Framework (RDF). Similar to conventional conceptual modeling approaches, such as class diagrams and entity relationships, the RDF data model is based on statements that describe and feature resources, especially web resources, in the form of subject-predicate-object expressions. The subject corresponds to the resource. The predicate expresses a relationship between the subject and the object. Such expressions are called triples. For example, the statement “The sky is blue” can be expressed in an RDF triple as follows:

  • Subject: “The sky”
  • Predicate: “is”
  • Object: “blue”

RDF is an abstract model that has several serialization formats. Consequently, the syntax of the triple varies from format to format. Keep in mind that RDF is a concept, not a syntax.

The authors of the “conventional” Web usually publish unstructured data, because they do not know about the power of structured data, find RDF too complex, or do not know how to create and publish RDF in any of its serialization formats. The following are solutions to the problem that add structured data to conventional (X)HTML markup, which can be extracted by appropriate software and converted to RDF:

  • Microformats, which reuse markup attributes
  • Microdata, which extends HTML5 markup with structured metadata
  • RDFa (RDF in attributes), which expresses RDF in markup attributes that are not part of (X)HTML vocabularies

All data controlled by conventional web applications are kept by the applications themselves, making a significant share of data and their relationships virtually unavailable for automated processing. Semantic Web applications, on the other hand, can access this data through the general web architecture and transfer structured data between applications and web sites. Semantic web technologies can be widely applied in a variety of areas, such as web search, data integration, resource discovery and classification, cataloging, intelligent software agents, content rating, and intellectual property right descriptions. A much wider range of tasks can be performed on semantic web pages than on conventional ones; for example, relationships between data and even sentences can be automatically processed. Additionally, the efficiency is much higher. For example, a very promising approach provides direct mapping of relational data to RDF, making it possible to share data of relational databases on the Semantic Web. Since relational databases are extremely popular in computing, databases that have been stored on local hard drives up to now can be shared on the Semantic Web. Commercial RDF database software packages are already available on the market (5Store, AllegroGraph, BigData, Oracle, OWLIM, Talis Platform, Virtuoso, and so on). Semantic tools can also be used in a variety of other areas, including business process modeling or diagnostic applications.

Structured Data

Data should be structured to support advanced processability and searchability by data type. Structured data is data organized in a structure to become identifiable. Such data has been used for decades in computing, such as in the form of Access and SQL databases, where queries can be performed to retrieve information (for example, a ZIP code). In contrast to relational databases, most data on the Web is stored in (X)HTML documents that contain unstructured data. Conventional web documents contain large amounts of unstructured data that can be rendered in web browsers. This approach works satisfactorily for publishing purposes; however, a large amount of data stored in, or associated with, web documents cannot be processed this way. According to Berners-Lee, the data used to describe social connections between people is a good example for that kind of data:

“The Web is more a social creation than a technical one. I designed it for a social effect-to help people work together-and not as a technical toy. The ultimate goal of the Web is to support and improve our weblike existence in the world. We clump into families, associations, and companies. We develop trust across the miles and distrust around the corner. What we believe, endorse, agree with, and depend on is representable and, increasingly, represented on the Web. We all have to ensure that the society we build with the Web is of the sort we intend.”

Linked Open Data

Linked Data (also known as Linking Data) can be applied to improve the exploitation of the “Web of data.” The expression refers to the publishing of structured data in a way that typed links are created between data from different sources to provide a higher level of usability. By using Linked Data, it is possible to find other, related data. Structured data should meet four requirements to be called Linked Data:

  • URIs should be assigned to all entities of the dataset.
  • HTTP URIs are required to ensure that all entities can be referenced and cited by users and user agents.
  • Entities should be described using standard formats such as RDF/XML.
  • Links should be created to other, related entity URIs.

All data that fulfill these requirements and are released for the public are called Linked Open Data (LOD). The variety of datasets published as Linked Data is represented by the LOD cloud diagram The image collects the datasets published according to the Linked Data principles and represents links between them. The size of the bubbles corresponds to the number of triples stored in each dataset. Contributors include the Linking Open Data community project, individuals, and organizations.

Different Approaches-Different Annotations And Syntaxes

Metadata is structured data describing information about features and content of web sites. The meta tags written in (X)HTML head sections, which do not require additional technologies, can be used to describe general data about web pages. Semantic, machine-readable labels can be provided as attribute values of (X)HTML or XML elements by microdata, microformats, or RDFa. There are several metadata technologies; many apply different annotations. For example, the description of a person can be expressed in RDFa, microdata, the vCard microformat, and further vocabularies such as FOAF or DOAC. Special metadata such as licensing can be provided with different notations. Licensing information of images and of the web pages containing them can be different. Providing license metadata can be beneficial to every web site, especially the ones that have different copyright than the user content, such as image-sharing portals like Flickr. Image licenses can be provided in basic markup, microdata, rel=”license” microformat, and RDFa.

Several metadata technologies can be written in a variety of syntaxes. In the case of microformats, for example, there are differences between the markup languages they can be embedded into. In other cases, reducing complexity is desired (for example, RDF syntaxes).

There are many machine-readable metadata annotations, semantically meaningful attributes, vocabularies, schemes, and ontologies available, including but not limited to the following:

  • General metadata in the markup: Conventional meta tags
  • Microformats: Metadata provided as attribute values of markup elements
  • Microdata: A metadata annotation for general metadata embedding in HTML5
  • RDF: A standardized framework for Semantic Web data models
  • OWL: A knowledge representation language for describing and sharing web ontologies that formally represent knowledge as a set of concepts within a domain and the relationships between those concepts
  • FOAF and DOAC: Machine-readable ontologies for people and their professional capabilities
  • XMP, Rich Snippets, SearchMonkey RDFa: Metadata formats for images and video clips

After gaining popularity on large-scale industrial portals and especially online community portals, some features of the Semantic Web, together with personalization, is now ubiquitous. The variety of metadata annotations can significantly extend the possibilities of web documents. They can also considerably improve the effectiveness of web searches. A good example is HTML5 microdata and RDFa, both of which can be retrieved by Google as Rich Snippets. RDF would be one of the best choices to add structure to the Web and change conventional search engines that apply brute-force approaches.