Conventional web sites rely on markup languages for document structure, style sheets for appearance, and scripts for behavior, but the content is human-readable only. When searching for “Jaguar” on the Web, for example, traditional search engine algorithms not always can tell the difference between the British luxury car and the South American predator.
A typical web page contains structuring elements, formatted text, and some even multimedia objects. By default, the headings, texts, links, and other web site components created by the web designer are meaningless to computers. While browsers can display web documents based on the markup, only the human mind can interpret the meaning of information, so there is a huge gap between what computers and humans understand. Even if alternate text is specified for images (
alt attribute with descriptive value on the
figure elements), the data is not structured or linked to related data, and human-readable words of conventional web page paragraphs are not associated with any particular software syntax or structure. Without context, the information provided by web sites can be ambigious to search engines.
The concept of machine-readable data is not new and it is not limited to the Web. Think of the credit cards or barcodes, both of which contain human-readable and machine-readable data. One person or product, however, has more than one identifier, which can cause ambiguity.
Even the well-formed XML documents, that follow rigorous syntax rules, have serious limitations when it comes to machine-processability. For instance, if an XML entity is defined between
Contents can be made machine-processable and unambiguous by adding organized (structured) data to the web sites as the extension of the markup or as dedicated external metadata files, and link them to other, related structured datasets. Structured data files support a much wider range of tasks than conventional web sites, and are far more efficient to process. For example, assume the number
87 in a movie description, which represents the running time of the movie. If the description is written as plain text,
87 is a meaningless string for computers. If it is expressed in XML and declared as a positive integer, computers will treat it as a positive integer rather then consecutive alphanumeric characters only, and could perform calculations with it. However, without structured data, such as RDF data,
87 has no meaning. Since numbers can appear in ISBN numbers, phone numbers, etc., the formal definition of the corresponding property would semantically enrich the description, which enables complex automated tasks.