In HTML5 developers have the freedom of flavor choice since HTML5 can be written either in HTML or in XML syntax (HTML5 and XHTML5, respectively). XHTML5 is the XML serialization of HTML5. The syntax is described by the HTML5 specification. However, one shouldn’t be confused since XHTML5 is as an application of XML. In other words, HTML5 and XHTML5 have identical vocabulary (the same set of elements and attributes) but different parsing rules. HTML5 documents might also be valid XML documents. This markup is often referred as a “polyglot” language. It is the overlap language of documents which are HTML5 and XML documents at the same time. HTML5 and XHTML5 serializations are cross-compatible. However, XHTML5 has a stricter syntax. Furthermore, some parts of XHTML5 such as processing instructions are not valid in HTML5.
Documents served as XML MIME type, such as application/xhtml+xml, are treated as XML documents by browsers, i.e., they are parsed by an XML processor. It is important to keep in mind that XML and HTML are processed differently. In fact, even minor syntax errors will prevent an XML document (or the ones that claimed to be XML) from being rendered correctly. In contrast, the errors of such documents would be ignored in the HTML syntax. A parsing error of XML documents can easily result in a “Yellow Screen of Death”.
Syntax And Restrictions
While most HTML elements could have always been used in the corresponding XHTML 1.0 flavor (HTML 4.01 Transitional elements in XHTML 1.0 Transitional, and HTML 4.01 Strict elements in XHTML 1.0 Strict), some elements introduced in the XHTML specifications were applied to XHTML exclusively. The difference between the HTML and XHTML vocabularies completely disappeared with the introduction of the latest markup versions, HTML5 and XHTML5, since HTML5 has exactly the same elements and attributes as XHTML5. However, XHTML5 is the zenith of markup languages. While some developers incorrectly consider XHTML as a too verbose language, it is not only stricter, but also more precise than HTML5. The major differences between HTML5 and XHTML5 can be summarized as follows.
- Well-formedness is required. All elements must be closed. Nesting should be done in the proper order. Overlapping elements are incorrect in XHTML5.
- Names are in lowercase. Since XML is case-sensitive, all XHTML5 element and attribute names must be in lowercase.
- End tags are required. In HTML5, the end tag of several elements can be omitted, which is not allowed in XHTML5. All elements that are declared in the specification as empty elements (meta, link, br, hr, img, input) can be closed either by an end tag (similar to nonempty elements) or by the shorthand notation; in other words, a space and a slash character are inserted prior to the end of the declaration. Tags without a closing tag are also known as self-closing tags. In XHTML5, all unterminated elements are incorrect, including unterminated empty elements. The script element applies either to the full form (with the end tag) or to the shorthand notation, depending on the number of parameters and the behavior of the element.
- Attribute values must be quoted and all attributes must include values in XHTML5. Unquoted attribute values are not allowed in XHTML5.
- Attribute minimization is forbidden. Attribute-value pairs must be written in full. Attribute names such as compact and checked cannot be used in elements without specifying their values.
- Whitespace handling is more advanced in XHTML5. Leading and trailing whitespace characters are stripped in XHTML5. In contrast to HTML5, whitespace characters in XHTML5 attribute values are normalized to single spaces. According to the XML specification, a single interword space (#x20) is appended to whitespace character sequences (#x20, #xD, #xA, #x9).
- Script and style elements in XHTML5 are processed differently than in HTML5. While the content type of the script and style HTML elements is character DATA (CDATA), it is processed character DATA (#PCDATA) in XHTML5. The script and style elements are defined with #PCDATA content; in other words, < is handled as the beginning of markup code, while < is recognized as an entity. XML processors recognize these CDATA sections. They are represented as nodes in the Document Object Model (DOM). Alternatively, external script files/styles sheet files can be used, eliminating the need for unescaped script or style contents.
- Identifiers must be declared by the id attribute. XHTML documents must use the id attribute when defining fragment identifiers on markup elements.
- Element prohibitions apply. In XHTML5, elements cannot be nested arbitrarily. Those who are not familiar with XHTML5 often commit nesting errors. The nesting rules should not be confused with overlapping, which is strictly forbidden in XHTML5. Unlike in HTML5, texts cannot be provided directly in the XHTML5 body without wrapping them in container elements (such as p).
- Most special characters must be written directly in the markup instead of using character entities. Using characters directly with UTF-8 encoding is strongly recommended.
- Dashes in comments are limited. Double dashes can be provided only at the beginning and end of XHTML comments.
Character Encoding Declarations
Character encoding of XHTML5 documents can be determined in many ways:
- Using the HTTP header
- Using in-document declarations
- Pragma directive
- Meta charset attribute
- XML declaration
The older kind of declaration (meta http-equiv) should be used at the top of the head element. XHTML5 also provides a newly specified meta charset attribute (either of them could be used but only one at the same time). It should also be ensured that the whole declaration fits within the first 512 bytes of the document. This kind of meta element declaration cannot be used in the head element of XHTML5 documents if the character encoding is UTF-16. A byte-order mark should be present at the beginning of UTF-16 encoded files. The encoding declaration of XHTML documents depends on which MIME type they are served with. If they are served as text/html, the pragma directive can be used at the top of the head element. XHTML documents served as XML can use the encoding declaration of the XML declaration on the first line of the document. It should be ensured that there is no other content before the declaration (a byte-order mark can be used).
In spite of the advantages of XHTML5, HTML5 has become the recommended markup language due to its simplicity and suitability for everyday purposes. However, web designers should keep in mind that well-formedness, proper document structure, and correct element use should always be provided in the markup regardless of the serialization used even if the HTML parser is “more forgiving” than the XML parser.