XHTML
Since January 2000 all W3C Recommendations for HTML have been based on XML rather than SGML, using the abbreviation XHTML (Extensible HyperText Markup Language). The language specification requires that XHTML Web documents must be well-formed XML documents – this allows for more rigorous and robust documents while using tags familiar from HTML.
One of the most noticeable differences between HTML and XHTML is the rule that all tags must be closed: empty HTML tags such as <br> must either be closed with a regular end-tag, or replaced by a special form: <br /> (the space before the ‘/‘ on the end tag is optional, but frequently used because it enables some pre-XML Web browsers, and SGML parsers, to accept the tag). Another is that all attribute values in tags must be quoted. Finally, all tag and attribute names must be lowercase in order to be valid; HTML, on the other hand, was case-insensitive.
Other XML-based applications
Many XML-based applications now exist, including Resource Description Framework (RDF), XForms, DocBook, SOAP and the Web Ontology Language (OWL). For a partial list of these see List of XML markup languages.
Features
A common feature of many markup languages is that they intermix the text of a document with markup instructions in the same data stream or file. This is not necessary; it is possible to isolate markup from text content, using pointers, offsets, IDs, or other methods to co-ordinate the two. Such “standoff markup” is typical for the internal representations that programs use to work with marked-up documents. However, embedded or “inline” markup is much more common elsewhere. Here, for example, is a small section of text marked up in HTML:
<h1> Anatidae </h1> <p> The family <i>Anatidae</i> includes ducks, geese, and swans, but <em>not</em> the closely-related screamers. </p>
The codes enclosed in angle-brackets <like this> are markup instructions (known as tags), while the text between these instructions is the actual text of the document. The codes h1, p, and em are examples of semantic markup, in that they describe the intended purpose or meaning of the text they include. Specifically, h1 means “this is a first-level heading”, p means “this is a paragraph”, and em means “this is an emphasized word or phrase”. A program interpreting such structural markup may apply its own rules or styles for presenting the various pieces of text, using diffent typefaces, boldness, font size, indention, colour, or other styles, as desired. A tag such as “h1″ (header level 1) might be presented in a large bold sans-serif typeface, for example, or in a monospaced (typewriter-style) document it might be underscored – or it might not change the presentation at all.
In contrast, the i tag in HTML is an example of presentational markup; it is generally used to specify a particular characteristic of the text (in this case, the use of an italic typeface) without specifying the reason for that appearance.
The Text Encoding Initiative (TEI) has published extensive guidelines for how to encode texts of interest in the humanities and social sciences, developed through years of international cooperative work. These guidelines are used by projects encoding historical documents, the works of particular scholars, periods, or genres, and so on.
Alternative usage
While the idea of markup language originated with text documents, there is an increasing usage of markup languages in other areas which involve the presentation of various types of information, including playlists, vector graphics, web services, content syndication, and user interfaces. Most of these are XML applications because it is a well-defined and extensible language.
The use of XML has also led to the possibility of combining multiple markup languages into a single profile, like XHTML+SMIL and XHTML+MathML+SVG[9]
Because markup languages, and more generally data description languages (not necessarily textual markup), are not programming languages (they are data, not code), they are more easily manipulated than programming languages – for example, web pages are presented as HTML documents, not C code, and thus can be embedded within other web pages, displayed when only partially received, and so forth. This leads to the web design principle of the “Rule of Least Power”, which advocates using the least (computationally) powerful that satisfies a task to facilitate such manipulation and reuse.