Related items

The Emperor has New Clothes : HTML Recast as an XML Application

You are here: irt.org | Articles | Extensible Markup Language (XML) | The Emperor has New Clothes : HTML Recast as an XML Application [ previous next ]

Published on: Sunday 30th January 2000 By: Pankaj Kamthan

Introduction
Markup Languages on the Web : The Evolution
XHTML Syntax and Semantics : A Tour
Transition from HTML to XHTML
Translating HTML to XHTML
- HTML Tidy
  - HTML Tidy Offline
  - HTML Tidy Online
- HTML-KIT
Translating XML to XHTML
- XSL Transformations
Authoring XHTML
Validating XHTML
- W3C HTML Validation Service
- HTML::VALIDATOR
Rendering XHTML
Conclusion
Acknowledgements
References
Appendix : HTML Tidy Configuration Options

Introduction

After a period of dormancy, a relatively silent but significant change has taken place that will redefine the way HTML is currently served, received and processed. This evolution has been in the direction of XML (Extensible Markup Language), that in the last one year has emerged as a powerful meta-markup language for the Web.

Since the majority of information on the Web still exists in the form of HTML, there is a dire need of a mechanism for a transition from HTML to XML that is as transparent as possible. XHTML is intended to fill this gap, which is the subject of this article. (The term XHTML should not be confused with the HTML editor Xhtml.) It is intended to be used as a language for content that is both XML-conforming and, if some simple guidelines are followed, operates in HTML 4 conforming user agents.

There are several questions that an author needs to ask him/herself:

What are the limitations inherent in the design of HTML? Is there a global (approach to the issue) and open (nonproprietary) solution that circumvents (any of) those limitations?
What is the purpose of extending HTML? Why and what should we know about it, particularly in terms of how we markup our documents now?

In what follows, we assume that the reader is already familiar with HTML 4. Some knowledge of XML can also be useful.

Markup Languages on the Web : The Evolution

The evolution of markup languages for the Web went through two major milestones: development of HTML, which built the Web architecture, and XML, which is poised to consolidate the same.

SGML/HTML

SGML (Standard Generalized Markup Language) is a meta-language for describing markup languages, particularly those used in electronic document exchange, document management, and document publishing. It is both feature-rich and flexible. This flexibility, however, comes with a level of complexity that has inhibited its adoption on the Web.

HTML is an SGML application, and is widely regarded as the standard publishing language of the Web. HTML was originally conceived as a language for the exchange of scientific and technical documents, suitable for use by non-document specialists. HTML addressed the problem of SGML complexity by specifying a small set of structural and semantic tags suitable for authoring relatively simple documents. In addition to simplifying the document structure, HTML added support for hypertext. Later, multimedia capabilities were added.

In a remarkably short time span, HTML became very popular and rapidly outgrew its original purpose. Since its inception, HTML has gone through several stages of evolution, both standard and proprietary, to adapt it to a very diverse and international community wishing to make information available on the Web. This evolution has led to interoperability problems for documents across different user agents. With the rapidly proliferation of the heterogeneity of both software and platforms, the situation has worsened.

The current HTML standard, HTML 4.0 Specification, was announced as a W3C Recommendation on December 18, 1997 and revised on April 24, 1998. It has recently been revised to HTML 4.01 Specification with minor changes. By "HTML 4," we will mean HTML 4.01 unless stated otherwise.

XML/XHTML

XML was conceived as a means of regaining the power and flexibility of SGML. While retaining all of SGML's commonly used and beneficial features, XML removes many of the more complex features of SGML. XML 1.0 Specification, was announced as a W3C Recommendation on February 10, 1998. As the acceptance of XML grew, it was realized that there is was very large base of Web documents, transition of which to XML poses a major issue. The past few years also saw a rise of access to the Web using a variety of devices with varying computing power. The questions in the development of the XHTML were: how to design the next generation language for Web documents without obsoleting the existing content on the Web, and how to create a markup language that supports device-independence. The result is XHTML.

XHTML is a family of current and future document types and modules that reproduce, subset, and extend HTML 4. XHTML family document types are XML based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML 1.0 is the first document type in the XHTML family. It is a reformulation of the three HTML 4 document types as applications of XML 1.0. XHTML 1.0 was announced as a W3C Recommendation on January 26, 2000.

Advantages of XHTML

XHTML 1.0 connects the present Web to the future Web
-Tim Berners-Lee, in XHTML 1.0 Press Release

The XHTML 1.0 (and XHTML family, in general) offers several advantages to an author:

Extensibility and Modularity. The benefits of XML syntax include extensibility and modularity. With HTML, authors had a fixed set of elements to use, with no variation. There is a constant need to extend the functionality of HTML, which leads to combining HTML with other types of markup. Under HTML, the addition of a new group of elements requires alteration of the entire DTD (Document Type Definition). XHTML, being based on XML, is extensible. With XHTML 1.0, via XML Namespaces, elements from other XML vocabularies can be added without altering the entire DTD that the document is based on. XHTML is modular, and so it can be used in conjunction with other XML applications such as Mathematical Markup Language (MathML), Scalable Vector Graphics (SVG), Resource Description Framework (RDF), and more. The modular design reflects the realization that a one-size-fits-all approach will no longer work on the Web where browsers vary enormously in their capabilities. For example, a browser in a PDA can't offer the same experience as a high-end multimedia desktop computer because of the differences in the screens and memory.
Interoperability and Accessibility. Currently most of the Internet access is through browsers on desktop computers, though alternate ways are constantly being introduced. It is expected that, in the next few years, there will be an increasing use of non-desktop devices like PDAs, digital TVs, and other alternative platforms to access Web documents. In most cases, these devices will not have the processing power of a desktop computer, and browsers on them will be less capable to tolerate malformed markup to render the document. People browsing the Web with cellular phones or other mobile devices would want Web content tailored to their needs. Furthermore, people with disabilities need ways to transform content into accessible formats. The XHTML family is designed with general user agent interoperability and accessibility in mind. XHTML 1.0 provides the foundation for device-independent Web access. XML documents can be transformed using XSL Transformations (XSLT), and rendered using independent style sheets such as CSS style sheets. XHTML 1.1, currently under development, equipped with device-specific style sheets and Composite Capability/Preference Profiles (CC/PP), a protocol which allows a user to describe both user preferences and device capabilities, will bring mobile and other devices to the Web as full participants.
Software Development. HTML, being based on SGML does not require a document to be well-formed. For example, in HTML a <dt> tag might be terminated by </dt>, by another <dt>, by a <dd>, or by a </dl>, or by something else. This makes processing software development difficult as one has to "predict" all possibilities and include appropriate response. Since XHTML 1.0 is based on XML, every XHTML 1.0 document must be well-formed, thus excluding "unpredictable" scenarios.
Familiarity with Power. XHTML 1.0 combines the familiarity of HTML with the power of XML. XHTML 1.0 allows authors to create Web documents that work with current HTML browsers and that may be processed by XML-enabled software as well. Authors writing XHTML can use the well-known elements of HTML 4, but with XML syntax, which promotes markup conformance.
Support. There is a broad industry support for XHTML, as indicated in the wide range of testimonials, with multiple implementations already available.

XHTML Syntax and Semantics : A Tour

In our context, a DTD is a collection of XML declarations that, as a collection, defines the legal structure, elements, and attributes that are available for use in a document that complies to the DTD. Each DTD can be recognized by a unique label called a formal public identifier (FPI). It is a definition of what qualifies as legal syntax in XHTML 1.0. The semantics of the XHTML 1.0 elements and their attributes are as defined by the HTML 4 Specification.

XHTML DTDs

XHTML 1.0 specifies three XML document types that correspond to the three HTML 4 DTDs: Strict, Transitional, and Frameset.

XHTML 1.0 Strict.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

You should use this when you want presentational markup to be seperated from the structure of the document. Cascading Style Sheets (CSS) can be used for the purpose of presentation.

XHTML 1.0 Transitional.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

You should use this when presentational markup needs to be embedded in the document (for example, to support browsers that are not CSS-compliant).

XHTML 1.0 Frameset.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

You should use this when you want to have a HTML Frames-based document.

The system identifier may be changed to reflect local system conventions. These DTDs approximate the HTML 4 DTDs, though, a DTD for XHTML 1.0 is more restrictive than the DTD for HTML since XML is more restrictive than SGML.

XHTML Entities

The XHTML entity sets are the same as for HTML 4, but have been modified to be valid XML 1.0 entity declarations. The modified set includes Latin-1 characters, special characters and symbols.

XHTML Namespace

There are cases where different XML documents may have same element or attribute names. Combining element names from different documents into one document then becomes an issue. For example, a title element may mean the title of a person in one document and of a book in another. Including the information from both documents into a single document will lead to a conflict.

XML Namespaces provide a method for qualifying element and attribute names used in XML documents by associating them with namespaces which are identified by URI references. This allows element names from different documents to be combined in one document without conflict in cases where names happen to be identical. It uses qualified names to prevent potential conflicts between identically named XML elements, by associating a prefix which identifies an intended namespace with an URI.

The XHTML namespace is given to be http://www.w3.org/1999/xhtml.

XHTML Document Conformance

A strictly conforming XHTML 1.0 document is a document restricted to tags and attributes from the XHTML 1.0 namespace. Such a document must meet all of the following criteria:

It must validate against one of the three DTDs.
The root element of the document must be html.
The root element of the document must designate an XHTML 1.0 namespace using the xmlns attribute.
There must be a DOCTYPE declaration in the document prior to the root element. If present, the public identifier included in the DOCTYPE declaration must reference one of the three DTDs using the respective FPI.

Example 1. The following is an example of a minimal XHTML 1.0 document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <title>The Minimal XHTML 1.0 Document</title>
</head>
<body>
  <p>This is a minimal XHTML 1.0 document.</p>
</body>
</html>

XHTML Documents with Non-XHTML Namespace

The XHTML 1.0 namespace may be used with other XML namespaces. Though such documents are well-formed with respect to XML syntax, they are not strictly conforming XHTML 1.0 documents.

Example 2. Non-XHTML Namespace in XHTML 1.0 Document. The following example shows the way in which XHTML 1.0 could be used in conjunction with the MathML 1.01 Specification:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <title>A MathML Example</title>
</head>
<body>
  <p>The following is an example of MathML Content markup:</p>
  <math xmlns="http://www.w3.org/1998/Math/MathML">
    <apply><root/><ci>x</ci></apply>
  </math>
</body>
</html>

Example 3. XHTML 1.0 Namespace in a Non-XHTML Namespace Document. SVG provides extensibility by inclusion of elements from foreign namespaces. The following example shows the way in which XHTML 1.0 content could be embedded in a SVG document:

<?xml version="1.0" standalone="yes"?>
<svg width="500" height="400" xmlns="http://www.w3.org/Graphics/SVG/SVG-20000802.dtd">
<defs> 
  <html:html xmlns:html="http://www.w3.org/Profiles/xhtml1-transitional"> 
  <html:head>
    <html:title>Sales By Department</html:title>
  </html:head>
  <html:body>
    <html:table align="center" cellpadding="5" border="1">
    <html:tr><html:th>Department</html:th><html:th>Revenue</html:th><html:th>Profit</html:th></html:tr>
    <html:tr><html:td>A</html:td><html:td>100</html:td><html:td>5</html:td></html:tr>
    <html:tr><html:td>B</html:td><html:td>200</html:td><html:td>15</html:td></html:tr>
    </html:table>
    </html:body>
  </html:html> 
</defs> 
<desc>Sales By Department</desc> 
<g style="stroke: #000000">
  <line x1="10" x2="150" y1="150" y2="150" style="stroke: #000080"/>
  <line x1="10" x2="10" y1="10" y2="150" style="stroke: #000080"/>
  <text x="10" y="10" style="font-size:14; font-family: Tahoma">Profit</text>
  <text x="10" y="165" style="font-size:14; font-family: Tahoma">0</text>
  <text x="150" y="165" style="font-size:14; font-family: Tahoma">Department</text>
  <rect style="fill: blue; stroke: red" x="40" y="125" width="20" height="25"/>
  <text x="40" y="165" style="font-size:12; font-family: Tahoma">Products</text>
  <text x="40" y="120" style="font-size:12; font-family: Tahoma">5</text>
  <rect style="fill: blue; stroke: red" x="90" y="75" width="20" height="75"/>
  <text x="90" y="165" style="font-size:12; font-family: Tahoma">Services</text>
  <text x="90" y="70" style="font-size:12; font-family: Tahoma">15</text>
</g>
</svg>

It is preferable to separate the style attributes in a CSS style sheet.

HTML vs. XHTML Syntax

This section outlines the differences between HTML and XHTML 1.0 syntax. Most of the differences stem from the fact that XHTML 1.0 is an XML application, and certain practices that are legal in SGML-based HTML 4 need to be revised. Wherever relevant, we will include suggestions according to the HTML Compatibility Guidelines on ensuring backward compatibility.

Internet Media Type

XHTML 1.0 documents which follow the HTML Compatibility Guidelines may be labeled with the Internet Media Type "text/html", as they are backward compatible with most HTML browsers. The general recommended MIME labeling for XML-based applications has yet to be resolved.

Processing Instructions

The XML declaration, such as, <?xml version="1.0" encoding="UTF-8"?>is not required in all XHTML 1.0 documents. Such a declaration is, however, required when the (Unicode-based) character encoding of the document is other than the default UTF-8 or UTF-16. Therefore, it is a good practice to use XML declarations such as the above in all XHTML 1.0 documents.

DOCTYPE Declaration

The purpose of the DOCTYPE declaration is to declare that the document adheres to a specified DTD. An XHTML 1.0 document must have a DOCTYPE declaration just before the <html> tag and must reference one of the three XHTML 1.0 DTDs.

Element Language

The language of elements is specifying by using both the lang and xml:lang attributes. The value of the xml:lang attribute takes precedence.

Root Element

The root element of an XHTML 1.0 document ("html") must designate a XHTML 1.0 namespace.

Well-Formedness

A document is well-formed when it is structured according to the rules defined in Section 2.1 of the XML 1.0 Specification. An XHTML 1.0 document must be well-formed. Basically, this means that elements, delimited by their start and end tags, are nested properly within one another. Although overlapping is illegal in SGML, it is widely tolerated in existing browsers. Example:

HTML	XHTML
`<p>This is <b>bold.</p></b>`	`<p>This is <b>bold.</b></p>`

Element and Attribute Names

Since XML is case-sensitive, all HTML element and attribute names must be in lowercase. Example:

HTML	XHTML
`<BODY BGCOLOR="#ffffff">`	`<body bgcolor="#ffffff">`

In an effort towards migrating existing documents in HTML to XHTML 1.0, you can use HTML-Kit to change the case of element and attribute names in a document. Here is an example document before and after the case change and conversion from HTML to XHTML 1.0.

Pattern matching capabilities of Perl can also be used to change case. This solution will have the advantage of doing that for an entire set of HTML documents and will be portable across platforms which have Perl installed on them. However, a generic solution which changes the case of element and attribute names while ignoring attribute values, may be nontrivial.

Attribute Values

User-defined attribute values, however, can be in any case. For example, the "#ffffff" above can also be written as "#FFFFFF." The above is only for illustration; HTML 4.01 discourages use of style attributes. It is suggested that all presentation be carried out via CSS.

Empty Elements

Empty elements must either have an end tag or the start tag must end with />. This is also sometimes called a self-terminating element. Example:

HTML	XHTML
`<hr>`	`<hr />` or `<hr></hr>`

Note the space between the element text and the />. <hr />, known as the minimized tag syntax, is preferred over <hr></hr> as the latter gives unexpected results on some user agents.

Non-Empty Elements

An non-empty element does not have (as designated by the DTD) an EMPTY in its content model. (The term "empty" is being used in a set-theoretic sense.) Due to its SGML legacy, the start and end tags for elements in HTML can (depending on the case) either be required or be optional. For example, for the "head" element, both the start and end tags are optional, for the "p" element the end tag is optional, and for the "title" element, both the start and end tags are required. This is not possible in XML. In XML, both the start and end tags are required for all non-empty elements. XHTML 1.0 does not provide such liberty of choice. Both start and end tags for elements are required. Therefore, all elements other than those declared as EMPTY in the DTD must have an end tag. (The exception to this rule are empty elements, which are treated differently as indicated above.) Example:

HTML	XHTML
`<p>`A paragraph	`<p>`A paragraph`</p>`

One important implication of this is that it increases the size of the document, which in turn has an affect on the delivery time.

There is a difference between an empty element and an empty instance of an element (non-empty element). Therefore, non-empty elements with empty content should not use the minimized form. For example, an empty paragraph (a paragraph with no content to encapsulate) should be represented as <p&t;</p> (and not <p />).

Attribute Values

All attribute values, including those which appear to be numeric, must be quoted (in single or double quotes). Example:

HTML	XHTML
`<table border=1>`	`<table border="1">`

This would need some adjustment in practices of dynamically generating HTML. If double quotes are used, for example, in case of Perl:
print "<table border=1>\n"; will need to be changed to
print "<table border=\"1\">\n"; with the quotes being escaped.

Attribute Minimization

An attribute is said to be minimized when there is only one value for it. For example, in the form element <input type="checkbox" ... checked>, the attribute checked has been minimized. XML does not support attribute minimization. In XHTML 1.0, attribute-value pairs cannot be minimized and must be written in full (as if they had values). Example:

HTML	XHTML
`<input type="checkbox" ... checked>`	`<input type="checkbox" ... checked="checked" />`

Elements with id and name Attributes

HTML 4.0 defined the name attribute for certain elements (a, applet, frame, iframe, img, and map) and also introduced the id attribute. Both of these attributes are designed to be used as fragment identifiers. In XHTML 1.0, the name attribute of these elements is formally deprecated. XHTML 1.0 documents must use the id attribute when defining fragment identifiers, even on elements that previously had a name attribute.

When serving XHTML 1.0 documents as media type text/html, see the HTML Compatibility Guidelines for information on ensuring backward compatibility.

White Spaces in Attribute Values

White space characters, such as a line break, are handled differently by different user agents. When white space is used in attribute values, user agents strip leading and trailing white space and map sequences of (one or more) white space characters to the ASCII space character inter-word space. Therefore, you should avoid line breaks and multiple white space characters within attribute values.

Script and Style elements

The main purpose for character data (CDATA) sections is to ignore characters that would otherwise be regarded as markup. The only delimiter that is recognized in a CDATA is the string "]]>" which ends the CDATA section.

In XHTML 1.0, the script and style elements are declared as having parsed character data (#PCDATA) content. As a result, and contrary to the possible intended purpose, < will be treated as the start of markup. An entity such as < will be recognized as an entity reference by the XML processor to be <. Use of & and ]]> within script and style sheets will also create confusion. Therefore, the content of the script or style element should be wrapped within a CDATA marked section to avoid any unintended interpretation. Example:

HTML	XHTML
<script type="text/javascript"> ... script content ... </script>	<script type="text/javascript"> <![CDATA[ ... script content ... ]]> </script>

This only applies when you're including code inside these elements. This does not apply in cases of external references to scripts and style sheets, for example, <script type="text/javascript"src="foo.js"></script> and <link rel="stylesheet" href="bar.css" type="text/css" /> . Therefore, as an alternative to using the CDATA wrapper, you can use external scripts and style sheets.

Form Elements

The isindex element is deprecated in favour of the input element. If used at all, only one isindex element should be included in the document head.

Entity Sets

The XHTML 1.0 entity sets are the same as for HTML 4, but have been modified to be valid XML 1.0 entity declarations.

SGML Exclusions

SGML gives the author of a DTD the ability to exclude specific elements from being contained within an element. Such prohibitions or exclusions are not possible in XML. The following XHTML 1.0 elements have prohibitions on which elements they can contain:

a cannot contain other a elements.
pre cannot contain the img, object, big, small, sub, or sup elements.
button cannot contain the input, select, textarea, label, button, form, fieldset, iframe or isindex elements.
label cannot contain other label elements.
form cannot contain other form elements.

Transition from HTML to XHTML

Migration of content from HTML to XHTML 1.0 offers various benefits:

XHTML 1.0 documents are XML 1.0 conforming. So they are readily viewed, edited, and validated with standard XML tools.
XHTML 1.0 documents served as media type text/html will continue to operate (as well or even better than it did before) in existing HTML 4-conforming user agents, as well as in new XHTML 1.0 conforming user agents. XHTML 1.0 documents with appropriate style sheet support will operate just as well in XML-based user agents as they do in HTML-based user agents.
XHTML 1.0 documents can utilize applications (for example, JavaScript scripts and Java applets) that rely upon either the HTML DOM or the XML DOM.

The HTML to XHTML 1.0 transition, though simple, deserves a careful consideration as it involves several issues:

Migrating legacy (existing) HTML documents to XHTML 1.0. This can be done by translating HTML to XHTML and subsequently using HTML validation techniques.
Authoring (new) XHTML 1.0 documents. This can be done by deploying available XHTML authoring environments.
In both the above cases, keeping backward compatibility with existing HTML user agents. This can be done by following HTML Compatibility Guidelines.

Translating HTML to XHTML

One approach towards the transition of (existing legacy) HTML documents to XML is to translate them to XHTML. As seen in the section on HTML vs. XHTML Syntax, this can require quite a few changes (particularly, if the original document was non-HTML 4 conforming). This section looks at some tools to make make that task easier. One such tool is HTML Tidy.

HTML Tidy

To convert documents from HTML to XHTML 1.0, you can use HTML Tidy, a freely available utility available for a variety of platforms. It also cleans up markup errors, and reformats the markup for legibility and maintenance.

HTML Tidy Offline

You can use Tidy by the following on the command line:

tidy [[options] filename]*

By default, input and output are directed to stdin and stdout respectively, and errors to stderr. This can be changed by using one or more of the various useful options that Tidy offers, which can be seen by:

tidy -help

A selected list of some commonly used options is given in the table below; for an entire list, you can refer to the official documentation:

HTML Tidy Command Line Option	Function
`-help`	List command line options
`-config file`	Read config file
`-m`	Modify original files
`-f file`	Write errors to file
`-asxml`	Convert HTML to XML
`-xml`	Use this when input is in XML

The -m option should be used with care as it modifies the original file. The -f option can be used to direct the errors to a file, which makes it easier to review them later. For instance:

tidy -m -f error.txt file.html

runs tidy on the file file.html, updating it and writes the error messages to the file error.txt.

The most convenient way to configure Tidy is to use a configuration file. If you have a config file named config.txt, you can instruct Tidy to use it via the command line option -config config.txt, as in:

tidy -config config.txt file1.html file2.html

There are again various options that are supported. A list of some commonly used options is given in the Appendix. Here is a sample configuration file:

// A sample configuration file for HTML Tidy
indent: auto
wrap: 72
markup: yes
output-xhtml: no
input-xml: no
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: yes
output-xhtml: yes

If you have several HTML files in the same directory and want to use Tidy on all of them simultaneously, you can do the following:

DOS. Use for command, as in:

for %i in (*.html) do tidy %i

UNIX. Use a UNIX shell script, as in:

#!/bin/sh

for i in directory_1 directory_2 directory_3 ...
do
  /absolute_path_to/tidy -config config.txt $i.html
done

HTML Tidy Online

The following is a CGI-based interface to HTML Tidy. Enter a URI of the HTML document that you wish to convert to XHTML 1.0 in the field below, and press submit:

CYAN is another online interface to HTML Tidy with various input/output options.

HTML-KIT

HTML-Kit is a freely available program for Windows 9x/NT, designed to help HTML authors to edit, format, validate, preview and publish documents on the Web. It includes a customizable GUI to HTML Tidy for converting documents from HTML to XHTML 1.0. Among other possibilities of viewing, it provides a view with split windows, one with the original markup and the other with the markup after transformation. Errors and suggestions for improving the markup are also reported in a separate window.

An illustration of the output of a document after a HTML to XHTML 1.0 conversion is available. On the documents authored using HTML-Kit, the following icon can be placed as a sign of support:

TRANSLATING XML TO XHTML

There may be times when a need arises to translate a set of XML files to XHTML 1.0, say, for presenting them to HTML 4 user agents. An efficient way of doing that is to use XSLT.

XSL Transformations

XSLT, the "transformational" part of the Extensible Stylesheet Language (XSL), is a language for transforming XML documents into other XML documents (as well as documents in other formats). Here is an example of using XSLT to create an XHTML 1.0 document from a given XML document.

The XML document is sale1999.xml and the XSL stylesheet is sales.xsl. The files are processed using XT, for example, under Windows 9x/NT, as:

C:\path_to\xt sale1999.xml sales.xsl sale1999.htm

where we use the XML document sale1999.xml as input, apply the XSL stylesheet sales.xsl, and direct the output to an HTML document sale1999.htm.

Authoring XHTML

Many existing visual HTML editors may not be useful in authoring XHTML 1.0 documents as they often insert tags on their own and do not follow the rules that produce syntactically-correct documents. (Even though documents with malformed markup are often tolerated by some browsers at present, that may not always be the case with strictly XHTML-compliant renderers in the future.) Therefore, a judicious choice among visual editors is necessary. One example is HTML-Kit discussed above.

Use of specialized XHTML 1.0 visual editors can be made, as we will see later in this section. General purpose text editors, such as Emacs, could also be useful. For details of editing HTML with Emacs, see the article HTML Editing with Emacs.

HTML Compatibility Guidelines

There are HTML Compatibility Guidelines for authors who wish their XHTML 1.0 documents to render on existing HTML user agents. They suggest the appropriate use of various items (Processing Instructions, Empty Elements, Element Minimization and Empty Element Content, Embedded Style Sheets and Scripts, Line Breaks within Attribute Values, Isindex, The lang and xml:lang Attributes, Fragment Identifiers, Character Encoding, Boolean Attributes, DOM, Using Ampersands in Attribute Values, CSS).

Amaya

Amaya is a freely available W3C test-bed browser/editor that supports XHTML 1.0. The documents, however, must be labeled as text/html as Amaya does not support XML per se. Amaya gives various options for output format, including XHTML. Amaya can be particularly useful for authoring XHTML 1.0 documents with embedded MathML markup, as its rendering/editor components support both MathML and XHTML 1.0. All one has to do is include the appropriate XML namespace for MathML in the XHTML 1.0 document.

XML Spy

XML Spy is a commercial XML editor with support for XHTML 1.0, that provides three advanced views of the documents: an Enhanced Grid View for structured editing, a Source View with syntax-coloring for low-level work, and an integrated Browser View. An example illustration of the Enhanced Grid View is available.

Mozquito Factory

Moqzuito Factory is a commercial XML editor that is also the first authoring environment especially designed for the XHTML family of document types. Among many features, it can check XHTML 1.0 documents for well-formedness and validity (against any one of the three DTDs it corresponds to). For example, test results of XHTML 1.0 well-formedness and validity for the document minxhtml.htm are shown in the snapshot. On the XHTML 1.0 documents that validate with the Mozquito Factory, the following icon can be placed:

Validating XHTML

Validation is a process whereby documents are verified against the associated DTD, ensuring that the structure, use of elements, and use of attributes are consistent with the definitions in the DTD. Validating an XHTML 1.0 document involves verifying its markup against one of the three associated DTDs. For details on the issue of HTML validation, see the article Stop! Is Your HTML Document Valid?

W3C HTML Validation Service

You can validate XHTML 1.0 by using the online W3C HTML Validation Service which is based on an SGML parser. It comes with various options, such as including Weblint results and displaying the parse tree. It does not support file upload though.

Enter the URI of the document to be validated (without the options though) in the field below, and press submit:

You may display this icon on any page that validates:

Here is the HTML you could use to add this icon to your Web page:

<a href="http://validator.w3.org/check/referer">
  <img src="http://validator.w3.org/images/vxhtml10" 
       alt="Valid XHTML 1.0!" height="31" width="88" />
</a>

HTML::VALIDATOR

HTML::Validator is a Perl module which can be used to check the validity of HTML, XHTML, SGML, or XML files offline. The validation is performed by nsgmls SGML parser.

There are some additional requirements in XHTML which the XML data content models cannot check, primarily due to the limitations of DTDs. For such cases, an Exclusion Validator for XHTML in XSL is available.

Rendering XHTML

When HTML Compatibility Guidelines are followed, XHTML 1.0 documents can be rendered on HTML 4-compliant browsers. Opera and Amaya have native support for rendering XHTML 1.0.

Conclusion

XHTML : HTML, Alive and Well ... and Better

HTML, as the lingua franca of the Web, has begun to show signs of aging. XHTML has come to the rescue with its family of document types that will provide interoperability on a wide range of new devices.

XHTML 1.0 is the first member of the XHTML family that recasts HTML 4 in XML 1.0. It provides the bridge to authors for entering the structured data world of XML, while still being able to maintain operability with user agents that support HTML 4. The other members of the family are based on a Modularization of XHTML, which provide a means for subsetting and extending XHTML for emerging platforms such as mobile computers and cellular phones. These initiatives include XHTML 1.1 - Module-based XHTML, XHTML Basic, XHTML Events module, and Ruby Annotation.

Transition from HTML to XHTML 1.0 is straightforward if you follow the suggested guidelines for authoring content. This transition has an additional benefit of a smooth passage to XML and use of its applications, while still maintaining backward and future compatibility of the content.

"Welcome to the XML World! Would you please check your (malformed HTML) baggage at the (validator) door? Thank you, and enjoy your (markup) journey."

Acknowledgements

I am grateful to Hsueh-Ieng Pai and Martin Webb for their critical reading and various useful suggestions. I would also especially like to thank all the users for their feedback since this work first appeared in August 1999.

References

Extensible Markup Language (XML) 1.0 Specification - Tim Bray, Jean Paoli, C. M. Sperberg-McQueen (Editors). W3C Recommendation, February 10, 1998.
XHTML^™ 1.0: The Extensible HyperText Markup Language — A Reformulation of HTML 4.0 in XML 1.0 - Steve Pemberton, et al. (Editors). W3C Recommendation, January 26, 2000.
HTML 4.01 Specification - Dave Raggett, Arnaud Le Hors, Ian Jacobs (Editors). W3C Recommendation, December 24, 1999.
XSL Transformations (XSLT) Version 1.0 - James Clark (Editor). W3C Recommendation, November 16, 1999.
HTML Tidy - A tool for detecting and correcting a wide range of markup errors prevalent in an HTML document, that can also be used as a tool for converting HTML to XHTML.
HTML-Kit - A free program for Windows 9x/NT designed for HTML authoring. It provides a customizable GUI to HTML Tidy.
Amaya - W3C's test-bed browser/editor with support for XHTML.
XML Spy - A validating XML editor for Windows 9x/NT that supports XHTML.
Mozquito Factory - An XHTML authoring environment for Windows 9x/NT.
W3C HTML Validation Service - An HTML validation service that checks HTML documents for compliance with W3C Specifications.

APPENDIX : HTML Tidy Configuration Options

The following table gives a selected list of options used by HTML Tidy (assuming the input to be an HTML file and output to be an XHTML file); for an entire list, you can refer to the official documentation.

The * | * | ... are Boolean values, the default is indicated in green:

HTML Tidy Configuration Option	Function
markup: yes \| no	Determines whether Tidy generates a pretty printed version of the markup.
indent: yes \| no \| auto	Tidy uses this to indent block-level tags. If set to auto Tidy will decide whether or not to indent the content of tags depending on whether or not the content includes a block-level element.
output-xml: yes \| no	Tidy uses this to generate the pretty printed output writing it as well-formed XML. Any entities not defined in XML 1.0 will be written as numeric entities to allow them to be parsed by an XML parser. The tags and attributes will be in the case used in the input document, regardless of other options.
add-xml-pi: yes \| no	Tidy uses this to add the XML processing instruction when outputting XML or XHTML.
output-xhtml: yes \| no	Tidy uses this to generate the pretty printed output writing it as XHTML. This option causes Tidy to set the doctype and default namespace as appropriate to XHTML. If a doctype or namespace is given they will checked for consistency with the content of the document. In the case of an inconsistency, the corrected values will appear in the output. For XHTML, entities can be written as named or numeric entities according to the value of the "numeric-entities" property. The tags and attributes will be output in the case used in the input document, regardless of other options.
doctype: omit \| auto \| strict \| loose \| <fpi>	This property controls the doctype declaration generated by Tidy. omit. The output file will not contain a doctype declaration. auto. Tidy will use an educated guess based upon the contents of the document. strict. The doctype is set to the strict DTD. loose. The doctype is set to the loose (transitional) DTD. <fpi>. For an XHTML document, Tidy will set the system identifier to the empty string.
numeric-entities: yes \| no	Causes entities other than the basic XML 1.0 named entities to be written in the numeric rather than the named entity form.
quote-ampersand: yes \| no	This causes "&" characters to be written out as &.
logical-emphasis: yes \| no	This causes Tidy to replace any occurrence of `<i>` by `<em>` and any occurrence of `<b>` by `<strong>`. In both cases, the attributes are preserved unchanged.
enclose-text: yes \| no	This causes Tidy to enclose any text it finds in the body element within a `<p>` tag. This can be useful when you want to use an existing HTML file with a style sheet.
write-back: yes \| no	Tidy uses this to write back the tidied markup to the same file it read from.
error-file: filename	Writes errors and warnings to the named file rather than to stderr.
show-warnings: yes \| no	This can be useful when a few errors are hidden in a long list of warnings. If set to no, warnings are suppressed.

XML Namespaces : Universal Identification in XML Markup

SVG Brings Fast Vector Graphics to Web

Time Changes Everything

P3P - What's in it for us?

XSL - What's in it for us?

RDF - What's in it for us?

MathML - What's in it for us?

XML - What's in it for us?