Related items

XML Entities and their Applications

You are here: irt.org | Articles | Extensible Markup Language (XML) | XML Entities and their Applications [ previous next ]

Published on: Sunday 21st May 2000 By: Pankaj Kamthan

Introduction
Motivation for Entities
Basic Characteristics of Entities
A Taxonomy of Entities
Internal General Entities
External General Entities
- Uses of External General Entities
Parameter Entities
- Uses of Parameter Entities
The Document Entity
Entity Declarations and References
Entity Attributes
- Entity Attributes vs. Entities in Attribute Values
Entity Standardization
Entity Recognition, Processing and Expansion
Entity Authoring
Entity Management
- Entity Conversion from SGML to XML
Applications of Entities
Limitations of Entity use
Conclusion
References

Introduction

XML provides both a logical structure and a physical structure to a document. The logical structure tells what elements are to be included in a document and in what order. The physical structure governs the content in a document in form of storage units called entities. In general, entities allow you to assign a name to some content, and use that name to refer to that content.

Entities have several uses. For example, they allow you to create "macros" for content, which can be referred to in the document simply by referring to its name. For example special characters or images, can be included in form of entities. This results in reusability of same content without the need of unnecessary repetition of them in the document. It also leads to efficient management and future modifications as now they can reside in a central location.

The goal of this article is to discuss different aspects of XML entities, including some possible applications and scenarios of use. We assume that the reader is familiar with the basics of XML and DTD syntax.

Motivation for Entities

There are several reasons that have motivated the introduction of entities:

In order for the XML processor to recognize markup in your document, it must be able to distinguish markup from content. Therefore, some characters have been reserved to identify the start of markup. For instance, the left angle bracket (<), identifies the beginning of an element start- or end-tag. In order to insert these characters into your XML document as content, there must be an alternative way to represent them. Entities provide a way to represent these special characters. (If not encoded, an XML parser will interpret them as characters introducing elements.)
When authoring a DTD, there are often times when certain information repeats itself. By defining this repetitious information separately, one can reuse it by pointing only to the definitions. This leads to considerable reduction in the size of DTDs, as well as, a modular form. Any changes in the repetitious information will then only need to be made at single place (the definition) and not at every place it is used, which helps in maintenance.
There are times when including content of external files, including ones which have non-XML content, is necessary. Entities provide a facility to include or point to the content of these external files.

In the foregoing, examples are presented that elaborate these points.

Basic Characteristics of Entities

This section provides the basic characteristics that are inherent in all entities as well as relevant definitions.

The "Non-vacuous" Nature of Entities. All entities have content (they are nonempty). The set of entities in an XML document is nonempty. Each XML document has one special entity called the document entity.
Internal and External Entities. There are two different contexts in which you can use entities: in the DTD and in the documents. In a document, entities are declared in the document prolog and referenced within the document root element. An internal entity is one whose definition is given in the DTD, and for which there is no separate associated physical object. External entities are the ones which are not internal.
Parsed and Unparsed Entities. There two types of entities: parsed and unparsed. An internal entity is a parsed entity. An external entity can be either parsed or unparsed. Note that this distinction is from an XML processor's point of view (whether it is supposed to parse an entity or not) and not from the point of view of an entity's content (whether it is parsable or not). A parsed entity's contents are referred to as its replacement text; this text is considered an integral part of the document. An unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML. Each unparsed entity has an associated notation, identified by name.
General and Parameter Entities. General entities, which can be either parsed or unparsed, are for use within the document content. Parameter entities are parsed entities for use within the DTD. These two types of entities use different forms of declarations and references and are recognized in different contexts.
Entity Names. All entities, except the document entity and the external DTD subset, are identified by an entity name. Entities must have a unique name within their namespace. General and parameter entities occupy different namespaces; a parameter entity and a general entity with the same name are two distinct entities. In case of a parsed entity, the name identifies the entity in an entity reference. In the case of an unparsed entity, the name identifies the entity in the value of an ENTITY or ENTITIES attribute. The details are discussed in the section Entity Declarations and References and Entity Attributes.
Entity Placeholders. Entities can exist in various forms, including: as a single file, a stream of bytes, or generated in real-time from a database.
Entity Filename Extensions. For those entities that exist as a file, there is no imposition on the filename extension. By convention, however, .ent is commonly used for external general entities and .mod commonly used for external parameter entities.

The notion of entities has apparent similaries to various concepts found in modern object-oriented system programming languages, such as, C++ and Java (which is not surprising considering that XML reflects an inclination towards object serialization):

Modularization. Modular use such as #include in C++ or import in Java, idea of preprocessing as in Server-Side Includes (SSIs) and Active Server Pages (ASP).
Strong Typing. A parallel to "strong typing" in the sense that once an entity is declared of one type (say, general) it can not be used (referenced) as another type (say, parameter) without altering its declaration.
Inheritance. Entity declarations in an internal subset of a DTD overriding those in the external DTD (or multiple DTDs) mimic child object overriding the methods of its parent object (or parent objects, in case of multiple inheritance). There is "interface inheritance" in the sense that one can use an unparsed entity (say binary) without knowing how it was implemented.
Encapsulation. The entities are "encapsulated" in the sense that they are not visible to the user.

A Taxonomy of Entities

Entities can be categorized as follows: Internal vs. External, General vs. Parameter, Parsed vs. Unparsed. Among the possible eight (2³) combinations, only the following five different entity categories are considered legal: internal general parsed, internal parameter parsed, external general parsed, external parameter parsed, and external general unparsed. Some conclusions that can be drawn from this classification are: any entity appearing in the internal subset of a DTD is always parsed, parameter entities are always parsed, and there is only one type of entity that is unparsed. Furthermore, parameter entities can appear only in the internal or external subset of a DTD, where treatment in both cases is identical.

Thus, with appropriate references to whether the entity is parsed or unparsed, we can effectively reduce our categories of discussion to the following :

Internal general entities.
External general entities
Parameter entities.

When concepts apply to an overlapping category, we will use terms that apply to all without any loss of generality and any potential of ambiguity. For example, by internal entities we mean both internal general and internal parameter entities, by general entities we mean both internal general and external general entities and by parameter entites we mean both internal parameter and external parameter entities.

Internal General Entities

Internal entities function as shortcuts or macros. With internal entities, you can associate an essentially arbitrary piece of text (which may have other markup, including references to other entities) with a name. The text that is inserted by an entity reference is called the replacement text. The replacement text of an internal entity can contain markup (elements, attributes, processing instructions, other entity references, and so on), but the content must be balanced (any element that is started in an entity must end in the same entity) and recursive entity references are not allowed. There is no separate physical storage object, and the content of the entity is given in the declaration. Internal entities can include references to other internal entities, but it is an error for them to be recursive.

Internal entites are important from a processing point of view. Every conforming XML processor is required to do internal entity processing, if the entities are declared in the internal DTD subset.

Predefined Entities

There are five predefined internal entities in XML, as shown in Table 1. The < and & characters in the declarations of "lt" and "amp" are doubly escaped to meet the requirement that entity replacement be well-formed.

Entity	Entity Name	Replacement Text
The left angle bracket (<)	`lt`	`&#60;`
The right angle bracket (>)	`gt`	`>`
The ampersand (&)	`amp`	`&#38;`
The single quote or apostrophe (')	`apos`	`'`
The double quote (")	`quot`	`"`

Table 1. Predefined Entities.

All XML processors are required to recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, prior to use. If the entities in question are declared, they must be declared as internal entities whose replacement text is the single character being escaped or a character reference to that character. For example, the fragment

<?xml version="1.0" encoding="UTF-8"?>
<algebra>
&#34;{x: 0 &#38;#60; x &#38;#38; x &#62; 1} has
&#39;no solution&#39; in the set of real numbers.&#34;
</algebra>

when processed yields:

"{x: 0 < x & x > 1} has 'no solution' in the set of real numbers."

Character References

Character reference refers to specific characters in the ISO/IEC 10646 (now part of Unicode) character set. They allow you to include the specified Unicode standard character directly into your document, even if they are unavailable directly on your keyboard or are not portable across applications and operating systems. Character references are also very useful for dealing with multilingual texts.

Character references are numeric and can be used without any special declaration. If the character reference begins just with "&#", the digits up to the terminating ; provide a decimal representation of the Unicode character number. If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the Unicode character number.

REMARKS

Making a character reference does not guarantee that a processing or display system will be able to do anything specific with the character. For example, ∂ would insert, a "partial differential" symbol, but whether or not you can display or print that character, is a different matter.
In addition to the use of decimals like in SGML (and therefore HTML), hexadecimal support was deemed necessary since the character tables in the Unicode standard are labelled in hexadecimal. This makes the XML Specification from/to Unicode Specification transition transparent without the need of any conversions.

Character References vs. Entity References

Character references are similar in appearance to entity references but differ from other entity references in processing. They are expanded immediately when recognized by the parser. As an example, using '"' is identical to '"' and, therefore, a character reference can not be used in an attribute value to escape the quotation characters.

Numeric character references can also be used to escape the left angle bracket, and other delimiters for which there exists a set of predefined entities. However, in general, using names is better than using character references since names are more (humanly) "intuitive." Thus, for example, to encode <, < is preferred over <.

Uses of Internal General Entities

The uses of internal entities depend on the two possible views of internal entities: as macros (from a programming viewpoint) and as "boilerplates" (from a publishing viewpoint). They are useful in many situations:

Internal entities allow you to define shortcuts for frequently typed text, such as boilerplate text or text that is expected to change, such as the revision status of a document. Using entities, one can define abbreviated notations of such text strings. Declaration at a single place makes storage and management easier, and also improves accuracy while saving network bandwidth.
XML numeric character references that belong to the Unicode character set in either decimal or hexadecimal representation are not very intuitive. Using entities, one can define abbreviations to them and use entity references instead of the character references.
Many XML documents, such as those in URL databases, contain lots of URLs. URLs lack persistence and are difficult to maintain. Use of entities can help make this easier.

External General Entities

External entities are all those that are not internal. They can consist of more than a single element, such as, an external entity consisting of some character data with embedded inline markup. The tags in an external entity must be well balanced within the entity (you can not start a tag in an entity and end it in your document or in another entity). External entities can reference internal or other external entities, but must not contain a recursive reference to itself, either directly or indirectly. The same external entity can be referred several times in a single document (if the document is to be validated, however, ID attributes should not be used in the external entity). It is legal to have several external entities that all refer to the same external file. There are no additional restrictions placed on the character encodings used by external entities. In particular, external entities with different encodings can be used in the same document.

External entities, like internal entities, have names and are referenced in the same manner, although they are declared differently, which serves as one way of recognizing them.

Uses of External General Entities

The following are some primary uses of external general entities:

External entities allow an XML document to incorporate content from external files. This provides a mechanism for dividing a document up into logical chunks, each of which can be stored in a separate file and external entities can be used to "refer" to them. When the parent file is parsed by an XML processor, it will have the effect of inserting the contents of each of the individual files at that location of the respective entity references.
External entities can contain binary data, which can be used to reference images and other non-XML content in the document.

Parameter Entities

Parameter entities are only available within the internal and external subsets of the DTD. They can be either internal or external, but they can not refer to non-XML data.

Uses of Parameter Entities

Parameter entities are most frequently used to customize and extend DTDs. Definition of variables for use inside the DTD (parameter entities). This is used to modularize a DTD for reuse and ease of maintenance.

The Document Entity

The document entity serves as the root of the entity tree and a starting-point for an XML processor, and may contain the whole document. The first design goal of XML says that "XML shall be straightforwardly usable over the Internet." In a networking environment, it is typically expensive to serve documents split in multiple modules. The document entity is all that a non-validating XML processor is required to read. The advantage of this is that documents can be authored in multiple independent modules, composed together, and delivered as a single component - the document entity.

If the XML document is in a file, the document entity is that file. If the XML document is being accessed via a URL, then the stream of bytes that obtained by calling a function is the document entity.

The document entity is special in many ways. The differences between the document entity and any other external parsed entity are:

The document entity can begin with an XML declaration, other external parsed entities with a text declaration.
The document entity can contain a document type declaration.
Unlike other entities, the document entity has no name and can appear on a processor input stream without any identification at all.

Entity Declarations and References

Entities declarations carry the following general characteristics:

All entities are declared with the ENTITY declaration. The exact form of the declaration distinguishes between internal, external, and parameter entities.
Entities can be declared in three places:
- Internal DTD. You can define entities in your local document as part of the DOCTYPE definition. This option should be chosen when the entity is being used only in this one particular file.
- External files: You can also link to external files that contain the entity data. This, too, is done through the DOCTYPE definition. This option should be chosen when the entity is being used in many documents. (Parameter entities can not be declared in this way as they only appear in the DTD.)
- External DTD.
If the same entity is declared more than once, only the first declaration applies and the internal subset is processed before the external subset. A useful implication of this is, since the internal subset of the DTD is considered to appear "first", you can put entity declarations in it to override those in the external part of the DTD. Though not required, an XML processor may issue a warning if entities are declared multiple times.
Entities must be declared before they can be referenced. (There is a subtle difference in cases of undeclared entity references in well-formedness and validity contexts, but one should avoid the practice of referencing an entity without declaring it first.)

Entities references carry the following general characteristics:

General and parameter entities are referenced differently, depending on the use of the combination of symbols ampersand (&), percent sign (%) and semicolon (;) as delimiters. This serves as one way of identifying them in the documents they appear.
An entity reference must not contain the name of an unparsed entity.
Parsed entities are invoked by name using entity references. A parsed entity must not contain a recursive reference to itself, either directly or indirectly. The invocation of unparsed entities is a combination of a necessary and sufficient condition: Unparsed entities are invoked by name given in the value of ENTITY or ENTITIES attributes, and conversely, unparsed entities can be referred to only in attribute values declared to be of type ENTITY or ENTITIES.

Internal General Entity Declarations and References

All internal entities must be declared in the internal or external DTD subsets. Entity references should follow their declaration in the source. At the time of being parsed, an entity reference at a given point in an XML document instance triggers the substitution of its contents at that point. Entity definitions can themselves refer to other internal and previously defined entities. You create internal entities with entity declarations in the internal subset or the DTD.

It is an error to insert an entity reference to an unparsed entity directly into the flow of an XML document. Unparsed entities can only be used as attribute values on elements with ENTITY attributes.

Internal general entity declarations, which can occur only in the DTD, have the following form:

<!ENTITY entityname "replacement text">

You can use either double or single quotes to delimit the replacement text. For example, if your document frequently refers to, say, "World Wide Web Corporation," you could declare it as an entity:

<!ENTITY wwwc "World Wide Web Corporation">

You can then insert it as needed in your document with the entity reference &wwwc;, which saves typing it out each time.

Internal general entity references, can occur both in the DTD and the document instance. They consist of an ampersand (&), followed by the name of the entity, followed by a semicolon (;). They are not expanded in the DTD and are of the form

&entityname;

For example, to refer to "John Wiley & Sons" in a document by &jws;, we can have the following entity:

<!ENTITY jws "John Wiley &amp; Sons">

If an internal entity will be used in several documents, it is preferable to add that to an external file (or DTD) instead of declaring them in the internal DTD subset. This has the following advantages:

It avoids having to include the declarations in each document, which gets particularly cumbersome and errorprone to author when there are a number of entities to be declared.
Multiple authors can make use of the entities without having to declare them over and over again. Such external entity files can evolve into a useful enterprise-wide database.
If modification is ever required in the entity name or replacement text, (such as, from World Wide Web Corporation to World Wide Web Corporation, Inc.) global changes can be made simultaneously.

An entity declaration can include another entity reference within it as long as the reference is not, either directly or indirectly, recursive. Therefore, recursive entity declarations such as the following should not be used:

<!ENTITY endless "&loop;">
<!ENTITY loop "&endless;">

CONSTRUCTION OF INTERNAL ENTITY REPLACEMENT TEXT

The construction of internal entity replacement text discusses the treatment of internal entities, general as well as parameter, and distinguishes two forms of the entity's value: the literal entity value and the replacement text.

Literal Entity Value. The literal entity value is the quoted string actually present in the entity declaration, corresponding to the entity value. The literal entity value as given in an internal entity declaration may contain character, parameter-entity, and general-entity references. Such references must be contained entirely within the literal entity value.
Replacement Text. The replacement text is the content of the entity, after replacement of character references and parameter entity references. The actual replacement text that is included as described above must contain the replacement text of (1) any parameter entities referred to, and (2) must contain the character referred to, in place of any character references in the literal entity value. However, general entity references must be left as-is, unexpanded.

Example 1. This example shows that a literal entity value and the replacement text can be different. It also shows that the replacement text can be different from what may eventually appear in the document's content or an attribute value once a general entity is referenced and expanded.

Consider the declarations:

<!ENTITY % company    "Yoyodyne, Inc.">
<!ENTITY   legal "All Rights Reserved.">
<!ENTITY   signature   "&#xA9; 1999 %company;. &legal;">

The literal entity value for the entity "signature" is:

&#xA9; 1999 %company;. &legal;

The replacement text for the entity "signature" is:

� 1999 Yoyodyne, Inc. &legal;

The entity "signature" once referenced and expanded in the document's content or an attribute value is:

� 1999 Yoyodyne, Inc. All Rights Reserved.

External General Entity Declarations and References

External entity declarations come in two forms depending on whether the entity is in an XML format or in a non-XML format. External general entities are referenced in the same manner as the internal general entities.

External General Entities with an XML Format

There are two cases here.

I. The External Identifier contains a System Identifier.

When the external entity contains some XML format, the declaration can have the following form:

<!ENTITY entityname SYSTEM "system-identifier">

The external identifier is preceded by the keyword SYSTEM and followed by a system literal. This is also known as the system identifier which is a system literal in the form of a URI and is used to retrieve the entity. Often the URI can be a simple filename.

Example 2. XML documents can be of arbitrary complexity and size, and delivery performance can become as issue. External entity declarations can be useful for large document management by dividing them into components, which can later be included in the parent document via entity references. Given such a document, filename.xml, you can divide it into logical components (say, sections) and store them in separate files (say, section_1.xml, ..., section_m.xml). The following defines an external entity that incorporates these sections into your document can be declared like so:

<!ENTITY section1 SYSTEM "path/to/section_1.xml">
...
<!ENTITY sectionm SYSTEM "path/to/section_m.xml">

Then, the "parent" document filename.xml (which is actually the document entity) can consist of just the references to the entities:

<?xml version="1.0" encoding="UTF-8"?>
<document>
  &section1;
  ...
  &sectionm;
</document>

Thus, we see that entire documents can be created using entities.

II. The External Identifier contains Public Identifier and a System Identifier.

A public identifier is a name that is intended to be semantically interoperable across different computer environments. It is an ISO standard (ISO/IEC:9070, 1991). In addition to a system identifier, an external identifier may include a public identifier. The external identifier can be preceded by the keyword PUBLIC, which must then also contain a public identifier literal followed by a system literal in form of a URI.

When the external entity contains some XML format, the declaration can thus also have the following form:

<!ENTITY entityname PUBLIC "public-identifier" SYSTEM "system-identifier">

A public identifier can be useful when working with an entity that is publicly available. The XML processor can check the public identifier against a list of resources to which it is connected and determine whether or not it needs to get a new copy of the entity. However, until such public information storage mechanisms become more widely available and canonical, the system identifier will be more commonly used. Thus, XML allows you to use public identifiers, but you still have to provide a system identifier (URI) for each external entity. An XML processor attempting to retrieve the entity's content may use the public identifier to try to generate an alternative URI. If the processor is unable to do so, it must use the URI specified in the system literal.

The following declaration makes use of the XHTML 1.0 public identifier:

<!ENTITY PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

External General Entities with a Non-XML Format and Notation Declarations

External entities that refer to non-XML data files must declare that data they contain is not in XML. This is accomplished by using notation declarations.

A notation is a name, with an associated external identifier, which can be applied to unparsed entities (and also, when used in a NOTATION attribute, to the content of elements). It identifies by name the format of unparsed entities, the format of elements with a notation attribute, and the application to which a processing instruction is addressed. The basic idea is that the notation (name and an external identifier) should be helpful in handling the data (unparsed entities) to which it is attached. A notation declaration thus consists of a name for the notation, and an external identifier that allows the XML processor to locate an application to process data that is flagged to be in the given notation.

The notation declaration has the form:

<!NOTATION entityname PUBLIC "public-identifier" notation>
<!NOTATION entityname SYSTEM "system-identifier" notation>

If both public and system identifier are provided, the public identifier should come first, and the system identifier should not be preceded by the word SYSTEM. Entities using a notation are unparsed entities.

An example is:

<!NOTATION GIF87A SYSTEM "C:\Program Files\gifviewer.exe" GIF>

This information is passed by the XML processor to the application, which may use it however it wishes to. Note that XML processors provide applications with the name and external identifier(s) of any notation declared and referred to. They may additionally help the application process the data described in the notation by resolving the external identifier into the system identifier, file name, or other information. However, it is not within the scope of an XML processor to deal with the unparsed entities.

There are specific attributes in the DTD syntax to express external general entities with non-XML data. See the section on Entity Attributes for more details.

Parameter Entity Declarations and References

Parameter entity declarations, which can occur only in the DTD, are identified by a % preceding the entity name. They are of the form:

<!ENTITY % parameterentityname1 "replacement text">
<!ENTITY % parameterentityname2 SYSTEM "URI">

Note the space following the % in the declaration. The declaration of a parameter entity must precede any reference to it.

Parameter entity references, can occur only in the DTD part of the document instance. They consist of a percent symbol (%), followed by the name of the entity, followed by a semicolon (;). They are of the form

%parameterentityname;

The following declaration defines the parameter entity subdtd, a part of a DTD can be referenced from inside the internal or external DTD subsets,

<!ENTITY % subdtd SYSTEM "http://path/to/filename.dtd">

Example 3. This example shows a section (which has been edited here for convenience) of the SVG DTD that uses parameter entities in the definition of the circle element:

<!-- Allow at most one of description and title, supplied in any order. -->
<!ENTITY % descTitle "((desc,title?)|(title,desc?)?)">

<!-- Allow extending the DTD with internal subset for graphics elements. -->
<!ENTITY % geExt "">

<!-- Use of entities in the circle element. -->
<!ENTITY % circleExt "">
<!ELEMENT circle (%descTitle;,(animate|set|animateMotion|animateColor|animateTransform 
                  %geExt;%circleExt;)*)
>

REMARKS

Parameter entities are only recognized in markup declarations (in the DTD, for example).
Parameter entities, although they can appear in a document outside the DTD, should not be used in the content of your document as they are not recognized. Suppose there is a parameter entity named foo. Then, if the string %foo; appears somewhere in the document outside of the DTD, it is not an error. It is just the string %foo;.
The use of parameter entities in other entity values is not recommended.
Parameter entity references are immediately expanded in the DTD and their replacement text is part of the declaration, whereas non-parameter entity references are not expanded.

Parameter Entity Declarations and Conditional Sections

Conditional sections are a mechanism for parameterizing the DTD. Note, however, that you cannot use conditional sections in the internal subset of XML documents.

Conditional Sections are portions of the external DTD subset which are included in, or excluded from, the logical structure of the DTD based on the keyword which governs them. Like the internal and external DTD subsets, a conditional section may contain one or more complete declarations, comments, processing instructions, or nested conditional sections, intermingled with white space.

A markup declaration, which is only allowed in the conditional section, is an element type declaration, an attribute-list declaration, an entity declaration, or a notation declaration. These declarations may be contained in whole or in part within parameter entities. They are introduced by the sequence:

<![KEYWORD[ ... ]]>.

The most common KEYWORD's are:

INCLUDE. INCLUDE indicates that the contents of the conditional section are part of the DTD.
IGNORE. IGNORE indicates that the contents of the conditional section are not logically part of the DTD and text in section should be ignored (it completely disappears from the parsed document). Note that for reliable parsing, the contents of even ignored conditional sections must be read in order to detect nested conditional sections and ensure that the end of the outermost (ignored) conditional section is properly detected.
CDATA. CDATA indicates that the contents of the section should be ignored except for the closing characters ]]>. (Note that CDATA sections are allowed in an XML document, but the keyword cannot be a parameter entity.)

If a conditional section with a keyword of INCLUDE occurs within a larger conditional section with a keyword of IGNORE, both the outer and the inner conditional sections are ignored. If the keyword of the conditional section is a parameter entity reference, the parameter entity is replaced by its content before the processor decides whether to include or ignore the conditional section.

Example 4.The following example illustrates the use of INCLUDE and IGNORE. A company can use these constructs for presenting a document (brochure) for internal (private) and external (public) use.

<!ENTITY % private "INCLUDE">
<!ENTITY % public "IGNORE">

<![%private;[
<!ELEMENT brochure (comments*, cost, title, body)>
]]>
<![%public;[
<!ELEMENT brochure (title, body)>
]]>

Entity Attributes

References to an unparsed entity can occur only in attribute values that were declared to be of types ENTITY and ENTITIES. As with attribute declarations, the first occurrence of an entity declaration takes precedence. This allows declarations to be made in the DTD's internal subset which is read prior to the external subset, thus overriding possible definitions for the same entity name in the external subset.

For handling nonparsable data, we must specify the NDATA keyword followed by the name of a notation. This allows the data to be passed to and handled by an application capable of interpreting that notation.

A common use of unparsed entities in entity attributes is in XML elements that incorporate graphics into a document.

Example 5. This example illustrates the use of the ENTITY attribute:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document [
  <!ELEMENT document (graphics)+>
  <!ELEMENT graphics EMPTY>
  <!ATTLIST graphics image ENTITY #REQUIRED 
                     alternative CDATA #IMPLIED>
  <!NOTATION PNG SYSTEM "/usr/local/bin/PNGViewer">
  <!ENTITY logo SYSTEM "logo.png" NDATA PNG>
  <!ENTITY wwwc "World Wide Web Corporation">
]>

<document>
  <!-- The following image is the World Wide Web Corporation logo. -->
  <graphics image="logo" alternative="&wwwc; Logo"/>
</document>

In this example, the image is an ENTITY attribute and contains only the name (logo) of the unparsed entity (logo.png) whereas alternative is a parsed entity and contains the reference (&wwwc;) to the internal general entity (wwwc).

Example 6. There may be data that may not render itself suitable for expression in XML syntax or there may be some legacy data that is considered unfit for a transition to XML for one reason or the other. Such can be the case for data in binary formats. For example, for photographs, JFIF (more commonly known as JPEG) is the preferred format of use over Portable Network Graphics (PNG) or Scalable Vector Graphics (SVG). External entities in such cases can be used for backward compatibility. As an example, suppose we have an image, ottawa.jpg, of the City of Ottawa, Canada. Then, the following declaration declares the entity ottawa as a JPEG image:

<!ENTITY ottawa SYSTEM "ottawa.jpg" NDATA JPG>

Entities declared this way cannot be inserted directly into the document, and conversely, entities declared without a notation as the value of an entity attribute cannot be used. Such entities must be used as entity attributes to elements:

<graphics image="ottawa"></graphics>

There are differences between entity attributes and entity references in attribute values, as discussed in the next section.

Entity Attributes vs. Entities in Attribute Values

Unparsed entities are allowed in entity attributes, where parsed entities are forbidden. Unparsed entities may be referred to only in attribute values declared to be of type ENTITY or ENTITIES. An ENTITY attribute can only contain the name of an unparsed entity. In particular, it contains the name of the entity, and not a reference to the entity.

In the next example, the attributes point to external data in the form of unparsed entities:

<!-- Attribute b points to a single unparsed entity -->
<!ATTLIST a b ENTITY #IMPLIED>

<!-- Attribute c points to multiple unparsed entities -->
<!ATTLIST a c ENTITIES #IMPLIED>

Parsed entities are allowed in the body of the document, whereas unparsed entities are forbidden. References to internal entities in attribute values are allowed, external entity references in attribute values are not. An entity reference must not contain the name of an unparsed entity. The following:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE document [
  <!ENTITY logo SYSTEM "logo.png" NDATA PNG>
]>

<document>
  <!-- The following image is the World Wide Web Corporation logo. -->
  ..
<p>The company logo: &logo;</p>
</document>

is not allowed because an unparsed entity could be binary (as above) and embedding it in midst of some text will be indecipherable (atleast to humans).

Entity Standardization

There are sets of entities which are frequently used by large communities of geographically dispersed people with common goals. Such is the case, for example, among users of mathematical notation or natural language characters. This raises the possibility of potential incompatible usage of same symbols under different names, or even name collisions. For example, one user may use &diff; and the other may use &difference; to mean the difference of two sets (say, A - B) in a Set Theory context, whereas a third user may use &diff; to denote the differential sign (say, dx) used in Calculus. To avoid this situation, the process of associating names with entities needs to be formalized and standardized. The two major bodies that carry out this standardization process are ISO and Unicode.

Entity Recognition, Processing and Expansion

This section summarizes the entity expansion and the (expected) treatment of entities and references by an XML processor. The details are given in Section 4.4 and Appendix D of the XML Specification.

An XML processor as described in the treatment of entities and references inserts the replacement text of a parsed entity into the document wherever a reference to that entity occurs. There is a table that summarizes the contexts in which character references, entity references, and invocations of unparsed entities might appear and the required behaviour of an XML processor in each case. This includes what is and what is not recognized, when is an entity included, what should the processor do (include/not include) when it comes across external parsed entities or parameter entities, what to do when an entity reference appears in an attribute value, or a parameter entity reference appears in a literal entity value, what to do when the name of an unparsed entity appears as a token in the value of an attribute of declared type ENTITY or ENTITIES, what to do when a general entity reference appears in the EntityValue in an entity declaration, and what type of references are not allowed (forbidden). Some highlights and useful consequences are:

The % character (which when present in the DTD, would be treated as a parameter entity reference), has no special significance outside the DTD and is not recognized.
The names of unparsed entities are not recognized except when they appear in the value of an appropriately declared attribute.
The replacement text of an entity (except when in an attribute value or entity declaration) can contain other entity references, which are processed recursively.
When the name of an unparsed entity appears as a token in the value of an attribute of declared type ENTITY or ENTITIES, a validating processor must inform the application of the system and public (if any) identifiers for both the entity and its associated notation. What the application does with it, is the responsibility of the application, not the processor.
The appearance of a reference to an unparsed entity, the appearance of any character or general entity reference in the DTD (except within an EntityValue or AttValue), a reference to an external entity in an attribute value are not allowed.

Different processors may differ in their implementation of entities. For example, in Microsoft XML Parser (MSXML) the entity declaration <!ENTITY lt "<"> is considered invalid since it can not be legally expanded.

A complete XML authoring software will usually have a support for entity expansion. A standalone entity expander entity.exe is also available which simply expands the entities in an XML file.

Character references are expanded immediately. General entity references that appear in the entity value of an entity declaration are not expanded (bypassed and left as is) by an XML processor until the entity being declared is referenced. Thus, general entities do not have to be declared before being used and the order of general entity declarations is unimportant. (However, the declaration of a general entity must precede any reference to it which appears in a default value in an attribute-list (ATTLIST) declaration.) For example, the following set of declarations:

<!ENTITY wwwc "&www; Corporation">
<!ENTITY www  "World Wide Web">

are legal in the internal subset because the entity reference "&www;" is expanded after "&wwwc;" is expanded.

Example 7. The following is an example which illustrates a sequence of complex interactions in entity recognition, processing and expansion:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document [
  <!ELEMENT document (quote)+>
  <!ELEMENT quote (#PCDATA)>
  <!ENTITY r 'Rumple'>
  <!ENTITY s 'stilskin'>
  <!ENTITY % y '&#60;!ENTITY x "&#38;r;&#38;s;"&#62;'>
%y;
]>

<document>
  <quote>
    &#34;My name is &x;, but you can call me r12n.&#34;
  </quote>
</document>

which upon parsing (with, for example, XML4C) should yield:

<document>
  <quote>
    &quot;My name is Rumplestilskin, but you can call me r12n.&quot;
  </quote>
</document>

Entity Authoring

An XML authoring software that has support for DTDs will typically provide features for entity well-formedness and validation. XML Spy is a commercial XML editor with such support. For Example 5, we have the Enhanced Grid View, the Source View, and the validation test, illustrated in the screenshots below.


Enhanced Grid View	Source View	Validation Test

As seen previously, one of the options for declaring entities is via a URI, and when this is the case, the XML processor is in some sense doing a client-side includes in an XML document. It is possible to do this dynamically by extending an XML parser that implements the XML DOM, to change the values of entities after it has parsed the DTD, but before it begins parsing the document. (It is not possible to do that by just walking the DOM and looking for entity reference nodes in a DOM, since certain entity references are resolved at parse time and adjacent text nodes are normalized.) This can be useful if the entities are available publicly but copyrighted, and explicit distribution, except from the originating source, is not allowed.

It may be desirable to view or edit one or more of the entities or parts of entities while having no interest, need, or ability to view or edit the entire document. The XML Fragment Interchange defines a way to send only such fragments of an XML document.

Entity Management

Since external entities in different documents can refer to the same files on your file system, they provide an opportunity for reuse strategy. DTDs which make a large-scale [re]use of entities require an entity management system, which XML, by itself, does not provide.

OmniMark is a SGML/XML content mangement system which provides a variety of support for entities. In the absence of an entity management system (which are often commercial and can be expensive), certain entity-related tasks can be carried out via scripts or style sheets. If you used to declare entities within XML document files, there is a Perl script that will move them to external DTD files. Using a DSSSL script, MathML entities were extracted from unicode.xml to yield the MathML DTD.

Entity Conversion From SGML to XML

SGML entities differ from XML entities in various significant ways, particularly in terms of flexibility. For example, SGML parameter entity declarations are not restricted only to the DTD. The transition of SGML entity-base to XML then becomes an issue. There has been effort to convert HTML entities (which are SGML entities) in an XML format. Recently, HTML entities have been converted to XML format to be used in XHTML 1.0. Similar efforts have been carried out for DocBook.

Applications of Entities

The use of entities is already pervasive in HTML. There are several symbols commonly used, such as, in legal contexts (copyright, registered, trademark), financial contexts (currencies), and internationalization contexts (Arabic, Cyrillic, Greek).

Data Management

Entities can make entering and managing data easier in various respects:

Data Reuse. Entities can be quite useful in reuse of "blocks" of information:
- Static Blocks. Mathematical formulas, theorems, and author name, are examples of static "blocks" of information. Instead of including them in a particular document, one can "componentize" them in separate entity files. One can then have "many-to-one" (many documents accessing one entity file) and "one-to-many" access.
- Dynamic Blocks. Legal or contact information are examples of dynamic "blocks" of information. There are times when a company's legal information (copyright, terms of use, acceptable use policy) have not been decided at the time of production of a document. When encapsulated in an entity, the legal information becomes independent of the document production process, and can be added any time during the cycle. If the company plans to move, the contact (postal) address will change. With the use of entities, global modifications to the contact information can be made readily. This is similar to the situation of a pointer to a variable in C.
Typing. Data reuse lessens the burden of typing. Instead of typing the same text over and over again, you can define an internal entity to contain the text and then you only need to use the entity where you want to insert the text. Also, the lesser the number of occasions of entering the data, the lesser the potential of inadvertently introducing typing errors. This therefore results in improved overall quality of typing. Because the entity is expanded by the parser, you can be assured that you'll get the same text in every location.
Maintenance. Modularization and data reuse helps in reducing time and effort in maintaining large sets of documents, and as a consequence helps focus on other priorities, such as, creation of the content itself.

These possibilities have been illustrated at an elementary level in Example 2 and Example 4.

Entities for Mathematical Characters in XHTML

The entity base for basic mathematical notation has been strong in HTML (and now XHTML) since HTML 3.2 (though support for rendering in widely-used browsers has always been lacking).

Example 8. Let A and B be two sets. The number of elements in A union B is equal to the number of elements in A plus the the number of elements in B minus the number of elements in A intersection B can be expressed as:

n(A &#8746; B) = n(A) + n(B) &#8722; n(A &#8745; B).

More complex constructions are possible. We have extracted useful mathematically-oriented entities from XHTML 1.0 and made them available. It could be used as follows:

<!ENTITY % HTMLmath SYSTEM "xhtml-math.ent">
%HTMLmath;

Entities in MathML

Notation is at the heart of mathematical representation and evolution of mathematics as a subject. As a result, mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use in coding. Furthermore, it is difficult to read mathematics if glyphs are not available for presentation on specific display devices.

MathML has an extensive support entities and characters, which has taken on directly specification of part of the full mechanism of proceeding from notation to final presentation, and is collaborating with organizations (ISO, The STIX Project, Elsevier Science, Wolfram Research) undertaking specification of the rest.

TtH, software for translating [LA]TEX to HTML and its successor, TtM, software for translating [LA]TEX to MathML make a wide use of symbols for rendering mathematical notation.

DTD Customization and Extension

Parameter entities are often used in large-scale DTDs for compactness (and thus time and space efficiency) and for ease of maintenance. See the section on Parameter Entities. You can not use character or general entities in the DTD, you have to use parameter entities. This is a because DTDs use a different syntax from that of documents. Parameter entities are not needed for schemas based on XML syntax (such XML Schema).

One can extend the capabilities of a DTD using entities in two different ways: (1) by adding an entity to (the internal DTD subset of) an XML document or, (2) by adding an entity to the external DTD subset.

Example 9. This example shows how you can add a "nonstandard" entity to a XHTML document.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY r "Rumplestilskin">
]>

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><title>My Name is Not Bond</title></head>
<body>
  &#34;My name is &r;, but you can call me r12n.&#34;
</body>
</html>

DocBook is a widely-used DTD in publishing and makes sophisticated use of parameter entities for customization.

Extending XHTML DTD by the MathML DTD

Let *ML be an XML vocabulary. In order to combine *ML and XHTML into a single DTD, an author would:

Decide where *ML content should be legal in the document.
Add the *ML root element to the content model at that point.

An example of a complex module that can be added to XHTML is the MathML DTD.

Example 10. Extension of XHTML 1.0 Transitional DTD by the MathML 1.0 DTD.

<!-- This assumes that the DTD's are available locally. -->
<!ENTITY % xhtml SYSTEM "xhtml1-transitional.dtd">
<!ENTITY % mathml SYSTEM "mathml1.dtd">

<!-- We will use MathML entities. Declare XHTML entities empty (to override them). -->
<!ENTITY % HTMLlat1 "">
<!ENTITY % HTMLsymbol "">
<!ENTITY % HTMLspecial "">

<!-- Add MathML to HTML Content Model at an appropriate place. -->
<!ENTITY % misc "ins | del | script | noscript | mathml:math">

<!-- Load XHTML and MathML. -->
%xhtml;
%mathml;

A formal framework towards modularization of XHTML has been initiated. Using this, XHTML can be extended by building XHTML modules by adding a module to XHTML (or to a subset of XHTML). The module being added is incorporated in the DTD by reference rather than explicitly including the new definitions in the DTD. XHTML 1.1 has a single DTD (as opposed to XHTML 1.0 which has three DTD's). An example of the extension of XHTML 1.1 DTD by the MathML 1.0 DTD has been provided.

REMARKS

The elements in the MathML DTD must be declared with a namespace prefix on them as XHTML requires that new elements and attributes be in their own namespaces.
In case there is a conflict between element (or attribute) names, the extension even though valid, will not work.

Limitations of Entity use

One should not increase, beyond what is necessary, the number of entities required to explain anything.
- William of Ockham (1285-1349)

Inspite of the fact that entities have various advantages and use, there are a few caveats:

Processing. Non-validating parsers are not required to resolve entities declared outside the document (in the external DTD subset). In fact, non-validating parsers may not perform entity expansion at all. Therefore, certain well-formedness errors, specifically those that require reading external entities, may not be detected by a non-validating processor. Examples include the constraints entitled Entity Declared, Parsed Entity, and No Recursion, as well as some of the cases described as Forbidden in XML Processor Treatment of Entities and References. The information passed from the processor to the application may vary, depending on whether the processor reads parameter and external entities. For example, a non-validating processor may not normalize attribute values, include the replacement text of internal entities, or supply default attribute values, where doing so depends on having read declarations in external or parameter entities.
Rendering. The extent of support for entities varies across browsers. Currently, the widely-used browsers do not support them entirely. This is particularly the case with the support for character references.
Interoperability. There are factors pertaining to entity management in SGML that impede interoperability of SGML documents. Such issues are also a cause of concern in XML documents. For example, when XML documents are moved to different computers, the association of system identifiers (URIs) in entity declarations with the correct storage objects may not be preserved. This is because URIs lack persistence.

Conclusion

Entities are a fundamental concept in XML. Use of entities offers a number of benefits such as providing the possibility to define commonly used text in a single location for reuse, and to divide mololithic documents into manageable modules. When used appropriately, they can be the underlying foundation of large-scale document production and management in both internal (intranet) and external (Internet) environment with long-term potential benefits.

Acknowledgements

This work has benefited from the insightful annotations by Tim Bray, co-editor of the XML 1.0 Specification, and without which, it would have been incomplete. I would also like to thank Martin Webb who made several useful editorial suggestions.

References

Extensible Markup Language (XML) 1.0 Specification - Tim Bray, Jean Paoli, C. M. Sperberg-McQueen (Editors). W3C Recommendation, February 10, 1998.
The Annotated XML 1.0 Specification - By Tim Bray. XML.com.
Structuring XML Documents - By David Megginson, Prentice-Hall, 1998. Chapter 1, Section 1.4 discusses XML entities.
Mathematical Markup Language (MathML^™) 1.01 Specification - Patrick Ion, Robert Miner, Stephen Buswell, Nico Poppelier (Editors). W3C Recommendation, April 7, 1998. Revised July 7, 1999. Chapter 6. Entities, Characters and Fonts.
Entities: What are They Good For? - By Norman Walsh. This includes descriptions of Types of Entities and Entity Declarations, Attributes and Expansion.

Related items

Perspectives of XML in E-Commerce

XML Conformance : The Burden of Proof

XMLization of Graphics

XML Euphoria in Perspective

XML and CSS : Structured Markup with Display Semantics

XML Namespaces : Universal Identification in XML Markup

The Emperor has New Clothes : HTML Recast as an XML Application

XML - What's in it for us?