Related items

XML - What's in it for us?

You are here: irt.org | Articles | Extensible Markup Language (XML) | XML - What's in it for us? [ previous next ]

Published on: Saturday 28th March 1998 By: Janus Boye

Introduction
What is XML?
Some XML history
Will it replace HTML?
Building XML
Web applications of XML
XML software
XML and style sheets
The future of XML
Conclusion
Useful links

Introduction

The second XML (Extensible Markup Language) draft is out (recommended by W3C as of 10/02-1998), but since it is only a recommendation, every implementation of XML is a guess at what XML will eventually become, and at the same time the recommendation is available for discussion.

"What's in it for us," you ask? Quite a bit. XML offers the most near-term benefits for professional web developers, in particular those who are working on putting large numbers of complex documents on-line. HTML is quite limiting. It does not offer very rich semantics to describe a document.

If you're designing data-hungry sites, especially for intranets, you should be getting excited about XML, because in XML, you'll be able to create and respond to much richer set of data elements. That will in turn let you build more individualised dynamic sites and pages. For example, your site's users could access information across databases and types of data without having to rely on a search engine.

Currently, Microsoft Internet Explorer 4.0 is the only browser that supports XML, but more on that later.

What is XML?

XML is all about metadata and the idea that certain groups of people have similar needs for describing and organising the data they use. Like HTML, XML is a set of tags and declarations -- but rather than being concerned with formatting information on a page, XML focuses on providing information about the data itself and how it relates to other data.

Some data types are pretty much universal (<First Name>, <Address>, <City>, and so forth). Others are industry or even company-specific (<price>, <manufacturer>, <componentID>). Healthcare organisations, for example, have a whole set of data types and acronyms understandable (some would say penetrable) only to claims processors. XML allows each of these data types to be easily recognised and, for site developers, used to create sites optimised around both the data and the people using it.

XML differs from HTML in three major respects:

Information providers can define new tag and attribute names at will.
Document structures can be nested to any level of complexity.
Any XML document can contain an optional description of its grammar for use by applications, that need to perform structural validation.

XML is not backwards compatible with existing HTML documents, but documents conforming to HTML3.2 can easily be converted to XML, as can generic SGML documents and documents generated from databases.

Some XML history

In November 1996 the initial XML draft was presented at the SGML 96 Conference in Boston. Then in March 97, the 1st XML Conference was held in San Diego, by the Graphic Communications Association.

In April 97, we then got the initial XML Linking Working Draft. In July it got revised, and then in August 97, we also got the revised XML Syntax Working Draft, plus the XML Developers Day was held in Montreal Canada on August 21st.

In October the W3C came with a note on 'W3C Data Formats on XML, SGML, HTML, and RDF'.

In December the XML 1.0 Proposed Recommendation arrived, and in February 1998, we now have the 2nd XML draft out

Enough history, let's get right at it....

Will it replace HTML

Doubtful. At least not in the near term. Initially, I expect we will see XML used as a storage format, and HTML used as the display format. Just run your XML document through a filter and out comes an HTML document. This will provide backward compatibility support for legacy HTML browsers. I think that we will continue to see this for at least 2-3 years. Although, use of native XML browsers will increase throughout that time, and eventually eclipsing HTML, relegating HTML to purely legacy support. This all depends on the tools.

Initially, XML will be difficult to use, and expensive. Only large firms who have clear and distinct needs and the money needed to support it will use it.

HTML has the advantage of being very simple to use. XML is not difficult, but it's not that easy either. So HTML will probably continue to be used by the general public because it's so simple to use.

Software development efforts take time though. It takes 6 months to do a good new product revision, plus a beta-testing cycle, which means it could be a year or more before many of these products become available.

Furthermore, the XML standard isn't even all the way hammered out. The XML data, style and linking pieces have yet to be completed. Each is in various draft stages. We may not see a complete cohesive XML standard before December 31, 1998. On the other hand, the whole XML standards process has been moving along at quite a rapid pace, so we might be surprised and see something sooner.

Considering the time it takes for a technology to become truly mainstream, a 2-3 year adoption curve, with a couple of years tacked on to that until we see the really spectacular implementations, is probably not unrealistic, in my opinion.

XML and HTML complement each other. Browsers will be able to process both, and future HTML standards will likely allow mixing HTML and XML in the same document.

What about existing HTML documents? Am I going to have to re-code all of them in XML? Will XML-native browsers also support HTML documents as well? These are all open questions.

Basically, if your HTML document uses quotes around ALL of the attributes and closes ALL of the tags, then it's awfully close to being well formed.

I think that realistically, we will see both browsers which support XML and HTML. Just like early web browsers built in support for FTP and Gopher in addition to HTML. These protocols continue to be supported. So you won't necessarily have to convert all of your documents. On the other hand, you may want to. In order to help facilitate that, we will probably see HTML-to-XML conversion utilities. Naturally, the quality of the resulting documents will vary. Some will be good. Some won't be. Automation can only take you so far.

Building XML

XML comes in two flavours: well formed and valid. Well-formed is the easier standard to meet. It just requires that a document has an XML prologue, that all elements be nested cleanly, and that all start tags have matching end tags. "Empty" tags like IMG, which don't normally have closing tags, may end with a "/>" instead of receiving a full end tag. For instance, the HTML:

<IMG SRC="mygif.gif">

will become

<IMG SRC="mygif.gif"></IMG>

<IMG SRC="mygif.gif"/>

The XML prolog is the most obvious change from either SGML or HTML:

<?XML VERSION="1.0" RMD="NONE" ENCODING="UTF-8"?>

The VERSION attribute should always be included, to protect documents against changes in the standard. RMD is short for Required Markup Declaration and announces which, if any, document type declarations (DTDs) should be applied to the document. For well-formed documents this will be "NONE." Valid documents may use "INTERNAL" or "ALL." ENCODING tells the parser what kind of character set the document will use. UTF-8, a subset of Unicode, is the default. (XML parsers must support the full 16-bit Unicode standard for international character encodings, however.)

These minimal changes to the world of mark-up make life much easier for parser developers, who no longer have to support poorly coded HTML missing half its end tags. Before a document can call itself well formed XML, it has to meet minimum requirements. This requires some extra effort from those creating documents, but makes it possible for programmers to build much more reliable systems with much less effort.

Valid documents must be accompanied by a document type declaration (DTD) that defines their structure. The DTD may be included as part of the document itself, or it may be stored in a separate document. Most complex DTDs will probably be stored as separate documents. A DTD is basically a list of element, entity and attribute declarations in a simplified SGML declaration style.

Web applications of XML

The applications that will drive the acceptance of XML are those, that cannot be accomplished within the limitations of HTML. These applications can be divided into four broad categories:

Applications that require the Web client to mediate between two or more heterogeneous databases.
Applications that attempt to distribute a significant proportion of the processing load from the Web server to the Web client
Applications that require the Web client to present different views of the same data to different users.
Applications in which intelligent Web agents attempts to tailor information discovery to the needs of individual users.

The alternative to XML for these applications is proprietary code embedded as "script elements" in HTML documents, and delivered in conjunction with proprietary browser plug-ins or Java applets. XML derives from a philosophy that data belongs to its creators and that content providers are best served by a data format, that does not bind them to particular script languages, authoring tools, and delivery engines, but provides a standardised, vendor-independent, level playing field upon which different authoring and delivery tools may freely compete.

An example of the first category of XML applications could be a information tracking system for a home health care agency. This app could then have the following functions that are not all accomplishable in HMTL:

Log into the hospitals web site.
Access the patient's medical records in a Web-based interface that represents the records for that patient with a folder icon.
Drag the folder from the app over to the internal database.
Drop it into the database.

The app could use XML tags such as <allergies>, <drug-reaction>, and so on.

You can view the House of Worship, who already use XML, as a way to allow its members to share information -- especially on religious discourse. The move is one of the first implementations by an independent site of the next-generation Web authoring language. HOW introduces amongst others the <PRAYER> and <SCRIPTURE> tags.

XML software

Good tools are going to be the thing that makes XML work. XML is complex enough that you are not going to want to do much of it by hand. The nice thing is that it looks like a lot of tools vendors are going to support it. Microsoft is talking about making it the native default file format for upcoming versions of MS office, including MS Word, Excel and Powerpoint. This could mean that one could potentially serve these documents directly out onto the web without having to convert them or mark them up by hand. I suspect that we will be seeing similar functionality available in a future version of Corel WordPerfect as well, although I haven't heard any announcements yet. Tool support will be necessary, not just for putting primary content documents into XML format, but also for supporting and maintaining large collections of documents. Tools will be necessary for creating new documents which combine content from various different sources. There is plenty of opportunity for innovation here.

A number of other commercial vendors are preparing XML software tools. In addition, aided by XML's relative simplicity, many individuals and academic institutions are undertaking XML efforts.

As part of IBM's support for the World Wide Web Consortium's (W3C) endorsement of XML 1.0 as a Web standard, IBM has released an alpha version of its XML for Java technology. The W3C, an international group that oversees Web standards, is promoting XML as a language to let applications interchange data with greater precision than standard HTML can provide.

Web developers seeking to increase their familiarity with XML should check out XML for Java -- developed at IBM's Tokyo Research Lab, and available on IBM's alphaWorks. XML for Java is an XML processor written entirely in Java; with it, Web developers can parse, process, and create XML documents.

Leading examples of XML tools available for free non-commercial also use include the following:

NXP is a validating XML parser written in Java by Norbert Mikula.
http://www.edu.uni-klu.ac.at/~nmikula/NXP

Lark is a non-validating XML processor written in Java by Tim Bray.
http://www.textuality.com/Lark/

XP is another non-validating XML processor written in Java, by James Clark.
http://www.jclark.com/xml/xp/index.html

MSXML is a validating XML parser written in Java by Microsoft.
http://www.microsoft.com/xml/parser/jparser.asp

clXML is a validating XML parser written in Tcl by Steve Ball.
http://tcltk.anu.edu.au/XML/

LT XML is an XML developers' toolkit from the Language Technology Group at the University of Edinburgh.
http://www.ltg.ed.ac.uk/software/xml/

JUMBO is a Java-based XML browser designed for the Chemical Mark-up Language, an XML application developed by Peter Murray-Rust.
http://www.venus.co.uk/~pmr/README

DSC is a DSSSL syntax checker and development environment available from the Language Technology Group at the University of Edinburgh.
http://www.ltg.ed.ac.uk/~ht/dsc-blurb.html

XML and style sheets

Because XML is really about specifying characteristics of data, and not simply presenting it, you will need to write style sheets to use it. Since DHTML, CSS and CDF's are all standards supported by both Netscape and Microsoft, you can start using XML today. Also, new tools are constantly emerging to evaluate your XML conventions and ensure that others parsers can use them as you intended.

The Extensible Style Language (XSL) represents a early attempt to create a more dynamic and powerful notation for defining document style, and to augment the capabilities of the Cascading Style Sheets work (CSS1 and CSS2) already in place at the W3C. Objectives here include a model that can dynamically resize itself completely around base font selections (which CSS cannot currently handle) and to provide more powerful, interactive support for document styles and rendering. At present, this work is largely experimental and most active development uses CSS1 or CSS2 style sheets for production. But just as XML represents a strict subset of SGML, the work on XSL derives in large part from the DSSSL style sheet language developed in the SGML community.

XSL can handle an unlimited number of tags, each in an unlimited number of ways, by virtue of its extensibility. It brings advanced layout features to the Web, such as rotated text, multiple columns, and independent regions. It supports international scripts, all the way to mixing left-to-right, right-to-left, and top-to-bottom scripts on a single page.

The future of XML

XML could take many different directions, since it is only still a recommendation, many things can (and probably will) happen. One direction is to serve as an alternative to HTML. This particular use is going to take a little while to mature, because you need to populate the world with the tools to create XML documents and the people who know how to use them and put them up on their sites. And that's going to take time, because it's going to require that people make a shift mentally in how they conceive of data.

In the very short term the main impact of XML for Web developers will be its use in a variety of special-purpose facilities, such as Microsoft Corp.'s Channel Definition Format (CDF), Marimba's Open Software Description (OSD) protocol and Vignette and its partners' Information and Content Exchange (ICE). These are simple easily described languages that do special-purpose tasks such as channel description, download automation and syndication negotiation. Anyone who wants to play in these application spaces will have to learn how to read and write the appropriate XML-based languages. Fortunately, this is easy, since generating XML is trivial, and parsing it can be done with any number of freeware parsers available right now in C or Java.

Coming shortly thereafter will probably be RDF, or Resource Description Framework, a general metadata exchange mechanism based on XML, currently in the process of being drafted at W3C. This has the potential to bring dramatic benefits to the worlds of searching, retrieval and many other aspects of content automation.

Conclusion

If you want to be an early adopter, now is the time to start reading the standards, looking at the specs, and starting to think how you could use this technology. XML is not going to catch on all by itself. It takes people to support it, to build the tools and create the content using it. XML seems to have a lot of industry support behind it. It offers the potential to do a lot of things that people want to be able to do.

If you think it can work for you, look into it more. Then make an informed decision. Take a look at it. Find out how it's being used. Try it for yourself, just to play with, or in a small pilot project. If the tools aren't mature enough yet, then wait a few months and look again later. If XML turns out to be a good technology it will succeed. If not, people will pass it by.

Everyone, including Netscape is supporting XML. It is already used to some extent by Netscape, mainly in its own internal output and IE4 supports it some, but not completely. Its support will be strong in the 5th generation of both browsers. It is extremely helpful in establishing means to speak with databases and as a way to have PDF type output, but with access to the data on the browser.

I urge it be learned in the future. If you want an excellent use of it now and a great program in addition get Frontier5 (http://www.scripting.com/frontier5/xml/), which uses it to quickly generate HTML on the fly.

The bad news: At the time of this writing, browsers are between generations, not yet fully ready to embrace these new technologies and standards. But this lag may be just what hatching standards need, giving developers enough time to rethink the way their Web applications should work before a rewoven Web hits with full force, starting at the end of this year.

Comments? Lavish praise? Flaming criticism? Other ideas? Contact janus@janusboye.dk

Useful links

The main W3C XML page:
http://www.w3.org/XML

W3C Recommendation:
http://www.w3.org/TR/REC-xml

XML Linking draft:
http://www.w3.org/TR/WD-xml-link

XSL Style Proposal:
http://www.w3.org/TR/NOTE-XSL.html

XML Data Note:
http://www.w3.org/TR/1998/NOTE-XML-data-0105/

Microsoft's XML Page:
http://www.microsoft.com/xml

The House of Worship (uses XML):
http://www.housesofworship.net/

XML Namespaces : Universal Identification in XML Markup

The Emperor has New Clothes : HTML Recast as an XML Application

SVG Brings Fast Vector Graphics to Web

Time Changes Everything

P3P - What's in it for us?

XSL - What's in it for us?

RDF - What's in it for us?

MathML - What's in it for us?