Forums (Obsolete)

Member

CDK - 2/14/2006 4:00:58 PM

Sending XHTML as text/html (USEFUL information for those who need it)

hi

i just thought this might help someone here
(ie such as if your having layout problems with firefox you might have to delivery your xhtml page as a text/html etc)

------------------------------
Sending XHTML as text/html Considered Harmful -*- Mode: text; -*-
=============================================

Author: Ian Hickson <ian@hixie.ch> (Comments welcome.)

Abstract
--------

A number of problems resulting from the use of the text/html MIME type
in conjunction with XHTML content are discussed. It is suggested that
XHTML delivered as text/html is broken and XHTML delivered as text/xml
is risky, so authors intending their work for public consumption
should stick to HTML 4.01, and authors who wish to use XHTML should
deliver their markup as application/xhtml+xml.

Translations
------------

Une traduction française est disponible:
http://www.hixie.ch/advocacy/xhtml.fr

Context
-------

This was originally written in September 2002 in the context of this
Web log entry:

http://ln.hixie.ch/?start=1031465247&count=1

It has since been regularly updated to correct errors that have been
brought up in various mailing lists and other discussion forums. As of
late 2004, it is still just as relevant as when it was originally
written.

Note that this document compares XHTML 1.0 compliant to appendix C to
HTML 4.01, because that is the only variant of XHTML that may be sent
as text/html.

Executive Summary
-----------------

If you use XHTML, you should deliver it with the application/xhtml+xml
MIME type. If you do not do so, you should use HTML4 instead of XHTML.
The alternative, using XHTML but delivering it as text/html, causes
numerous problems that are outlined below.

Unfortunately, IE6 does not support application/xhtml+xml (in fact, it
does not support XHTML at all).

Why using text/html for XHTML is bad
------------------------------------

What usually happens to authors who decide to send XHTML as text/html
is the following:

1. Authors write XHTML that makes assumptions that are only valid for
tag soup or HTML4 UAs, and not XHTML UAs, and send it as
text/html. (The common assumptions are listed below.)

2. Authors find everything works fine.

3. Time passes.

4. Author decides to send the same content as application/xhtml+xml,
because it is, after all, XHTML.

5. Author finds site breaks horribly. (See below for a list of
reasons why.)

6. Author blames XHTML.

Steps 1 to 5 have been seen by every single person I have spoken to
who has switched to using the XHTML MIME type. The only reason step 6
didn't happen in those cases is that they were advanced authors who
understood how to fix their content.

SPECIFIC PROBLEMS

These are the issues that affect documents when they are switched from
text/html to application/xhtml+xml:

* <script> and <style> elements in XHTML sent as text/html have to be
escaped using ridiculously complicated strings.

This is because in XHTML, <script> and <style> elements are #PCDATA
blocks, not #CDATA blocks, and therefore  really _are_
comments tags, and are not ignored by the XHTML parser. To escape
script in an XHTML document which may be handled as either HTML4 or
XHTML, you have to use:

<script type="text/javascript"><![CDATA[//><!]]></script>

To embed CSS in an XHTML document which may be handled as either
HTML4 or XHTML, you have to use:

<style type="text/css"><![CDATA[/*></style>

Yes, it's pretty ridiculous. If documents _aren't_ escaped like
this, then the contents of <script> and <style> elements get
dropped on the floor when parsed as true XHTML.

(This is all assuming you want your pages to work with older
browsers as well as XHTML browsers. If you only care about XHTML
and HTML4 browsers, you can make it a bit simpler.)

* A CSS stylesheet written for an HTML4 document is interpreted
slightly differently in an XHTML context (e.g. the <body> element
is not magical in XHTML, tag names must be written in lowercase in
XHTML). Thus documents change rendering when parsed as XHTML.

* A DOM-based script written for an HTML4 document has subtly
different semantics in an XHTML context (e.g. element names are
case insensitive and returned in uppercase in HTML4, case sensitive
and always lowercase in XHTML; you have to use the namespace-aware
methods in XHTML, but not in HTML4). BUT, if you send your
documents as text/html, then they will use the HTML4 semantics
DESPITE being XHTML! Thus, scripts are highly likely to break when
the document is parsed as XHTML.

* Scripts that use document.write() will not work in XHTML contexts.
(You have to use DOM Core methods.)

* Current UAs are, for text/html content, HTML4 user agents (at best)
and certainly not XHTML user agents. Therefore if you send them
XHTML you are sending them content in a language which is not
native to them, and instead relying on their error handling. Since
this is not defined in any specification, it may vary from one user
agent to the other.

* XHTML documents that use the "/>" notation, as in "<link />" have
very different semantics when parsed as HTML4. So if there was to
be a fully compliant HTML4 UA, it would be quite correct to show
">" characters all over the page.

For more details on this see the third bullet point in the section
entitled "The Myth of "HTML-compatible XHTML 1.0 documents".

COPY AND PASTE

The worst problem, and the main reason (I suspect) for most of the
REALLY invalid XHTML pages out there, is that authors who have no clue
about XHTML simply copy and pasted their DOCTYPE from another
document. So even if you write valid XHTML, by using XHTML, you are
likely to encourage authors who do not know enough to write valid
XHTML to claim to do so.

Why trying to use XHTML and then sending it as text/html is bad
---------------------------------------------------------------

These are not likely to be problems for authors who regularly validate
their pages, but other authors will run into these problems.

* Documents sent as text/html are handled as tag soup [1] by most UAs.

This is the key. If you send XHTML as text/html, as far as browsers
are concerned, you are just sending them Tag Soup. It doesn't
matter if it validates, they are just going to be treating it the
same was as plain old HTML 3.2 or random HTML garbage.

Since most authors only check their documents using one or two UAs,
rather than using a validator, this means that authors are not
checking for validity, and thus most documents that claim to be
XHTML on the web now are invalid.

See, for example, this study:
http://www.goer.org/Journal/2003/Apr/index.html#results
...but if you don't believe it, feel free to do your own. In any
random sample of documents that appear to claim to be XHTML, the
overwhelming majority of documents are invalid.

Therefore the main advantage of using XHTML, that errors are caught
early because it _has_ to be valid, is lost if the document is then
sent as text/html. (Yes, I said _most_ authors. If you are one of
the few authors who understands how to avoid the issues raised in
this document and does validate all their markup, then this
document probably does not apply to you -- see Appendix B.)

* If you ever switch your documents that claim to be XHTML from
text/html to application/xhtml+xml, then you will in all likelyhood
end up with a considerable number of XML errors, meaning your
content won't be readable by users. (See above: most of these
documents do not validate.)

* If a user saves such an text/html document to disk and later
reopens it locally, triggering the content type sniffing code since
filesystems typically do not include file type information, the
document could be reopened as XML, potentially resulting in
validation errors, parsing differences, or styling differences.
(The same differences as if you start sending the file with an XML
MIME type.)

* The only real advantage to using XHTML rather than HTML4 is that it
is then possible to use XML tools with it. However, if tools are
being used, then the same tools might as well produce HTML4 for you.
Alternatively, the tools could take SGML as input instead of XML.
(SGML is over a decade older than XML and the tools have existed
for years.)

* HTML 4.01 contains everything that XHTML 1.0 contains, so there is
little reason to use XHTML in the real world. It appears the main
reason is simply "jumping on the bandwagon" of using the latest and
(perceived) greatest thing.

The Myth of "HTML-compatible XHTML 1.0 documents"
-------------------------------------------------

RFC 2854 spec refers to "a profile of use of XHTML which is compatible
with HTML 4.01". There is no such thing. Documents that follow the
guidelines in appendix C are not valid HTML 4.01 documents. They just
happen to be close enough that tag soup parsers are able to handle
them just like most of the other pages on the Web.

The simplest examples of this are:

* The "/>" empty tag syntax actually has totally different meaning in
HTML4. (It's the SHORTTAG minimisation feature known as NET, if I
recall the name correctly.) Specifically, the XHTML

<p> Hello <br /> World </p>

...is, if interpreted as HTML4, exactly equivalent to:

<p> Hello <br>> World </p>

...and should really be rendered as:

Hello
> World

* Script and style elements cannot have their contents hidden from
legacy UAs. The following XHTML:

<style type="text/css">

</style>

...is exactly equivalent to the following HTML4:

<style type="text/css">

</style>

...because comments are not ignored in XHTML <style> blocks.

* The "xmlns" attribute is invalid HTML4.

* The XHTML DOCTYPEs are not valid HTML4 DOCTYPEs.

Using XHTML and sending it as text/html is effectively the same, from
an HTML4 point of view, as writing tag soup (see "Why UAs can't handle
XHTML sent as text/html as XML" below).

Note: This is covered by HTMLWG issue XHTML-1.0/6232:
http://hades.mn.aptest.com/cgi-bin/voyager-issues/XHTML-1.0?id=6232;expression=appendix%20c;user=guest

Why UAs can't handle XHTML sent as text/html as XML
---------------------------------------------------

* Documents sent as text/html are handled as tag soup by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. A conforming XML
UA would thus be unable to show as many documents as current UAs,
and would therefore never get enough marketshare to be relevant.

* It is impossible to reliably autodetect XHTML when sent as
text/html. This is why UAs could not ever treat text/html documents
as XML, even if they did not care about not being usable (see the
first point in this section).

+ You can't sniff for the five characters "<?xml" because:

- The <?xml ... ?> header is optional per Appendix C, and it is
recommended not to include it as it causes IE6 to trigger
quirks mode.

- SGML can also contain PIs (see the example below).

+ You can't trigger from the DOCTYPE since the W3C might introduce
new XHTML DOCTYPEs in future, so you don't know which DOCTYPEs
to look for. (Not to mention that DOCTYPEs are optional for
well-formed XHTML documents, DOCTYPE parsing is hard, DOCTYPEs
may be hidden in comments, and DOCTYPE sniffing has been called
harmful by many leading figures at the W3C and elsewhere.)

+ You can't trigger off the "<html xmlns" string because it might
be there but hidden in a comment (you'd need a complete XML
parser to step past comments, PIs, internal subsets, etc).

e.g. what language is this text/html document in?:

<?xml this is not?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN"
[  ]>

This is a comment. This document is not XHTML.
<html xmlns="http://www.w3.org/1999/xhtml"/>
Ok, I'm done now. -->
<html>
<title> Need a title in HTML4! </title>
<p> This is a valid HTML4 document.
</html>

* Even if you could detect XHTML, what do you do with a document that
is not well formed (such as the example above)? If you fall back on
HTML4, then there is no advantage to using an XML processor, and you
might as well always treat it as HTML4.

* The HTML working group said that UAs should not do this:
http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

The advantages of XHTML
-----------------------

When sent as application/xhtml+xml, XHTML has several advantages:

1. XHTML content will be able to be mixed-and-matched with content
from other well-known namespaces (in particular, MathML). This
is the main advantage for content authors.

2. UAs will immediately catch well-formedness errors

3. Tools interacting with XHTML documents are guaranteed a
well-formed document.

4. XHTML content can be parsed with a simpler parser than tag soup
can, and a _much_ simpler parser than SGML can.

However, none of these apply when an XHTML document is sent as
text/html, and since authors feel their pages should be readable on
the most popular Web browser, which does not support
application/xhtml+xml, there is basically no point in using XHTML at
the moment.

Conclusion
----------

There are few advantages to using XHTML if you are sending the content
as text/html, and many disadvantages.

In addition, currently, the majority (over 90% by most counts) of the
UA market is unable to correctly render real XHTML content sent as
text/xml (or other XML MIME types). For example, point IE at:

http://www.mozillaquestquest.com/

Only Mozilla, Mozilla-based browsers such as Netscape 6 and 7, recent
versions of Opera, and Safari, are able to correctly render that site.
(IE6 shows a DOM tree!)

Authors who are not willing to use one of the XML MIME types should
stick to writing valid HTML 4.01 for the time being. Once user agents
that support XML and XHTML sent as one of the XML MIME types are
widespread, then authors may reconsider learning and using XHTML.

(Advanced authors should also see appendix B.)

Further Reading
---------------

I wrote another document on a related matter: people wanting UAs to
treat XHTML documents sent as text/html as XML and not tag soup.

http://www.damowmow.com/playground/xhtml-in-uas.xhtml

Henri Sivonen wrote a similar document asking what is the point of
XHTML:

http://www.hut.fi/u/hsivonen/xhtml-the-point

There are also many mailing list posts on this matter, e.g. on
www-talk. The following post summarises some issues relating to using
text/html for XHTML content containing XML extensions:

http://lists.w3.org/Archives/Public/www-talk/2001MayJun/0046.html

Some people have run into the problems this document mentions, for
example:

http://flrant.com/index.php?id=P21

There are also some interesting points made in other posts, for
example:

| > But does Mozilla call its xml parser for http://www.w3.org/ ?
|
| Nope. If it did, it would render the page without any expanded
| character entity references, since Mozilla is not a validating
| parser and thus skips parsing the DTD and thus doesn't know what
| , · and © are. Not to mention that it would end up
| ignoring the print-media specific section of the stylesheet, which
| uses uppercase element names and thus wouldn't match any of the
| lower case elements (line 138 of the first stylesheet), and it would
| use an unexpected background colour for the page because the
| stylesheet sets the background on <body> and not <html>, which in
| XHTML will result in a different rendering to the equivalent in
| HTML4 (same sheet, line 5).
-- http://lists.w3.org/Archives/Public/www-talk/2001MayJun/0004.html

Or this post, near the end of the thread:

| I'm still looking for a good reason to write websites in XHTML _at
| the moment_, given that the majority of web browsers don't grok
| XHTML. The only reason I was given (by Dan Connolly [1]) is that it
| makes managing the content using XML tools easier... but it would be
| just as easy to convert the XML to tag soup or HTML before
| publishing it, so I'm not sure I understand that. And even then,
| having the content as XML for content management is one thing, but
| why does that require a minority of web browsers to have to treat
| the document as XML instead of tag soup? What's the advantage of
| doing that? And even _then_, if the person in control of the content
| is using XML tools and so on, they are almost certainly in control
| of the website as well, so why not do the content type munging on
| the server side instead of campaigning for UA authors to spend their
| already restricted resources on implementing content type sniffing?
|
| [1] http://lists.w3.org/Archives/Public/www-talk/2001MayJun/0031.html
-- http://lists.w3.org/Archives/Public/www-talk/2001JulAug/0005.html

Appendix A: application/xhtml+xml
---------------------------------

See: http://ln.hixie.ch/?start=1036767231&count=1

Appendix B: Advanced Authors
----------------------------

Some advanced authors are able to send back XHTML as
application/xhtml+xml to UAs that support it, and as text/html to
legacy UAs.

Assuming you are using XHTML 1.0 compliant to Appendix C (or have
otherwise checked that the XHTML 1.0 you send is compatible with Tag
Soup processors), then that's fine. All I am saying in this document
is that sending XHTML as text/html ONLY is harmful.

Note: Sending XHTML 1.1 as text/html is NEVER fine. There is no spec
that allows this. Sending XHTML 2.0 as anything in a production
(non-testing) context is NEVER fine either, since that spec has not
reached CR yet.

Also note that I would personally suggest that even advanced authors
not use XHTML sent as text/html, since many authors copy and paste
markup from others and thus may easily end up copying the valid XHTML
markup but using it as HTML4.

Appendix C: Acknowledgements
----------------------------

Thanks to Nick Boalch for the abstract. Thanks to Dan Connolly for
pedancy that has improved the quality of this document. Thanks to Ted
Shaneyfelt and many others for suggesting improvements to the text.

Appendix D: Footnotes
---------------------

[1] The term "handled as tag soup" refers to the fact that UAs
typically are very lenient in their error handling, and do not support
any of the "advanced" SGML features. For example, browsers treat the
string "<br/>" as "<br>" and not "<br>>", the latter being what
SGML says they should do. Similarly, real world UAs have no problem
dealing with content such as "<b> foo <i> bar </b> baz </i>" even
though according to the HTML4 spec that is meaningless.

SOURCE: http://hixie.ch/advocacy/xhtml


Guest	admin - 2/14/2006 9:20:39 PM Re: Sending XHTML as text/html (USEFUL information for those who need it) Thank you very much for posting this here. We will consider this issue for Kentico CMS 2.0. Best Regards,

Member

CDK - 2/16/2006 4:22:01 AM

just so you know this works in firefox!

if you have problems in firefox with the layout when using "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">"
and do not want to use HTML formating and still wish to use XHTML
then do the following!

Applying XHTML in microsoft frontpage
---------------------------------------------
1) open up the file in source code mode
2) right click choose "apply XML Formating rules" (this will convert everything into XML formating)
3) replace your <!DOCTYPE... > with this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
4) make sure you have <html xmlns="http://www.w3.org/1999/xhtml" > above the header (remember to close it!)

now as long as you have XML rule set on you will have a well formed XHTML file.

goodluck :)


Guest	admin - 2/16/2006 10:22:56 AM Re: just so you know this works in firefox! Thank you, again! Best Regards