开发者

Should I write Polyglot HTML5 documents?

开发者 https://www.devze.com 2023-01-05 01:02 出处:网络
I\'ve been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.

Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?

Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:

  1. No errors when run through the W3C Validator as HTML5
  2. No errors when run through an XML parser

But are there any other rules I'm missing?

Thirdly, seeing as it is a polyglot, does anyone 开发者_如何学Cknow any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?

Edit: After a small bit of experimenting I found that entities like   break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.


Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.

One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.


I'm late to the party but after 5 years the question is still relevant. On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.

https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.

So I think there is a useful middle way:

  1. For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.

  2. Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.

    Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...

    • No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).

    • No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).

    • Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:

      <script>/*<![CDATA[*/
         foo < bar && bar < baz;
      /*]]>*/</script>
      

    ...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)

  3. Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.

    Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.

    Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.


Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.

In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.


Should you? Yes. But first some clarification on a couple points.

Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About &nbsp;, that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is &#xa0; or &#160;, I personally prefer hex because unicode code points are represented that way (U+00A0).

Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.

This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell


This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.

I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.


This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号