I'm looking for a machine-readable 开发者_开发知识库version of the HTML5 specs, akin to a DTD, although any format would do as long as it's parsable.
The HTML5 specs don't seem to contain anything of the sort so my first idea was to look into validators. I dug into the sources of the validator.nu validator but it seems that the schema they use is build by parsing the specs (e.g. parsing its HTML and its english text) and I'll have to build the validator to generate it.
More specifically, I'm looking for a list of elements, their content models, and a list of their attributes with their type and whether they are required or they have a default value.
Finally, I should mention that I'm not looking for validating specific documents. I would use W3C's validator, or validator.nu directly. I'm looking for the specs so that I can use them in my own applications.
Trawling through W3's site I can only see two things of interest on this:
- "As HTML5 is no longer formally based upon SGML, the DOCTYPE no longer serves this purpose, and thus no longer needs to refer to a DTD." from the HTML5 working draft. It doesn't say there isn't one, just that clients don't need one
- And that HTML5 is still a working draft obviously, not a specification, which implies there may be a DTD published later
I've looked as hard as you probably have with nothing concrete. I think validator.nu's approach is the best as the working draft is likely to change several times before a specification is ever agreed upon. If someone did publish an unofficial DTD it would need constant maintenance.
+1 great question, I wish I could find a concrete answer. I hope someone else can!
I've read this question and it's answers and decided to start a new project: WHATWG HTML5 Standard Parser. Currently, it parsers the singlepage version of the standard html page and provides the elements together with allowed attributes.
Hope to get something started... Pull requests are welcome!!!
There isn't a BNF/CFG for HTML5 because HTML5 is partially about progressive enhancement and fixing errors silently. If a page features broken markup, it's the browser's duty to display the page as well as it can and not complain to the user.
More about this history can be read at Dive Into HTML5 / How Did We Get Here?:
As you might expect, the fact that “broken” HTML markup still worked in web browsers led authors to create broken HTML pages. A lot of broken pages. By some estimates, over 99% of HTML pages on the web today have at least one error in them. But because these errors don’t cause browsers to display visible error messages, nobody ever fixes them.
I guess this isn't particularly helpful, so my apologies. You could try looking at the XHTML 1.1 DTD or SGML DTD as starting points. Or, if you want a heuristic-based best-attempt approach, check out an HTML parser such as Beautiful Soup.
UPDATING
Since 2014-10-28 the HTML5 is a recommendation (!)... But this question is not obsolete (the validators now are more complex tham simple DTD).
ANSWER
there are no simple parser, as @ruediste clues show... Today, perhaps the best parser is at https://validator.nu/ ... so,
- You show the first part of the answer: it is a complex parser, and validator.nu is a good parser.
- the 2014-10-28 W3C's recommendation confirms that there are no simple parser (like a DTD or a list of elements) to say "this is a valid HTML5".
- ... this other question show that, perhaps, only context (use/community) can validate the list of tags and attributes.
NEW as of April 2019 The WHATWG HTML5 spec as JSON, although very incomplete and a work in progress.
Uses Python to parse the multipage standard.
Full disclosure: I made this.
See also
HTML5 RelaxNG schemas
精彩评论