开发者

When matching html or xml tags, should one worry about casing?

开发者 https://www.devze.com 2023-01-05 08:28 出处:网络
If you are parsing html or xml (with python), and looking 开发者_JS百科for certain tags, it can hurt performance to lower or uppercase an entire document so that your comparisons are accurate. What pe

If you are parsing html or xml (with python), and looking 开发者_JS百科for certain tags, it can hurt performance to lower or uppercase an entire document so that your comparisons are accurate. What percentage (estimated) of xml and html docs use any upper case characters in their tags?


XML (and XHTML) tags are case-sensitive ... so <this> and <tHis> would be different elements.

However a lot (rough estimate) of HTML (not XHTML) tags are random-case.


Only if you're using XHTML as this is case sensitive, whereas HTML is not so you can ignore case differences. Test for the doctype before worrying about checking for case.


I think you're overly concerned about performance. If you're talking about arbitrary web pages, 90% of them will be HTML, not XHTML, so you should do case-insensitive comparisons. Lowercasing a string is extremely fast, and should be less than 1% of the total time of your parser. If you're not sure, carefully time your parser on a document that's already all lowercase, with and without the lowercase conversions.

Even a pure-Python implementation of lower() would be negligible compared to the rest of the parsing, but it's better than that - CPython implements lower() in C code, so it really is as fast as possible.

Remember, premature optimization is the root of all evil. Make your program correct first, then make it fast.

0

精彩评论

暂无评论...
验证码 换一张
取 消