I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
开发者_运维百科JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>
, but I'm not entirely sure how to do so. The <html>
itself can contain basically every character under the sun, so [^"]
won't work.
Any thoughts?
Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
string[42:-7]
As well as being quicker than a regex, it then doesn't matter if quotes inside <html>
are escaped or not.
If every occurance of "
inside the html code would be escaped by using \"
(it is a JavaScript string after all), you could use
HTML\("((?:\\"|.)*?)"\)
to get the parameter to HTML into the first capturing group.
Note that this Regex is not yet escaped to be a Javascript String itself.
精彩评论