开发者

Validation Failed: "EntityRef: expecting ';'"

开发者 https://www.devze.com 2023-01-10 20:14 出处:网络
Hi I\'ve got some XML that won\'t validate. I\'ve narrowed down the problem to this bit: <script type=\"text/javascript\">document.getelementbyid(\"oxm-1f4a4485-5a1d-45f9-a989-9c65a0b9ceb6\").s

Hi I've got some XML that won't validate. I've narrowed down the problem to this bit:

<script type="text/javascript">document.getelementbyid("oxm-1f4a4485-5a1d-45f9-a989-9c65a0b9ceb6").src="http://bid.website.net/display?l=h4siaaaaaaaaad2nmq6cqbrenycw7qjyolfccxmregvcoae0u0sly_agtvaewwn4bg_havwbnebpvmzkkzra_kzzdvoloq4u-hjnp7sii0rxcbzz5vl5kxsrds6wtsfbxmcr9chysuhqbecuckb8cvx4m-pbcxugtdrll6d3dqtihnqukth2yvdkptr67cuzfvlxjlinkul9634lpal_h4mwhso8aabzhw1cdcwjxl6xivgv8agrjxjc_gaaaa==&p=h4siaaaaaaaaabxkmq7cmaxaurcqjjrrsfqqsrm7x3fsrwyvosda8qnj_3ojfgb49o45pblq7e80syzjhopggso9wyzpcpntzkxk1ldtbbi7otmxfj9da1wpjcf10vtxdj9e5_utyj19k2lfssepld5agnqaaaa=&url=http%3a%2f%2flocalhost%2fproject-debug%2fproject.html";</script>

I put it in an XML validator and it spat out:

This page contains the following errors: error on li开发者_C百科ne 1 at column 16: EntityRef: expecting ';'

Any ideas as to where the missing ';' is supposed to go? Is there another problem?


You have unescaped ampersands & in your URL. They either need to be (a) changed to character entities (&amp;), or (b) enclosed in a CDATA section.

A CDATA section lets you leave special characters like & unescaped, so that'd be easiest:

<script type="text/javascript">
// <![CDATA[
    document.getElementById(...).src="...";
// ]]>
</script>

You can include anything you want inside of a CDATA section aside from the exact character sequence ]]>. The // comments are there to make sure browsers that don't understand CDATA sections ignore the <![CDATA[ and ]]> markers.

By the way, JavaScript is case sensitive. That should be getElementById not getelementbyid.


modifying the content isn't always possible, e.g if you're scraping a website.

you can't just str_replace '&' with '&amp;' because the html might include valid html entities, and you'd get something like "&amp;amp;"

Here's a regex which should replace ampersands with htmlentiries for ampersands, without breaking good htmlentities:

$html = preg_replace("|&([^;]+?)[\s<&]|","&amp;$1 ",$html);

I used it to scrape about 700 pages without any problems :)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号