Hi I've got some XML that won't validate. I've narrowed down the problem to this bit:
<script type="text/javascript">document.getelementbyid("oxm-1f4a4485-5a1d-45f9-a989-9c65a0b9ceb6").src="http://bid.website.net/display?l=h4siaaaaaaaaad2nmq6cqbrenycw7qjyolfccxmregvcoae0u0sly_agtvaewwn4bg_havwbnebpvmzkkzra_kzzdvoloq4u-hjnp7sii0rxcbzz5vl5kxsrds6wtsfbxmcr9chysuhqbecuckb8cvx4m-pbcxugtdrll6d3dqtihnqukth2yvdkptr67cuzfvlxjlinkul9634lpal_h4mwhso8aabzhw1cdcwjxl6xivgv8agrjxjc_gaaaa==&p=h4siaaaaaaaaabxkmq7cmaxaurcqjjrrsfqqsrm7x3fsrwyvosda8qnj_3ojfgb49o45pblq7e80syzjhopggso9wyzpcpntzkxk1ldtbbi7otmxfj9da1wpjcf10vtxdj9e5_utyj19k2lfssepld5agnqaaaa=&url=http%3a%2f%2flocalhost%2fproject-debug%2fproject.html";</script>
I put it in an XML validator and it spat out:
This page contains the following errors: error on li开发者_C百科ne 1 at column 16: EntityRef: expecting ';'
Any ideas as to where the missing ';' is supposed to go? Is there another problem?
You have unescaped ampersands &
in your URL. They either need to be (a) changed to character entities (&
), or (b) enclosed in a CDATA section.
A CDATA section lets you leave special characters like &
unescaped, so that'd be easiest:
<script type="text/javascript">
// <![CDATA[
document.getElementById(...).src="...";
// ]]>
</script>
You can include anything you want inside of a CDATA section aside from the exact character sequence ]]>
. The //
comments are there to make sure browsers that don't understand CDATA sections ignore the <![CDATA[
and ]]>
markers.
By the way, JavaScript is case sensitive. That should be getElementById
not getelementbyid
.
modifying the content isn't always possible, e.g if you're scraping a website.
you can't just str_replace '&' with '&' because the html might include valid html entities, and you'd get something like "&amp;"
Here's a regex which should replace ampersands with htmlentiries for ampersands, without breaking good htmlentities:
$html = preg_replace("|&([^;]+?)[\s<&]|","&$1 ",$html);
I used it to scrape about 700 pages without any problems :)
精彩评论