I'm looking for a solution to replace all the links from a curl response to my site.
Lets say my site is: example.com, then I make a CURL request to site.com. site.com has various links:
<a href="http://smthing.com">Something!</a>
<some html>......
<a href="http://google.com">Google!</a>
<more html>
开发者_StackOverflow中文版 <a href="#" onclick="window.location.href='http://somethingElse.com'">Something else</a>
My goal is to prefix all the links with: example.com/?url={THE URL OF THE LINK} (AKA my site).
My current solution uses regexp to "catch" and process all the links. This works most of the time, but from time to time I encounter a non-valid HTML that fails the regex. The regex has another disadvantage: I can't catch onclick="" actions and different link scenarios.
I heard several solutions such as rewrite and reverse proxy. Any of them can work to achieve my goal?
Thanks..
You should absolutely be able to use regex for that. However, your code will have to be a little more robust to handle inline scripting. Analyze a large sample of anchor attributes to determine all the possible link formats, over and above /href=""/ and /window.location.href/.
You will also have to parse referenced script files to see what the event handlers hold.
精彩评论