I know I can check the response header's 'last-modified' value to determine when the web page was last modified, but in many instances that header is NOT provided. Also, in many instances the content itself hasn't changed, but the current time/date is displayed on the page, thus giving the appearance of a modification.
Any suggestions on how to overcome the above issues and determine if a web pag开发者_StackOverflow社区e has been (truly) modified?
Thanks.
Sure. Define for yourself what counts as a "modification" (for example, only things in the "content" div) and only look at that.
If you can't find a way to decide whether something's been changed, then you can't expect a computer to…
You are asking two question here:
- When was it modified?
- Was it modified?
To answer question #1, you'd have to check the page every so often to meet your granularity requirements e.g. every hour, every day, every week, etc. This could be quite resource intensive. This will depend on if you really need to know this.
To answer question #2, you need to compare something. You could do what @Paul Rosnia suggested, but if they as much as added a comma, it will be considered modified.
Then, you might also want to see what has been modifed. Then you you'd have to save the content and compare them to each other in order to highlight the changes.
You could use http://php.net/manual/en/function.file-get-contents.php and a CRON job to cache the page on your server and then perdiodically compare your cache. The comparing part will be the tricky part, since you have to write specific code to ignore the things that don't matter to you e.g. date/time stamp, header changes, menu changes, etc.
The sure-fire way to detect page changes is to download and checksum it. If the checksum changes, the page has been edited (with extremely high certainty).
Here's an example that works on the command line:
curl -s news.ycombinator.com | md5 #=> d86582bec138c051b0d8322f7823a23c
That was a few minutes ago. If you run it now you'll get a different answer!
精彩评论