开发者

Compare two websites and see if they are "equal?"

开发者 https://www.devze.com 2023-01-08 12:08 出处:网络
W开发者_开发技巧e are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the

W开发者_开发技巧e are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?


Get the formatted output of both sites (here we use w3m, but lynx can also work):

w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html

Then use wdiff, it can give you a percentage of how similar the two texts are.

wdiff -nis /tmp/1.html /tmp/2.html

It can be also easier to see the differences using colordiff.

wdiff -nis /tmp/1.html /tmp/2.html | colordiff

Excerpt of output:

Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion

                           Google [hp1] [hp2]
                                  [hp3] [-Français-] {+Deutschland+}

           [                                                         ] Recherche
                                                                       avancéeOutils
                      [Recherche Google][J'ai de la chance]            linguistiques


/tmp/1.html: 43 words  39 90% common  3 6% deleted  1 2% changed
/tmp/2.html: 49 words  39 79% common  9 18% inserted  1 2% changed

(he actually put google.com into french... funny)

The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).


The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.

IF the pages have dynamic content you will have to download the site using a tool like wget

wget --mirror http://thewebsite/thepages

and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.


I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!

the paste is here:

http://pastebin.com/0V7sVNEq


Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:

  • Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
  • Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii

For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.

Disclaimer: I have helped create recheck-web.


Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:

diff -r /tmp/directory1 /tmp/directory2

For all intents and purposes, you can put them in your preferred location with your preferred naming convention.

Edit 1

You could potentially use lynx -dump or a wget and run a diff on the results.


Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.

However, it is certainly possible to compare the downloaded website after downloading recursively with wget.

  wget [option]... [URL]...

   -m
   --mirror
       Turn on options suitable for mirroring.  This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
       directory listings.  It is currently equivalent to -r -N -l inf --no-remove-listing.

The next step would then be to do the recursive diff that Warner recommended.

0

精彩评论

暂无评论...
验证码 换一张
取 消