matching common strings between two data sets_问答_开发者

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.

What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.

Is there some tool or a script out there th开发者_如何学Cat could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like

string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.

I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).

I realize this is a computationally intensive task, but even letting it run over the weekend is fine.

I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.

Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?

So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.

The first answer is that I doubt you'll find a ready-built tool that will meet your needs. The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.

I will go one step further and propose an architecture.

First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.

Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.

You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:

Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.

This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.

Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.

AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.

Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.