开发者

What algorithm helps to identify repeating sequences in a DOM?

开发者 https://www.devze.com 2023-01-14 20:43 出处:网络
Is there a good algorithm I might apply to a DOM to lead me to groups of probably related nodes? The ultimate goal is to get something useful to assist extracting things like TOC\'s and \"blog rolls\"

Is there a good algorithm I might apply to a DOM to lead me to groups of probably related nodes? The ultimate goal is to get something useful to assist extracting things like TOC's and "blog rolls" from websites. If something like this already exists, I'd be happy if someone let me know that as well.

I realize it's开发者_C百科 not something I can hope to do deterministically. The reason I suspect there might be a solution out there already comes from recently stepping through the 'diff algorithm' which deals with common sequences. I'm not sure if it's a leap or not to go from 'common' to 'repeating'...


"Related" is a very general term, in that it is always going to depend largely upon what the actual data is and what the relationships you're trying to infer are. I don't quite understand why you're talking about "repeating sequences" as a metric for "relatedness". Stricly speaking, there is not really any "sequence" in a DOM - it's a tree, so you could only talk about ordering (and therefore sequencing) with respect to parent/child relationships or sibling relationships. I'm not sure that you mean any of these things.

That said, there are some things you can say about DOMs. They are trees, so you're essentially looking to identify sub-trees with similar shape, I assume?

One approach you might take is to take two such DOMs and attempt to relate similar nodes (e.g. ones with known attributes or particular nodes) by adding edges (making the whole thing a connected graph), and then calculate a clique.

Other that that, I'm not sure there's much more specific methods I could suggest without a slightly more complete problem description.


You just have to pick an example of a "definitely interesting" node and invent a good similarity relation; then all similar nodes will be interesting. Similarity might be based on factors like: height of path to root, attribute values, tag names, positions among siblings, all of the above for several levels of parent nodes, etc. I used this approach and it worked surprisingly well.

0

精彩评论

暂无评论...
验证码 换一张
取 消