开发者

Generalizing XPaths

开发者 https://www.devze.com 2023-02-15 07:39 出处:网络
I would like seek your help for a problem I am trying to tackle involving XPaths. I am trying to generalize multiple Xpaths provided by a user to get an XPath that would best \'fit\' all the provided

I would like seek your help for a problem I am trying to tackle involving XPaths.

I am trying to generalize multiple Xpaths provided by a user to get an XPath that would best 'fit' all the provided examples. This is for a web scraping system I am building.

Eg: If the user gives the following xpaths (each pointing to a link in the 'Spotlight' section from the Google News page)

Good examples:

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3] /div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[1]/div[2]/a[@id='MAE4AUgAUABgAmoCdXM']/span

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[6]/div[2]/a[@id='MAE4AUgFUABgAmoCdXM']/span

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[12]/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span

Bad Examples: (pointing to a link in another section)

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='lt-col']/div[2]/div[@id='replaceable-section-blended']/div[1]/div[4]/div/h2/a[@id='MAA4AEgFUABgAWoCdXM']/span

It should be able to generalize and produce an xpath expression that would select all the links in the 'Spotlight' section. (It should be able to throw out the incorrect xpath given)

Generalized XPath

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span

Could you kindly advice me on how to go about it. I was thinking of using the Longest Common Substring strategy but however that would over-generalize if a bad example is given (like the fourth example given) Are there any libraries or any open source software that has been done in this area?

I saw some similar posts (fi开发者_StackOverflow社区nding common ancestor from a group of xpath? and Howto find the first common XPath ancestor in Javascript?) However they are talking about longest common ancestor.

I am writing it in Javascript as a form of a firefox extension.

Thanks for your time and any help would be greatly appreciated!


The question here is in Automaton minimization problem. So you have (Xpath1|Xpath2|Xpath3) and you would like to get minimal automaton Xpath4 which match same nodes. THere are also question about minimization with information lose or not, like JPEG. For exact minimization you could google "Algorithms for Minimization of Finite-State Automata".

Ok, the simplest way is finding common subsequence, after converting each Xpath operator to character and run character based substring finder from list of string. So we have for example

adcba, acba, adba --common substring--> aba --general reg exp--> a.*b.*a --convert back to xpath--> ...

You can also try to set something less general in place of .*

0

精彩评论

暂无评论...
验证码 换一张
取 消