Extracting data from an ASPX page_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-16 15:33 出处：网络

I\'ve been entrusted with an idiotic and retarded task by my boss. The task is: given a web application that returns a table with pagination, do a software that \"reads and parses it\" since there is

I've been entrusted with an idiotic and retarded task by my boss.

The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.

Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.

There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.

I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.

Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?

EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.

EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-pro开发者_Go百科gramming.aspx and comment about that.

Stop disregarding the tools suggested.

No, the parser you can write isn't WatiN or Selenium, both of those Will work in that scenario.

ps. had you mentioned anything on needing to extract the data from flash/flex/silverlight/similar this would be a different answer.

btw, reason to proceed or not is Definitely not technical, but ethical and maybe even lawful. See my comment on the question for my opinion on this.

WatiN will help you navigate the site from the perspective of the UI and grab the HTML for you, and you can find information on .NET DOM parsers here.

Already commented but think thus is actually an answer.
You need a tool which can click client side links and wait while page reloads. Tool s like selenium can do that. Also (from comments) WatiN WatiR

@Insane, the CDC's website has this exact problem, and the data is public (and we taxpayers have paid for it), I'm trying to get the survey and question data from http://wwwn.cdc.gov/qbank/Survey.aspx and it's absurdly difficult. Not illegal or unethical, just a terrible implementation that appears to be intentionally making it difficult to get the data (also inaccessible to search engines).

I think Selenium is going to work for us, thanks for the suggestion.