开发者

page posting issue when working in Screen Scraping

开发者 https://www.devze.com 2023-01-01 06:57 出处:网络
I am working on screen scraping and done successfully in 3 websites, I have an issue in last website here is my url, When I hit with my parameter, it is showing result on next page, simply posting to

I am working on screen scraping and done successfully in 3 websites, I have an issue in last website

here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page

Here is My Test

However, when I hit from my application, since here I don't have an option to post, it only fetch html of req开发者_如何学编程uested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result.

How can I handle this situtation? Please give me hint.

Thanks

here is my C# code, I am using HTMLAgality

String url;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc;
url = "http://mysampleURL";
doc = hw.Load(url);


Use the WebClient class for posting the form of the first page with the expected input values. The input values can be found in the source of the first page, but it's also possible to capture them using Fiddler which is imho a great tool for these scenarios.

Example:

NameValueCollection values = new NameValueCollection();
values.Add("action","hotelPackageWizard@searchHotelOnly");
values.Add("packageType","HOTEL_ONLY");
// etc..
WebClient webclient = new WebClient();
webclient.Headers.Add("Content-Type","application/x-www-form-urlencoded");
byte[] responseArray = webclient.UploadValues("http://www.expedia.com/Hotels?rfrr=-905&","POST", values);
string response = System.Text.Encoding.ASCII.GetString(responseArray);


If the resource requires a POST, then you MUST submit a POST.

This is a fairly simple task. Here is an example from Rick Strahl's blog. The code is a bit rustic but works and will get you heading the right direction

string lcUrl = "http://www.west-wind.com/testpage.wwd";
HttpWebRequest loHttp =
   (HttpWebRequest) WebRequest.Create(lcUrl);

// *** Send any POST data
string lcPostData =
   "Name=" + HttpUtility.UrlEncode("Rick Strahl") +
   "&Company=" + HttpUtility.UrlEncode("West Wind ");

loHttp.Method="POST";
byte [] lbPostBuffer = System.Text.           
                       Encoding.GetEncoding(1252).GetBytes(lcPostData);
loHttp.ContentLength = lbPostBuffer.Length;

Stream loPostData = loHttp.GetRequestStream();
loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
loPostData.Close();

HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();

Encoding enc = System.Text.Encoding.GetEncoding(1252);

StreamReader loResponseStream =
   new StreamReader(loWebResponse.GetResponseStream(),enc);

string lcHtml = loResponseStream.ReadToEnd();

loWebResponse.Close();
loResponseStream.Close();


For screen scraping tasks that involve posting forms such as log-ins, maintaining cookies, taking care of XSRF tokens, one solution is to use CURL. But it is not easy.

I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.

0

精彩评论

暂无评论...
验证码 换一张
取 消