开发者

How to grab dynamic content on website and save it?

开发者 https://www.devze.com 2022-12-26 18:35 出处:网络
For example I need to grab from http://gmail.com/ the number of free storage: Over <span id=quota>2757.272164开发者_Go百科</span> megabytes (and counting) of free storage.

For example I need to grab from http://gmail.com/ the number of free storage:

Over <span id=quota>2757.272164开发者_Go百科</span> megabytes (and counting) of free storage.

And then store those numbers in a MySql database. The number, as you can see, is dynamically changing.

Is there a way i can setup a server side script that will be grabbing that number, every time it changes, and saving it to database?

Thanks.


Since Gmail doesn't provide any API to get this information, it sounds like you want to do some web scraping.

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites

There are numerous ways of doing this, as mentioned in the wikipedia article linked before:

Human copy-and-paste: Sometimes even the best Web-scraping technology can not replace human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation.

Text grepping and regular expression matching: A simple yet powerful approach to extract information from Web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python).

HTTP programming: Static and dynamic Web pages can be retrieved by posting HTTP requests to the remote Web server using socket programming.

DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls also parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.

HTML parsers: Some semi-structured data query languages, such as the XML query language (XQL) and the hyper-text query language (HTQL), can be used to parse HTML pages and to retrieve and transform Web content.

Web-scraping software: There are many Web-scraping software available that can be used to customize Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases.

Semantic annotation recognizing: The Web pages may embrace metadata or semantic markups/annotations which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer2, are stored and managed separated to the Web pages, so the Web scrapers can retrieve data schema and instructions from this layer before scraping the pages.

And before I continue, please keep in mind the legal implications of all this. I don't know if it's compliant with gmail's terms and I would recommend checking them before moving forward. You might also end up being blacklisted or encounter other issues like this.

All that being said, I'd say that in your case you need some kind of spider and DOM parser to log into gmail and find the data you want. The choice of this tool will depend on your technology stack.

As a ruby dev, I like using Mechanize and nokogiri. Using PHP you could take a look at solutions like Sphider.


Initially I thought it was not possible thinking that the number was initialized by javascript.

But if you switch off javascript the number is there in the span tag and probably a javascript function increases it at a regular interval.

So, you can use curl, fopen, etc. to read the contents from the url and then you can parse the contents looking for this value to store it on the datanase. And set this up a cron job to do it on a regular basis.

There are many references on how to do this. Including SO. If you get stuck then just open another question.

Warning: Google have ways of finding out if their apps are being scraped and they will block your IP for a certain period of time. Read the google small print. It's happened to me.


One way I can see you doing this (which may not be the most efficient way) is to use PHP and YQL (From Yahoo!). With YQL, you can specify the webpage (www.gmail.com) and the XPATH to get you the value inside the span tag. It's essentially web-scraping but YQL provides you with a nice way to do it using maybe 4-5 lines of code.

You can wrap this whole thing inside a function that gets called every x seconds, or whatever time period you are looking for.


Leaving aside the legality issues in this particular case, I would suggest the following:

Trying to attack something impossible, stop and think where the impossibility comes from, and whether you chose the correct way.

Do you really think that someone in his mind would issue a new http connection or even worse hold an open comet connection to look if the common storage has grown? For an anonimous user? Just look and find a function that computes a value based on some init value and the current time.

0

精彩评论

暂无评论...
验证码 换一张
取 消