Simple Screen Scraping using jQuery_问答_开发者

I have been playing with the idea of using a simple screen-scraper using jQuery and I am wondering if the following is possible.

I have simple HTML page and am making an attempt (if this is possible) to grab the contents of all of the list items from another page, like so:

Main Page:

<!-- jQuery -->
<script type='text/javascript'>
$(document).ready(function(){
$.getJSON("[URL to other page]",
  function(data){

    //Iterate through the <li> inside of the URL's data
    $.each(data.items, function(item){
      $("<li/>").value().appendTo("#data");
    });

  });
});
</script>

<!-- HTML -->
<html>
    <body>
       <div id='data'></div>
    </body>
</html>

Other Page:

//Html
<body>
    <p><b>Items to Scrape</b></p>   
    <ul>
        <li>I want t开发者_开发百科o scrape what is here</li>
        <li>and what is here</li>
        <li>and here as well</li>
        <li>and append it in the main page</li>
    </ul>
</body>

So, is it possible using jQuery to pull all of the list item contents from an external page and append them inside of a div?

Use $.ajax to load the other page into a variable, then create a temporary element and use .html() to set the contents to the value returned. Loop through the element's children of nodeType 1 and keep their first children's nodeValues. If the external page is not on your web server you will need to proxy the file with your own web server.

Something like this:

$.ajax({
     url: "/thePageToScrape.html",
     dataType: 'text',
     success: function(data) {
          var elements = $("<div>").html(data)[0].getElementsByTagName("ul")[0].getElementsByTagName("li");
          for(var i = 0; i < elements.length; i++) {
               var theText = elements[i].firstChild.nodeValue;
               // Do something here
          }
     }
});

Simple scraping with jQuery...

// Get HTML from page
$.get( 'http://example.com/', function( html ) {

    // Loop through elements you want to scrape content from
    $(html).find("ul").find("li").each( function(){

        var text = $(this).text();
        // Do something with content

    } )

} );

$.get("/path/to/other/page",function(data){
  $('#data').append($('li',data));
}

If this is for the same domain then no problem - the jQuery solution is good.

But otherwise you can't access content from an arbitrary website because this is considered a security risk. See same origin policy.

There are of course server side workarounds such as a web proxy or CORS headers. Of if you're lucky they will support jsonp.

But if you want a client side solution to work with an arbitrary website and web browser then you are out of luck. There is a proposal to relax this policy, but this won't effect current web browsers.

You may want to consider pjscrape:

http://nrabinowitz.github.io/pjscrape/

It allows you to do this from the command-line, using javascript and jQuery. It does this by using PhantomJS, which is a headless webkit browser (it has no window, and it exists only for your script's usage, so you can load complex websites that use AJAX and it will work just as if it were a real browser).

The examples are self-explanatory and I believe this works on all platforms (including Windows).

Use YQL or Yahoo pipes to make the cross domain request for the raw page html content. The yahoo pipe or YQL query will spit this back as a JSON that can be processed by jquery to extract and display the required data.

On the downside: YQL and Yahoo pipes OBEY the robots.txt file for the target domain and if the page is to long the Yahoo Pipes regex commands will not run.

I am sure you will hit the CORS issue with requests in many cases. From here try to resolve CORS issue.

var name = "kk";
var url = "http://anyorigin.com/go?url=" + encodeURIComponent("https://www.yoursite.xyz/") + name + "&callback=?";
$.get(url, function(response) {
  console.log(response);
});