开发者

How can i grab pdf links from website with Python script

开发者 https://www.devze.com 2023-03-10 06:51 出处:网络
Quite often i have to download the pdfs 开发者_如何学Gofrom websites but sometimes they are not on one page.

Quite often i have to download the pdfs 开发者_如何学Gofrom websites but sometimes they are not on one page. They have divided the links in pagination and I have to click on every page of get the links.

I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.

I am new to python so can anyone please give me the directions how can i do it


Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

Result:

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...


If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/). It is pretty easy to understand how to use it and can download pdf files you need.


By phone, maybe it is not very readable

If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests

import requests
page_content=requests.get(url)

But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)

  • Frist way:make your requests more like a browser(human). add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers) make the right post form.This one should copy the ways you post the form by browser. get the cookies, and add it to requests

  • Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver) remeber to add chromedriver to the path Or use code to load the driver.exe driver=WebDriver.Chrome(path) not sure is this set up code

    driver.get(url) It is trully surf the url by browser, so it will decrease the difficulty of grabing things

    get the web page page=driver.page_soruces

    some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.

    try: certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID) WebDriverWait(certain_element)

    or use implict wait: wait the time you like

driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)

And you can controll the website by WebDriver. Here is not going to describe. You can search the module.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号