I need to fetch data from a website using PHP and save it in a MySQL database. I also want to fetch the images and save them in my server so that I can display them in my site. I heard that an API can be used for this, but I would like to know whether or not I can do this using CURL. I want to fetch a huge amount of data on a开发者_运维百科 daily basis, so will using CURL consume a large amount of server-side resources? What other methods exist to fetch data?
I think this is more of a stack overflow question, but I wll try to answer.
From what you tell seems like you want a generic web crawler. There are a few solutions. And writing yours is relatively easy.
The problem is that php and curl are slow. And most probably you can enter a memory issues down the line and script execution times. Php is just not designed to run in an infinite loop.
How will I do it with custom crawler:
Respect robots.txt! Respect number of connections!
Php: Curl the url, load it into dom(lazy) or parse get all tags(for next links) then download all img tags. Add the a tag hrefs to a hashmap and queue. hashmap not to recrawl already visited. Queeue - for next job. Rinse repeat and you are in business.
Java : Webdriver + chrome + browsermob crawler can be made in a few lines of code. and you will catch some js things you will otherwise miss. Slow but easy and lazy. You will intercept all images directly from the proxy.
Java/C# : Proper, asynchronous, high performance crawler with something like magestic 12 html parser in the back. You can get to 2000 pages processed per minute and will win the eternal hatred of any webmaster.
You can also take a look at lucent - part of the apache project.
精彩评论