开发者

How to store crawled data from webpages

开发者 https://www.devze.com 2023-03-03 19:01 出处:网络
I want to 开发者_如何学Gobuild an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching

I want to 开发者_如何学Gobuild an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?


You can grab them with file_get_contents() function. So you'd have

$homepage = file_get_contents('http://www.example.com/homepage');

This function returns the page into a string.

Hope this helps. Cheers


Building a crawler I would make the list of URLs to get and finally get them

A. Make the list

  1. Define a list of URL to crawl
  2. Add this URL to the list of URL to crawl (job list)
  3. Define the max depth
  4. Parse the first page, get all the find the href, get the link.
  5. For each link: if it's from same domain or relative, add it to the job list.
  6. Remove the current URL from the job list,
  7. Restart from the next URL job list if non empty.

For this you could use this class, which makes parsing html really easy : https://simplehtmldom.sourceforge.io/

B. Get content

Loop on the array made and get the content. file_get_contents will do this for you : https://www.php.net/file-get-contents

This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号