How to store crawled data from webpages_问答_开发者

开发者 https://www.devze.com 2023-03-03 19:01 出处：网络

I want to 开发者_如何学Gobuild an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching

You can grab them with file_get_contents() function. So you'd have

$homepage = file_get_contents('http://www.example.com/homepage');

This function returns the page into a string.

Hope this helps. Cheers

Building a crawler I would make the list of URLs to get and finally get them

A. Make the list

Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.

For this you could use this class, which makes parsing html really easy : https://simplehtmldom.sourceforge.io/

B. Get content

Loop on the array made and get the content. file_get_contents will do this for you : https://www.php.net/file-get-contents

This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.