We can easily find subdirectories on our local disc using os.walk() but what if those directories are not local and are on a web server?
For example, I have a website called http://www.geoglobaldomination.org. There are a couple subdirectories开发者_高级运维 that are NOT referenced on the home page. eg.http://www.geoglobaldomination.org/kml and http://www.geoglobaldomination.org/kml/temp.
How can I find those subdirectories using a simple python crawler without using the HTML tags as a reference point?
Anything you want to access off a remote server needs to be made public in some way. There is no automatic discovery mechanism -- this is why search engines want site maps for a website. The best practice in this case is to make a sitemap and have your crawler start there.
Well, in the most general sense you can't.
There are some websites that may give you an index of subdirectories when you end your uri with a '/', or a "index.html", but they don't have to. A website author can basically return anything they want when you visit their site (with a browser or a program). They could return a NOT FOUND (even when the document you requests exists at the exact location you request).
It is completely implementation-dependent.
精彩评论