Does anyone know in which programming language the Googlebot was written?
Or, more generally, in which language are efficient web-crawlers written?
I've seen many in Java language, but开发者_Go百科 it doesn't seem to me the most appropriate language to develop a web-crawler because it creates far too much overhead (tried with Heritrix web-crawler, and it's extremely heavy).
An educated guess is Python. They employ the creator of it. However, I can imagine that their crawler probably is a distributed app that takes advantage of MapReduce, in which case it might actually be a C/C++ application.
This is besides the point, though. You can write an efficient web-crawler in many different languages and still get the same result. A hammer will still hit a nail even if it is a yellow or blue hammer. Pick your favorite color and use it correctly.
The very earliest version, Backrub, was written in Python and Java.
This might help. Original google Paper.
http://infolab.stanford.edu/~backrub/google.html
Don't know about GoogleBot (Most likely C or Python) but there are some good ones out there in both Java and .NET.
One of the more popular open source options is Nutch (often used with Lucene).
Nutch itself is writting in Java and is fairly efficient. There's also a .NET port called Nutch.NET.
I don't think the language will matter as much as the specific implementation.
What kind of overhead are you worried about in Java? memory, processing power?
精彩评论