I have a very basic web crawler. The database table which stores the links it finds has a Unique index on the url field.
The logic I have so far is that for each link that is found on a page, the application will query the links tables to see if this link already exists. If it doesn't already exist it wi开发者_JAVA百科ll insert it.
In trying to get the best performance for the script, would it be ok to just skip the initial query which checks if the link already exists since if the link tries to get inserted it will fail anyway?
There will be more insert attempts because of this, but would eliminate the need for a full select query for every link found.
I would guess that running the select first will be faster, but testing is more reliable than intuition.
The results depend on the relative speed of select, successful insert and failed insert. It is entirely possible that creating the error for the failed insert takes much more time than the additional select, but if that occurs infrequently enough it is still less than the additional select.
For example, say that a select takes 1ms, a successful insert takes 20ms and a failed insert takes 10ms. (all numbers completely invented). If out of every 100 items 99 will succeed, then 100 select/insert will take 2080ms while insert/fail will take only 1990 ms. In on the other hand only 10 in 100 inserts will succeed then 100 select/insert will take 300ms while 100 insert/fail will take 1100ms.
Short answer: time it.
精彩评论