I'm writing an application that needs to deal with millions of URLs. It also needs to do retrieval by URL.
My table currently looks like this:
CREATE TABLE Pages (
id bigint(20) unsigned NOT NULL,
url varchar(4096) COLLATE utf8_unicode_ci NOT NULL,
url_crc int(11) NOT NULL,
PRIMARY KEY (id),
KEY url_crc (url_crc)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The idea behind this structure is to do look ups by a CRC32 hash of the URL, since a b-tree index would be very inefficient on URLs which tend to have common prefixes (InnoDB doesn't support hash indexes). Duplicate results from the CRC32 are filtered by a comparison with the full URL. A sample retrieval query would look like this:
SELECT id
FROM Pages
WHERE url_crc = 2842100667
AND url = 'example.com/page.html';
The problem I have is avoiding duplicate entries being inserted. The a开发者_JAVA百科pplication will always check the database for an existing entry before inserting a new one, but it's likely in my application that multiple queries for the same new URL will be made concurrently and duplicate CRC32s and URLs will be entered.
I don't want to create a unique index on url as it will be gigantic. I also don't want to write lock the table on every insert since that will destroy concurrent insert performance. Is there an efficient way to solve this issue?
Edit: To go into a bit more detail about the usage, it's a real-time table for looking up content in response to a URL. By looking up the URL, I can find the internal id for the URL, and then use that to find content for a page. New URLs are added to the system all the time, and I have no idea what those URLs will be before hand. When new URLs are referenced, they will likely be slammed by simultaneous requests referencing the same URLs, perhaps hundreds per second, which is why I'm concerned about the race condition when adding new content. The results need to be immediate and there can't be read lag (subsecond lag is okay).
To start, new URLs will be added only a few thousand per day, but the system will need to handle many times that before we have time to move to a more scalable solution next year.
One other issue with just using a unique index on url is that the length of the URLs can exceed the maximum length of a unique index. Even if I drop the CRC32 trick, it doesn't solve the issue of preventing duplicate URLs.
Have you actually benchmarked and found the btree to be a problem? I sense premature optimization.
Second, if you're worried about the start of all the strings being the same, one answer is to index your URL reversed—last character first. I don't think MySQL can do that natively, but you could reverse the data in your app before storing it. Or just don't use MySQL.
Have you considered creating a UNIQUE INDEX (url_crc, url) ? It may be 'gigantic', but with the number of collisions you're going to get using CRC32, it will probably help the performance of your page retrieval function while also preventing duplicate urls.
Another thing to consider would be allowing duplicates to be inserted, and removing them nightly (or whenever traffic is low) with a script.
In addition to your Pages table, create 3 additional tables with the same columns (PagesInsertA, PagesInsertB and PagesInsertC). When inserting URLs, check against Pages for an existing entry, and if it's not there, insert the URL into PagesInsertA. You can either use a unique index on that smaller table, or include a step to remove duplicates later (discussed below). At the end of your rotation time (perhaps one minute, see discussion below for constraints), switch to inserting new URLs into PagesInsertB. Perform the following steps on PagesInsertA: remove duplicates (if you weren't using a unique index), remove any entries that duplicate entries in PagesInsertC (that table will be empty the first time around, but not the second), add the entries from PagesInsertA to Pages, empty PagesInsertC.
At the end of the second period, switch to inserting new URLs into PagesInsertC. Perform the steps discussed above on PagesInsertB (only difference is that you will remove entries also found in PagesInsertA and empty PagesInsertA at the end). Continue the rotation of the table the new URLs are inserted into (A -> B -> C -> A -> ...).
A minimum of 3 insert tables is needed because there will be a lag between switching URL insertion to a new insert table and inserting the cleaned-up rows from the previous insert table into Pages. I used 1 minute as the time between switches in this example, but you can make that time smaller as long as the insertion from PagesInsertA to Pages and emptying PagesInsertC (for example), occurs before the switch between inserting new URLs into PagesInsertB and PagesInsertC.
精彩评论