I have 2 databases and I need to link information between two big tables (more than 3M entries each, continuously growing). The 1st database has a table 'pages' that stores various information about web pages, and includes the URL of each one. The column 'URL' is a varchar(512) and has no index.
The 2nd database has a table 'urlHops' defined as:
CREATE TABLE urlHops
(
dest
varchar(512) NOT NULL,
src
varchar(512) DEFAULT NULL,
timestamp
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY dest_key
(dest
),
KEY src_key
(src
)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Now, I need basically to issue (efficiently) queries like this: select p.id,p.URL from db1.pages p, db2.urlHops u where u.src=p.URL and u.dest=?
开发者_如何学PythonAt first, I thought to add an index on pages(URL). But it's a very long column, and I already issue a lot of INSERTs and UPDATEs on the same table (way more than the number of SELECTs I would do using this index).
Other possible solutions I thought are: -adding a column to pages, storing the md5 hash of the URL and indexing it; this way I could do queries using the md5 of the URL, with the advantage of an index on a smaller column. -adding another table that contains only page id and page URL, indexing both columns. But this is maybe a waste of space, having only the advantage of not slowing down the inserts and updates I execute on 'pages'.
I don't want to slow down the inserts and updates, but at the same time I would be able to do the queries on the URL efficiently. Any advice? My primary concern is performance; if needed, wasting some disk space is not a problem.
Thank you, regards
Davide
The MD5 hash suggestion you had is very good - it's documented in High Performance MySQL 2nd Ed. There's a couple of tricks to get it to work:
CREATE TABLE urls ( id NOT NULL primary key auto_increment, url varchar(255) not null, url_crc32 INT UNSIGNED not null, INDEX (url_crc32) );
Select queries have to look like this:
SELECT * FROM urls WHERE url='http://stackoverflow.com' AND url_crc32=crc32('http://stackoverflow.com');
The url_crc32 is designed to work with the index, including url in the WHERE clause is designed to prevent hash collisions.
I'd probably recommend crc32 over md5. There will be a few more collisions, but you have a higher chance of fitting all the index in memory.
If pages to URL's is a 1-to-1 relationship and that table has a unique id (primary key?), you could store that id value in the src and dest fields in the urlHops table instead of the full URL.
This would make indexing and joins much more efficient.
I would create a page_url table that has auto-inc integer primary key, and your URL value. Then update Pages and urlHops to use page_url.id.
Your urlHops would become (dest int,src int,...)
Your Pages table would replace url with pageid.
Index page_url.url field, and you should be good to go.
精彩评论