开发者

High-performance multi-tier tag filtering

开发者 https://www.devze.com 2023-03-26 00:14 出处:网络
I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are severa

I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are several thousand (or even hundred thousand) tags applicable to each item type.

I am trying to accomplish two tasks, and I'm having a very hard time getting the queries to perform acceptably.

Task 1) Get all tracks that have any given tags (if provided) by artists that have any given tags (if provided) on albums with any given tags (if provided). Any set of tags may not be present (i.e. only a track tag is active, no artist or album tags)

Variation: The results are also presentable by artist or by album rather than by track

Task 2) Get a list of tags that are applied to the results from the previous filter, along with a count of how many tracks have each given tag.

What I am after is some general guidance in approach. I have tried temp tables, inner joins, IN(), all my efforts thus far result in slow responses. A good example of the results I am after can be seen here: http://www.yachtworld.com/core/listing/advancedSearch.jsp, except they only have one tier of tags, I am dealing with three.

Table structures:

Table: attribute_tag_groups
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 name       | character varying(255)      | 
 type       | enum (track, album, artist) | 

Table: attribute_tags
   Column                       |          Type               |   
--------------------------------+-----------------------------+
 id                             | integer                     |
 attribute_tag_group_id         | integer                     |
 name                           | character varying(255)      | 

Table: track_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 track_id   | integer                     |
 tag_id     | integer                     | 

Table: artist_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 artist_id  | integer                     |
 tag_id     | integer                     | 

Table: album_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 album_id   | integer                     |
 tag_id     | integer                     | 

Table: artists
   Column   |          Type               |   
--------开发者_开发知识库----+-----------------------------+
 id         | integer                     |
 name       | varchar(350)                | 

Table: albums
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 artist_id  | integer                     | 
 name       | varchar(300)                | 

Table: tracks
   Column    |          Type               |   
-------------+-----------------------------+
 id          | integer                     |
 artist_id   | integer                     | 
 album_id    | integer                     | 
 compilation | boolean                     | 
 name        | varchar(300)                | 

EDIT I am using PHP, and I am not opposed to doing any sorting or other hijinx in script, my #1 concern is speed of return.


If you want speed, I would suggest you look into Solr/Lucene. You can store your data, and have very speedy lookups by calling Solr and parsing the result from PHP. And as an added benefit you get faceted searches as well (which is task 2 of your question if I interpret it correctly). The downside is of course that you might have redundant information (once stored in DB, once in the Solr document store). And it does take a while to setup (well, you could learn a lot from Drupal Solr integration).

Just check out the PHP reference docs for Solr.

Here's on article on how to use Solr with PHP, just in case : http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/.


You probably should try to denormalize your data. Your structure is optimised for insert/update load, but not for queries. As I got it, your will have much more select queries than insert/update queries.

For example you can do something like this:

store your data in normalized structure.

create agregate table like this

  track_id, artist_tags, album_tags, track_tags
   1 , jazz/pop/,  jazz/rock, /heavy-metal/  

    or 

    track_id, artist_tags, album_tags, track_tags
    1 , 1/2/,  1/3, 4/

to spead up search you probably should create FULLTEXT index on *_tags columns

query this table with sql like

select * from aggregate where album_tags  MATCH (track_tags) AGAINST ('rock')

rebuild this table incrementally once a day.


I think the answer greately depends on how much money you wish to spend on your project - there are some tasks that are even theoretically impossible to accomplish given strict conditions(for example that you must use only one weak server). I will assume that you are ready to upgrade your system.

First of all - your table structure forces JOIN's - I think you should avoid them if possible when writing high performace applications. I don't know "attribute_tag_groups" is, so I propose a table structure: tag(varchar 255), id(int), id_type(enum (track, album, artist)). Id can be artist_id,track_id or album_id depending on id_type. This way you will be able too lokup all your data in one table, but of cource it will use much more memory.

Next - you should consider using several databases. It will help even more if each database contains only part of your data(each lookup will be faster). Deciding how to spread your data between databases is usually rather hard task: I suggest you make some statistics about tag length, find ranges of length that will get similar trac/artists results count and hard-code it into your lookup code.

Of cource you should consider MySql tuning(I am sure you did that, but just in case) - all your tables should reside in RAM - if that is impossible try to get SSD discs, raids etc.. Proper indexing and database types/settings are really important too (MySql may even show some bottlenecks in internal statistics).

This suggestion may sound mad - but sometimes it is good to let PHP do some calculations that MySql can do itself. MySql databases are much harder to scale, while a server for PHP processing can be added in in the matter of minutes. And different PHP threads can run on different CPU cores - MySql have problems with it. You can increase your PHP performace by using some advanced modules(you can even write them yourself - profile your PHP scripts and hard code bottlenecks in fast C code).

Last but I think the most important - you must use some type of caching. I know that it is really hard, but I don't think that there was any big project without a really good caching system. In your case some tags will surely be much more popular then others, so it should greately increase performance. Caching is a form of art - depending on how much time you can spend on it and how much resources are avaliable you can make 99% of all requests use cache.

Using other databases/indexing tools may help you, but you should always consider theoretical query speed comparison(O(n), O(nlog(n))...) to understand if they can really help you - using this tools sometimes give you low performance gain(like constant 20%), but they may complicate your application design and most of the time it is not worth it.


From my experience most 'slow' MySQL database doesn't have correct index and/or queries. So I would check these first:

  1. Make sure all data talbes' id fields is primary index. Just in case.
  2. For all data tables, create an index on the external id fields and then the id, so that MySQL can use it in search.
  3. For your glue tables, setting a primary key on the two fields, first the subject, then the tag. This is for normal browsing. Then create a normal index on the tag id. This is for searching.
  4. Still slow? Are you using MyISAM for your tables? It is designed for quick queries.
  5. If still slow, run an EXPLAIN on a slow query and post both the query and result in the question. Preferably with an importable sql dump of your complete database structure.


Things you may give a try:

  • Use a Query Analyzer to explore the bottlenecks of your querys. (In most times the underlying DBS is quite doing an amazing job in optimizing)

  • Your table structure is well normalized but personal experience showed me that you can archive much greater performance levels with structures that enable you to avoid joins& subquerys. For your case i would suggest to store the tag information in one field. (This requires support by the underlying DBS)

So far.


Check your indices, and if they are used correctly. Maybe MySQL isn't up to the task. PostgreSQL should be similiar to use but has better performance in complex situations.

On a completely different track, google map-reduce and use one of these new fancy no-SQL databases for really really large data sets. This can do distributed search on multiple servers in parallel.

0

精彩评论

暂无评论...
验证码 换一张
取 消