Process optimization for large sets of data_问答_开发者

I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:

EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %' OR keyword LIKE '% {term}'

Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to pro开发者_Python百科cess depending on the number of keywords in the list of keywords to search for.

Do you have any suggestions on how to optimize the process so this will run much faster?

UPDATE: Here is an example query to search for Holiday Inn:

SELECT * FROM keyword_list 
WHERE
(
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
);

Here is the pastebin for the output of EXPLAIN: http://pastebin.com/tk74uhP4

Some additional information that may be useful. A keyword can consist of multiple words like:

this is a sample keyword
i like my keywords
keywords are great

Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.

The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').

To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".

I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.

From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.

P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.

I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as

KEYWORD1 KEYWORD2 KEYWORD3

Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.

Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:

CREATE TABLE PPC
 (ID_PPC       NUMBER PRIMARY KEY,
  KEYWORD      VARCHAR2(1000),
  <other fields>...);

What you could do would be to rebuild the tables as follows:

CREATE TABLE NEW_PPC
 (ID_PPC       NUMBER PRIMARY KEY,
  <other fields>...);

CREATE TABLE NEW_PPC_KEYWORD
 (ID_NEW_PPC       NUMBER,
  KEYWORD      VARCHAR2(25),  -- or whatever is appropriate for a single keyword
  PRIMARY KEY (ID_NEW_PPC, KEYWORD));

CREATE INDEX NEW_PPC_KEYWORD_1
  ON NEW_PPC_KEYWORD(KEYWORD);

You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like

SELECT P.*
  FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
  ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';

Share and enjoy.

The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:

Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.

Have you tried to cache the results of the queries in a table? Keyed by the input keyword?

Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.

To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.

In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:

select AD.id
from DICT, AD
where 
  DICT.word = :input_word and
  DICT.word_id = AD.word_id

DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.

Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.

Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.

Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.

You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.

My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.