开发者

JOIN query is far too slow. Won't use INDEX?

开发者 https://www.devze.com 2023-03-03 14:17 出处:网络
I have a transitional table that I temporarily fill with some values before querying it and destroying it.

I have a transitional table that I temporarily fill with some values before querying it and destroying it.

CREATE TABLE SearchListA(
  `pTime`  int unsigned  NOT NULL ,
  `STD` double unsigned NOT NULL,
    `STD_Pos`   int unsigned NOT NULL,
  `SearchEnd`  int unsigned NOT NULL,
    UNIQUE INDEX (`pTime`,`STD` ASC) USING BTREE
    ) ENGINE = MEMORY;

It looks as such:

+------------+------------+---------+------------+
| pTime      | STD        | STD_Pos | SearchEnd  |
+------------+------------+---------+------------+
| 1105715400 | 1.58474499 |       0 | 1105723200 |
| 1106297700 |  2.5997839 |       0 | 1106544000 |
| 1107440400 | 2.04860375 |       0 | 1107440700 |
| 1107440700 | 1.58864998 |       0 | 1107467400 |
| 1107467400 | 1.55207218 |       0 | 1107790500 |
| 1107790500 | 2.04239417 |       0 | 1108022100 |
| 1108022100 | 1.61385678 |       0 | 1108128000 |
| 1108771500 | 1.58835083 |       0 | 1108771800 |
| 1108771800 | 1.65734727 |       0 | 1108772100 |
| 1108772100 | 2.09378189 |       0 | 1109027700 |
+------------+------------+---------+------------+

Only columns pTime and SearchEnd are relevant to my problem.

My intention is to use this table to speed up searching through a much larger, static table.

The first column, pTime, is where the search should start

The fourth column, SearchEnd, is where the search should end

The larger table is similar; it looks like this:

CREATE TABLE `b50d1_abs` (
  `pTime` int(10) unsigned NOT NULL,
  `Slope` double NOT NULL,
  `STD` double NOT NULL,
  `Slope_Pos` int(11) NOT NULL,
  `STD_Pos` int(11) NOT NULL,
  PRIMARY KEY (`pTime`),
  KEY `Slope` (`Slope`) USING BTREE,
  KEY `STD` (`STD`),
  KEY `ID1` (`pTime`,`STD`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 MIN_ROWS=339331 MAX_ROWS=539331 PACK_KEYS=1 ROW_FORMAT=FIXED;

+------------+-------------+------------+-----------+---------+
| pTime      | Slope       | STD        | Slope_Pos | STD_Pos |
+------------+-------------+------------+-----------+---------+
| 1107309300 |  1.63257919 | 1.392416开发者_如何学Go98 |         0 |       1 |
| 1107314400 |   6.8959276 | 0.22425643 |         1 |       1 |
| 1107323100 | 18.19909502 | 1.46854808 |         1 |       0 |
| 1107335400 |  2.50135747 |  0.4736305 |         0 |       0 |
| 1107362100 |  4.28778281 | 0.85576985 |         0 |       1 |
| 1107363300 |  6.96289593 | 1.41299044 |         0 |       0 |
| 1107363900 |  8.10316742 |  0.2859726 |         0 |       0 |
| 1107367500 | 16.62443439 | 0.61587645 |         0 |       0 |
| 1107368400 | 19.37918552 | 1.18746968 |         0 |       0 |
| 1107369300 | 21.94570136 | 0.94261744 |         0 |       0 |
| 1107371400 | 25.85701357 |  0.2741292 |         0 |       1 |
| 1107375300 | 21.98914027 | 1.59521158 |         0 |       1 |
| 1107375600 | 20.80542986 | 1.59231289 |         0 |       1 |
| 1107375900 | 19.62714932 | 1.50661679 |         0 |       1 |
| 1107381900 |  8.23167421 | 0.98048205 |         1 |       1 |
| 1107383400 | 10.68778281 | 1.41607579 |         1 |       0 |
+------------+-------------+------------+-----------+---------+
...etc (439340 rows)

Here, the columns pTime, STD, and STD_Pos are relevant to my problem.

For every element in the smaller table (SearchListA), I need to search the specified range within the larger table (b50d1_abs()) and return the row with the lowest b50d1_abs.pTime that is higher than the current SearchListA.pTime and that also matches the following conditions:

SearchListA.STD < b50d1_abs.STD AND SearchListA.STD_Pos <> b50d1_abs.STD_Pos

AND

b50d1_abs.pTime < SearchListA.SearchEnd

The latter condition is simply to reduce the length of the search.

This seems to me like a pretty straightforward query that should be able to use indexes; especially since all values are unsigned numbers - But I cannot get it to execute nearly fast enough! I think it is because it rebuilds the entire table each time instead of just omitting values from it.

I would be extremely grateful if someone takes a look at my code and figures out a more efficient way to go about this:

SELECT   
    m.pTime as OpenTime,
    m.STD,
    m.STD_Pos,
    mu.pTime AS CloseTime
FROM
    SearchListA m
 JOIN b50d1_abs mu ON mu.pTime =(
    SELECT 
        md.pTime 
    FROM
        b50d1_abs as md
    WHERE
        md.pTime > m.pTime
    AND md.pTime <=m.SearchEnd
    AND m.STD < md.STD AND m.STD_Pos <> md.STD_Pos

     LIMIT 1
); 

Here is my EXPLAIN EXTENDED statement:

+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+
| id | select_type        | table | type   | possible_keys   | key     | key_len | ref  | rows   | filtered | Extra                    |
+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+
|  1 | PRIMARY            | m     | ALL    | NULL            | NULL    | NULL    | NULL |    365 |   100.00 |                          |
|  1 | PRIMARY            | mu    | eq_ref | PRIMARY,ID1     | PRIMARY | 4       | func |      1 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | md    | ALL    | PRIMARY,STD,ID1 | NULL    | NULL    | NULL | 439340 |   100.00 | Using where              |
+----+--------------------+-------+--------+-----------------+---------+---------+------+--------+----------+--------------------------+

It looks like the lengthiest query (#2) doesn't use indexes at all!

If I try FORCE INDEX then it will list it under possible_keys, but still list NULL under Key and still take an extremely long time (over 80 seconds).

I need to get this query under 10 second; and even 10 is too long.


Your subquery is a dependent subquery, so the best case is that it's going to be evaluated once for every row in table m. Since m contains few rows, that would be OK.

But if you put that subquery in a JOIN condition, it is going to be executed (rows in m)*(rows in mu) times, no matter what.

Note that your results may be incorrect since :

return the row with the lowest b50d1_abs.pTime

but you don't specify that anywhere.

Try this query :

SELECT   
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
(
    SELECT min( big.pTime )
    FROM   b50d1_abs as big
    WHERE   big.pTime >  m.pTime
        AND big.pTime <= m.SearchEnd
        AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
) AS CloseTime
FROM SearchListA m

or this one :

SELECT   
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
min( big.pTime )
FROM   
    SearchListA m
    JOIN b50d1_abs as big ON (
        big.pTime >  m.pTime
        AND big.pTime <= m.SearchEnd
        AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
    )
GROUP BY m.pTime

(if you also want rows where the search was unsuccessful, make that a LEFT JOIN).

SELECT   
m.pTime as OpenTime,
m.STD,
m.STD_Pos,
(
    SELECT big.pTime
    FROM   b50d1_abs as big
    WHERE   big.pTime >  m.pTime
        AND big.pTime <= m.SearchEnd
        AND m.STD < big.STD AND m.STD_Pos <> big.STD_Pos
    ORDER BY big.pTime LIMIT 1
) AS CloseTime
FROM SearchListA m

(Try an index on b50d1_abs( pTime, STD, STD_Pos)

FYI here are some tests using Postgres on a test data set that should look like yours (maybe remotely, lol)

CREATE TABLE small ( 
    pTime INT PRIMARY KEY, 
    STD FLOAT NOT NULL, 
    STD_POS BOOL NOT NULL, 
    SearchEnd INT NOT NULL 
);

CREATE TABLE big( 
    pTime INTEGER PRIMARY KEY, 
    Slope FLOAT NOT NULL, 
    STD FLOAT NOT NULL, 
    Slope_Pos BOOL NOT NULL, 
    STD_POS BOOL NOT NULL 
);

INSERT INTO small SELECT 
    n*100000, 
    random(), 
    random()<0.1,
    n*100000+random()*50000 
    FROM generate_series( 1, 365 ) n;

INSERT INTO big SELECT 
    n*100,
    random(),
    random(),
    random() > 0.5,
    random() > 0.5
    FROM generate_series( 1, 500000 ) n;

Query 1 :  6.90 ms (yes milliseconds)
Query 2 : 48.20 ms
Query 3 :  6.46 ms


I'll start a new answer cause it starts to look like a mess ;)

With your data I get, using MySQL 5.1.41

Query 1 : takes forever, Ctrl-C
Query 2 : 520 ms
Query 3 : takes forever, Ctrl-C

Explain for 2 looks good :

 +----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys       | key  | key_len | ref  | rows   | Extra                                          |
+----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+
|  1 | SIMPLE      | m     | ALL  | PRIMARY,STD,ID1,ID2 | NULL | NULL    | NULL |    743 | Using temporary; Using filesort                |
|  1 | SIMPLE      | big   | ALL  | PRIMARY,ID1,ID2     | NULL | NULL    | NULL | 439340 | Range checked for each record (index map: 0x7) |
+----+-------------+-------+------+---------------------+------+---------+------+--------+------------------------------------------------+

So, I loaded your data into postgres...

Query 1 :  14.8 ms
Query 2 : 100   ms
Query 3 :  14.8 ms (same plan as 1)

In fact rewriting 2 as query 1 (or 3) fixes a little optimizer shortcoming and finds the optimal query plan for this scenario.

Would you recommend using Postgres over MySql for this scenario? Speed is extremely important to me.

Well, I don't know why mysql barfs so much on queries 1 and 3 (which are pretty simple and easy), in fact it should even beat postgres (using an index only scan) but apparently not, eh. You should ask a mysql specialist !

I'm more used to postgres... got fed up with mysql a long time ago ! If you need complex queries postgres usually wins big time (but you'll need to re-learn how to optimize and tune your new database)...

0

精彩评论

暂无评论...
验证码 换一张
取 消