开发者

Sorting of large dataset fails to complete

开发者 https://www.devze.com 2023-02-25 14:14 出处:网络
This is a followup to debugging a mysql insert fail in php as I now have time to continue on the project.

This is a followup to debugging a mysql insert fail in php as I now have time to continue on the project.

I have three tables in this problem:

840,721 posters in poster_data

58,506 poster categories in poster_categories

17,629,007 (17 million+) poster/category combinations in poster_prodcat

Based on the Efficient Pagination presentation by Yahoo!, I'm trying to add a category rank number to poster_prodcat so we can page by rank as opposed to using limits and offsets. Despite cranking up my php mysql connect timeout to 3600 (crazy 开发者_如何学JAVAI know) and turning of the php timeout, the ordering never seems to complete. maybe several tens to maybe a hundred thousand or so, but never the full 17,000,000 set.

Here's the script:

$sql1="select distinct apcatnum from poster_prodcat";
$result1 = mysql_query($sql1);

while ($cats = mysql_fetch_array ($result1)) {
  $sql2 = "SELECT poster_data.apnumber,poster_data.aptitle 
           FROM poster_prodcat,poster_data 
           WHERE poster_prodcat.apcatnum ='$cats[apcatnum]' 
           AND poster_data.apnumber = poster_prodcat.apnumber 
           ORDER BY aptitle ASC";
  $result2 = mysql_query($sql2);
  $ordernum=1;

  while ($order = mysql_fetch_array ($result2)) {
    $sql3 = "UPDATE poster_prodcat SET catorder='$ordernum' 
             WHERE apnumber='$order[apnumber]' AND apcatnum='$cats[apcatnum]'";
    $result3 = mysql_query($sql3);
    $ordernum++;
    }
  }

This is on a 2 gig server that also hosts the site. The timeouts are long and the server's not crashing, so I don't see what's stopping it from completing. Can I do this on this server, or since this is a once a month or so operation, should I just create some massive memory EC2 instance, do the sorting there and downloaded the massaged tables?

Thanks.

Here's the structure of poster_data (with some fields removed that aren't involved in the selects):

CREATE  TABLE  `poster_data` (
`apnumber` mediumint( 8  )  NOT  NULL DEFAULT  '0',
`aptitle` varchar( 255  )  NOT  NULL DEFAULT  '',
`aptype` varchar( 100 ) NOT NULL DEFAULT '',
[snip]
UNIQUE  KEY  `posterid` (  `apnumber`  ) ,
KEY  `aptitle` (  `aptitle`  ) ,
KEY  `aptype` (  `aptype`  ) ,
KEY  `title_type` (  `aptitle` ,  `aptype`  )  ) ENGINE  = InnoDB;

poster_prodcat:

CREATE TABLE `poster_prodcat` (
`apcatnum` mediumint( 8 ) NOT NULL DEFAULT '0',
`apnumber` mediumint( 8 ) NOT NULL DEFAULT '0',
`catorder` mediumint( 7 ) NOT NULL DEFAULT '0',
PRIMARY KEY ( `apcatnum` , `apnumber` ) ,
KEY `apcatnum` ( `apcatnum` ) ,
KEY `apnumber` ( `apnumber` ) ,
KEY `catorder` ( `catorder` )
) ENGINE = InnoDB /*!50100 PARTITION BY HASH (apcatnum) PARTITIONS 10 */;


  while ($order = mysql_fetch_array ($result2)) {
    $sql3 = "UPDATE poster_prodcat SET catorder='$ordernum' 
             WHERE apnumber='$order[apnumber]' AND apcatnum='$cats[apcatnum]'";
    $result3 = mysql_query($sql3);
    $ordernum++;
    }
  }

Does this mean you're executing 17 million separate transactions? If you can sustain a thousand transactions per second, this part alone will take about 5 hours, right?

Wikipedia (readily available, but not what I'd consider authoritative) says this about innodb transactions.

When operating in fully ACID-compliant modes, InnoDB must do a flush to disk at least once per transaction, though it will combine flushes for inserts from multiple connections. For typical rotating hard drives or arrays, this will impose a limit of about 200 update transactions per second.

0

精彩评论

暂无评论...
验证码 换一张
取 消