Optimization of psql querys called from perl_问答_开发者

I have been struggling for a while trying to optimize the time and resources consumed by a couple of querys Im calling from a perl script. The problem is as follows:

Expected behavior: There is a table with over 4 millions registers, one of the fields (I'll call it f_path from now on) of this table is a string that stands for directory path that may or may not (most do)contain a substring in a date format. The behavior expected from the script is to take a sub-set of these registers, filter it using the date-like substrings from the f_path and store them in a local file to perform some operations on them and use the data in some graphic-generating functions. The filter is to work as follows: To get all the records with a correspoding date substring in the f_path, that are in a user defined date range. The user defines this range providing a last date (the newest date to be fetched) and a number that represent how many dates to fetch from the last date backwards. The dates follow a somewhat sequential pattern but that can't be counted on (there may be gaps). All the dates that comply with filter parameters p开发者_C百科rovided by the user are to be stored in a variable in perl, in a file, or both, since they are to be used by other scripts later, and are also used to control the graph functions mentioned prevously.

Current behavior: There is no filtering. The current query fetches all the 4 million+ registers and stores them in a file locally, then it runs yet another query on the same table that fetches all the distinct values for the f_path (a couple of values with the same date substring may be picked since the directory path may be differente even if the date is the same) and stores them in a local variable in perl, where later a regex is applied to fetch the date-like substrings and store them in a file. Problem: This is utterly inefficient since it is querying the whole table twice and recovering lots of data that isn't going to be used.

Fixes attempted: Everything I have tried has, sadly, ended up being even slower than the current script:

Create a temp table from a select operation to the table with all the data, using the substring function, a regex, and the filter parameters provided by the user to obtain the set of dates (not the whole directory path that is the value f_path) to be used as a filter. Then make a join between this table and the table with the required data, on the f_path field (that is actually called proj_base), applying the substring function to get the date-like substring and verify if it matches with any of the values in the temp table. This approach has the added disadvantage that as with the current script, I will have to query the table again later in the script to get the set of dates that act as a filter. I might be able to use the temp table I created previosuly, but I'm not sure if it will still be around when I query again for the dates, havent tried that. It looks like this:

psql -U postgres -c "CREATE TEMP TABLE temp_filter_trees AS SELECT DISTINCT substring(proj_base, '[0-9]{4}_[0-9]{2}_[0-9]{2}') AS "TREE" FROM testcase ORDER BY TREE DESC LIMIT 30; SELECT mem_tot_size, mem_size_inc_pct, tot_sim_time, duration, cpu_util_pct, cpu_tot_time, result, test_host, proj_base, topology FROM testcase, temp_filter_trees WHERE substring(proj_base, '[0-9]{4}_[0-9]{2}_[0-9]{2}') = temp_filter_trees.TREE " $proj > $dir/log_file;

Query the table for all the dates that comply with the filter parameters, using the substring function and regex and storing them in a list variable. Then loop applying a regex to the fields of this list variable to assertain that it only contains dates, and to build a regex with all of the dates to be used as a filter parameter in the query for al the data. Use the pattern obtained from the loop as a filter parameter in the query that fetches all the data. It looks like this:

@data = psql -U postgres -c "SELECT DISTINCT substring(proj_base, '[0-9]{4}_[0-9]{2}_[0-9]{2}') AS "TREE" FROM testcase ORDER BY TREE DESC LIMIT 30" $proj;

foreach(@data)
{ 
next unless m/(\d{4}\w{1}\d{2}\w{1}\d{2})$/; 
if(defined $psqlpattern){
$psqlpattern = $psqlpattern ."|$1";
}
else{
$psqlpattern = "'" . $1;
}
push @trees, $1;

}

$psqlpattern = $psqlpattern . "'";

psql -U postgres -c "SELECT mem_tot_size, mem_size_inc_pct, tot_sim_time, duration, cpu_util_pct, cpu_tot_time, result, test_host, proj_base, topology FROM testcase WHERE proj_base ~ $psqlpattern" $proj > $dir/log_file;

Other Ideas: I also thought about just using the substring query on the f_path field in the query that fetches all the data, to check if the date-like substring is in the range specified by the user by just asking if it is older than the the last date and newer than the older date in the filter, that must have been previously fetched anyway. I haven't tried this but seemed that could be a t least a little more efficient that the other things I have tried.

Any insight or help you can provide will be greatly appreciated. Thanks for your time!

The perl i/o operations to read up the records is expensive, especially for 4M records. So the general approach I would recommend is to do as much work as you can within the database, and to concentrate on tuning the database queries. I would also try to achieve this with a single pass and to avoid copying the records to a temp table.

So as a starting point, perhaps see if you can achieve what you want with a single query that uses psql's SIMILAR TO operator, or possibly the regexp_matches function. Then add another condition to filter based on the date.