I want to merge the following data using sql:
no code area rate startdate enddate 1 开发者_如何学Python 0101 EU 0.1% 20050101 20051231 2 0101 EU 0.1% 20060101 20061231 3 0101 EU 0.1% 20080101 20081231 4 0101 EFTA 0.2% 20050101 20051231 5 0101 EFTA 0.1% 20060101 20061231
- In the above example no 1 and 2 should be merged as they have the same code, same rate, have subsequent dates and are from the same area.
- No 3 should not be merged as the dates are not subsequent.
- 4 and 5 should not be merged as they have different rates.
The end result should look as follows:
no code area rate startdate enddate 1 0101 EU 0.1% 20050101 20061231 3 0101 EU 0.1% 20080101 20081231 4 0101 EFTA 0.2% 20050101 20051231 5 0101 EFTA 0.1% 20060101 20061231
Is there a way to do this using only sql? I am using postgres.
Thanks
Assuming this is a one-off hit thing you want to do, and isn't part of some on-going production solution, I think the only way you'll do this is with a cursor.
I don't know Postgres, but it seems they have cursors which work in a similar way to MS-Sql: http://www.postgresql.org/docs/current/static/plpgsql-cursors.html
Cursors allow you to "loop through" a select statement, putting the values into temporary variables where you can then work on them. It's more akin to procedural programming.
Performance isn't great, which is why it's OK as a one-off thing, but there may be a more complex (but more performant) solution if you have performance issues.
I would start by creating a cursor to loop through the table, putting the code,rate, startdate and endate into variables. (If you sort the select statement by code,rate and startdate you'll guarantee that all your records will appear in the right order for you.) For each record you can compare the current code and rate to the value in the variable, and if they're the same, update the end date variable.
Use a 2nd table (or a table variable) for the results. If the code/rate isn't the same as the variables, then everything in the variables is a new row candidiate. Write it out, then update the variables with the new data and continue.
In one pass you'll have created your new table. You can either then delete the original one and replace, or store it elsewhere.
Maybe I'm missing something, but to me it seems that you want:
SELECT DISTINCT ON (code, area, rate, startdate, enddate), no, code, area, rate, startdate, enddate FROM your_table
I have faced the same problem. The approach I use it to 'divide' the ordered list in 'groups' of rows having subsequent dates, by marking the first ('original') row of the 'group' which is the one that is not subsequent to its predecessor, and then finding the list row of each 'group' - end date of such last row is the 'aggregated' end date.
Create additional table with new column Subsequent (int). Copy all rows from original table into this one filling new column 'Subsequent' as follows: if start date of the row is equal to end date of previous row, then Subsequent = 1 (means that this row is subsequent to previous one), if not equal then 0 (means that this row starts new sequence). In order to find end date of previous row use lag() function partitioned by code and area and ordered by start date. For instance in the above mentioned example line no. 2 will have subsequent = 1 and all others subsequent = 0.
Write a query which will select from above table only rows where Subsequent = 0, i.e. 'original' ones, with code and rate, and then find the maximum end date among all subsequent rows following after the original row, for each original row selected.
Maximum end date can be found using subquery like this:
(A) Try to find it among subsequent rows following this original one:
select maximum end date from all rows which meet criteria:
- subsequent is 1, i.e. only subsequent rows
- code equals to code of the original row
- rate equals to code of the original row
- start date greater than start date of original row
- end date less than start date of next original row, or, if there is no next original row, then end date = max of all end dates over this combination of code and rate"
Next original row can be found by another subquery: select all rows where
- subsequent = 0
- start date > start date of original row
- code and rate are the same as in original row.
(B) If above returned null, i.e. the original row has no subsequent rows, then end date for this row is its own end date.
Coalesce function make a choice between (A) and (B) nicely.
As the result of the query you should have a list which you would like to get in your example.
精彩评论