开发者

Rename data, then delete duplicates

开发者 https://www.devze.com 2022-12-18 05:54 出处:网络
I\'ve got a large dataset, some of which is is duplicate records, which are identifiable by dupes in two fields.

I've got a large dataset, some of which is is duplicate records, which are identifiable by dupes in two fields.

To find these records, the following query works:

SELECT * FROM supplierstuffs
GROUP BY "Supplier Code", "Cost ex Tax"
HAVING count("Des开发者_开发百科cription") > 1

Basically what I want to do is cat together all the values of "Description" to form a single row, then replace all of the duplicated rows with the single row.

This is my half broken query so far, it's kludgy and horrid. My primary goal is to get this working - but if I learn some new tricks in sql along the way that's not at all a bad thing.

UPDATE supplierstuffs SET "Description" = 
(SELECT array_to_string(array_accum("Description"), ', ') FROM supplierstuffs
GROUP BY "Supplier Code", "Cost ex Tax"
HAVING count("Description") > 1)
WHERE .....

This is far as I've gotten. What should I be reading to get a bit further? I've read a couple of books and a lot of webpages on the topic. However in this case I think my problem is not limited lack of SQL (ok, it's not my only problem) but more approaching the problem the wrong way.

EDIT 1:

'Name'; 'Supplier Code'; 'Desciption';
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"04 -14, S - 3XL"
"7CP PODIUM CRICKET PANT  ";"7CP";"08 -14, S - 2XL"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"04 -14, S - 3XL"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"04 -14, S - 3XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL, XS - 2XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL, 8-16"

^^ is what I want to create from vv

"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"S - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"8-16"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"7CP PODIUM CRICKET PANT  ";"7CP";"08 -14"
"7CP PODIUM CRICKET PANT  ";"7CP";"S - 2XL"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"04 -14"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"S - 3XL"
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"04 -14"
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"S - 3XL"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"04 -14"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"S - 3XL"

^^ noting that the lines not having more than one description line need to remain untouched.

I've so far created the new records in a new table with:

INSERT INTO tmptable
SELECT "Name" , "Supplier Code", array_to_string(array_accum("Description"), ', ')
FROM supplierstuffs

GROUP BY "Name", "Supplier Code", "Description"
    HAVING count("Description") > 1 

So now all that remains is to delete the records that were caught by the cat command. It seems I can't DELETE FROM with a having clause? I'm thinking that DELETE FROM table WHERE oid IN (SELECT OID's using having clause) Will work?

EDIT 2:

SELECT array_accum(oid)
FROM supplierstuffs

GROUP BY "Name", "Supplier Code", "Colour", "Cost ex Tax"
    HAVING count("Description") > 1 

returns a few arrays of 2 oids, all of which need to be delorted. I feel I'm very close, and yet so far. Thanks in advance


The following approach will work

  1. Identify only the duplicate rows and store them in a new table.
  2. Delete duplicate rows from parent table
  3. Concatenate the description column in the table containing only duplicate rows. Concatenate using a group by clause.
  4. Insert all rows from the result of step 3 into the original table.


So what you currently have is something like this ...

DESCRIPTION            SUPPLIER_CODE  COST_EX_TAX
Widget                 X23                  42.00 
Brass gadget           X23                  42.00 
Flange                 X42                  23.00 
Flange, steel          X42                  23.00 

... and what you want is ...

DESCRIPTION            SUPPLIER_CODE  COST_EX_TAX
Brass gadget, Widget   X23                  42.00 
Flange, Flange, steel  X42                  23.00 

This still doesn't seem like the right approach. That concatenated DESCRIPTION seems wrong to me. However you know your data and your customer's requirements better than I do.

0

精彩评论

暂无评论...
验证码 换一张
取 消