I have several billion rows of data in CSV files. Each row can have anything from 10 to 20 columns. I want to use COPY FROM to load the data into a table containing 20 columns. If a specific CSV row only contains 10 columns of data, then I expect COPY FROM to set the rest of the columns (for which the values are missing) to NULL. I specify DEFAULT NULL on every column in the CREATE TABLE statement.
MY QUESTION: Can this be开发者_开发技巧 done using COPY FROM?
EDIT: Greenplum (a database based upon PostgreSQL) has a switch named FILL MISSING FIELDS, which does what I describe (see their documentation here). What workarounds would you recommend for PostgreSQL?
Write a pre-processing script to just add some extra commas on the lines that don't have enough columns, or to transform the CSV into TSV (tab-separated) and put "\N" in the extra columns.
I don't think you can make COPY FROM deal with different number of columns inside the same file.
If it's always the same 10 columns that are missing, a workaround could be to first load everything into a staging table that has a single text
column.
After that, you can use SQL to split the line and extract the columns, something like this:
INSERT INTO target_table (col1, col2, col3, col4, col5, ...) SELECT columns[1], columns[2], ... FROM ( SELECT string_to_array(big_column, ',') as columns FROM staging_table ) t WHERE array_length(columns) = 10
and then do a similar thing with array_length(columns) = 20
In a context of etl
and data-warehouse
-- my suggestion would be to actually avoid the "shortcut" you are looking for.
ETL is a process, frequently implemented as ECCD (Extract, Clean, Conform, Deliver). You could treat those files as "Extracted", so simply implement data cleaning and conforming as different steps -- you will need some extra disk space for that. All conformed files should have the "final" (all columns) structure. Then deliver (COPY FROM
) those conformed files.
This way you will also be able to document the ETL process and what happens to the missing fields in each step.
It is a usual practice to archive (disk, DVD) original customer files and conformed versions for audit and debug purposes.
From the PostgreSQL manual:
COPY FROM will raise an error if any line of the input file contains more or fewer columns than are expected.
Read the first line of your CSV file to see how many columns you have to name in the COPY statement.
精彩评论