Database ETL Design Question_问答_开发者_运维开发者技术经验分享

A dataset I receive for routine refresh purposes contains a date field that's actually VARCHAR.

As this will be an indexed/searched field, I'm left with...

1) Converting the field to DATETIME and validating and normalizing the data values when refreshing

or...

2) Leavi开发者_开发技巧ng the data as-is and forming my queries to accommodate various valid date formats, i.e., WHERE DateField = 'CCYYMMDD' OR DateField = 'MM/DD/CCYY' OR ....

The refresh would be on a monthly basis; "cleaning" the data would add about 35% time to the ETL cycle. My queries on the date field would all be equalities; I do not need to range search. Also, I'm a one man shop, so the more hands-off the overall solution the better.

So which scenario am I better off doing? All opinions appreciated.

I think this is a great question. Here's my opinion:

I'm a big believer in the idea that in the long run you'll save more time and have fewer headaches by using data types for the purpose for which they were intended. That means dates in date fields, characters in character fields, etc. If you go with option 2 you'll need to remember to code for all the various possible date formats every time you query the table. If you set this down and come back a year from now, are you going to remember?

By contrast, if you use a date field and do the upfront work in the ETL process of dealing with the dates properly, you will always know just how to interact with the field. And I'm not even going into performance implications.

And in this case, I'm not sure you'll even see a short-term benefit. If there are, for example 5 different possible date formats in the source data, you'll need to account for those one way or another; either in the ETL or in the output queries. The code to transform those 5 formats in ETL is not materially more complicated than the code to manage those 5 formats in the output queries.

And if the data could literally arrive in an infinite number of formats, you have big problems either way. Either your ETL will break or your queries will break. It is, to a certain extent, an irreducible complexity.

I would suggest that you take the time to code the proper transforms into your ETL. But do yourself a favor and code a preprocessing step that identifies dates in formats that won't properly transform and alerts you to them. If you see patterns; i.e., if any format shows up more than once, code a transform for it. Over time you'll be left manually cleaning fewer and fewer of those nasty dates. With luck, your 35% will drop to 5% or less.

Good luck!

You are better off cleaning the data. First dates which are not good dates are meaningless so it's pointless to store them. Second, it is harder to fix a bad datatype choice later than it is to never make it. Querying will not only be easier but it will be faster than if you use a varchar. And things like ordering will work correctly as well as date functions. Third, I can't imagine that cleaning this would add that much to your import, I clean data all the time without it being a problem. But if it does, then clean the data in a staging table that no other process is using (so you aren't affecting users on prod) and then do the load to the prod tables from nice clean data.

Clean the data up front and store the dates as dates.

I work with systems that store dates as strings and there appear to be an unlimited number of ways to store the dates. This makes it very difficult to create a query to will work against a future new date format.

If you store dates as strings then you should apply constraints to make sure the data is stored in the proper format. Or, just convert the date strings to dates and let the database apply the valid date constraint itself. It is usually best to let the database do the work for you.

Definitely better off cleaning the data and loading into date column as this will ensure the integrity.