How can I avoid NULLs in my database, while also representing missing data?_问答_开发者

In SQL and Relational Theory (C.J. Date, 2009) chapter 4 advocates avoiding duplicate rows, and also to avoid NULL attributes in the data we store. While I have no troubles avoiding duplicate rows, I am struggling to see how I can model data without making use of NULL. Take the following, for example - which is a bit from work.

We have an artist table, which has, amongst other columns, a gender column. This is a foreign key to the gender table. However, for some artists, we don't know their gender - for example we've been given a list of new music which has no descriptions of the artist. How, without using NULL is one meant to represent this data? The only solution I see is to add a new gender, "unknown", to the gender table.

While I am thoroughly enjoying this book, I was really disappointed when the chapter concluded with:

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are much too complex to be discussed in detail here.

Which is a real shame - because this was the solution I was waiting to read about! There is a reference to read the appendix which has lots of publications to read, bu开发者_如何学Pythont I was hoping for a little bit more of a down to earth summary before I dived into reading these.

I'm getting a few people commenting that they don't understand why I wish to avoid 'NULL' so I will quote the book again. Take the following query:

SELECT s.sno, p.pno
  FROM s, p
 WHERE s.city <> p.city
    OR p.city <> 'Paris'

Now, take the example that s.city is London, and p.city is Paris. In this case, London <> Paris, so the query is true. Now take the case that p.city is not Paris, and is infact xyz. In this case, (London <> xyz) OR (xyz <> Paris) is also True. So, given any data - this query is true. However, if xyz is 'NULL' the scenario changes. In this case both of these expressions are neither True nor False, they are in fact, Unknown. And in this case because the result is unknown you will not get any rows returned.

The move from 2 value logic to 3 value logic can easily introduce bugs like this. Infact, I just introduced one at work which motivated this very post. I wanted all rows where the type != 0 However, this actually ends up matching type == 0 OR type IS NULL - confusing behavior.

Whether or not I model my data with or without NULL in the future is unclear, but I'm very curious what the other solutions are. (I too have always been of the argument that if you don't know, you should use NULL).

Good on you, for eliminating Nulls. I have never allowed Nulls in any of my databases.

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are much too complex to be discussed in detail here.

Actually it is not so hard at all. There are three alternatives.

Here's a paper on How To Handle Missing Information Without Using NULL by H Darwen, that may help to get your head around the problem.

1.1. Sixth Normal Form is the answer. But you do not have to normalise your entire database to 6NF. For each column that is optional, you need a child table off the main table, with just the PK, which is also the FK, because it is a 1::0-1 relation. Other than the PK, the only column is the optional column.

Look at this Data Model; AssetSerial on page 4 is a classic case: not allAssets have SerialNumbers; but when they do, I want them to store them; more important I want to ensure that they are Unique.

(For the OO people out there, incidentally, that is a three level class diagram in Relational notation, a "Concrete Table Inheritance", no big deal, we've had it fro 30 years.)

1.2. For each such table, use a View to provide the 5NF form of the table. Sure, use Null (or any value that is appropriate for the column) to identify the absence of the column for any row. But do not update via the view.

1.3 Do not use straight joins to grab the 6NF column. Do not use outer joins, either (and have the server fill in a Null for the missing rows). Use a subquery to populate the column, and specify the value that you want returned for a missing value (except if you have Oracle, because its Subquery processing is even worse than its set processing). Eg. and just an eg. you can convert a numeric column to string, and use "Missing" for the missing rows.

When you do not want to go that far (6NF), you have two more options.

You can use Null substitutes. I use CHAR(0) for character colomns and 0 for numeric. But I do not allow that for FKs. Obviously you need a value that is outside the normal range of data. This does not allow Three Valued Logic.
In addition to (2), for each Nullable column, you need a boolean Indicator. For the example of the Sex column, the Indicator would be something like SexIsMissing or SexLess (sorry). This allows very tight Three Valued Logic. Many people in that 5% like it because the db remains at 5NF (and less tables); the columns with missing info are loaded with values that are never used; they are only used if the Indicator is false. If you have an enterprise db, you can wrap that in a Function, and always use the UDF, not the raw column.

Of course, in all cases, you can never get away from writing code that is required to handle the missing info. Whether it is ISNULL(), or a subquery for the 6NF column, or an Indicator to check before using the value, or an UDF.

If Null has a specific meaning ... then it is not a Null! By definition, Null is the Unknown Value.

So how do you design without NULLS? That was the original question.

It's actually quite easy. You design such that whenever you have to leave some data missing, you can do so by leaving a whole row missing. If a row isn't there, it isn't a row full of NULLs. It just plain isn't there.

So, in the case of "DateOfDeath", we have a table with two columns, namely, PersonId and DateOfDeath. PersonId references Id in the Persons table. If there is no DateOfDeath to be stored, we don't store the row. End of discussion.

If you do an OUTER JOIN between this and the Persons table, you'll get a NULL for the DateOfDeath wherever there was no row. And if you use this in a where clause, you'll get the usual perplexing behavior concerning 3-value logic. If you do an INNER JOIN, the rows for which there is no DateOfDeath will simply disappear from the join.

A design that permits every column to be NOT NULL enforced has been called sixth normal form.

Having said all that, I often allow NULLs in non critical columns. And I don't have a succinct way of telling you how I determine that a column is critical.

Quite simply by storing only the known information - in other words the Closed World Assumption. Aim to be in at least Boyce Codd / Fifth Normal Form and you won't go far wrong.

nulls are a consequence of theory meeting reality and having to be adjusted to be usable. In my opinion attempting to avoid all null values will ultimately lead to uglier and less maintainable code than just using null where appropriate.

NULLs are required - theres no need to replace them

The enitre definition of NULL is that its unknown - simply replacing this with arbitrary type is doing the same thing, so why?

For the comments below:

Just tried this - neither is true:

declare @x char
set @x = null

if @x = @x
begin
select 'true'
end

if @x <> @x
begin
select 'false'
end

I can only take this to mean that because null is unknown then it can't be said that it equals or does not equal - hence both statements are false

NULL could/should be used as long as:

A) You have a business reason. For example, in a table of payments, a NULL payment value would mean it was never paid. A 0.00 payment value would mean we intentionally paid nothing. For medical charts, a NULL value for a blood pressure reading would mean you didn't take a BP, a 0 value would mean the patient is dead. This is a significant distinction, and necessary in certain applications.

B) Your queries account for it. If you understand the affect of NULL on IN, EXISTS, inequality operators (like you specified in OP), etc. then it shouldn't be an issue. If you have NULL now in your tables and don't want the value for certain applications, you can employ views and either COALESCE or ISNULL to populate different values if the source table has a NULL.

EDIT:

To address OP's questions about "real world" inequalities/equalities using NULL, this is a great example I use sometimes.

You are at a party with 3 other people. You know that one person is named "John" but don't know the others.

Logically, the answer for "How many people are named Joe" is unknown or NULL. In SQL, this would be something like

SELECT name FROM party where NAME = 'Joe' You would get no rows since you don't know their names. They may or may not be Joe.

Your inequality would be:

SELECT name from party where NAME <> 'Joe' You would only get a return value for "John" since John's name is all you know. The other people may or may not be Joe, but you have no way to know.

I disagree with the author and would claim that NULL is actually the CORRECT way to handle missing data for optional fields. In fact, it's the reason that NULL exists at all...

For your specific problem regarding gender:

Are you sure you want a gender table and incur the cost of an extra join for every query? For simple enumerated types it's not unreasonable to make the field an int and define 1=male, 2=female, NULL=unknown.

Do not allow a column to be defined as NULL if at all possible. For me it does not have anything to do with the business rule of what you want NULL to mean it has to do with disk I\O.

In SQL Server a nullable column, say a character 10, will take one bit in a bitmap when null and 10 bytes when not nullable. So how does having a null hurt disk I/O. The way it hurts is when a value is inserted into a column where a null used to be. Since SQL did not reserve space there is not room in the row to just put the value so SQL Server has to shift data around to make room. Page splits, fragmentation, updating the RID if this is a HEAP, etc all hurt disk I/O.

BTW if there is a gender table we could add another row for "Unable to determine the true sexual origin or state of the individual".