(De)Normalization of two relations_问答_开发者

People who read C.J.Date's Introduction to Database System or books of similar level should not have problems with definition of normalization and denormalization.

However, memory is not what it used to be and I find myself often looking at some design and saying that it is not normalized even though I can not find which of the normal forms it is breaking.

The actual example that illustrate it is:

If we have relations

r1 (A, B, C) and r2 (A, D)

with FDs: AB->C and A->D

and r1 represent detailed data, while r2 is summary of that data (in another words each instance of D is a function of values in r1. in this example let it be subtotal of values C according to A from r1).

Example instance

So, even though I can not say that it breaks for example 2NF or 3NF, I seem to be stuck on the idea that the design is still denormalised in the following sense (from Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34, commenting on the reasons to normalize beyond 1NF):

To free the collection of relations from undesirable insertion, update and deletion dependencies;

To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;

To make the relational model more informative to users;

To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

As I can say, that if we define D as a sum of all Cs from r1 where A from r1 is equal to A from r2 then, if we update C in r1 and we don't update D in r2, we can end up with undesirable update dependency and the data ends up in inconsistent state I find this reason to call r1 and r2 denormalized and to think of them as denormalized. (In fact whole r2 is a function of r1 and bring zero new facts into the model; r2 = f(r1))

So the questions are

can we call r1 and r2 denormalized?
if yes, why? if not, why? (according to which rule? or according to which definition?)

NOTE:

To those who find the question(s) interesting enough to put in an answer, I kindly ask to provide either something quotable or to put it in a form of specific assumptions and开发者_开发技巧 conclusions (or in another words, if you are going to put in your opinion, please follow it with some reasoning).

EDIT I accepted dportas answer. I'll try to add a bit to it here: C.J.Date can makes a clear and strict distinction:

Much of design theory has to do with reducing redundancy; normalization reduces redundancy within relvars, orthogonality reduces it across relvars.

quoted from Database in depth: relational theory for practitioners

and on the next page

just as a failure to normalize all the way implies redundancy and can lead to certain anomalies, so too can a failure to adhere to orthogonality.

Assuming AB is a key in r1 and A is a key in r2 then it seems that the schema is in 6NF. The Relational Database Dictionary (Date) defines denormalization as:

Replacing a set of relvars R1, R2, . . ., Rn by their join R, such that for all i the projection of R on the attributes of Ri is guaranteed to be equal to Ri (i = 1, 2, . . ., n).

Fundamentally, normalization/denormalization is about composition and nonloss decomposition using projection and join operators. In this example you have redundancy caused by a different operator: summation. I expect it would be quite possible in principle to form a theory of "normalization" for operators other than projection and join, perhaps even for non-relational functions like summation. That's not how normalization is conventionally defined however and in the absence of any sound basis for doing otherwise I think we ought to apply the technical meaning denormalization as defined by Date in the above quotation.

Your definition for column D in r2, "a sum of all Cs from r1 where A from r1 is equal to A from r2", is a constraint on D. More formally, where Σ is summation, π is projection and σ is selection,

(a,d) ∈ r2 ⇔ (a, d) = (a, Σ c), a ∈ π_A(r1), c ∈ π_C(σ_A=a(r1))

Since this constraint is neither a domain constraint nor a key constraint, r2 is not in Domain/Key Normal Form (DKNF).

DKNF is the only normal form of which I'm aware that isn't defined in terms of a single relation, chiefly because it's defined in terms of constraints rather than dependencies.

so r2 is a function of r1 which mean r2 is a materialized view of that function of r1

and in that example the it would be a view of select A, sum(C) from r1 group by A

views are not covered in codd's work on normalization, but i think he did write about them

materializing a view is normally done for caching reasons which some might think of as a form of denormalization so there were papers on automatically deciding which view to materialize thus making it just something the database could do with view to make them faster sometimes

but as updates to views are normally not permitted although i think i read that codd said something like all views that can be update-able should be and there were papers on getting that to work in some complex cases

I think the pair of relations violates fifth normal form.

R2 is a projection of R1. Some argue that SUM is outside the scope of the relational model. In this case, SUM is a trivial extension of COUNT, which is within the scope of the relational model.