Why References Aren't Pointers
Let's start with some bad reasons for exclusively using surrogate keys:
- References to surrogates take less space than natural ones. Those familiar with the relational model will instantly recognize this argument as crossing logical/physical boundaries. In other words, an issue such as "space" is purely dependant on the system which implements the given logical schema. A hypothetical system, for instance, might store related relation variables in a common structure where the common attributes are not duplicated, or may have some surrogate mechanism built in. Even for systems with more direct physical isomorphisms, altering the logical schema for such physical considerations is paramount to optimizing for storage.
- No need to worry about deducing the natural keys. It should be clear that if the database designer isn't considering the cardinality of various attribute combinations, that designer is not likely to arrive at a successful schema design.
- Allows auditing of what was changed in a particular data "slot". It is not very likely that the auditing requirements call for knowing that the current home phone number was entered in place of the former work phone number. I've only seen this as meaningful in cases where there is a natural order to the data, in which case a monotonically increasing series of numbers--as is often used for surrogates--actually is a natural key.
- References to natural keys automatically enforce certain constraints. This mechanism allows for referencing relvars to state dependencies that would otherwise be complicated to state.
- Simplicity. By introducing additional attributes and keys, surrogates increase the complexity of the schema.
- Queriability is often improved by proliferating meaningful rather than meaningless attributes. Joins can be avoided in cases where the key contains the desired information. Depending on the implementation of the system, this might also provide a performance benefit.
- Logical defects manifest themselves earlier. Surrogates often hide design problems until systems are far along or are in production. This is so for at least two reasons: 1) it is easier to neglect careful thought about the key; and 2) it is possible to enter data that might otherwise violate an integrity rule.
- When a row in the relation variable (relvar) is logically unique regardless of the other attributes. This goes back to the notion of a "slot".
- If there is no clear natural key. Much has been written on this topic, but basically: only choose a natural key that is relatively concise, unchanging, and unquestionably unique.
- If the natural key is compound. This is a gray area because referencing compound keys often allows for the automation of otherwise difficult to enforce integrity constraints. At the same time, heavily compound references often imply integrity constraints that are not desired.
- Surrogate's don't violate some logical pureness rule. The real world is littered with examples of surrogates, so in fact surrogates are often natural! The real world is also a useful reference for examples of where surrogates work well and not so well (think SSNs).
The practice of using surrogate keys for all relvars is probably as common as it is due to the industry's familiarity with Object Oriented languages, which provide purely surrogate semantics for addressing. For those looking to get more out of their database system than data storage for their objects, natural keys should be considered thoroughly before resorting to surrogates.
In conclusion, as has been stated many times in other places, this is a design issue and thus I can offer no solid rules, just general ideas. I will say though, that considering all of the above trade-offs, it should be pretty clear that any design that takes the extreme of using all surrogates is probably not the best design.