Hi Dave,
of course a database should have consistent data, but even with PK/FK/CHECKs there might be some bad data in a source system or it's only consistent within a single source system, but you load multiple sources into a DWH. Data quality is a quite important topic, but you might wonder how low this quality is in some DWHs :-)
A table in a warehouse is usually loaded in batches, i.e. hundreds/thousands/millions of rows. When you insert/select into the target table and there's a PK-violation you get an error message and the transaction is rolled back. To avoid this potential rollback you have to write a query filtering those rows violating the PK in the select part. Now you're shure the load will not fail, but when the data is alread checked why should the DBMS do a second check?
Or you load using a MERGE (so you don't have to care about updates v. inserts), now you don't even have to care when you use the logical PK columns to match source/target
A PK/Unique constraint might also be a huge overhead (if the PK is not implemented as UPI it will be a unique secondary index) and if it's just a logical constraint which is hardly used in WHERE-conditions you can easily avoid that overhead by not implementing it.
Foreign Keys are similar, one row violating it will cause a rollback. Additionally you might have slowy changing dimensions where you keep older versions of the data (an UPDATE inserts a new row and updates the old row's valid_to), but FKs were designed for OLTP system, you can't do table1.column references table2(column) and table1.datecol between table2.valid_from and table2.valid_to, (unless you use Teradata's Temporal feature). And finally you sometimes reload or recreate tables in a warehouse, try that when this table is referenced in a referential constraint :-)
Dieter
Hi Dave,
of course a database should have consistent data, but even with PK/FK/CHECKs there might be some bad data in a source system or it's only consistent within a single source system, but you load multiple sources into a DWH. Data quality is a quite important topic, but you might wonder how low this quality is in some DWHs :-)
A table in a warehouse is usually loaded in batches, i.e. hundreds/thousands/millions of rows. When you insert/select into the target table and there's a PK-violation you get an error message and the transaction is rolled back. To avoid this potential rollback you have to write a query filtering those rows violating the PK in the select part. Now you're shure the load will not fail, but when the data is alread checked why should the DBMS do a second check?
Or you load using a MERGE (so you don't have to care about updates v. inserts), now you don't even have to care when you use the logical PK columns to match source/target
A PK/Unique constraint might also be a huge overhead (if the PK is not implemented as UPI it will be a unique secondary index) and if it's just a logical constraint which is hardly used in WHERE-conditions you can easily avoid that overhead by not implementing it.
Foreign Keys are similar, one row violating it will cause a rollback. Additionally you might have slowy changing dimensions where you keep older versions of the data (an UPDATE inserts a new row and updates the old row's valid_to), but FKs were designed for OLTP system, you can't do table1.column references table2(column) and table1.datecol between table2.valid_from and table2.valid_to, (unless you use Teradata's Temporal feature). And finally you sometimes reload or recreate tables in a warehouse, try that when this table is referenced in a referential constraint :-)
Dieter