Saturday 23 April 2011

Optimise your time dimension

I think everyone who designs big dbs wants to make them well optimised. Every DWH has certainly a time dimension. In my last Job I worked as a DWH developer and I had to work with MS SQL Servers and it was never really questioned to apply some optimisation on the time dimension. In my current Job I work with open source dbs and it was a huge improvement using the following optimisation. Usually you design a time dimension as a table with the according columns:

import table:

Basically that’s how you get the orders maybe delivered from your staging server. The “date” column has a datetime datatype. For example this import table contains million of rows and 20 columns. You then create many different dimensions from this import table to make it easier to query. Like one dimension for orders, one for products etc. Otherwise it would take forever to query something from the import table.

Fact table:

Order dim:

Product dim

Now it’s better to query the data. For instance:

SELECT date, o.order_id, p.description, o.order, o.discount
FROM fact_orders f
JOIN dim_order o ON f.order_id = o.order_id
JOIN dim_product p ON f.product_id = p.product_id

That’s the generally the most common way I’ve seen, but I prefer to use a unix timestamp. Now what you can do is the following.You change the datetime column to an integer, because when querying an integer it’s much quicker than a datetime. The only drawback is that you can’t use it when you want to save anything before the 1st of Jan 1970.

For example, you can convert 2011-04-22 10:30:12 to 1303486212, which represents the time in seconds between 1970-01-01 00:00:00 and 2011-04-22 10:30:12.

How to use that in a query?

SELECT CAST(date as datetime), o.order_id, p.description, o.order, o.discount
FROM fact_orders f
JOIN dim_order o ON f.order_id = o.order_id
JOIN dim_product p ON f.product_id = p.product_id