There are primarily two types of indexes available to you in SQL Server: Clustered and Non-Clustered. How and when you use them impacts the performance of your data warehouse dramatically. And yet I’ve found that there still exists a bit of confusion as to when and why one should use one type of index over the other. That’s the reason why I’m writing this little article – to hopefully add a small bit of insight to this critically important subject.
Let’s just start with what a Clustered Index is. It’s an index which physically maintains the order of your records. It does this by making it so the leaf level of the index is the data page itself with the rows ordered within each page. Because there can only be one physical order in a table, there can only be on Clustered Index. When you add new records to the table, it will be inserted in the proper, ordered location.
I’d like you to think about the Clustered Index as an Encyclopedia. All the entries in an encyclopedia are in order already, so when you look for “Yaroslav Hasek”, you go to the F-H book, page to the entry you want in the “H” section, and get to your entry. By the time you find it, you’re already at the place where the data for the great Czech writer is stored. As a fan of the author, I’m sure glad that the indexing of the encyclopedia, like the Clustered Index in SQL Server, takes me straight to the data I’m looking for.
The Non-Clustered Index, on the other hand, behaves completely differently. The behavior of the Non-Clustered Indexes also changes according to whether the table contains a Clustered Index or not.
Let’s start with how a Non-Clustered Index helps with a search when a table contains no Clustered Index . This kind of index is structured like a B-Tree, the same as the upper nodes of the Clustered Index , but the difference is that the leaf nodes contain a hex notated Row ID to indicate in which page and offset the data is located in the database. This functionality is more akin to the index in the back of a history book. If you want to find out something about Naples, you go to the back of the book, find the letter “N”, and scroll to “Naples”. Instead of getting the data about Naples you want, you get a page number which you still have to go to in order to get your data. That Row ID number is like that page number. After the index tells SQL Server where the data is, it has to go find where it’s physically located and then retrieve it.
A table with no Clustered Index is also known as a “Heap”, in large part because the data is stored in a completely unordered sequence. This is great for tables where INSERT transactions have to be as fast as possible. They’re fast for the same reason I sometimes have an unordered pile of papers on my desk. It only takes a nanosecond to add a new sheet of paper to the pile (I just throw it on top). On the other hand, the retrieval of one of those papers is a bit more time consuming than if I had spent the time just filing them in order as they came in.
A Non-Clustered Index on a table with a Clustered Index in it behaves just a little differently than one on a heap. When you build a Non-Clustered Index in this case, the leaf node of the index doesn’t contain a reference to the location of the record – instead, it contains a reference to the Clustered Index key of that record. That means that once the value in the index has been found, it takes the Clustered Index key value and performs a lookup based on that Clustered Index key.
So just to recap:
1) A Clustered Index leaf node IS the data and causes the table to be physically ordered by the key of the index
2) A table with only Non-Clustered Indexes is a heap. The leaf nodes contain a pointer to the location of the data
3) A table with both clustered and Non-Clustered Indexes does not use a pointer in the leaf nodes. Instead, it performs a search based on the Clustered Index key.
There is one interesting fact that differentiates heaps from non-heaps. It can happen that a record in a table changes size and no longer fits in the allocated space for that record. When a Varchar column that occupied 8 bytes before now occupies 16 bytes after an update, the record might have to change physical location if it no longer fits in that page, and move to another page. In a heap, indexes don’t change their pointer location. They continue to point to the old location since nobody wants to incur the overhead of rebuilding index pointers every time a record has to move. So what does the pointer point to if the record is no longer there? The pointer in this case points to a forwarding pointer which contains another pointer to the new location. And if that record has to move again, well .. SQL Server just leaves yet another pointer to the new location. So now we have an index pointer pointing to a pointer that points to another pointer. Each one of these pointer redirections is called a “hop”. I’ve seen tables that required more than 23 hops to get to a record. You can avoid this by simply rebuilding the indexes every so often.
A table with a Clustered Index does not have to worry about this kind of fragmentation because the Non-Clustered Indexes point to a Clustered Index key. If the record changes location, the key does not have to change. No forwarding pointers. But the effort to keep the tables in proper order can cause some page splitting. This is not as bad as the forward pointer issues by any stretch but if it’s a common occurrence, then an index rebuild will also take care of the problem.
Now that we know all this, the question often comes of when a table should have a Clustered Index . In my opinion: always. I can hardly think of any reason not to have one. If anyone can give me a good reason why not, I’d love to hear it. This stems from the principle that every table in a database should have a primary key. I suspect Microsoft agrees with me. If you create a primary key on a table, it automatically becomes supported by a Clustered Index unless you specifically tell it not to. But the main reason is that the searching mechanism of the Clustered Index is faster than using a pointer page offset that you find in a heap.
There are a few things to keep in mind. Unless your Clustered Index is the ONLY index in the table, then try to use the smallest key possible such as an integer. Do NOT use multiple columns as the key for a Clustered Index unless it’s the only index. Here is the reason:
Remember that all Non-Clustered Indexes in a non-heap table contain the Clustered Index in their key structure. So if your Non-Clustered Index is keyed on last names and your Clustered Index is keyed on an integer ID field, then each entry in your Non-Clustered will contain the entry for the last name, say “Jones”, with a search key of say 235. If instead of a 4 byte integer ID on Clustered Index you have a 16 byte GUID, then the “Jones” entry and every other last name will have to carry an extra 16 bytes around. If you want your indexes to be as fast as possible, they need to have the fewest nodes possible. Each node can only hold a certain number of bytes, so if your Clustered Index key is large, the key of all the Non-Clustered Indexes will be at least as large. They will then require more nodes, which require more node splits and jumps which then slow down the searches. So as much as possible, you need to keep Clustered Index keys small.
Bottom line: Make sure every table has a Clustered Index and keep the keys of that index as small as you can. Your tables will be better organized and will perform better when queried.
Next week, I’ll talk about indexing strategies for Star Schemas ..
Categories: Clustered Index, Data Warehouse, Indexes
Hi,
This articles provided me with real insights of primary indexes and surely helped me in big way in clearing the thought process.
-Sandeep Goel