Friday, December 20, 2013

Identity Vs Uniqueidentifier primary key


  • The reason why people suggest you avoid the unique identifier data type is complicated.  You need to understand how things work "behind the scenes".
  • First, let's investigate what happens with an integer key set to Identity(1,1).  Primary keys are implemented through indexes.  SQL Server stores index data on the hard drive.  All data is stored in 8 kilobyte pages (including indexes).
  • Now, an integer takes 4 bytes, so you can store roughly 2,000 values in an index page.  Data in an index is sorted, which is why index searches are so much faster.  Finally, consider what happens when you add a new value.  Since this is an identity column, the new value is guaranteed to be larger than any existing piece of data, so it will be added to the end of the list.  If the last data page is NOT full, SQL Server simply writes the data to the end of the last index page.  If the page happens to be full (which will only occur once every 2000 inserts), then a new page is allocated and the new value is written to the first spot if the newly allocated data page.
  • Now let's think about the uniqueidentifier primary key.  A uniqueidentifier requires 16 bytes per value, so the most you can fit in an 8K data page is 500 values.  This means your index will be 4 times larger simply because of the data type.  This also means that joins and searches will be slightly slower too.  It takes more time to compare a 16 byte value than it does to compare a 4 byte value.
  • More importantly, think about the data stored in the index. If you use NewId() to create your unique identifer, you can run in to some serious performance problems. Remember earlier I said that the data in an index is sorted.  This also holds true for unique identifiers.  NewId() will return a unique number which is likely to fall somewhere in the middle of your existing numbers. So.... when you insert a row, SQL needs to determine which data page to store the new value, then, if the data page is full, it must split the page in to 2 pages.  1/2 the data from the existing page is copied to the new page (leaving each page 1/2 full), and then the new value is inserted.  Even if the page isn't full, most of the time, the data in the page will need to be rearranged to accommodate the new piece of data (some of the rows will need to be pushed down so that the new value can be inserted in to the correct spot).
  • This problem occurs if you use the NewId() function for your unique identifier.  If you use the NewSequentialId() function, there is less of a problem because you are guaranteed to get a GUID that is greater than your existing values.  Your new value will be inserted at the end, and you won't have page splits to worry about.  You still have a 16 byte value to compare with other 16 byte values (instead of 4 byte integers), so your joins and filters will be slightly slower than if you had used integer keys.  
  • It's important to put this in to perspective.  If you have a table with 10's of millions of rows, with inserts, updates, and deletes occurring 100's or 1000's of times per second, then you really need to worry about this stuff.  With smaller tables that have less usage, it really doesn't matter because you will not notice the performance issues that this can cause.

No comments:

Post a Comment