Muller Jannie

About myself and everyone else.

Index Maintenance.

Why do index maintenance, if you don’t know this. Then I’m afraid this code is probably overkill for you and I don’t think you should run it.

I’ve been doing index maintenance on servers for a while and it normally goes quite well with a flask of coffee , a couple of open query analyser windows , an open window for windows explorer (pre 2012) or DBCC SQLPERF(logspace) or the DMV ,  you might also have an eye on the log file and log backup sizes.

Now you might not require all or some of the  above, this is probably due to the fact that like www.derivco.com 2006-2011 you have more grunty hardware than what you know what to do with, or your database aren’t really the size where index maintenance if a problem , you could probably still run the Maintenance Wizard or a 3rd party tool. This isn’t bad but that code is not for you then but it would be handy to get an understanding if you might want to understand SQL a little better.

Where to start, let’s start with what we want to do – we want to rebuild indexes, we might want to do this based on some rules like, what is my fragmentation on a table, what is the size of the indexes and which database do I want to do index builds on.

True as the DBA story goes, for the sake of not stepping on toes or whole feet I had to reuse some of the procs that was available. In some cases I completely rewrote it and kept only the name but this is why.

Let’s go over a few concepts.

Index physical stats, this is the dmv that contains the information on fragmentation and index size for a specific database and object(table).

Recovery models, this determines how data is logged (for recovery) in SQL server, index rebuilds more so online rebuilds do a jolly good job at increasing the log file size so depending on the recovery model your log file will be influenced.

Log backup, this is more of an issue when it comes to log shipping for a full recovery model database. How frequently do you back up the log , what is you network throughput and what is the drives space for your log and what is the drive space for your log backup, log backup and log truncation, checkpoints goes hand in hand.

Memory, aka buffers. Anything done in SQL uses buffer, if I build an index OFFLINE or ONLINE it will the pages of the index will end up in sys.dm_os_buffer_descriptors, thus memory is rather finite and the duration that object stay in memory and contention are monitored with windows \ SQL dmv performance counters such as page life expectancy, buffer cache hit ratio, buffer manager object counters etc.  You can use the is_modified to determine the size of buffer in transit not yet committed this will happen when data is … well… modified.

Availability groups, new in SQL 2012 and mirroring before that saw delay in synchronization because of the dependency on the log, the came log compression and things got better. A tip on this is to create a VLAN for your traffic over the Availability group endpoint. This will ensure traffic between nodes is optimized especially when it comes to synchronous configuration, it will also ensure you don’t interfere with production traffic.

Space, disk space. This is such a poor type of problem to have but I see many successful companies bickering on spending money over storage. OFFLINE builds use less storage because the index is created in place so to speak and ONLINE a copy is created and swopped out, I think the SQL Skills folks will have a heart attack at the brevity of how it works, the moral here is ONLINE builds will use more space, everywhere log file and data file and an index build, rebuild will use space in tempdb if you use the SORT_IN_TEMPDB hint.

Those manual index builds.

If you build indexes you might have a “soft” maintenance period, more and more companies are running 24/7 and you don’t really have a proper maintenance period but there might be an hour or three before the next BI load and some temporary batch processes that has since been stuck into the job schedule for your database that has since become permanent.

So when the clock strikes maintenance hour you hit REBUILD, GO and expect your pumpkin to turn into a gran coach, make it to the party and have a ball back before it turns into a pumpkin again, hopefully not leaving something at the party.

Based on my experience working at really big and really small places this seems to be the norm.

In this tool of mine I have somewhat moved away from this concept, at least for full recovery model databases.

Those automated builds.

I have taken exactly what I would normally do and put it into code. Basically it’s not really new and it’s not really rocket science either but I’ve got this implemented at an ecommerce search system with no downtime and standard edition, simple recovery. www.trademe.co.nz  2011-2013

Let’s get on with it shall we.

Objects.

Schema

IndexRebuildQueueAdd – This proc reads from the physical stats dmv based on database,fragmentation threshold and index size parameters and puts the output into the IndexRebuildQueue table. Baie eenvoudig ja?

Some notes on the physical stats DMV. On large (2tb databases, or 300gb+ tables) I have seen poor performance on this view.  Thus I would not suggest running this anything greater than sampled which is what I used. (I required some of the sampled output). So don’t run this in your maintenance period, I would suggest you need to do this before. It’s rather un-intrusive (or less).

IndexRebuildProcessor – This proc reads from the queue  and determines whether we can process the index based on some rules sitting entirely in the object IsNowAGoodTime. Once processed, processed implying both successful or failed output is writing to the indexrebuildhistory table.

That is really the summary of it.

Now that you have the overview we can look at a bit of process flow.

Simple Diagram

The queue would preferably be populated before your maintenance period. Once populated the processor reads of the queue, if it finds an index it will pass that through to IsNowAGoodTime. IsNowAGoodTime will validate the following.

Is 60% of the log file big enough to fit one and a half times the index size. This is to ensure there is enough space in the log. This is so we don’t run out of log space. The 60% is a rather custom rule and could be adjusted to 90%.

That if the index is built there is alteast 30%+ free in the log. This is also a custom rule which can be tweak according to your policies or business.

Then there is a check for filegroup space in the filegroup where the indexes resides. I haven’t really tested partitioning specifically but eveything else works.

Once there is enough space memory is checked for low page life. If the page life is below the 300ms then a delay is set in order to generate a wait for the amount of seconds we are short of 300ms. i.e. If we have 200ms page life then the delay would be for 100 seconds. Then the processor would check again and proceed or delay further.

Then follow the locking \ blocking \ HADR latency and other important process checks. If there are blocks \ locks \ backups \ restores or other indexes processes running we don’t proceed.

Along with the ONLINE option that fillfactor will also be toggled. 100% Fill for the clustered assuming that you have best practise clustered indexes. If you do not then modify this. This calculation will eventually be modified, what needs to happen is once you have sufficient data you can then record the fill factor at build time and the duration that the index takes before it comes back on the fragmentation radar after being rebuilt, you can then decrease the fillfactor to ensure the proper space in the pages to reduce fragmentation. This can be toggled up to a point based on how frequently you can afford to do maintenance.

The next step is to determine the backup internval. This is the log backup interval for a database that is in full recovery, this check is not performed for other recovery models. This check forms part of a great check which looks at the backup size for the last interval and it also  calculates how many backups we have of the log per hour. If there are no log backups for a full recovery system within the last hour the process stops for this index.

Once the interval(number of log backups per hour) is available  this value is multiplied to the log size. This gives us a total throughput per hour. This limit is set since we don’t want to autogrow the log file and cause potential failures with automation.  If an index size is greater than the throughput for an interval it will not be rebuild.

i.e. Log backups run every 10 minutes or 6 per hour. The log size is 100mb. The througput per hour would be 600mb or 100mb / 10 minutes.  Thus we can build 2 indexes of 50mb each (this is an example so givin internal overhead the actual index would be around 37mb.) or we could build 4 indexes of 20mb each. Once built we will have to wait for the log backup to complete in order for the log to get truncated. This allows a steady maintance as opposed to just hitting the server as

The values for log backup interval and log size and the maintenance period is also used to thottle the builds if we are in full recovery.

i.e. A normal index maintenance job will put a hard delay after each index such as 1 minute, a DBA would tweak this depending on how his job is running over the maintenance period or less than maintenance period. This could potentially over run or run too fast and let the log grow too big and mess up log shipping or other log centric systems, replication etc.

There is also a bit that does a check on the maintenance hours. This can be hardcoded (configured). In the proc. It’s current setup between 22pm and 5am.

How the throttle is calculated.

Based on our example above we have 6 backups per hour of 100mb . We also know the queue size based on the pages for the index and we know our maintenance period. Taking all of those values we know we need to process [x]mb in the maintenance period. We only had a set number of intervals where the log backups run.  So the sum of the queue is divided by the number of log backups in the maintenance period. That gives us a minimum amount of [x]mb that we need to process in order to complete all the indexes within the maintenance period.

Example.

6 backups per hour.

340mb queue size.

1 hour maintenance period.

Minimum build per interval = 340 / 6

This means we need to build at least 56mb of indexes in each interval (60 minutes per hour, 6 backups i.e. 1 backup every 10 minutes). Thus in a 10 minute period we need to build 56mb. This is done by looking at the QueueHistory table for the database on which we are building indexes. We look at the size where they are completed.

Program flow.

Program Flow

Index 1 (40mb is read of the queue). We read the history and found no rows. This means we are under the 56mb limit and that we need to build more indexes. So the index is processed and moved into the queue. Now we read another index off the queue, 26mb. We read

history and find a 40mb index, 40mb is still less than 56mb thus we proceed with the build (you might pick up here that it’s important that we don’t include the 26mb index in the equation otherwise we would build less than required) . Once the index is built we process the next index 10mb. We read the history and find 66mb completed. We only required a throughput of 56mb in the interval and thus the process is throttled and delayed for 1 minute.  (I could calculate the required waiting time based on the last log backup time difference to getdate()), maybe V2.

If it’s a good time.

Once all the checks can successfully complete the index building proceeds. There are currently two options of building, one caters for index builds and the other caters for heaps. The ONLINE flag can be toggled for both. There is no defrag option. That was done intentionally. If you want to add defragmentation options you could make use of the retry parameter.

Depending on the index one of the build methods are chosen. When the build is started there monitoring job is kicked off. The monitoring job executes a procedure. The procedure runs in a loop which reiterates every 500ms. This procedure looks at the index processor and will kill any transactions not falling within the specific blocking threshold and also the log free space threshold. If killed the job updates the history table with a corresponding note. The job then completes.

In summary.

You have a process selecting indexes into a queue based on rules. Once the indexes are queued they have to comply with the rules in order to be processed. During processing they have to affect the system minimally in a defined way otherwise they get killed to be retried at a later stage.

The rules are based around systems performance, system throughput, backup intervals, log sizes and disk size and some good peaceful sleep for DBAs.

You can download the files here .Index Maintenance . It’s a zip source , so once downloaded rename it and extract. Since I’ve written the code it’s in use at two companies where I have worked. It’s been a project in the making , schema \ object names have been done in a way to reuse as much code where possible in order not to step on peoples toes, more specifically previously employed DBA turf.

I hope this helps.

Feel free to hit me up for the source code or other DBA questions.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: