This article presents 27 SQL performance tuning tips. The impact of each of these tips depends on:
- which DBMS you are using.
- how you organised your data.
- what is your current hardware configuration.
- the volume of data present in your database.
As a result, some tips may help you, others not. Try each of the proposed SQL performance tuning techniques. Each time, measure the performance increase, or lack of it.
Use an index on most used columns
Create an index on columns the most used in where statements.
The queries that your application uses the most look like this:
select * from salesperson where last_name like = 'wilson'
In such queries, the column used in the where statement is last_name. By creating an index on last_name, you can increase these queries’ performance.
This tuning technique has drawbacks. Indeed, indexes:
- slow down insert and update operations.
- take place on the disk.
Therefore be careful to add an index only when it is useful.
Moreover, adding indexes will not increase the performance if:
- the table contains a few rows.
- a great percentage of rows have the same value in this column. Example: “select * from users where is_employee=’true’. In this example, if 80% of users are employees, creating and index on is_employee will not help.
Do not forget to tune queries included in functions
Once I added an index on a column used in a where statement. The query was faster, but still slow.
After some investigations I noticed that the part of the query just after the select statement was using a function. This function itself used a query which was lacking an index.
Let us say you created an index on first_name, last_name. In that case the part of the query where first_name = “John” and last_name=”Doe” is well tuned. But the function called “perf” can itself hide a slow query:
If there is no index on salesperson_id in table sales, you can improve the function’s performance by adding this index. As a result your first query, which uses this function, will be faster.
In other words: do not forget to tune queries inside functions and stored procedures.
Use indexes when joining two tables
Let us say you are joining two tables using a query. This query could be faster if you create an index on columns used to join these two tables.
In this example you can create an index on both id of salesperson and salesperson_id of clients.
If you are often using the column first name in the where clause after this join, you can also create a composite index on id, first_name instead of an index on id.
Composite index: do not omit the first column
Your index can include more than one column. It is then called a composite index.
Example: an index includes first_name and last_name, in this order. When you look for a first name and a last name:
- the DBMS looks for the provided first name (Alfie in the diagram below) among the first names contained in the index (Amelia, Alfie, Archie, Alexander,...).
- the index returns a group of last names corresponding to this first name (Adams, Arrow, Willis, Johansen).
- the DBMS looks for the provided last name among this group of last names.
- once it finds it, the index indicates where the data is in the table (line corresponding to employee number 5).
If you do not give the first name in the query (in the diagram below, on the bottom left corner, we give the last name Richardson without giving the first name), then the first step cannot be executed. As a result, this composite index is not used.
Composite index: do not omit the first column in the order by clause
The previous chapter advises not to omit the composite index’s first column in the where clause. Forgetting to apply this principle on the order by clause can slow down your query as well.
If your composite index is on invoice_id, position_id, then
order by position_id can be slower than order by invoice_id, position_id (missing first column of the composite index).
Also, for some DBMSes, order by position_id, invoice_id (wrong order) can be slower than order by invoice_id, position_id (correct order).
Use of caching
The principle of caching is that if someone asks twice the same question, then you make twice the same reply:
- a user requests information from your application.
- your application remembers the request.
- your application looks for information in the database using an SQL query, and gets the needed information.
- your application remembers the association between the request and the corresponding information returned.
- your application returns the information to the user.
- each time the user asks for the same information, your application returns the same request he kept in memory, without the need to query the database with an SQL query.
The diagram below illustrates these steps:
This SQL performance tuning technique has a drawback: another user or process can update the data between the moment the result of the SQL query is saved and the moment you use the cache (moment you ask for the same information) . Therefore, the answer to the same question could differ after a while.
As a result, you must find out what is the biggest amount of time you can wait before getting the most up to date information (in other words, before refreshing the cache). Example:
- for a debit card transaction you must always get the most up to date information immediately, so caching is not an option. Indeed, you cannot wait even for a second in order to avoid fraud.
- to display a list of countries in a combo box you can wait a couple of minutes, even a couple of hours, before updated data appears.
In other words, the most efficient SQL query is the query that you don’t have to execute.
Invest in SSD disks
Replacing an HDD drive by an SSD drive significantly increases your SQL queries’ performance. Indeed, it is the hard disk which stores your data. Faster read and write on the disk results in faster read and update of data.
The other advantage of using an SSD drive is that you avoid data fragmentation and lower the frequency of disk failure.
The main downsides of SSD drives are that:
- SSD drives’ cost per megabyte is higher.
- you cannot find an SSD drive as big as the biggest HDD drive on the market.
You can overcome this disadvantages if you:
- keep the volume of your database low enough to fit on an SSD drive.
- distribute the data between several drives.
Note that if you have a huge number of requests per second, your SSD drive may not have the capacity to read the required amount of megabytes per second (throughput). To correct this situation, you can either use drive that have a higher throughput, or separate the processes in two flows, each one working on a separate SSD drive.
Tune the machine on which the DBMS is running
Check the level of RAM and CPU use on the machine hosting the DBMS. Remember that:
- your DBMS is a software like any others, using RAM memory and processor power. If your machine does not have enough RAM or its processor is not fast enough, the DBMS will slow down. In that case, you should add more RAM or invest in a machine having a faster CPU. Most of the time, the most efficient will be to add more RAM.
- when the computer starts lacking memory, he uses the hard drive to get extra memory. If this hard drive is an HDD drive, it results in a significant decrease of performance.
- other processes running on the same machine can take too much CPU or too much memory. This slows down the DBMS. These other processes can be another application you own, or a virus. If your database is on a machine shared with other companies, there will also be more processes running on your machine, using your RAM and CPU. As a result, the following actions can help you tune your database performance: a) tracking viruses, malwares, spywares b) suppressing unnecessary services or programs on your machine c) use a dedicated machine or a virtual machine instead of a shared machine.
Important note regarding RAM: adding more RAM influences performance, so can :
- the type of RAM used: dynamic or static. Static RAM (SRAM) is faster but more expensive.
- the RAM speed: DDR (double data rate) RAM is faster (typically 266mhz instead of 100 mhz) but more expensive. The RAM speed is usually written on the RAM specification.
Regarding CPU, the following factors can influence its performance:
- number of clock cycles per second ( 1 Gigahertz = 1 billion clock cycles per second).
- L1 and L2 cache: processors can keep (cache) data inside their own dedicated RAM. This helps to process data faster as it takes less time to access this data than accessing the same data when it is stored in the main RAM. L1 cache is stored on the chip, L2 is stored outside the chip.
- number of cores: each core can process data independently. As a result, the more cores a processor has, the more data can be processed in parallel. This can increase the overall performance of the processor (providing that your softwares are running actions in parallel). The number of threads is the number of processes that can be run in parallel. Usually it is equal to the number of cores, when one core is able to run one process at the same time. On some processors, each core can run several processes at the same time, which increases the total number of threads.
- architecture type: for the same number of Gigahertz, a 64 bits processor will be faster than a 32 bits processor.
Example of specification:
Processors: 16 x 3.8 GHz processor cores: means 16 cores running at 3.8GHz each
Level 2 (L2) cache: 256 KB L2 cache per core
Level 3 (L3) cache: 10 MB L3 cache per core
RAM related:Up to 1 TB of 1066 MHz DDR3: 1 TB is the amount of RAM and 1066 MHz its speed.
The operating system can also influence your machine’s performance. Some computers running on Windows become slower after a few months due to disk fragmentation. This happens mainly when you are installing new software often. Reinstalling the operating system makes the computer fast again; but never as fast as initially.
Some persons argue that Linux offers better performance for the same hardware. They write that:
- Windows runs more heavy processes in the background, including antiviruses.
- Windows file system leads to more data fragmentation.
- Windows focuses on being compatible with older versions of software. This is called backward compatibility. The side effect of backward compatibility can be a slower performance.
- Linux architecture (monolithic kernel) favors speed of communication between the operating system and the hardware (disk, RAM, CPU). Also in the case of Linux the kernel (part of the operating system responsible of communication between the software and the hardware) is supposed to be updated more often by developers than in the case of Windows, as there are more developers working on enhancing Linux.
Windows fans will argue that these arguments are irrelevant. Linux fans will say these arguments are valid. The only way to know who is right or wrong is to try both operating systems with the same hardware and compare the performance and the number of failures of your application under heavy load.
Create clustered index on most used columns when selecting ranges
Clustered indexes physically order lines in a table.
Example: if you have a clustered index on social_security_number in the table clients:
- when selecting rows in the table without using order by (Example: select * from clients), they will appear ordered by the social_security_number, even if you did not insert them in this order.
- if you select all clients that have client_id between 10000 and 15000, it will be faster with a clustered index. Indeed, these clients are already grouped together (10001 is just under 10000, 10002 is just under 10001,...). So the DBMS just needs to find id 10000, and then take the 4999 next rows in the table.
- ordering by social_security_number in an SQL query will also be faster as the rows in the table are already in the right order.
Unfortunately, you cannot always use this technique. There are two limitations:
- you can only have one clustered index per table. Choose the one that is the most used for range selections and ordering.
- there cannot be more than one row with the same value in the clustered index. Example: you cannot create a clustered index on first_name, last_name in the table employees if there is a possibility that two employees will have the same first name and last name.
Use joins instead of correlated subqueries
Joins and correlated queries are two method to join two tables. In some DBMSes (Sybase, mySQL) and for a high volume of data, joins are faster.
Example - correlated sub-query:
Example - join:
Both queries return the same results, but on some DBMSes, the second one is faster.
When comparing both queries’ performance, you should delete the cache. If you do not, the second query will always return a better result, whether it is better tuned or not. This is how to clear the cache (be careful to execute this only on a test database):
- on SQL server:
- on mySQL:
SELECT SQL_NO_CACHE * FROM TABLE(TABLE being the table the query you want to test is selecting).
- on Oracle:
alter system flush buffer_cache;
Do not use non-deterministic functions on the left hand side of a comparison
We talked about caching earlier in this article. Some DBMSes also use caching, answering faster to an SQL query that has already been executed.
In some cases an SQL query can prevent this mechanism. This happens when you write, in your where clause, on the left hand side of the comparison (=,>,<,...), a function that does not always return the same result. getdate() is an example of such a function, which returns a different result depending on the current time.
Example: query returning clients which statuses changed more than 50 days ago:
Query before tuning (getdate() function on left hand side of comparator “>=”):
Add a column containing reverse names when looking for a string at the end of a column
Let us say you have an index on a text field. A query looking for a string at the end of this column is sometimes slower than a query looking for a string at the beginning of the column.
Likewise, in a dictionary, it would be much faster to find a word starting with the letter “a” than finding a word ending with this letter “a”.
The solution is to:
- create a new column in the same table (let us call it column B).
- use a trigger so that each time you insert or update a row, you put in this new column B the reverse value of the column you want to use in the where clause (let us call it column A). To execute that operation use the function reverse.
- when looking for a string at the end of the column A, match the reverse of the string you are looking for to the new column B.
Example - query:
Example - solution:
- You create a column reverse_first_name.
- Each time you create a new row or update the value of first_name, the trigger populates reverse_first_name using reverse(first_name).
- Use the query:
select * from clients where reverse_first_name like “fia%”
Do not need to select all columns
In your query, select only the columns that are necessary for further processing. As a result:
- the DBMS will have less data to read.
- less data will transit through the network.
Example: if you only need the first_name and last_name of the client, then use “select first_name, last_name” instead of “select *”.
Execute updates in batches
If you are writing a script to update a large amount of data, execute these changes in batches. This means that you separate the full list of updates into chunks, and update chunk by chunk. Example: you update 1000 rows in the client table, then another 1000 rows, until you update all the rows in the table.
This improves the update script’s performance. You can also use this technique for inserts. For more details, see the blog on how to insert data massively.
Give exactly the same data type to columns used to join tables
When you join two tables, you match a column in the left table with a column in the right table. When doing that, make sure the two columns' data types correspond.
In this example, the column salesperson_id of table clients should have the same type (example: integer) as the column called id in the salesperson table.
Split voluminous tables into several tables
If you have a table storing a huge amount of information, you can split this table’s data between several tables. Each resulting table will contain less rows than the corresponding original table.
The clients_bills table has a huge number of rows. You can divide the volume by 4 by splitting the data in clients_bills table into 4 tables. You can name these tables clients_bills_1, clients_bills_2, clients_bills_3, clients_bills_4.
In order to split the table, you need to find out a rule to tell which row goes into which target table.
In our example, you can use the bill’s year to split the data. You will store bills corresponding to year 2016 in one table, bills corresponding to year 2015 in a second table, ....
Once you find out this rule, you need to change your code to read, update and insert data in the right target table. To make things clearer you can name the tables clients_bills_2016, clients_bills_2015,... instead of clients_bills_1, clients_bills_2, ...
Increase the speed at which information is sent to the user
SQL queries are fetching information from the database. Once this is done, you still need to send this information to the user requesting it. As a result, you need to work on the speed at which the information is sent to the user:
- make sure you send as less data as possible (selecting only the columns that are necessary instead of using “select *”).
- make sure the internet connection between the machine hosting the database and the outside world is able to send a large amount of data per second (bandwidth). At home, your internet connection may be sufficient to surf on the internet and upload source code or scripts, but insufficient to send data from the database to thousands or users. If you have a large amount of users, consider using a hosting company and take into account bandwidth when comparing companies. Example of offer : “uplink with a guaranteed 20mbit connection”.
- make sure, if possible, that your database is as close as possible to the users or applications accessing the information.
The last criteria is the distance between the database and the person or application requesting the information. If you are outsourcing the hosting of your database server, you may sometimes oversee in which countries your servers are. You can reach a situation where your web application server is in the US, the database server in Asia and visitors in Europe. In that case you loose time sending back and forth information between continents. Avoid keeping different components in different locations.
In order to do that, you can divide your data between continents. Data related to one continent would be in a database hosted on this continent. You would have one database server for America, another for Asia,... Each database would have the same tables but would contain only rows corresponding to its respective continent. Example: an American client database would contain only data about clients in Americas and would be in the US; the Asian client database would contain data about clients in Asia and would be in Singapore,...
Use a temporary table when joining tables
When joining two voluminous tables, you can enhance the query using the following technique:
- you query one of the two tables.
- you store the result of this query in a temporary table.
- you join this temporary table with the second table.
To choose which one of the two tables to use in step one, do so that the information in the temporary table is as less voluminous as possible.
The more tables you are joining the bigger the potential gain in performance.
Example - before applying the technique:
Example - after applying the technique:
Use stored procedures
The article about stored procedures lists the advantages of using a stored procedure in the chapter "advantages and disadvantages of using stored procedures".
If you are executing several SQL operations in one transaction, each operation depending on the previous one(s), stored procedures can help you.
Indeed, with a stored procedure you do only one call to the database. Without using a stored procedure you do as many calls as there are operations to execute. Remember that at each call to the database you need to establish a connection with the server, send data to the server and receive the result from the server. As a result, stored procedures can reduce the overall time spent executing a group of SQL statements.
For some DBMSes using a stored procedure increases performance even when you are executing one SQL statement inside it. This is because when you create your stored procedure the DBMS interprets the SQL statement included in this stored procedure. As a result, when you call the stored procedure the DBMS does not need to make this interpretation anymore. When calling the SQL query directly, the DBMS may have to make the interpretation of the SQL statement each time the query is called. As there is less to do at each call, we may have a performance gain.
I write “may have”, because if the DBMS keeps this interpretation in memory, during the next executions the DBMS will not need to make the interpretation again. In that case, the direct call to the SQL query will be as fast as the call to the stored procedure.
Delete records from underlying tables (possibly archive the deleted records)
If you have voluminous tables, you can increase your SQL performance by moving data that is not necessary anymore for your application in separate tables. As a result, the data in the database will be less voluminous and therefore faster to access.
In order to apply this tip, you need to find out rules to determine which data is not useful anymore.
Examples of such rules:
- it is not useful to keep information about bills older than 5 years.
- it is not useful to keep information about clients that you deactivated more than 1 year ago.
- if is not useful to keep data about promotions which end date is before today.
Then, once you find out the rules to archive data, you need to write a script that :
- finds rows fulfilling these rules.
- copies these rows to another table not accessed by your application.
- removes the rows from the table accessed by the application once the copy is successful.
You also need to create an application (even basic) that helps finding information in the archive.
This technique is interesting if you can archive a significant amount of data. If the rules defined during the first step are such that they apply to a low percentage of your data, then this technique is not worth implementing.
Each time you update, insert or delete rows in a table that has an index, the DBMS re-calculates the index to take this change into account.
It is possible that after many re-calculations, the index loses of its efficiency. One of the reason is that as a result of these operations data gets scattered on the disk. As a result it takes longer to the DBMS to look for data. Note that this problem does not apply to SSD hard drives.
You can solve this problem by dropping the index and then creating it again.
The previous technique was to rebuild the indexes. For the same reasons, you can also rebuild tables. To do that, you then need to:
- dump the table (data + structure) in a file.
- shut down your application.
- rename the table (example: add “_old” to the table name).
- re-create the table (data + structure) using the dump created in step 1.
- start your application and test it.
- after making sure your system is working correctly with the new table, you can delete the renamed table. You can also choose to keep it as a backup.
If possible test this process on a test environment before doing it in the main database.
Drop indexes and foreign keys before loading data
If you are planning to update or insert a massive amount of data in using a script, indexes can slow your script down.
This happens because each time the table’s content changes, the index also needs to change.
In that case, you should drop the indexes before the script execution. Of course you delete only the indexes corresponding to tables containing data the script modifies. Once the script completes, recreate the deleted indexes.
This technique applies also to foreign keys. Foreign keys protect the table’s integrity. For example it avoids having a child row without a parent row. On the other side, it can slow down updates and insert, as you need to check if the update or the insert complies to the foreign key.
As for indexes, you need to delete the foreign key, execute your update and inserts, and then recreate the foreign keys.
Online versus batch
Everytime the user requests an update, an insert or a delete, decide:
- if it is critical to execute the update, insert or delete as soon as the user requests it.
- if it is critical to make the user wait for the completion of the update, insert or delete results before the user can use the application further. Example: should the application be hanging as long as the bank did not accept the credit card transaction ?
If the answer to these two questions is no, find out how long it can wait. Then, code the following logic:
- when the user clicks on the update button, your application shows a message confirming the update is going to happen. The system does not do the update immediately, but remembers he has to do it. The user can use the application further, although the update is not done.
- cyclically (during hours where few people are using your application if it is OK to wait until that time), the system does the updates.
- optionally, the application sends a message to the user that it completed the updates.
The result of operating like this is that the user spends less time waiting before being able to execute the next action on your application. This is interesting when the transaction is complex and requires several seconds when a lot of users are accessing your application. It is less interesting if the update is very fast even during off-peak.
Put the most selective criteria first in the index and in the where clause
If your index includes several columns (composite index), the index should start with the column containing the least duplicate values (most selective column).
Your index includes columns client_id and status. Consider the following scenario:
- each invoice has a unique ID.
- each invoice has several lines (each line on the invoice showing a purchased article) which ID are 1,2,3,...
In that case create an index on invoice_id, line_id and not on line_id, invoice_id. Why? Let us compare the two options:
- invoice_id, line_id: you select first the invoice_id. Then you have to find the requested line among the lines belonging to this invoice. That would represent a dozen to a hundred of lines only.
- line_id, invoice_id: you select first all lines having the same line_id. Let us say you select line_id =1. All invoices have a first line (assuming no invoice is empty), so in this first step you return all the invoices. As a result, during the second step you will still have to find your invoice among millions of rows.
We just saw that this technique implies putting the columns in the right order when creating the composite index. You should also order these columns well in your where clause. In our example, you should use the following SQL statement:
select amount where invoice_id=234566 and line_id=1 instead of select amount where line_id=1 and invoice_id=234566.
Use less storage space on each column
Using less space on columns is going to speed your SQL queries in rare cases but:
- If your columns use less space, your database will use less space. This is an advantage, as it takes less time to backup a smaller database. It takes less time also to copy it to another server over the network.
- By limiting column size you are enforcing data validation. You could prevent for example data insertion that does not make sense, like last names of 1000 characters long, invalid status codes, invalid answer to a yes or no question...
If you decide to go for it, here is how you can gain space: determine what is the biggest size the data in each column can reasonably have. Then choose the smallest data type that can hold such a value:
- If the value is just yes or no (0 or 1), choose a bit (the size of a bit is 1 bit).
- In the case of a number: if you are sure it will be less than 250, will not be negative and will contain no decimal, use a tinyint (the size of a tinyint is 1 byte, which is 8 bits).
- In the case of a number: if you are sure it will be less than 32767 and will have no decimal use a smallint (size: 2 bytes).
- In the case of a number: if you are sure it will be less that 2,147,483,647 and will have no decimal use an integer (size: 4 bytes). If not, use a bigint (size: 8 bytes).
- In the case of a number that has decimals, find out the total number of digits really needed as well as the number of digits needed after the decimal point. The total number of digits determines the storage size: 1-9 digits: 5 bytes. 10-19 digits: 9 bytes. 20-28 digits: 13 bytes. 29-38 digits: 17 bytes. Then, the more digits you use after the decimal point, the less digits you can use before the decimal point.
- If your store dates without using hour minutes and seconds, use smalldatetime.
- In the case of a text: when all values in the column have the same length, use char(x). It uses 2 bytes less space for the same data length than varchar(x). Example: airport code which length is always 3 letters. Fields in char(3) will always have a storage size of 3 bytes. Fields in varchar(3) will have a storage size equal to the length of the data in the field + 2 bytes. As the data length is always 3 in our example, then the storage size will always be 3 + 2 = 5 bytes.
- In the case of letters: if the values in the fields have various lengths then it is best to use varchar over char. Example: a column contains strings which biggest reasonable length is 100 characters. Let us say that in average the string length is 25 in this column. Then the storage size in the case of varchar(100) is in average equal to 25 + 2 = 27 bytes. On the other side, the storage size for char(100) is always 100 bytes.
If you choose varchar, you could decide what is the biggest reasonable length of the strings contained in the column. But as the storage size in varchar(x) depends on the string size, why bother determining what to put instead of x? Example: you know that the strings stored will never go over 200 characters, but the storage size of each string will be the same whether you declare the column as varchar(200) or varchar(1000). Indeed, the storage size of each string is equal to the string actual size + 2.
You will find some experts arguing that it is irrelevant for performance whether you declare varchar(200) or varchar(1000). Some others say that it can impact performance. Their argument is that varchar(200) is an indication provided to the DBMS of the estimated storage size per field in this column. As every indication can influence how the DBMS is planning to execute the query, it can make sense to give an estimate as close as possible to reality to “help” the DBMS doing his job well.
choose between select count greater than 0 and exists
In some cases a query containing ”where exists” performs differently than a query counting a number of rows and checking if it is greater than 0. Both ways of doing return the same result, but their speed may differ.
You want to display only clients that have at least one salesperson assigned.
Using "where exists":
Again, when comparing the two SQL queries’ performance, clean the database cache between the execution of the first query and the execution of the second query.
I hope these tips will help you tuning your SQL query performance. Congratulations for your patience.