This article shows how to insert a massive amount of data in your tables. You will learn how to write SQL server scripts to:
- insert data in a loop.
- improve the performance of the insertion. This is done by inserting data in chunks.
- give indication about the script’s progress.
- in case of error, indicate the exact cause and avoid data corruption.
- introduce duplicate data.
- execute the SQL script from a batch script (dos).
This article will also show you two full working examples of such an SQL script.
The scripts shown in this article use techniques explained in the previous article about creating random data.
Inserting data in a loop
In order to insert a huge amount of data, create a loop:
- the script will execute a group of SQL inserts several times.
- at each execution, we add 1 to a counter (@counter).
- when this counter is greater than a maximum (@max_count), the script will stop executing these statements.
use TestDatabase: this has nothing to do with loops, but the first thing you should do in the script is to indicate which database you are using.
LOOP_START: : “LOOP_START:” is like a bookmark marking a line in the script. When “GOTO LOOP_START” is encountered, the script goes back to the bookmark “LOOP_START:”. It then starts executing commands from this line again. As a result, lines between “LOOP_START” and “GOTO LOOP_START” are executed again and again, in a loop.
select @counter = @counter + 1 : we add 1 to the counter “@counter” at each loop. As a result, the value of @counter is equal to the number of loops the script executed.
if (@counter > @max_count) begin goto LOOP_END end : when the counter is greater than max_count, we go out of the loop. Indeed, the marker “LOOP_END:” is located after “GOTO LOOP_START” (“GOTO LOOP_START” is the end of the loop). As a result, once the script goes to the line marked with “LOOP_END:”, it will not go back to the line marked with “LOOP_START:” anymore. It is like jumping out of the loop.
How to insert data in chunks
When inserting a huge number of rows in the database, you need to apply the following trick to insert rows faster:
- you present to the database a group of inserts (for example one thousand) to execute.
- then you ask the database to execute this group of inserts. This group of inserts forms a transaction. Asking the database to execute this group of inserts is called committing the transaction.
- once the transaction is committed, you mark a pause, and present to the database the next chunk.
- you repeat the same sequence until all your inserts are executed.
In this way, the database executes faster than when you present the inserts to execute one by one.
begin tran yourTransaction : this marks the beginning of the transaction.
commit tran yourTransaction : this marks the end of the transaction. It is also asking the database to execute the inserts included in the transaction. The script will submit every SQL command encore between the beginning and the end of the transaction to the database for execution. Note that the script will not execute any of these commands before it reaches “commit tran yourTransaction”.
if (@counter % @chunk_size) = 0 : this means “if @counter is a multiple of @chunk_size”. Example: chunk size is 1000, then transactions are committed at loop number 1000, 2000, 3000, … As a result, the transaction is committed every 1000 inserts. At loop number 1000, we commit all inserts included in loop number 1 to 1000,...
This means that we keep informing the database about inserts to execute (without asking for execution) until we reach a thousands inserts. Then we ask the database to execute (commit) this thousands inserts. Then we mark a pause of one second (WAITFOR DELAY '00:00:01'). After that, we begin a new transaction (begin tran yourTransaction).
Note that there is also a “commit tran yourTransaction” after the label “LOOP_END:”. This is to commit all the inserts presented to the database during the last loops. Example: size of chunk is 1000 and maximum number of inserts is 3500. You commit at 1000, 2000 and 3000 rows. Once you go out of the loop (when reaching 3500), you need to commit inserts between number 3000 and 3500.
Give more information about script execution
When an administrator executes a script, he likes to see:
- when the script started.
- what is the current progress.
- when the script ended.
- how long it took to execute.
Moreover, you want to show to the administrator only the relevant information. As a result, if you are executing a hundred thousands transaction, do not display the information “row inserted successfully” for each transaction.
set nocount on: prevents the script from logging on the screen all messages from the database. The only messages that will be displayed will be the messages we decided to show to the person running the script.
select @start_date=GETDATE() select @start_date as 'script started at': these commands get and then display the current date and time.
if (@counter % @chunk_size) = 0 begin select cast((@counter*100/@max_count) as char) + '%' as 'Progress' : at each chunk (group of inserts), we give the progress of the script in percentage. @counter is the current loop number, @max_count is the total number of loops to execute. cast(... as char) is used to convert the progress percentage into a string in order to be able to add the character “%” to it.
When executing long scripts, the administrator needs to know from time to time that something is happening. It is too scary to wait hours for a script to execute without seeing anything logged on the screen.
RAISERROR('',10,1) WITH NOWAIT: this forces SQL server to show the information to the user executing the script before the script is completed.
datediff("hour",@start_date,@end_date) as 'hour',datediff("minute",@start_date,@end_date) as 'min',datediff("second",@start_date,@end_date)as 'sec': This gives the difference between the time when the script was started and the time the script completed. It displays the difference in hour, minutes and seconds.
Rolling back the changes if something happened
The purpose of our scripts is to insert fake rows in the database. As a result, if a group of inserts is failing, the script does not need to stop execution. Indeed, it does not matter if a couple of rows is not inserted, as these rows do not correspond to real data.
On the other side, if at each loop we are inserting data in several tables. Therefore, we want that in case an insert fails on one table, we cancel the inserts on the other tables. Example: at loop number 555, we insert a client in the clients table, and then we fail inserting this client’s status in the status table. In that case, we want to cancel the insertion of this client in the client table. This is in order to avoid inconsistencies in the database (client without a status).
begin TRY ... end TRY : there are several SQL commands between “begin try” and “end try”. If one of these commands fails:
- the script cancels all the commands between “begin try” and “end try”.
- the script starts executing what is between “begin catch” and “end catch”.
- after that, the script does not stop executing, it just goes further.
Note that if nothing fails between "begin try" and "end try", then statements between "begin catch" and "end catch" are not executed.
select @error_count = @error_count + 1 : @error_count keeps track of the number of times a group of inserts failed. If the script starts executing what is between “begin catch” and “end catch”, it means it encountered one more error when executing the group of inserts between “begin try” and “end try”.
if @error_count < @max_error
: we set a maximum number of errors acceptable. It is OK to let pass a few errors while we are having lunch. But if the scripts encounters too many errors, it does not make sense to go on with the execution. As a result:
- if @error_count < @max_error, then we continue the execution (we go back to the beginning of the loop).
- if @error_count ≥ @max_error, we stop the execution (we go to after the end of the loop).
Indicating the exact cause of errors when they occur
If the script encounters an error, we want to know as much as possible about:
- the error returned by the database.
- the value of the parameters you were using when executing the transaction that failed.
Inserting duplicate data
In real life some people have the same first name. For example you can have two Johns in a company. In our script we will insert once in a while the same first name.
if (@counter % @duplicate_ratio) = 0 : this condition will be true every time @counter (counting the number of loops) will be a multiple of @duplicate_ratio. Example: @duplicate_ratio is 100; then every 100 transaction, we insert a duplicate value.
select @first_name = @duplicate_value_holder: this introduces a duplicate. Instead of generating a new first name randomly, we use the first name generated during the last loop.
exec getRandomWord @first_name output,4,5,'UL' select @duplicate_value_holder = @first_name: in other cases, we generate a new first name and keep this first name in the variable @duplicate_value_holder. This variable will be useful to create the next duplicate.
Executing the SQL script from a batch script
If you execute a huge amount of operations using SQL management studio, then sql management studio could lack memory. In that case, it is better to execute your script using the command line.
To do that:
- save your SQL script in a file (give this file the extension “.sql”).
- find out where is the sqlcmd executable on your computer. On Windows 8, click on the Windows button, type “sqlcmd.exe”. You will see appear on the upper right hand side a dialog “Search”, with underneath “SQLCMD.EXE” appearing. Right click on “SQLCMD.EXE”, and choose “open in the folder”. You will then see what is the path of the path of this executable.
- find out what is your server (it is written on the login dialog box of SQL management studio, in front of “server name”).
- create a file with extension “.bat".
- inside this file bat:
write a command to go to the directory where you installed SQL management studio. Example:
cd c:\Program Files\Microsoft SQL Server\Client SDK\ODBC\110\Tools\Binn
write a command to execute your script:
sqlcmd -S servername\SQLExpress -U SqlUserAccount -P SqlPassword
-r1 -i c:\yourscriptpath\your_sql_script.sql
In this line:
- replace servername by the server’s name.
- replace SqlUserAccount and SqlPassword by the administrator’s account and password.
- replace c:\yourscriptpath by the path where the script is located, and your_sql_script.sql by your script’s name.
“-r1” is an option to enable the script to show messages on the screen before the end of the script (in order to show the progress).