How to pass arrays to SQL? TVP – The Ultimate Guide
You always wanted to know how to pass arrays to SQL Server? Watch this Ultimate Guide to Table-Valued Parameters!
Hello and welcome! My name is Vedran Kesegić and this video is an absolute “must” for database developers. Fast performing database – everyone wants that! To make fast data processing, it is essential to process data in SETS. That means thousands of rows in a single command. NOT thousand commands that process a single-row. Note the difference!
That means we need a way to send an array of rows into one SQL command. This will be a basic building block for any well performing database application. This is essential, I would say mandatory for good performance.
We will start as a beginner and finish with more advanced techniques.
The Task
Let’s say we have one simple task: Update the status column in a SQL Server table for some rows. We have a list of IDs, possibly a long list that identifies the rows, and a new status. The list of IDs is changing per each execution, in both values and length.
We will go through this task evolutionary, from the worst to the best possible solution.
Solution 1: Absolutely the worst practice would be to build a CSV list with baked-in values:
1 2 |
UPDATE MyTable SET Status = 123 WHERE Id IN (4123, 4132, 4213, 4231, 4312, 4321, 3124, 3142, 3214, 3241, 3412, 3421) |
Here is another variation with “OR” instead of comma (OR-separated values):
1 2 |
UPDATE MyTable SET Status=123 WHERE Id=4123 OR Id=4132 OR Id=4213 OR Id=4231 OR Id=4312 OR Id=4321 OR Id=3124 OR Id=3142 OR Id=3214 OR Id=3241 OR Id=3412 OR Id=3421 |
Why is this bad?
Because we need to leverage the plan cache of SQL Server. This cache has amazing ability to speed-up our database queries by caching and later reusing execution plans.
Creating a good query plan takes CPU time and compilation locks. Caching it skips plan creation in next calls which results in faster execution.
If using the cache, our code performs poorly, painfully slow, and it does not scale.
In order to use the cache, sql statement must be exactly the SAME as the cached statement, byte-by-byte, otherwise the cached plan will NOT be reused.
If we change IDs on every execution, the query statement is different, cached plan is NOT reused, and it compiles a new plan on each execution.
Compilation takes CPU time which makes our query slow and takes server resources. It also creates compilation LOCKS that block other sessions, making even other sessions slow on entire server.
Plan cache of the server becomes flooded with plans that are never reused, becuase values are “baked” into statement and statements are all different.
That frequent compilations and washing the cache makes a server-wide damage, slowing down almost all queries, not just this one query.
Another negative side is vulnerability to SQL injection attacks. Hackers can fairly easy break in.
So that solution is really a bad practice and hopefully whoever uses it will watch this video learn a better approach.
For now, we will fix this plan cache flooding issue by introducing parameters, so plan can be reused!
Solution 2: CSV of parameters looks like this:
1 2 3 4 5 |
DECLARE @Status INT=123; DECLARE @Id1 INT=4123, @Id2 INT=4132, @Id3 INT=4213, @Id4 INT=4231, @Id5 INT=4312, @Id6 INT=4321, @Id7 INT=3124, @Id8 INT=3142, @Id9 INT=3214, @Id10 INT=3241, @Id11 INT=3412, @Id12 INT=3421; -- The command UPDATE MyTable SET Status=@Status WHERE Id IN (@Id1, @Id2, @Id3, @Id4, @Id5, @Id6, @Id7, @Id8, @Id9, @Id10, @Id11, @Id12) |
Although we are using parameters, this is still a bad approach. As number of parameters vary, number of different plans is large, and we are not reusing plans much.
This solution does not fix the plan flood, overhead from compilation CPU time, nor compile locks blockage. It still does not scale.
It fixes only vulnerability to SQL injection attacks.
Some framework auto-generate this bad practice. For example, Entity Framework.
EF generates really bad SQL code. It would probably win the gold medal in “bad practices in SQL” competition. So many bad practices in one place is really hard to find. Code that performs super-slow and does not scale, is a “gem” for you if you need to demo bad practice or show examples “how NOT to code” in SQL trainings. But if you use EF in production, and your database is getting bigger, chances are – you probably have ton of performance issues.
—EPP—
Speaking of slow databasees, I routinely optimize SQL Server database systems for companies. Reach me out and you will be amazed what can be done with minimal effort. Not just database improvements but also what can be changed in app to work faster with databases.
I’ve optimized so many systems, that it is now an optimization package that sells with a known result, time, and price in advance. And all results are measured before vs after with graphs and numbers to see exactly what we achieved via metrics. Uses feel the difference and are happier with your product. Less CPU cores are needed after optimization, which cuts expensive SQL license costs – which makes your CFO happy. Everybody wins. Do not hesitate to reach me, you will be glad if you decide to do so.
—
To reuse the plan cache we will introduce parameters that are INDEPENDENT on number of IDs.
Solution 3: We will pass a parameter containing CSV. Here is the code:
1 2 3 4 5 |
-- Parameter declaration and values DECLARE @Status INT=123; -- in dotnet we use SqlParameter class instead of DECLARE-ing them here DECLARE @IdList VARCHAR(MAX)='4123, 4132, 4213, 4231, 4312, 4321, 3124, 3142, 3214, 3241, 3412, 3421' -- Statement. We could also use XML or JSON as variations of the same solution. UPDATE MyTable SET Status=@Status WHERE Id IN (SELECT t.id FROM STRING_SPLIT(@IdList, ',') t) |
Notice the statement – it is always the same, even if different values and list of different length are sent. Because antire list is in single @IdList parameter.
Now the plan cache is reused – excellent!
But this solution has a big flow: unpacking of CSV, XML, or JSON, in tsql takes a TON of CPU time and performs really slow.
Unpacking takes exponentially longer with bigger sets, spending CPU time, does not scale well, and for bigger sets it can slowdown to a halt.
What can we do? How to fix that CPU issue and get a better performance?
Solution 4: By using Table-Valued Parameters!
If you are familiar with table variables, TVP is something similar. It is a table variable that is passed as a parameter.
Some think that TVP can only be used to pass array of records to a stored procedure, but it can pass arrays to ANY sql statement, not just a stored procedure.
Almost every ORM framework supports TVP, including Dapper, EF, and all popular ones, so you can use TVP with almost any ORM and that performs really well – even with EF!
But TVP requires a small preparation.
We need to have a TYPE already created inside the database:
1 |
CREATE TYPE dbo.IntCollection AS TABLE(IntValue INT PRIMARY KEY); |
The name is generic to make that type reusable in any scenario, not tied to just one particular table.
We do not want to create a bunch of similar types, we want them to be reusable – therefore we use generic names that reflect only types.
As a good practice, we will also use table aliases and two-part names, eg. “dbo.table”:
1 2 3 4 5 |
-- Declare variables (SqlParameter class in ado.net) DECLARE @Status INT=123; DECLARE @Ids dbo.IntCollection; INSERT INTO @Ids VALUES (4123), (4132), (4213), (4231), (4312), (4321), (3124), (3142), (3214), (3241), (3412), (3421) -- SQL command uses TVP (Table-Valued Parameter): UPDATE m SET m.Status = @Status FROM dbo.MyTable m WHERE m.Id IN (SELECT t.IntValue FROM @Ids t) |
This is much better! It reuses plans from the cache, and skips compilations. It does NOT use extra CPU nor locks for parsing nor compiling – performs fast! Way faster than any example so far! It does not flood the cache, scales well, and is also SQL injection resistent – very good!
Is it possible to improve on that even further?
Solution 5: Yes! It is a known issue that passing bigger sets of rows into TVP, eg. more than 10 000 rows, perform much slower than usual.
That means, we should stay below 10 000 rows per iteration.
Another thing is “LOCK ESCALATION”. If we insert/update/delete more than 5000 rows at once, a “lock escalation” occurs, locking entire table. That blocks everyone from changing that table rows (“everybody waits”) and such code does not scale.
Maybe you had a situation when someone tried to DELETE a bunch of rows, and all users were blocked? Calling and screaming “what is happening with database”?
That is a “lock escalation”. It happens for INSERT, UPDATE, DELETE, and MERGE when 5000+ rows are changed in a single SQL command.
That limit of 5000 is not strict, it can be much higher or even lower based on the situation on the server, but if you keep below 5000 rows it is quite safe that lock escalation will not occur.
Therefore, we will pass 4000 rows at once in a loop from the app side, until all rows are processed.
We will also leverage the fact that IDs are unique, guaranteed by the PK on our TABLE TYPE, and will do a simple JOIN with the TVP:
1 2 |
UPDATE m SET m.Status = @Status FROM dbo.MyTable m JOIN @Ids t ON t.IntValue = m.Id |
How cool is that? 🙂 The best solution looks the simplest of all!
This performs great, scales well, and does NOT lock entire table so multiple users (or threads) can work on that table.
Multiple threads get us even faster speed, until blocking or lock collisions become significant which prevents you to speedup even further.
Let’s dive into dotnet code for this example:
DEMO: TVP in ado.net
Prepare the TABLE TYPE and the table:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TYPE dbo.IntCollection AS TABLE(IntValue INT PRIMARY KEY); GO IF OBJECT_ID('dbo.MyTable') IS NOT NULL DROP TABLE dbo.MyTable; CREATE TABLE dbo.MyTable ( Id INT IDENTITY CONSTRAINT PK_MyTable PRIMARY KEY, Status INT NOT NULL ); GO INSERT INTO dbo.MyTable(Status) SELECT TOP 100000 1 FROM sys.all_columns, sys.all_columns c GO |
CSharp code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
int totalRows = 30000; DataTable ids = new DataTable(); ids.Columns.Add("Id", typeof(int)); for(int i=1; i<=totalRows; i++) //ids.Rows.Add(Guid.NewGuid().GetHashCode()); // random int ids.Rows.Add(i); using (SqlConnection con = new SqlConnection("Server=SQL2019;Initial Catalog=tempdb;Integrated Security=true;")) { con.Open(); // With Table-Valued Parameters (TVP) using (SqlCommand cmd = new SqlCommand("UPDATE m SET m.Status = @Status FROM dbo.MyTable m JOIN @Ids t ON t.IntValue = m.Id", con)) { cmd.Parameters.Add("@Status", SqlDbType.Int).Value = Guid.NewGuid().GetHashCode(); // random int cmd.Parameters.Add("@Ids", SqlDbType.Structured).Value = ids; // DataTable cmd.Parameters["@Ids"].TypeName = "dbo.IntCollection"; cmd.ExecuteNonQuery(); } // Example of SELECT using (SqlCommand cmd = new SqlCommand("SELECT o.name, o.modify_date FROM sys.objects o WHERE o.modify_date < @Date", con)) { cmd.Parameters.Add("@Date", SqlDbType.DateTime2).Value = DateTime.Now; DataTable result = new DataTable(); result.Load(cmd.ExecuteReader()); Console.WriteLine("First table name: {0}", result.Rows[0].ItemArray[0]); Console.WriteLine("Rows processed: {0}", result.Rows.Count); } } |
Bonus – advanced technique: dynamic SQL with TVP
Where would we use dynamic SQL?
Normally we use static SQL. But one of situations where dynamic sql shines is so called “OPTIONAL FILTERS” problem.
That is when in a WHERE clause we have filters that are sometimes ON and sometimes OFF, depending on some parameter value.
For example:
1 |
AND (t.Column = @parameter OR @parameter IS NULL) |
– this “OR” is a tell-tell sign this is an “optional filter”.
The “OR” acts like a switch: “give me this specific rows OR all rows”. In execution plan that should translate to “pick rows by index” or “full scan all rows”.
There is another variation: “AND t.Column = ISNULL(@parameter, t.Column)”, with the same devastating effect.
To satisfy both cases with only one plan, SQL has to pick the broader one – a FULL SCAN! You can imagine how bad that performs, especially on big tables.
It is so common case that it deserves it’s own video, but we will briefly show it here.
Let’s say the list of IDs is “optional”. Empty list (no rows) means “give us all rows”.
What would be the static SQL solution? Classic SQL performs bad because it has only one plan for both cases: the FULL SCAN!
1 2 3 4 |
DECLARE @Status INT=123; DECLARE @Ids dbo.IntCollection; INSERT INTO @Ids VALUES (4123), (4132), (4213), (4231), (4312), (4321), (3124), (3142), (3214), (3241), (3412), (3421) UPDATE m SET m.Status = @Status FROM dbo.MyTable m WHERE (NOT EXISTS(SELECT 1 FROM @Ids) OR m.Id IN (SELECT t.IntValue FROM @Ids t)) |
The ideal solution would use multiple execution plans, each one optimal for their combination of filters.
We will achieve that using dynamic SQL:
1 2 3 4 5 6 7 |
DECLARE @Status INT = 123; DECLARE @Ids dbo.IntCollection; INSERT INTO @Ids VALUES (4123), (4132), (4213), (4231), (4312), (4321), (3124), (3142), (3214), (3241), (3412), (3421) DECLARE @sql NVARCHAR(MAX)='UPDATE m SET m.Status = @Status FROM dbo.MyTable m --IDS-- JOIN @Ids t ON t.IntValue = m.Id' IF EXISTS(SELECT 1 FROM @Ids) SET @sql = REPLACE(@sql, '--IDS--', ''); -- makes filter active by removing the comment mark EXEC sp_executesql @sql, N'@Status INT, @Ids dbo.IntCollection READONLY', @Status, @Ids PRINT @sql |
Did you noticed? No OR here! we got rid of the “OR”-s as OR would force us to use a slow “full scan” plan.
We will test both cases: empty and not-empty list of IDs by commenting out the INSERT line.
Observe actual execution plans – they are DIFFERENT and optimal for each case – exactly what we wanted!
That makes this “optional filter” solution the best possible. Getting rid of the “OR”-s and makes query run fast every time!
Summary
To sum it up:
- Do NOT use comma separated values or “OR” lists.
- TVP are much faster than CSV, XML, JSON – avoid them in favor of TVP!
- Use Table-Valued Parameters instead and pass them to a stored procedure or bare commands like in our examples
- TVP are supported in all relevant ORMs. They also reuse the plan cache of your SQL Server, not flooding the cache, which means they scale well. Therefore, TVP are highly recommended.
- We also learned how to use TVP in dynamic SQL to solve “optional filter” problem. We got rid of the “OR”s so the query is able to use the index when filter is enabled and full scan when filter is disabled.
Thank you for watching! Press subscribe, like, please share, and see you in the next video. Bye!
This is an excellent write-up and really good video walk through. Thank you for sharing.
Was wondering if you had any suggestions on a generic enough StringCollection data type (simillar to your IntCollection example) that would useful for passing in tables of strings?
To be generic enough for most use cases you’d have to make the length quite long perhaps, but that seems problematic. How might you approach this?
Absolutely, a string collection is frequently used generic collection type. Depending on your situation, you may want to have two versions: one for short strings (VARCHAR(500) for example, you may use unicode here if on SQL2019+), and one for potentially very long strings (NVARCHAR(MAX)).
Fantastic, another sound strategy. Thank you again for your insight.
Another question this raises for me (an admitted neophyte to SQL) is that solution 3, passing a csv parameter, is deemed less than ideal because one has to unpack the csv parameter which is CPU intensive.
But with solution 4 — let’s say for example your TVP will be of a ShortStringCollection type — you’ll still have to unpack the original csv string anyway to insert the separate values into the TVP. Right?
So are you really gaining the benefit of avoiding the CPU time? Or is it because the unpacking only happens on the TVP insert and not at the WHERE/JOIN level which makes the difference?
I’m sure I missing something?
Once you have values in TVP, you use them directly, you do not have to unpack them. Eg you join @TVP, you insert FROM @tvp, there are no conversions. If an app has data as CSV string for example, and it packs into @TVP (prepares the parameter) there is where packing occurs and CPU is spent. BUT, we spend CPU of app/web server, not the DB server CPU – and that is very important. DB CPU is very expensive, you usually pay SQL license based on CPU cores and you want to avoid spending DB’s CPU. Web/app CPU on the other hand is free (compared to SQL license cost), PLUS you can have as many web/app server, a farm of servers if you like, hitting one central database – their CPUs are very easy to scale (scaling out, getting more of the same machines), unlike DB CPU which is usually scaling up (getting a bigger machine).
Would it be possible to use temp tables instead of TVPs in this case? Im in a similar situation described in this article where I needed to send data from client->server and use it to query a base table. The issue in my case is that we dont have access to create TVPs with my login.
You can use temp tables, especially if you want to send more than 10K rows at a time (TVP is good only for <10K rows at a time, preferably 1K rows). BUT, temp tables cause recompiles and "optional SPID" issue that floods the plan cache. Therefore it is not suitable for super-frequent calls. It is more suitable for bigger but less frequent calls. In your case, I would send the CREATE TYPE script to the db owner and ask then to install it. That way you would not have to ask for privileges and you would get your TVP in place.