SQL Server 2005 XML data type - sql-server

UPDATE: This issue is note related to the XML, I duplicated the table using an nvarchar(MAX) instead and still same issue. I will repost a new topic.
I have a table with about a million records, the table has an XML field. The query is running extremely slow, even when selecting just an ID. Is there anything I can do to increase the speed of this, I have tried setting text in row on, but SQL server will not allow me to, I receive the error "Cannot switch to in row text in table".
I would appreciate any help in a fix or knowledge that I seem to be missing.
Thanks
TABLE
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParoleeID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [xml] NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
QUERY
DECLARE #ID int
SET #ID = NULL
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
DECLARE #Created datetime
SET #Created = NULL
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where
(#ID IS NULL OR Audit.ID = #ID) AND
(#ParoleeID IS NULL OR Audit.ParoleeID = #ParoleeID) AND
(#Page IS NULL OR Audit.Page = #Page) AND
(#ObjectID IS NULL OR Audit.ObjectID = #ObjectID) AND
(#Created is null or(Audit.Created > #Created and Audit.Created < DATEADD (d, 1, #Created )) )

You need to create a primary XML index on the column. Above anything else having this will assist ALL your queries.
Once you have this, you can create indexing into the XML columns on the xml data.
From experience though, if you can store some information in the relation tables, SQL is much better at searching and indexing that than XML. Ie any key columns and commonly searched data should be stored relationally where possible.

Sql Server 2005 – Twelve Tips For Optimizing Query Performance by Tony Wright
Turn on the execution plan, and statistics
Use Clustered Indexes
Use Indexed Views
Use Covering Indexes
Keep your clustered index small.
Avoid cursors
Archive old data
Partition your data correctly
Remove user-defined inline scalar functions
Use APPLY
Use computed columns
Use the correct transaction isolation level
http://tonesdotnetblog.wordpress.com/2008/05/26/twelve-tips-for-optimising-sql-server-2005-queries/

I had the very same scenario - and the solution in our case is computed columns.
For those bits of information that you need frequently from your XML, we created a computed column on the "hosting" table, which basically reaches into the XML and pulls out the necessary value from the XML using XPath. In most cases, we're even able to persist this computed column, so that it becomes part of the table and can be queried and even indexed and query speed is absolutely no problem anymore (on those columns).
We also tried XML indices in the beginning, but their disadvantage is the fact that they're absolutely HUGE on disk - this may or may not be a problem. Since we needed to ship back and forth the whole database frequently (as a SQL backup), we eventually gave up on them.
OK, to setup a computed column to retrieve from information bits from your XML, you first need to create a stored function, which will take the XML as a parameter, extract whatever information you need, and then pass that back - something like this:
CREATE FUNCTION dbo.GetShopOrderID(#ShopOrder XML)
RETURNS VARCHAR(100)
AS BEGIN
DECLARE #ShopOrderID VARCHAR(100)
SELECT
#ShopOrderID = #ShopOrder.value('(ActivateOrderRequest/ActivateOrder/OrderHead/OrderNumber)[1]', 'varchar(100)')
RETURN #ShopOrderID
END
Then, you'll need to add a computed column to your table and connect it to this stored function:
ALTER TABLE dbo.YourTable
ADD ShopOrderID AS dbo.GetShipOrderID(ShopOrderXML) PERSISTED
Now, you can easily select data from your table using this new column, as if it were a normal column:
SELECT (fields) FROM dbo.YourTable
WHERE ShopOrderID LIKE 'OSA%'
Best of all - whenever you update your XML, all the computed columns are updated as well - they're always in sync, no triggers or other black magic needed!
Marc

Some information like the query you run, the table structure, the XML content etc would definitely help. A lot...
Without any info, I will guess. The query is running slow when selecting just an ID because you don't have in index on ID.
Updated
There are at least a few serious problems with your query.
Unless an ID is provided, the table can only be scanned end-to-end because there are no indexes
Even if an ID is provided, the condition (#ID is NULL OR ID = #ID) is no guaranteed to be SARGable so it may still result in a table scan.
And most importantly: the query will generate a plan 'optimized' for the first set of parameters it sees. It will reuse this plan on any combination of parameters, no matter which are NULL or not. That would make a difference if there would be some variations on the access path to choose from (ie. indexes) but as it is now, the query can only choose between using a scan or a seek, if #id is present. Due to the ways is constructed, it will pretty much always choose a scan because of the OR.
With this table design your query will run slow today, slower tomorrow, and impossibly slow next week as the size increases. You must look back at your requirements, decide which fields are impoortant to query on, index them and provide separate queryes for them. OR-ing together all possible filters like this is not going to work.
The XML you're trying to retrieve has absolutely nothing to do with the performance problem. You are simply brute forcing a table scan and expect SQL to magically find the records you want.
So if you want to retrieve a specific ParoleeID, Page and ObjectID, you index the fields you search on and run a run a query for those and only those:
CREATE INDEX idx_Audit_ParoleeID ON Audit(ParoleeID);
CREATE INDEX idx_Audit_Page ON Audit(Page);
CREATE INDEX idx_Audit_ObjectID ON Audit(ObjectID);
GO
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where Audit.ParoleeID = #ParoleeID
AND Audit.Page = #Page
AND Audit.ObjectID = #ObjectID;

Related

Query equals check with 1-3 row result prefers NonClustered Index Scan

Here is my table which has order_number column. The table has less than 500 rows at the moment. A non clustered index on order_number has been created.
CREATE TABLE [outbound_service].[shipment_line]
(
[id] [uniqueidentifier] NOT NULL,
[shipment_id] [uniqueidentifier] NOT NULL,
[order_number] [varchar](255) NOT NULL,
.... 18 other columns
CONSTRAINT [PK_SHIPMENT_LINE]
PRIMARY KEY CLUSTERED ([id] ASC)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY],
CONSTRAINT [uk_order_order_line_number]
UNIQUE NONCLUSTERED ([order_number] ASC, [order_line_number] ASC)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX IX_shipment_line_order
ON outbound_service.shipment_line(order_number ASC)
Here is my simple equals check query that might return at max 5 rows.
DECLARE #P0 nvarchar(400) ='LG-ORD-002';
SELECT TOP 1 sl.order_number
FROM outbound_service.shipment_line sl
WHERE sl.order_number = #P0
I have expected nonclustered index seek, but I see an index scan happening. Very limited data at max 5 rows per order_number:
If I run the query without bind parameters, I see the index seek:
I have another database where I expect millions of rows and am worried about this scan as it is leading to 100 CPU on this query with high concurrency and slowing down rest of the workflows.
What could be the reason here when the data to return from index is very minimal but still SQL Server seems to be liking scan instead of seek?
There is a convert_implicit from your first execution plan, I would always align parameter type as column type and size, in addition, if you want to get the result by top I would suggest using ORDER BY order_number for two reasons.
A non-clustered index IX_shipment_line_order might be chosen to use without sorting cost.
Without order by we might not guarantee the result as you expect if you use TOP.
As this query
DECLARE #P0 [varchar](255) ='LG-ORD-002';
SELECT TOP 1 sl.order_number
FROM outbound_service.shipment_line sl
WHERE sl.order_number = #P0
ORDER BY sl.order_number
Here is the final answer after digging a bit into Hibernate, SQL Driver code, Using SQL Profiler to understand the impact on queries. HIbernate logs do not give any clue about the VARCHAR conversion to NVARCHAR. It is the MS SQL Driver by default passing all String variables in JPA Entity as nvarchar inputs into SQL where clause even though the underlying mapped columns are just varchars. This was forcing SQL Engine to do implicit conversion of the column data also into nvarchar.
Solution:
Passing in sendStringParametersAsUnicode=false; in the connection URL would change the driver's default behavior. By default all Entity Strings will be varchars in the query where clause. If there is a need for nvarchar and the driver not taking it into account automatically, the specific field in the entity can be annotated with #Nationalized annotation provided by Hibernate. Hibernate would help set the right expectation in this case to the SQL Driver.
sendStringParametersAsUnicode set to false is helpful if we have too many varchar columns that would not need to store international language characters and we refer to such columns heavily in where clause. This would help avoid index scans.

What is the recommended way for adding a SQL Server stored procedure to an Entity Framework model

Summary
In our application we use Entity Framework 6 and SQL Server 2016. When importing into the EF model a stored procedure with complex logic, in particular when using temp tables, we put on top:
SET FMTONLY OFF
Which causes the stored procedure to run the full logic and hence return the correct output format, which allows the EF model to generate a complex type correctly.
In general, it works OK. However, Microsoft documentation https://learn.microsoft.com/en-us/sql/t-sql/statements/set-fmtonly-transact-sql?view=sql-server-ver15 states: Do not use this feature. So I'm wondering what the recommended way is to import a stored procedure and generate a complex type correctly.
More details
Firstly, I'm confused by this note Do not use this feature on the Microsoft website about FMTONLY. Isn't it the same command (with option ON) called by the EF model updater? This would mean the EF model updater uses a feature which should not be used.
Secondly, I assume that this EF model updater behaviour aims to prevent an unwanted DB data manipulation when a stored procedure is executed with all input arguments equal NULL. But the actual behaviour does not seemed aligned with this assumption.
Consider a simple stored procedure below where I tried to avoid using SET FMTONLY OFF. I analysed the exact behaviour using SQL profiler while updating the EF model from Visual Studio (2013, CU 5, Pro).
SET NOCOUNT ON;
IF (1=0) --FOR EF edmx update this will actually execute
BEGIN
DECLARE #tempTable TABLE
(
[id] [int],
[date_local] [datetime2](7),
[value] [float] NULL
)
SELECT * FROM #tempTable
RETURN 0
END
CREATE TABLE #TempValue
(
[id] [int] NOT NULL,
[date_local] [datetime2](7) NOT NULL,
[value] [float] NULL
CONSTRAINT [PK_TempValue]
PRIMARY KEY CLUSTERED ([id] ASC, [date_local] ASC)
WITH (PAD_INDEX =OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
INSERT INTO #TempValue (id, date_local, value)
SELECT #id, *
FROM OPENJSON(#valueJson)
WITH (
date_local datetime2(7) '$.dateTime',
value float '$.value'
);
SELECT * FROM #TempValue
Points:
The EF model updater calls it, goes inside IF (1=0) (as expected with SET FMTONLY ON), and runs SELECT * from #tempTable correctly.
It ignores return 0 and continues to run the rest of the stored procedure body.
It skips CREATE TABLE #TempValue but then tries to run INSERT INTO #TempValue which of course doesn't exist, so it stops here and doesn't execute SELECT * FROM #TempValue, which would also fail.
In this case the complex type was generated correctly, because the last successful select was SELECT * from #tempTable. But I feel this was just luck because there are no subsequent selects which actually run. To me, this whole behaviour is flawed, in particular the skipping of return 0 and CREATE TABLE #TempValue but then allowing INSERT INTO; if you allow INSERT INTO surely you're not preserving the DB data?
So the 2 workarounds we tried that actually work are as follows:
Put SET FMTONLY OFF on top. It works but I'm a bit worried by this Do not use this feature on the Microsoft website.
Comment out whole SP body and add DECLARE #tempTable table ... SELECT * from #tempTable as in example above. After updating EF model remove the added chunk and uncomment the original code. Do this every time EF model is updated, so it's not very efficient.
What is a proper and recommended way to do it?

SQL Server getting unique records

I'm starting a project where I need to ensure that a large volume of users can obtain a promotional code for multiple promotions.
The codes have a monetary value attached to them so it is vital that only one code goes out to each user and no two users can ever receive the code.
My plan so far is to create a table in SQL Server known as code_pool and do a bulk insert of codes each time.
SQL Server 2005 table would look like this...
[id](int pk), [promo_code] varchar(150), [promotion_id](int fk)
Users would then retreive each code using a stored proc that would get the first record from the table for that promotion and then delete (or update) the record before returning the code as the result of the procedure.
My question is how do I ensure that the record is properly locked so that only one user may ever obtain each record and that no two users accessing the proc concurrently will ever receive the same code?
Do I need to lock the table/records and if so how would this stack up in a busy production environment?
One very handy built in data type for generating unique codes that are not easily guessible is the uniqueidentifier data type. You can use this to generate a unique code by making it have an auto-generated value (using the newid() function). Because GUIDs are in HEX and not generated sequentially unlike identity columns, it isn't possible to predict what codes have or will be generated which will make your process less vulnerable to someone just trying codes in sequence. The number of possible uniqueidentifiers is very large.
I've made the assumption that you will only want one promo code per person for each of your promos. The way you can do this in your database is by having a table, my example calls it PromoTest, which has a primary key on both of these columns which will ensure they remain unique. I didn't add a concept of 'Used' to indicate if the person has used the code but that's quite trivial to do.
To create your table with the primary key constraint and the auto-generated value run the following:
CREATE TABLE [dbo].[PromoTest](
[personid] [bigint] NOT NULL, [promocategory] [int] NOT NULL,
[promocode] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_PromoTest] PRIMARY KEY CLUSTERED (
[personid] ASC,
[promocategory] ASC )
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS
= ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
GO
ALTER TABLE [dbo].[PromoTest] ADD CONSTRAINT [DF_PromoTest_promocode] DEFAULT (newid()) FOR [promocode]
To then have a stored procedure that inserts a new promo code or selects the existing one is quite trivial, and due to the primary key constraint you cannot physically insert two codes of the same type for the same person.
The stored procedure can be defined as follows:
CREATE PROCEDURE GetOrCreatePromoCode
-- Add the parameters for the stored procedure here
#PersonId bigint,
#PromoCategory int,
#PromoCode uniqueidentifier OUT
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
IF (NOT EXISTS(SELECT PromoCode FROM PromoTest WHERE personid = #PersonId AND promocategory = #PromoCategory))
BEGIN
INSERT INTO PromoTest (personid, promocategory) VALUES (#PersonId, #PromoCategory)
END
SET #PromoCode = (SELECT PromoCode FROM PromoTest WHERE personid = #PersonId AND promocategory = #PromoCategory)
END
GO
don't you want to add a column e.g. in_use (int)? when you generate new promocode, in_use=0, when your stored proc obtains non-used promo code it selects first code where in_use = 0, and then updates it to 1
Why not use something similar, but like this:
Table UsedCodes
[id] int identity PK,
[userId] whatever,
[PromoId] int FK
Table Promotions
[PromoId] int pk,
[PromoCode] nvarchar
When a user gets a promo code, you would insert a value into used codes, and the promotion code delivered to them would be a concatenation of the promo code and the ID in the used codes table.
You could then also enforce a unique constraint on UserId | PromoId on the used codes table, to ensure only a single code per promo per user.
This has the advantage of retaining a record of the codes used, and reduces the complexity of needing your bulk insert, which could potentially introduce dups by accident. It also has the advantage of requiring no locking...
John P has given you an excellent answer but I find GUIDs are unwieldy for use as a voucher code due to the string length.
Have a look at the question how to generate a voucher code in c#? and of course my answer :) Although you are asking for a SQL solution you can probably adapt the ideas given there.
I would also recommend you do not delete the codes which have been used; how will you be sure the code presented by the customer was created by your system?

SQL-Statement with dynamic table-names or redesign?

I have a MS SQL 2008 database which stores data for creating a weighted, undirected graph. The data is stored in tables with the following structure:
[id1] [int] NOT NULL,
[id2] [int] NOT NULL,
[weight] [float] NOT NULL
where [id1] and [id2] represents the two connected nodes and [weight] the weight of the edge that connects these nodes.
There are several different algorithms, that create the graph from some basic data. For each algorithm, I want to store the graph-data in a separate table. Those tables all have the same structure (as shown above) and use a specified prefix (similarityALB, similaritybyArticle, similaritybyCategory, ...) so I can identify them as graph-tables.
The client program can select, which table (i.e. by which algorithm the graph is created) to use for the further operations.
Access to the data is done by stored procedures. As I have different tables, I would need to use a variable tablename e.g.:
SELECT id1, id2, weight FROM #tableName
This doesn't work because SQL doesn't support variable tablenames in the statement. I have searched the web and all solutions to this problem use the dynamic SQL EXEC() statement e.g.:
EXEC('SELECT id1, id2, weight FROM ' + #tableName)
As most of them mentioned, this makes the statement prone to SQL-injection, which I'd like to avoid. A simple redesign idea would be to put all the different graphs in one table and add a column to identify the different graphs.
[graphId] [int] NOT NULL,
[id1] [int] NOT NULL,
[id2] [int] NOT NULL,
[weight] [float] NOT NULL
My problem with this solution is, that the graphs can be very large depending on the used algorithm (up to 500 Million entries). I need to index the table over (id1, id2) and (id2, id1). Now putting them all in one big table would makes the table even huger (and requests slower). Adding a new graph would result in bad performance, because of the active indicees. Deleting a graph could not be done by TRUNCATE anymore, I would need to use
DELETE * FROM myTable WHERE graphId=#Id
which performs very bad with large tables and creates a very large logfile (which would exceed my disk space when the graph is big enough). So I'd like to keep the independent tables for each graph.
Any suggestions how to solve this problems by either find a way to parametrize the tablename or to redesign the database structure while avoiding the aforementioned problems?
SQL injection can easily be avoided in this case by comparing #tableName to the names of the existing tables. If it isn't one of them, it's bad input. (Obligatory xkcd reference: That is, unless you have a table called "bobby'; drop table students;")
Anyway, regarding your performance problems, with partitioned tables (since SQLServer 2005), you can have the same advantages like having several tables, but without the need for dynamic SQL.
Maybe I did not understand everything, but:
CREATE PROCEDURE dbo.GetMyData (
#TableName AS varchar(50)
)
AS
BEGIN
IF #TableName = 'Table_1'
BEGIN
SELECT id1
,id2
,[weight]
FROM dbo.Table_1
END
IF #TableName = 'Table_2'
BEGIN
SELECT id1
,id2
,[weight]
FROM dbo.Table_2
END
END
and then:
EXEC dbo.GetMyData #TableName = 'Table_1'
A different technique involves using synonyms dynamically, for example:
DECLARE #TableName varchar(50)
SET #TableName = 'Table_1'
-- drop synonym if it exists
IF object_id('dbo.MyCurrentTable', 'SN') IS NOT NULL
DROP SYNONYM MyCurrentTable ;
-- create synonym for the current table
IF #TableName = 'Table_1'
CREATE SYNONYM dbo.MyCurrentTable FOR dbo.Table_1 ;
IF #TableName = 'Table_2'
CREATE SYNONYM dbo.MyCurrentTable FOR dbo.Table_2 ;
-- use synonym
SELECT id1, id2, [weight]
FROM dbo.MyCurrentTable
Partioned Table may be the answer to your problem. I've got another idea, that's "the other way around":
each graph has it's own table (so you can still truncate table)
define a view (with the structured you mentioned for your redefined table) as a UNION ALL over all graph-tables
I have no idea of the performance of a select on this view and so on, but it may give you what you are looking for. I'd be interested in the results if try this out ..

SQL Server speed

I had previously posted a question about my query speed with an XML column. After some further investigation I have found that it is not with the XML as previously thought. The table schema and query are very simple. There are over 800K rows, everything was running smooth but not with the increase in records it is taking almost a minute to run.
The Table:
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[PID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The query:
SELECT *
FROM Audit
WHERE PID = 158
AND Page = 2
AND ObjectID = 93
The query only return 26 records and the interesting thing is that if I add "TOP 26" the query executes in less than a second, if I change it to "TOP 27" then it take a minute. Even if I change the query to SELECT ID, it does not matter.
Any help is appreciated
You have an index on ID, but your query is using other columns instead. Therefore, you're probably getting a full table scan. Changing to SELECT ID makes no difference because it's not anywhere in the WHERE clause.
It's quick when you ask for TOP 26 because it can quit once it finds 26 rows because you don't have any ORDER BY clause. Changing it to TOP 27 means that, once it finds the first 26 (which are all the matches, according to your post), it can't quit looking; it has to continue to search until it either finds a 27th matching row or reaches the end of the data.
A SHOW PLAN would have shown you the problem pretty quickly.
Add an index for the PID, Page and ObjectID fields.
Why not add a covering index for the Page and Object ID columns and call it a day?
I think you must add non unique indexes to your columns you want to search. Indexing will certainly reduce the search time it takes. Whether requesting single column or multi column in SELECT query will not make any difference. The time it takes to individually compare rows needs to be reduced by indexing.
the 26 rows are probably near the start of the table, when you scan you find them fast and abort the rest of the scan, when looking for the 27th that doesn't exist you scan the entire table, which is slow!
when looking for these type of problems try this from query management studio:
run: _SET SHOWPLAN_ALL ON_
then run your query, look for the word "SCAN", they is mostlikely where your query is running slow, figure out why no index is being used.
in this case you need to ad an index. I generally add an index based on how I query the data, if you always have one of the three: PID, Page, and Object ID, add an index with that column first, add another column to that index if you have that value sometimes too. etc.

Resources