For a personal project I'm trying, for reasons of performance and security, to add display information in a XML field on the main table.
In this case Orders and Orderlines.
The current setup is:
tblOrders has 1 Index: Clustered on UID
tblOrderItems has 1 Index: Clustered on UID
tblOrder.Orderlines (XML) has 2 indexes. a primary and a secondary on PATH.
Now I'm trying following 2 queries:
SELECT Ord.UID
, Item.DomainName
, Item.BasicInfo
, Item.Base
, Item.Period
FROM tblOrder Ord
INNER JOIN tblOrderItem Item
ON Item.OrderID = Ord.UID
WHERE Item.DomainName = 'domainname.com'
and
SELECT
UID
, c.value('(DomainName)[1]','nvarchar(150)') AS DomainName
, c.value('(BasicInfo)[1]','nvarchar(150)') AS [Basic Info]
, c.value('(Base)[1]','float') AS [Base Price]
, c.value('(Period)[1]','smallint') AS Period
FROM tblOrder
CROSS APPLY tblOrder.OrderLines.nodes('/OrderItem/line') as t(c)
WHERE c.value('(DomainName)[1]','nvarchar(150)') = 'domainname.com'
First one has a average time of 4ms while the second has a average time of 38ms.
Both tests were done with the same data, which is not a lot since I'm trying to decide what data model to use.
My question at last: is it possible to rewite the xml / xml query to make that one more performant then the regular inner join?
Thanks.
First of all SQL Server is relational database.
The whole point of relational databases is normalization.
First Normal Form:
A database is in first normal form if it satisfies the following conditions:
Contains only atomic values
There are no repeating groups
An atomic value is a value that cannot be divided.
Using XML column you insert non-atomic data. Then during retrieving data you need to parse it to get specific values. Parsing is almost always more expensive than simple JOIN. So the first approach is better.
Related
My data is structured as follows:
create table data (id int, cluster int, weight float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);
insert into data values (1234,4,3);
Then I have to impute some values because the vector is of certain lenght x:
declare #startnum int=0
declare #endnum int=4036;
with gen as (select #startnum as num
union ALL
select num+1 from gen where num+1<=#endnum)
select * into gen from gen -- store the numbers
option(maxrecursion 10000)
I then have to cross join the values stored in gen but this is done on two very large tables (not as in the current example), currently my query is running for over 2 hours and I start to think there is something wrong. Any ideas on how I can make this procedure faster and more correct?
Here's what I doing right now.
select id, cluster, max(v) as weight
from (select id, cluster, case when cluster=num then weight else 0 end as v
from data
cross join gen) cross_num
group by id, cluster;
go
EDIT: It is the last query that is running very slowly, and of course I have a super large dataset :)
Note: I also wonder what the Paste the Plan is exactly, I actually don't know how to look for this, can someone give me a resource I can look up and try to understand it?
So, the problem here is that you're creating a massive Cartesian product and aggregating at the same time.
However, we might be able to cheat if your data lines up well. This may also totally backfire if it lines up poorly. I can't see your data so I don't know what's going on.
I'm going to write this using an empty staging table or temp table. You could write it as a series of CTE expressions, but in my experience those do not perform quite as well. Ideally you can take these queries and wrap them in a stored procedure.
So, the primary key for your table can't be id, cluster, because you're aggregating on that group. If id, cluster is not very selective -- meaning that there are a lot of records for each id, cluster combination -- then we might be able to significantly reduce the amount of work done here. If there's 5 records for each id, cluster, then this will probably not help much but if there's 100,000 for each id, cluster then it will probably help a lot.
First, create your gen table. I recommend creating a clustered primary key on gen.num.
Second, let's start building the data. Remember, I'm assuming StagingTable is empty.
Here's the first query that does the real work:
INSERT INTO StagingTable (id, cluster, weight)
SELECT id, cluster, MAX(weight) AS weight
FROM data
GROUP BY id, cluster
The query would benefit from an index, but it will depend on your data if id, cluster, weight is better or worse than cluster, id, weight. However, before you run this you should disable any indexes on StagingTable and then rebuild the index after running at least this first insert.
Depending on your data, you may require or benefit from or should avoid using a WHERE cluster BETWEEN 0 AND 4036 clause on the above query as well. It's not clear to me if there are 4037 clusters numbered 0 to 4036, or if you're only interested in clusters 0 to 4036 but there are more, or if you're only interested in creating "default" records of weight 0 for clusters 0 to 4036 but want all clusters aggregated if they happen to go higher.
Now, think about what's in StagingTable. Everything that we've loaded into that table is everywhere that there is an id, cluster in the data table. Critically, every id we might need will be in StagingTable, even if it's missing one or more values for cluster.
Now we just need to fill in the missing cluster values for each id, and we know that the weight of the missing clusters is 0.
INSERT INTO StagingTable (id, cluster, weight)
SELECT DISTINCT s.id, g.num, 0
FROM StagingTable s
INNER JOIN gen g
ON g.num BETWEEN 0 AND 4036
WHERE NOT EXISTS (
SELECT 1
FROM StagingTable s2
WHERE s2.id = s.id
AND s2.cluster = g.num
)
The INNER JOIN gen g ON g.num BETWEEN 0 AND 4036 may not be necessary if gen is always going to be numbers 0 to 4036. In that case you can just use CROSS JOIN gen g.
The EXISTS is necessary to remove the duplicate rows.
Again, this query could benefit from an index on StagingTable, but without seeing your actual data it's a little difficult to tell exactly what you need (id, cluster) is one possibility, but (cluster, id) may actually work better. Ideally, it should be a clustered primary key.
Edit: Just realized my original second query wouldn't work in some cases. I've modified it to correct the logic.
I have a simple DB table with ONLY 5 columns with no primary key having 7 billion+(7,50,01,771) data. yes, you read it correctly. it has one cluster index.
DB table columns
Cluster index
if I write a simple select query to get data, it is taking 7-8 minutes to return data. now, you get my next question. what are the techniques that I can apply to this DB table? So that I can get data in time.
in the actual scenario, where I am using this table have join with 2 temp tables that have WHERE clause and filtered data. Please find below my query for reference.
SELECT dt.ZipFrom, dt.ZipTo, dt.Total_time, sz.storelocation, sz.AcctShip, sz.Licensee,sz.Entity from #Zips z INNER join DriveTime_ZIPtoZIP dt on zipFrom = z.zip INNER join #storeZips sz on ZipTo = sz.zip order by z.zip desc, total_time asc
Thanks
You can index according to the where conditions in the query. However, this comes at a cost: Storage.
Order by statement is also important. If you have to use order by in your query, you can also index accordingly.
But do not forget, the cost of indexing ...
For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.
I'm using Azure's SQL Database & MS SQL Server Management Studio and I wondering if its possible to create a self-referencing table that maintains itself.
I have three tables: Race, Runner, Names. The Race table includes the following columns:
Race_ID (PK)
Race_Date
Race_Distance
Number_of_Runners
The second table is Runner. Runner contains the following columns:
Runner_Id (PK)
Race_ID (Foreign Key)
Name_ID
Finish_Position
Prior_Race_ID
The Names Table includes the following columns:
Full Name
Name_ID
The column of interest is Prior_Race_ID in the Runner Table. I'd like to automatically populate this field via a Trigger or Stored Procedure, but I'm not sure if its possible to do so and how to go about it. The goal would be to be able to get all a runners races very quickly and easily by traversing the Prior_Race_ID field.
Can anyone point me to a good resource or references that explains if and how this is achievable. Also, if there is a preferred approach to achieving my objective please do share that.
Thanks for your input.
Okay, so we want, for each Competitor (better name than Names?), to find their two most recent races. You'd write a query like this:
SELECT
* --TODO - Specific columns
FROM
(SELECT
*, --TODO - Specific columns
ROW_NUMBER() OVER (PARTITION BY n.Name_ID ORDER BY r.Race_Date DESC) rn
FROM
Names n
inner join
Runners rs
on
n.Name_ID = rs.Name_ID
inner join
Races r
on
rs.Race_ID = r.Race_ID
) t
WHERE
t.rn in (1,2)
That should produce two rows per competitor. If needed, you can then PIVOT this data if you want a single row per competitor, but I'd usually leave that up to the presentation layer, rather than do it in SQL.
And so, no, I wouldn't even have a Prior_Race_ID column. As a general rule, don't store data that can be calculated - that just introduces opportunities for that data to be incorrect compared to the base data.
run the following sql(The distinct here is to avoid that a runner has more than one race at a same day):
update runner r1
set r1.prior_race_id =
(
select distinct race.race_id from runner, race where runner.race_id = race.race_id and runner.runner_id = r1.runner_id group by runner.runner_id having race.race_date = max(race.race_date)
)
I have benefited from this website for a long time now. This is my first question on the site. It is regarding performance tuning a reporting query. Here it goes.
1.
SELECT Count(b1.primkey)
from tableA b1 --WITH (NOLOCK)
join tableA b2 --WITH (NOLOCK)
on b1.email = b2.email
and DateDiff(day, b2.BookedDate , b1.BookedDate) > 1
tableA has around 7 million rows. Email is a varchar(100) field. Bookeddate is a datetime field. primkey is a primary key column that is an int.
My purpose of writing this query is to find out the count entries that have same email ids but have come in one day late. This query take about 45 minutes to run. I really want to reduce the time it takes to execute.
Since this is for reporting, i tried in vain to use --WITH (NOLOCK) option to improve the read time. I have a column store index on tableA and I know that it is being used by the SQL optimizer - can see in the execution plan. I am using SQL Server 2012.
Can someone tell me in such a case, what would be better? Using a nonclustered index on email or a nonclustered columnstore index on tableA?
Please help me.
Your query is relatively complex. You are essentially joining two tables that have 7 million records each on a column that is not unique.
How about the following query instead:
select Email
from TableA
group by Email
having MAX(BookedDate) > MIN(BookedDate) + 1
Also make sure you have an index with Email and BookedDate.
Hope this helps.
You have 3 options here:
Create clustered index on email field at least for a larger table.
But I suppose there are other queries running on these tables, and
clustered index is needed on other fields
Move emails to another table, and store email id's in TableA and
TableB; join on int field would be much faster than on varchar
fields
Create indexes on email fields with included columns BookedDate (no
need to include primkey, you can count on another field, or count(*). Code: create index idx_email on TableA include(BoodedDate)
I think that third option is the one you should go with. There's not much work to be done, and there will be great performance gain. The only problem is that index on varchar field will take a lot of space and impact insert/update operations; but you said that this is a reporting db, so I think you can allow that.