Here is simplified version of my schema. Using Sql Server 2012 enterprise edition.
CREATE table #abc (a INT , b INT);
CREATE TABLE #def ( a INT , c INT ,d INT);
INSERT INTO #abc values(1,23),(1,24);
INSERT INTO #def VALUES(1,53,54),(1,56,57)
Table #abc JOINs TO #def ON COLUMN a
Basically it is concatenation of rows from both tables based on column a. Tried inner join\cross apply but they all results in cross join kind of resultset understandably . I have workaround using another temp table(then update) but kind of feel that this can be done easily in single select . I am missing something simple here.
Need output like this:
a b c d
1 23 53 54
1 24 56 57
Thanks
-N
You need some sort of sequence number to join the tables together. You can generate one using row_number() as follows:
select a.a, a.b, d.c, d.d
from (select a.*, row_number() over (order by (select NULL)) as seqnum
from #abc a
) a join
(select d.*, row_number() over (order by (select NULL)) as seqnum
from #def d
) d
on a.seqnum = d.seqnum;
Now the caution, caution, caution. The order by clause does not really specify the ordering, so the sequence numbers may not be what you expect. You should really have a column to specify the ordering.
You need to have a unique key value in each row to be able to join the tables in the way you would like. Then, an inner join will return the result set you require.
If you introduce referential integrity between the tables, then this will be enforced and return the expected results.
Related
I use SQL Server 2016. Below is the rows in table: test_account. You can see the values of updDtm and fileCreateTime are identical. id is the primary key.
id accno updDtm fileCreatedTime
-----------------------------------------------------------------------
1 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
2 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
3 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
I want to query the latest account id which accno is 123456789 order by updDtm, fileCreatedTime
I run the following SQL, the output result is id = 1
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1
My question is does the query result is repeatable and reliable (always output id=1 either run 1 time or multiple times) when the values of columns updDtm and fileCreatedTime are identical or just output the random id?
I read some articles and learn that for MySql and Oracle the query result is not reliable and reproducible. How about SQL Server?
The context of this documentation reference is ORDER BY usage with OFFSET and FETCH but the same considerations apply to all ORDER BY usage, including windowing functions like ROW_NUMBER(). In summary,
To achieve stable results between query requests, the following conditions must be met:
The underlying data that is used by the query must not change.
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
I'm trying to find an case to test if the query would output result
other than id=1 but with no luck
The ordering of rows when duplicate ORDER BY values exist is undefined (a.k.a. non-deterministic and arbitrary) because it depends on the execution plan (which may vary due to available indexes, stats, and the optimizer), parallelism, database engine internals, and even physical data storage. The example below yields different results due to a parallel plan on my test instance.
DROP TABLE IF EXISTS dbo.test_account;
CREATE TABLE dbo.test_account(
id int NOT NULL
CONSTRAINT pk_test_account PRIMARY KEY CLUSTERED
, accno int NOT NULL
, updDtm datetime2 NOT NULL
, fileCreatedTime datetime2 NOT NULL
);
--insert 100K rows
WITH
t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
,t1k AS (SELECT 0 AS n FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c)
,t1g AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t1k AS a CROSS JOIN t1k AS b CROSS JOIN t1k AS c)
INSERT INTO dbo.test_account (id, accno, updDtm, fileCreatedTime)
SELECT num, 123456789, '2022-07-27 09:41:10.0000000', '2022-07-27 11:33:33.8300000'
FROM t1g
WHERE num <= 100000;
GO
--run query 10 times
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
Example results:
1
27001
25945
57071
62813
1
1
1
36450
78805
The simple solution is to add the primary key as the last column to the ORDER BY clause to break ties. This returns the same id value (1) in every iteration regardless of the execution plan and indexes.
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC, a.id) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
On a side note, this index will optimize the query:
CREATE NONCLUSTERED INDEX idx ON dbo.test_account(accno, updDtm DESC, fileCreatedTime DESC, id);
I have a long list of string values that I would like to count, in one particular column of a table. I know this works for counting all unique values.
SELECT
code_id ,
COUNT(*) AS num
FROM
mydb
GROUP BY
code_id
ORDER BY
code_id
I only have a certain selection of values to count, therefore do now want all. My list is long, but for example, if I just wanted to count the numbers of strings 'ax1', 'c39', and 'x1a' in my code_id column? I've seen examples with multiple lines of code, one for each value which will be huge for counting many values. I'm hoping for something like :
SELECT
code_id ,
COUNT(* = ('ax1, 'c39', 'x1a')) AS num
FROM
mydb
GROUP BY
code_id
ORDER BY
code_id
Desired output would be
code_id count
ax1 39
c39 42
x1a 0
Is there an easy way, rather than a line of code for each value to be counted?
Create a CTE that returns all the string values and a LEFT join to your table to aggregate:
WITH cte AS (SELECT code_id FROM (VALUES ('ax1'), ('c39'), ('x1a')) c(code_id))
SELECT c.code_id,
COUNT(t.code_id) AS num
FROM cte c LEFT JOIN tablename t
ON t.code_id = c.code_id
GROUP BY c.code_id;
See the demo.
I think this should work.
SELECT
code_id ,
sum(1) AS num
FROM Mydb
WHERE code_id in ('ax1', 'c39', 'x1a')
GROUP BY code_id
ORDER BY code_id
I have a 3 tables from which contain this data:
Table 1:
Table 2:
Table 3:
Output:
I have tried using Pivot but it has to have an aggregate function in it.
SELECT
project_code, project_name, fk_prj_project_id,
[A], [B], [C], [D]
FROM
(SELECT
project_code, project_name, employee_name,
fk_prj_project_id, fk_prj_project_id AS nm,
activity_details
FROM
PRJ_MST_PROJECT AS a
LEFT JOIN
PRJ_TNS_DAILY_SUMMARY AS b ON a.pk_prj_project_id = b.fk_prj_project_id
LEFT JOIN
HRM_EMP_MST_EMPLOYEE AS c ON b.fk_hrm_emp_employee_id = c.pk_hrm_emp_employee_id
WHERE
a.project_status = 0
AND b.transaction_status = 1
AND CONVERT(date, b.transaction_date, 103) = CONVERT(date, '15/04/2021', 103)) x
PIVOT
(MAX(nm)
FOR nm IN ([A], [B], [C], [D])
) p
The problem is you set your PIVOT to look for values of nm in A, B, C, and D, but nm is an alias for fk_prj_project_id, which has possible values of 1, 2, 3, 4, and 5. So there are no A, B, C, or D values to be had. I don't even see a name for the column that holds A, B, C, and D, but whatever column that is needs to be what you put in the "FOR ___ IN" section of your pivot.
Test your query by commenting out the reference to the pivot columns in the SELECT and comment out the word PIVOT and everything after it and re-run your query. You should see some column with values A, B, C, D. If you don't, fix your query so you do. Once you do, that column is what you PIVOT on (put it between FOR and IN in the pivot block).
Oh, and if you provide data in a usable format people might run your query and give you directly usable results, it's a lot to ask to have people enter your data to get to help you so meet them half way. A link to sqlfiddle is ideal, but even just a bunch of DECLARE #T1 and INSERT INTO T1 VALUES statements is usually enough to get significantly better help.
EDIT:
Nice job with the Fiddle!
OK, so using your data, we can test out actual queries. For PIVOT to work, we need a column to look up (employee name), a column to aggregate (activity_details), and some columns that will be constant across the rows produced (the project's name and ID). You're working with text not numbers, so your aggregation can't be mathematical, leaving you with pretty much just MAX or MIN. To make sure you get the right (newest) one, I first built a table of comments and numbered them by how new they were, then I picked just the newest comment for each (project, user) pair. cteCommentNewest is the result of that.
Now with a clean (and verified) table to pivot, the actual pivot syntax is simple. Well, as simple as Pivot can be, it's inherently pretty confusing IMHO, but structuring it this way keeps the actual PIVOT as clean as possible.
Note that the query is in twice, I tested it as a static query before converting it to dynamic because it's much easier to troubleshoot a static query, then I left it in in case you want to experiment with it. You don't need it for the final solution to work.
Here's the final code, fully tested and producing the specified output:
DECLARE #cols3 AS NVARCHAR(MAX)
DECLARE #query3 AS NVARCHAR(MAX)=''
DECLARE #dt varchar(100)='14/04/2021'
select #cols3 = STUFF((SELECT ',' + QUOTENAME(employee_name)
from dbo.HRM_EMP_MST_EMPLOYEE
order by employee_name
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
--SELECT #cols3 --Test column list for dynamic query
--Test the core functions of pivot before making dynamic
;with cteCommentsAll as (
SELECT P.project_code , P.project_name, C.activity_details , E.employee_name
, ROW_NUMBER () over (PARTITION BY P.project_code , E.employee_name ORDER BY C.transaction_date DESC) as Newness
FROM dbo.PRJ_MST_PROJECT as P --Projects
LEFT OUTER JOIN dbo.PRJ_TNS_DAILY_SUMMARY as C --Comments on projects
ON P.pk_prj_project_id = C.fk_prj_project_id --Get all projects, then all comments for each project
LEFT OUTER JOIN dbo.HRM_EMP_MST_EMPLOYEE as E --Employees who commented
on E.pk_hrm_emp_employee_id = C.fk_hrm_emp_employee_id
), cteCommentsNewest as (
SELECT project_code , project_name, activity_details , employee_name
FROM cteCommentsAll WHERE Newness = 1 --Only one comment per user per project of CROSS problems
)
SELECT *
FROM cteCommentsNewest as N --TEST up to this point to see the raw table
PIVOT (MAX(activity_details) FOR employee_name IN (A, B, C) ) as P
--Put the working query, modified for dynamic columns, into a variable
set #query3 = N'
;with cteCommentsAll as (
SELECT P.project_code , P.project_name, C.activity_details , E.employee_name
, ROW_NUMBER () over (PARTITION BY P.project_code , E.employee_name ORDER BY C.transaction_date DESC) as Newness
FROM dbo.PRJ_MST_PROJECT as P --Projects
LEFT OUTER JOIN dbo.PRJ_TNS_DAILY_SUMMARY as C --Comments on projects
ON P.pk_prj_project_id = C.fk_prj_project_id --Get all projects, then all comments for each project
LEFT OUTER JOIN dbo.HRM_EMP_MST_EMPLOYEE as E --Employees who commented
on E.pk_hrm_emp_employee_id = C.fk_hrm_emp_employee_id
), cteCommentsNewest as (
SELECT project_code , project_name, activity_details , employee_name
FROM cteCommentsAll WHERE Newness = 1 --Only one comment per user per project of CROSS problems
)SELECT *
FROM cteCommentsNewest as N
PIVOT (MAX(activity_details) FOR employee_name IN (' + #cols3 + ') ) as P
'
exec sp_executesql #query3
which produces the following output
project_code
project_name
A
B
C
MOA20171
Project A
some remark By Employee A on 14
NULL
some remark By Employee C on 14
MOA20172
Project B
NULL
NULL
some remark By Employee C on 15
MOA20173
Project C
NULL
NULL
NULL
This is my first question on here, so I apologize if I break any rules.
Here's the situation. I have a table that lists all the employees and the building to which they are assigned, plus training hours, with ssn as the id column, I have another table that list all the employees in the company, also with ssn, but including name, and other personal data. The second table contains multiple records for each employee, at different points in time. What I need to do is select all the records in the first table from a certain building, then get the most recent name from the second table, plus allow the result set to be sorted by any of the columns returned.
I have this in place, and it works fine, it is just very slow.
A very simplified version of the tables are:
table1 (ssn CHAR(9), buildingNumber CHAR(7), trainingHours(DEC(5,2)) (7200 rows)
table2 (ssn CHAR(9), fName VARCHAR(20), lName VARCHAR(20), sequence INT) (708,000 rows)
The sequence column in table 2 is a number that corresponds to a predetermined date to enter these records, the higher number, the more recent the entry. It is common/expected that each employee has several records. But several may not have the most recent(i.e. '8').
My SProc is:
#BuildingNumber CHAR(7), #SortField VARCHAR(25)
BEGIN
DECLARE #returnValue TABLE(ssn CHAR(9), buildingNumber CAHR(7), fname VARCHAR(20), lName VARCHAR(20), rowNumber INT)
INSERT INTO #returnValue(...)
SELECT(ssn,buildingNum,fname,lname,rowNum)
FROM SELECT(...,CASE #SortField Row_Number() OVER (PARTITION BY buildingNumber ORDER BY {sortField column} END AS RowNumber)
FROM table1 a
OUTER APPLY(SELECT TOP 1 fName,lName FROM table2 WHERE ssn = a.ssn ORDER BY sequence DESC) AS e
where buildingNumber = #BuildingNumber
SELECT * from #returnValue ORDER BY RowNumber
END
I have indexes for the following:
table1: buildingNumber(non-unique,nonclustered)
table2: sequence_ssn(unique,nonclustered)
Like I said this gets me the correct result set, but it is rather slow. Is there a better way to go about doing this?
It's not possible to change the database structure or the way table 2 operates. Trust me if it were it would be done. Are there any indexes I could make that would help speed this up?
I've looked at the execution plans, and it has a clustered index scan on table 2(18%), then a compute scalar(0%), then an eager spool(59%), then a filter(0%), then top n sort(14%).
That's 78% of the execution so I know it's in the section to get the names, just not sure of a better(faster) way to do it.
The reason I'm asking is that table 1 needs to be updated with current data. This is done through a webpage with a radgrid control. It has a range, start index, all that, and it takes forever for the users to update their data.
I can change how the update process is done, but I thought I'd ask about the query first.
Thanks in advance.
I would approach this with window functions. The idea is to assign a sequence number to records in the table with duplicates (I think table2), such as the most recent records have a value of 1. Then just select this as the most recent record:
select t1.*, t2.*
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
My second suggestion is to use a user-defined function rather than a stored procedure:
create function XXX (
#BuildingNumber int
)
returns table as
return (
select t1.ssn, t1.buildingNum, t2.fname, t2.lname, rowNum
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
);
(This doesn't have the logic for the ordering because that doesn't seem to be the central focus of the question.)
You can then call it as:
select *
from dbo.XXX(<building number>);
EDIT:
The following may speed it up further, because you are only selecting a small(ish) subset of the employees:
select *
from (select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber
) t
where seqnum = 1;
And, finally, I suspect that the following might be the fastest:
select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber and
t2.sequence = (select max(sequence) from table2 t2a where t2a.ssn = t1.ssn)
In all these cases, an index on table2(ssn, sequence) should help performance.
Try using some temp tables instead of the table variables. Not sure what kind of system you are working on, but I have had pretty good luck. Temp tables actually write to the drive so you wont be holding and processing so much in memory. Depending on other system usage this might do the trick.
Simple define the temp table using #Tablename instead of #Tablename. Put the name sorting subquery in a temp table before everything else fires off and make a join to it.
Just make sure to drop the table at the end. It will drop the table at the end of the SP when it disconnects, but it is a good idea to make tell it to drop to be on the safe side.
I'm trying to select records with a statement
SELECT *
FROM A
WHERE
LEFT(B, 5) IN
(SELECT * FROM
(SELECT LEFT(A.B,5), COUNT(DISTINCT A.C) c_count
FROM A
GROUP BY LEFT(B,5)
) p1
WHERE p1.c_count = 1
)
AND C IN
(SELECT * FROM
(SELECT A.C , COUNT(DISTINCT LEFT(A.B,5)) b_count
FROM A
GROUP BY C
) p2
WHERE p2.b_count = 1)
which takes a long time to run ~15 sec.
Is there a better way of writing this SQL?
If you would like to represent Set Difference (A-B) in SQL, here is solution for you.
Let's say you have two tables A and B, and you want to retrieve all records that exist only in A but not in B, where A and B have a relationship via an attribute named ID.
An efficient query for this is:
# (A-B)
SELECT DISTINCT A.* FROM (A LEFT OUTER JOIN B on A.ID=B.ID) WHERE B.ID IS NULL
-from Jayaram Timsina's blog.
You don't need to return data from the nested subqueries. I'm not sure this will make a difference withiut indexing but it's easier to read.
And EXISTS/JOIN is probably nicer IMHO then using IN
SELECT *
FROM
A
JOIN
(SELECT LEFT(B,5) AS b1
FROM A
GROUP BY LEFT(B,5)
HAVING COUNT(DISTINCT C) = 1
) t1 On LEFT(A.B, 5) = t1.b1
JOIN
(SELECT C AS C1
FROM A
GROUP BY C
HAVING COUNT(DISTINCT LEFT(B,5)) = 1
) t2 ON A.C = t2.c1
But you'll need a computed column as marc_s said at least
And 2 indexes: one on (computed, C) and another on (C, computed)
Well, not sure what you're really trying to do here - but obviously, that LEFT(B, 5) expression keeps popping up. Since you're using a function, you're giving up any chance to use an index.
What you could do in your SQL Server table is to create a computed, persisted column for that expression, and then put an index on that:
ALTER TABLE A
ADD LeftB5 AS LEFT(B, 5) PERSISTED
CREATE NONCLUSTERED INDEX IX_LeftB5 ON dbo.A(LeftB5)
Now use the new computed column LeftB5 instead of LEFT(B, 5) anywhere in your query - that should help to speed up certain lookups and GROUP BY operations.
Also - you have a GROUP BY C in there - is that column C indexed?
If you are looking for just set difference between table1 and table2,
the below query is simple that gives the rows that are in table1, but not in table2, such that both tables are instances of the same schema with column names as
columnone, columntwo, ...
with
col1 as (
select columnone from table2
),
col2 as (
select columntwo from table2
)
...
select * from table1
where (
columnone not in col1
and columntwo not in col2
...
);