sql count query over a count query

sql count query over a count query - sql-server

following is my query to get the count of documentid
select count(did),did
from (
select count(FileDataId) as fid,
documentid as did,
FileDataId
from DocumentHistory
where documentid in (
select documentid
from Document
where DocumentTypeId=11 and
IsVerified=0 and
lastmodified < '5/jan/2019'
)
group by DocumentId,
filedataid
having count(FileDataId)<2
)
I am getting error as
Incorrect syntax near ')'.
If i run the inner query, it is bringing me result
I like to know how many times the did is repeating in the result

I guess it's because you haven't put an alias for your subquery. You're also missing a grouping in the outer query. The formatting of your query could be improved too:
select
count(a.did),
a.did
from (
select
count(dh.FileDataId) as fid,
dh.documentid as did,
dh.FileDataId
from DocumentHistory dh
INNER JOIN Document d on d.documentid = dh.documentid
where d.DocumentTypeId=11 and d.IsVerified=0 and d.lastmodified < '2019-01-05'
group by DocumentId, filedataid
having count(FileDataId)<2
) a
GROUP BY did
As well as incorporating simon's suggestion
surrounding the date format (yyyy-mm-dd is ISO, and not subject to localisation problems like a date that contains words- your query might not work on eg a Spanish database), I swapped the IN for an INNER JOIN too; though query optimizers can generally rewrite IN to behave like a join you should aim to avoid using IN for lists longer than you'd be prepared to write by hand. Some databases are better optimised for join than in
Note, having said that, that there's a slight difference in behaviour between an INNER JOIN and IN, if the IN(...) list contains duplicates you won't get repeated rows out of documenthistory result but you would with an inner join. (I expect that documentid is a primary key of document so duplicates wouldn't appear in this case)

Related

Using SQL to combine detailed and aggregated results

I am developing a report against a SQL Server database. Using the query presented here...
SELECT
f.FacilityID as 'FID',
COUNT (DISTINCT f.PhoneTypeID) as 'Ptypes',
COUNT (DISTINCT f.PhoneID) as 'Pnumbers'
from dbo.FacilityPhones as f
inner join
dbo.Phones as ph
f.PhoneID = ph.PhoneID
group by f.FacilityID
having COUNT(DISTINCT f.PhoneTypeID)<>COUNT(DISTINCT f.PhoneId);
...I have identified 107 records where the number of phone numbers present for a Facility differs from the number of phone number types (e.g., there are two distinct phone numbers, both listed as primary).
I would like to be able to produce a detailed report that would list phone numbers and phone types for each facility, but ONLY when the distinct counts differ.
Is there a way to do this with a single query? Or would I need to save the summaries to a temp table, then join back to that temp table to get the details?

Not sure what fields exist in dbo.Phone; but assume the number comes from there... Likely need to join to the type table to get it's description as well...
This uses a common table expression to get your base list of items an then a correlated subquery to ensure only those facilities in your cte are displayed.
WITH CTE AS (
SELECT f.FacilityID as 'FID'
, COUNT (DISTINCT f.PhoneTypeID) as 'Ptypes'
, COUNT (DISTINCT f.PhoneID) as 'Pnumbers'
FROM dbo.FacilityPhones as f
GROUP BY f.FacilityID
HAVING COUNT(DISTINCT f.PhoneTypeID)<>COUNT(DISTINCT f.PhoneId))
SELECT *
FROM dbo.FaclityPhones FP
INNER JOIN dbo.Phones as ph
ON FP.PhoneID = ph.PhoneID
WHERE EXISTS (SELECT 1
FROM CTE
WHERE FID = FP.FacilityID)
The where clause here just says only show those FacilityID's and associated records if the FacilityID exists in your original query (CTE) (107) If we needed data from the CTE we'd join to it; but as it's simply restricting data placing it in the where clause and using an exists will likely be more efficient.

SQL Server : group all data by one column

I need some help in writing a SQL Server stored procedure. All data group by Train_B_N.
my table data
Expected result :
expecting output
with CTE as
(
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
)
select
*
from Train_M as m
join CTE as c on c.Train_B_N = m.Train_B_N
whats wrong with my query?

The GROUP BY smashes the table together, so having columns that are not GROUPED combine would cause problems with the data.
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
By ANSI standard, the GROUP BY must include all columns that are in the SELECT statement which are not in an aggregate function. No exceptions.
WITH CTE AS (SELECT TRAIN_B_N, MAX(DATE) AS Last_Date
FROM TRAIN_M
GROUP BY TRAIN_B_N)
SELECT A.Train_B_N, Duration, Date,Trainer,Train_code,Training_Program
FROM TRAIN_M AS A
INNER JOIN CTE ON CTE.Train_B_N = A.Train_B_N
AND CTE.Last_Date = A.Date
This example would return the last training program, trainer, train_code used by that ID.
This is accomplished from MAX(DATE) aggregate function, which kept the greatest (latest) DATE in the table. And since the GROUP BY smashed the rows to their distinct groupings, the JOIN only returns a subset of the table's results.
Keep in mind that SQL will return #table_rows X #Matching_rows, and if your #Matching_rows cardinality is greater than one, you will get extra rows.
Look up GROUP BY - MSDN. I suggest you read everything outside the syntax examples initially and obsorb what the purpose of the clause is.
Also, next time, try googling your problem like this: 'GROUP BY, SQL' or insert the error code given by your IDE (SSMS or otherwise). You need to understand why things work...and SO is here to help, not be your google search engine. ;)
Hope you find this begins your interest in learning all about SQL. :D

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.

Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

Performance Issues with Count(*) in SQL Server

I am having some performance issues with a query I am running in SQL Server 2008. I have the following query:
Query1:
SELECT GroupID, COUNT(*) AS TotalRows FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2
ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word) GROUP BY GroupID
Table1 contains about 500,000 rows. Table2 contains about 50,000, but will eventually contain millions. Playing around with the query, I found that re-writing the query as follows will reduce the execution time of the query to under 1 second.
Query 2:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
What I do not understand is it is a simple count query. If I execute the following query on Table 1, it returns in < 1 s:
Query 3:
SELECT Count(*) FROM Table1
This query returns around 500,000 as the result.
However, the Original query (Query 1) mentioned above only returns a count of 50,000 and takes 3s to execute even though simply removing the GROUP BY (Query 2) reduces the execution time to < 1s.
I do not believe this is an indexing issue as I already have indexes on the appropriate columns. Any help would be very appreciated.

Performing a simple COUNT(*) FROM table can do a much more efficient scan of the clustered index, since it doesn't have to care about any filtering, joining, grouping, etc. The queries that include full-text search predicates and mysterious subqueries have to do a lot more work. The count is not the most expensive part there - I bet they're still relatively slow if you leave the count out but leave the group by in, e.g.:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
GROUP BY GroupID;
Looking at the provided actual execution plan in the free SQL Sentry Plan Explorer*, I see this:
And this:
Which lead me to believe you should:
Update the statistics on both Inventory and A001_Store_Inventory so that the optimizer can get a better rowcount estimate (which could lead to a better plan shape).
Ensure that Inventory.ItemNumber and A001_Store_Inventory.ItemNumber are the same data type to avoid an implicit conversion.
(*) disclaimer: I work for SQL Sentry.

You should have a look at the query plan to see what SQL Server is doing to retrieve the data you requested. Also, I think it would be better to rewrite your original query as follows:
SELECT
Table1.GroupID -- When you use JOINs, it's always better to specify Table (or Alias) names
,COUNT(Table1.GroupID) AS TotalRows
FROM
Table1
INNER JOIN
Table2 ON
(Table2.Column1 = Table1.Column1) AND
(Table2.GroupID = #GroupID)
WHERE
CONTAINS(Table1.*, #Word)
GROUP BY
Table1.GroupID
Also, keep in mind that a simple COUNT and a COUNT with a JOIN and GROUP BY are not the same thing. In one case, it's just a matter of going through an index and making a count, in the other there are other tables and grouping involved, which can be time consuming depending on several factors.

SQL Server 2005 Performance: Distinct or full table in WHERE IN statement

We have two Tables:
Document: id, title, document_type_id, showon_id
DocumentType: id, name
Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42
);
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT Document.document_type_id FROM Document WHERE showon_id = 42
);
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that's what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)

You can use a join:
SELECT DISTINCT DocumentType.*
FROM DocumentType
INNER JOIN Document
ON DocumentType.id=Document.document_type_id
WHERE Document.showon_id = 42
I think it's the best way to do it.

For the best performance you should use:
SELECT DISTINCT dt.*
FROM
DocumentType dt
INNER JOIN Document d ON dt.id=d.document_type_id and d.showon_id = 42
Joins are very efficient at bridging multiple tables where as the nested query in the Where clause will need to perform a separate result selection that will filter down the From clause results. The join statement is also much more readable.
I would also put an index on showon_id, in addition to the primary keys and foreign key relationship.
My answer differs from wmasm's answer only by moving the showon_id filter up to the inner join. For MS SQL 2k5, I think the interpreter is smart enough to do this automatically, but you always want to work with the smallest result set possible. Bringing your filters up to inner join statements can limit the number of rows the query has to work with when joining many tables together. If you do this though, you should understand that this happens for every row comparison so complex filters (such as like x = '%a' or function calls) are better left for the Where clause so that the inner joins may filter out unnecessary comparisons.

Use an EXISTS. It sometimes is faster, but in my opinion, more readable than a DISTINCT and JOIN. Just for kicks, pls reply with the query plan for this query and the JOIN above, and see if anything is different (they may be optimized down to the same plan). If they are the same, I'd recommend the EXISTS as it is closer to a "plain language" description than a JOIN (because you don't want any of the data from Document, etc.)
SELECT whatever
FROM DocumentType dt
WHERE EXISTS( SELECT *
FROM Document
WHERE dt.id = document_type_id
AND showon_id = 42)
To get the query plan (ref: http://msdn.microsoft.com/en-us/library/ms180765(SQL.90).aspx), do:
SET SHOWPLAN_TEXT ON
GO
SELECT ...
GO

From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it's more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.

Follow up on Matt's answer:
I've enabled the query plan and tested the following four different queries that have come up so far:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DISTINCT DocumentType.* FROM DocumentType INNER JOIN Document ON DocumentType.id=Document.document_type_id WHERE Document.showon_id = 42;
SELECT DocumentType.* FROM DocumentType WHERE EXISTS ( SELECT * FROM Document WHERE DocumentType.id=Document.document_type_id AND showon_id = 42);
The query plan for all four queries turned out to be the same:
|--Hash Match(Right Semi Join, HASH:([Document].[document_type_id])=([DocumentType].[Id]))
|--Hash Match(Inner Join, HASH:([Document].[Title], [Uniq1005])=([Document].[Title], [Uniq1005]), RESIDUAL:([Document].[Title] as [Document].[Title] = [Document].[Title] as [Document].[Title] AND [Uniq1005] = [Uniq1005]))
| |--Index Seek(OBJECT:([Document].[IX_Document_3] AS [Document]), SEEK:([Document].[showon_id]=(1)) ORDERED FORWARD)
| |--Index Scan(OBJECT:([Document].[IX_Document_1] AS [Document]))
|--Table Scan(OBJECT:([DocumentType] AS [DocumentType]))
I am not sure what every line and element means, but it seems that from the performance perspective it does not matter how you construct the query for this kind of problem...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight