As you can see in the link of the ER Diagram, I got two tables, Department and Admissions. My goal is to print out only the Reshnum of the Department that has the most Admissions.
My first attempt was this:
select top 1 count(*) as Number_of_Adm, Department.ReshNum
from Admission
inner join Department on Admission.DepartmentId = Department.id
group by Department.ReshNum
order by Number_of_Adm;
It's pretty straight forward, counts the rows, groups them to the department and prints out the top answer after ordering for the highest count. My problem is that it prints both the count and the Rashnum.
I'm trying to only print the Rashnum (name of the branch/serialnumber). I've read up on sub queries to try to get the count in a subquery and then pick the branch out from that query, but I can't get it to work.
Any tips?
You just need to select the column you need and move the count to the order by criteria.
Using column aliases also helps make your query easier to follow, especially with more columns & tables in the query.
you also say you want the most, I assume you'll need to order descending.
select top (1) d.ReshNum
from Admission a
inner join Department d
on a.DepartmentId = d.id
group by d.ReshNum
Order By count(*) desc;
Great question! Stu's answer is probably the most optimum way, depending on your indexes.
Just for posterity, since your inquiry includes how to make a subquery work, here is an alternative using a subquery. As far as I can tell, at least on my database, SQL Query Optimizer plans the two queries out with about the same performance on either version.
Subqueries can be really useful in tons of scenarios, like when you want to add another field to display and group by without having to add every single other field on the table in the group by clause.
SELECT TOP 1 x.ReshNum /*, x.DepartmentName*/
FROM
(
SELECT count(*) AS CountOfAdmissions, d.CustomerNumber /*d.DepartmentName*/
FROM Adminssion a
INNER JOIN Department d ON a.DepartmentId= d.Id
GROUP BY d.ReshNum /*, d.DepartmentName*/
/*HAVING count(*) > 50*/
) x
ORDER BY CountOfAdmissions DESC
How it works:
You need to wrap your subquery in parenthesis and give that subquery an alias. Above, I have it aliased as x just outside the closing parenthesis as an arbitrary identifier. You could certainly alias it as depts or mySubQuery or whatever reads well in the resulting overall query to you.
Next, it's important to notice that while the group by clause can be included inside the subquery, the order by clause cannot. So you have to keep the order by clause on the outside of the query, which means you are actually ordering the results of the subquery, and not the results of the actual table. Which could be great for performance because the result of your subquery is likely to be vastly smaller than the whole table. However, it will not use your table's index that way, so depending on how your indexes are, that bonus may wash out or even be worse than ordering without a subquery.
Last, one of the benefits of this kind of subquery approach is that you could easily throw in another field if you want, like the department name for example, without costing very much in performance. Above I have that hypothetical department name field commented out between the /* and */ flags. Note that it is referenced with the d table alias on the inside of the subquery, but uses the subquery's x alias outside of the subquery.
Just as a bonus, in case it comes up, also commented out above is a having clause that you might be able to use. Just to show what could be done inside the subquery.
Cheers
Related
I have a query in SQL that I want to convert to SOQL.
I know that a LEFT JOIN is not possible in SOQL. But I don't how to write this in SOQL.
I want to check Cases without Decision__c. There is a Lookup relation between Case(Id) and Decision__c (Case__c).
That would be in SQL:
Select Id
FROM Case
LEFT JOIN Decision__c D
on D.Case__c = Case.Id
WHERE Case__c IS NULL
I exported all Cases (Case) and all Decisions (Decision__c) to Excel. With a VLOOKLUP I connected the Case with the decision. An error = no linked decision.
I exported the objects in PowerQuery and performed a left join to merge the two queries. Those with no decision where easily filtered (null value).
So I got my list of Cases without Decision, but I want to know if I can get this list with a SOQL query, instead of these extra steps.
To simply put it, you must, literally, select cases without Decision__c, the query should look like this:
SELECT Id FROM Case WHERE Id NOT IN(SELECT Case__c FROM Decision__c)
Although we don't JOINs in Salesforce we can use several "subqueries" to help filter records.
refer to the following link:
https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_select.htm
I asked a question about aliases recently: Discerning between alias, temp table, etc [SQL Server].
I got the impression that tables and resulting queries had to be named using aliases.
select customers.name as 'Customers'
from customers
where customers.id not in
(
select customerid from orders
)
In fact when you use an alias there is a runtime error. What gives?
When working with "tables" - that is, anything that can use a JOIN - a name of some sort is needed. For example, if your query was written as:
select customers.name as 'Customers'
from customers
LEFT JOIN (
select customerid from orders
) ___
WHERE ___ is null
Then you need to name the derived table, and fill in the blanks, because SQL Syntax requires a name in a JOIN statement.
However, in your sample code:
select customers.name as 'Customers'
from customers
where customers.id not in
(
select customerid from orders
)
The syntax does not require a name, and so the nested query does not require naming.
Aliases are there for convenience most of the time. There are times when you are required to use them, though.
https://www.geeksforgeeks.org/sql-aliases/
Temporary tables, derived look-ups (sub-queries), common table expressions (CTEs), duplicate table names in JOINs, and a couple other other things I'm sure I'm forgetting are the only times you need to use an alias.
Most of the time, it's simply to rename something because it's long, complex, a duplicate column name, or just to make things simpler or more readable.
The query you post won't likely need an alias, but using one makes things easier when you are using the results in code, as well as when/if you add more columns to the query.
Side note:
You may see a lot of single letter abbreviations in people's SQL. This is common, however, it's bad form. They will also likely abbreviate with the first letter of every word in a table name, such as cal for ClientAddressLookup, and this is also not great form, however typing ClientAddressLookup for each of the 12 columns you need when JOINing with other tables isn't great either. I'm as guilty of this as everyone else, just know that using good aliases are just as necessary and useful as using good names for your variables in code.
Hi am hoping someone can help my SQL theory out. I have to create a set of reports which use joins from multiple tables. These reports are running far slower than I would like and I am hoping to optimize my SQL although my knowledge has hit a wall and I cant seem to find anything on Google.
I am hoping someone here can give me some best practice guidance.
Essentially I am trying to filter on the results set as it comes back to reduce the number of rows included in later joins
Items INNER JOIN BlueItems ON Items.ItemID = BlueItems.ItemID AND BlueItems.shape = 'square'
LEFT JOIN ItemHistory ON Items.ItemID = ItemHistory.ItemsID
LEFT JOIN ItemDates ON Items.ItemID = ItemDates.ItemID
WHERE ItemDates.ManufactureDate BETWEEN '01/01/2017' AND '01/05/2017'
I figure that Inner Joining on Blue items that are squares vastly reduces the data set at this point?
I also understand that the Where clause is intelligent enough to reduce the data set on run time? Am I mistaken? Is it returning all the data and then just filtering on that data?
Any guidance on essentially how to speed this kind of query up would be fantastic, Index's and such have already been put in place. Unfortunately the database is actually looked after by someone else and I am simply creating reports based on their database. This does limit me to just being able to optimize my queries rather than the data itself.
I guess at this point its time for me to try and improve my knowledge on how SQL handles the various ways you can filter on data and try to understand which actually reduce the dataset used and which simply filter on it. Any guidance would be very appreciated!
You mentioned that the primary keys are all indexed, but this is always the case for primary key fields. The only portion of your current query which would possibly benefit from this is the first join with Items. For the other joins and the WHERE clause, these primary key fields are not being used.
For this particular query, I would suggest the following indices:
ALTER TABLE BlueItems ADD INDEX bi_item_idx (ItemID, shape)
ALTER TABLE ItemHistory ADD INDEX ih_item_idx (ItemID)
ALTER TABLE ItemDates ADD INDEX id_idx (ItemID, ManufactureDate)
For the ItemHistory table, the index ih_item_idx should speed up the join involving the ItemID foreign key. A column by the same name is also involved with the other two joins, and hence is part of the other indices. The reason for the composite indices (i.e. indices involving more than one column) is that we want to cover all the columns which appear in either the join or the WHERE clause.
These comments are not really an answer but too big to put in a comment...
IF the dates being passed in as parameters (i'm guessing they are) then it might be parameter sniffing that is causing the issue. The query may be using a bad plan.
I've seen this a lot especially when using the between operator. A few quick things to try as adding OPTION(RECOMPILE) to the end of your query. This might seem counter intuitive but just try it. Although compiled queries should be faster than recompiling, if a bad plan is being used, it can slow things down A LOT.
Also, If ItemDates is big, try dumping yuor filtered results to a temp table and joining to that, so something like.
SELECT * INTO #id FROM ItemDates i WHERE i.ManufactureDate BETWEEN '01/01/2017' AND '01/05/2017'
The change you main query to be something like
SELECT *
FROM Items
JOIN BlueItems ON Items.ItemID = BlueItems.ItemID AND BlueItems.shape = 'square'
JOIN #id i ON Items.ItemID = i.ItemID
LEFT JOIN ItemHistory ON Items.ItemID = ItemHistory.ItemsID
I also changed the JOIN from being a LEFT JOIN to a JOIN (implicitly an inner join) as you are only selecting items which have a match in ItemDates so LEFT joining makes no sense.
I have a huge select statement, with multiple inner joins, that brings back 20 columns of info.
Is there some way to filter the result to do a unique (or distinct) based on a single column only?
Another way of thinking about it is that when it does a join, it only grabs the first result of the join on a single ID, then it halts and moved onto joining by the next ID.
I've successfully used group by and distinct, but these require you to specify many columns, not just one column, which appears to slow the query down by an order of magnitude.
Update
The answer by #Martin Smith works perfectly.
When I updated the query to use this technique:
It more than doubled in speed (1663ms down to 740ms)
It used less T-SQL code (no need to add lots of parameters to the GROUP BY clause).
It's more maintainable.
Caveat (very minor)
Note that you should only use the answer from #Martin Smith if you are absolutely sure that the rows that will be eliminated will always be duplicates, or else this query will be non-deterministic (i.e. it could bring back different results from run to run).
This is not an issue with GROUP BY, as the TSQL syntax parser will prevent this from ever occurring, i.e. it will only let you bring back results where there is no possibility of duplicates.
You can use row_number for this
WITH T AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY YourCol ORDER BY YourOtherCol) AS RN,
--Rest of your query here
)
SELECT *
FROM T
WHERE RN=1
We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.