I ran into a problem and was trying to find a general solution to it as a join.
I have 2 tables :
http://pastebin.com/q5yws5Ym (not sure how to enforce the styling)
And i want to generate something like
http://pastebin.com/GscBUrYS
(while there are more parameters i'm interested in how i would do it for something like this)
While i was able to reach a similar effect with self-joins and equi-joins it would generate a lot of unneeded rows, which i'm not sure how to delete automatically.
Try something along the lines of:
SELECT user.user_id, j1.user_param, j1.user_value, j2.user_param, j2.user_value
FROM user
JOIN Users_info j1 ON user.user_id = j1.user_id
JOIN users_info j2 on user.user_id = j2.user_id
where j1.user_param != j2.user_param
GROUP BY user.user_id
It may be possible that you will need some more "exclusion" clauses for the where to make sure that every row is only selected once but the general idea should work (for a given and limited number of different user_param`s).
Related
I'm kind of new to writing sql and I have a question about joins. Here's an example select:
select bb.name from big_box bb, middle_box mb, little_box lb
where lb.color = 'green' and lb.parent_box = mb and mb.parent_box = bb;
So let's say that I'm looking for the names of all the big boxes that have nested somewhere inside them a little box that's green. If I understand correctly, the above syntax is another way of getting the same results that we could get by using the 'join' keyword.
Questions: is the above select statement efficient for the task it's doing? If not, what is a better way to do it? Is the statement syntactic sugar for a join or is it actually doing something else?
If you have links to any good material on the subject I'd gladly read it, but since I don't know exactly what this technique is called I'm having trouble googling it.
You are using implicit join syntax. This is equivalent to using the JOIN keyword but it is a good idea to avoid this syntax completely and instead use explicit joins:
SELECT bb.name
FROM big_box bb
JOIN middle_box mb ON mb.parent_box = bb.id
JOIN little_box lb ON lb.parent_box = mb.id
WHERE lb.color = 'green'
You were also missing the column name in the join condition. I have guessed that the column is called id.
This type of query should be efficient if the tables are indexed correctly. In particular there should be foreign key constraints on the join conditions and an index on little_box.color.
An issue with your query is that if there are multiple green boxes inside a single box you will get duplicate rows returned. These duplicates can be removed by addding DISTINCT after SELECT.
I have something of a Search App. There are 7 fields (first name, last name, phone, street, city, shop number, credit card number) where user can write parameters and it's gonna find him clients in the database. Everything is working with AND condition, so when first name is 'Andy' and last name is 'Larkin' is only gonna find Andy Larkins etc. User can leave a field empty, that means when first name is 'Andy' then it should find all the Andys etc. Database looks like this:
The 'Relation' table is to connect person and a shop. Person must have 1 address, 1 shop, can have multiple addresses, multiple shops and no credit card/multiple credit cards. Now, I have to handle all the filtering in a single query, I can't check some conditions before and then construct the query another way, I just don't have that option.
When I search by first name or last name it's fast (both in Person table), but when I search by phone number, or credit card number - it's taking a lot of time. There is a lot of data in the database, but still, my query is bad, I'm not really good at writing queries, especially in Oracle. Here's the query:
SELECT
PERSON.personId,
PERSON.firstName,
PERSON.lastName
ADDRESS.street,
ADDRESS.city,
ADDRESS.phoneNumber
FROM
PERSON
LEFT JOIN ADDRESS ON PERSON.personId = ADDRESS.personId,
LEFT JOIN RELATION ON PERSON.personId = RELATION.personId,
LEFT JOIN SHOPS ON RELATION.shopId = SHOPS.shopId
LEFT JOIN CREDITCARDS ON PERSON.personId = CREDITCARDS.personId
WHERE
PERSON.firstName = NVL(?, PERSON.firstName),
PERSON.lastName = NVL(?, PERSON.lastName),
ADDRESS.phoneNumber = NVL(?, ADDRESS.phoneNumber),
ADDRESS.street = NVL(?, ADDRESS.street),
ADDRESS.city = NVL(?, ADDRESS.city),
SHOPS.shopNumber = NVL(?, SHOPS.shopNumber),
CREDITCARDS.creditCardNumber = NVL(?, CREDITCARDS.creditCardNumber);
The parameters that user left empty are passed as NULLS, that's why I use NVL. When I delete all conditions and leave let's say a credit card number, then it's fast, so I guess that means that all the unnecessary condition checking is slowing the query, and I don't really need that condition checking in most cases, it's just there in case a user passes something.
If I would have the option to check for conditions and only then construct a query then I would just add the conditions that are needed, but I don't have that option. I was thinking about adding some 'IFs' in the query, but I'm not sure that's even possible, all I could find was 'IF/CASE WHEN' but couldn't find any examples that apply to my case. I also tried this:
...WHERE (? IS NULL OR (PERSON.firstName = NVL(?, PERSON.firstName))) AND...
That didn't help, and I got tons of duplicated (different only in address or something - person can have multiple addresses) results (even with 'DISTINCT').
It's not homework, that database is huge with lot of other fields, but I simplified it here, there is also a lot of data there. Thanks for help.
A few things to think about here.
Be careful about queries that might not make sense; such as those that query a credit card number and an address. Queries of that nature fall into a fan trap.
Creating referential integrity constraints in the database will allow the optimizer to do join elimination.
It would be much better for the optimizer, if you could build the query "where clause" dynamically, rather than using NVL functions.
A nested select on the shops might improve performance especially considering it's outer joined. The query below should be enough to get you the idea.
Regarding de-duplication- it's hard because you are selecting Id's and 'distinct' won't help. You'd probably have to use the group by syntax and that might slow the query even more.
If sorting can be done on the client it might help with performance. If the amount of data being returned is significant due to fan-out of relational data and group by isn't a good option then creating a stored procedure might be the best option so most the work is done on database and minimal data over the wire.
SELECT
p.personId,
p.firstName,
a.city,
a.phoneNumber,
shop.shopNumber
FROM
PERSON p,
ADDRESS a,
CREDITCARDS c,
(select ss.personId, ss.shopId, ss.shopNumber from shop s, relation r
where s.shopId = r.shopId) as shop
WHERE
p.personId = a.personId AND
p.personId = c.personId AND
p.personId = shop.personId (+)
Does join change the sorting of original table?
For example
SELECT * FROM #empScheduled
AND
SELECT es.job_code,es.shift_id, es.unit_id,jd.job_description,s.shift_description,u.unit_code,
[group] = (SELECT job_description FROM dbo.job_desc WHERE job_code = jd.job_group_with),es.new_group
FROM #empScheduled es JOIN job_desc jd ON es.job_code = jd.job_code
JOIN dbo.shifts s ON es.shift_id = s.shift_id JOIN dbo.units u
ON es.unit_id = u.unit_id
Will produce the same records but in different sorting orders! Why does that happen and what is the way to stop it and produce the result in same sort order as the orginal table WITHOUT having to use Order by ? Thanks.
If you don't specify an ORDER BY, there is no defined order on the result tuples. A simple SELECT * FROM x will return the tuples in the order they are stored, since only a table scan is done. However, if a join or similar is done, there is no way of knowing how the intermediate results are handled and the order of the results is undefined.
It is never a good idea to want the result "in the same order as the original table", since this is not a definition of an order. If you want to preserve the insert order, use an ascending primary key and simply order by that. This way, you can always restore the "original" order of your tuples.
There is no inherent sort order in SQL Server.
In your case, it is likely due to indexes on your joined table and their sorted order.
If you want consistently sorted results, you must use an ORDER BY clause.
You might try the FORCE ORDER hint to explicitly force the join order. Otherwise the optimizer can order the joins any way it thinks is best. Not sure why you cannot use ORDER BY though.
I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:
Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.
In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.
Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?
Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).
I have a basic "property bag" table that stores attributes about my primary table "Card." So when I want to start doing some advanced searching for cards, I can do something like this:
SELECT dbo.Card.Id, dbo.Card.Name
FROM dbo.Card
INNER JOIN dbo.CardProperty ON dbo.CardProperty.IdCrd = dbo.Card.Id
WHERE dbo.CardProperty.IdPrp = 3 AND dbo.CardProperty.Value = 'Fiend'
INTERSECT
SELECT dbo.Card.Id, dbo.Card.Name
FROM dbo.Card
INNER JOIN dbo.CardProperty ON dbo.CardProperty.IdCrd = dbo.Card.Id
WHERE (dbo.CardProperty.IdPrp = 10 AND (dbo.CardProperty.Value = 'Wind' OR dbo.CardProperty.Value = 'Fire'))
What I need to do is to extract this idea into some kind of stored procedure, so that ideally I can pass in a list of property/value combinations and get the results of the search.
Initially this is going to be a "strict" search meaning that the results must match all elements in the query, but I'd also like to have a "loose" query so that it would match any of the results in the query.
I can't quite seem to wrap my head around this one. My previous version of this was to do generate some massive SQL query to execute with a lot of AND/OR clauses in it, but I'm hoping to do something a little more elegant this time. How do I go about doing this?
it seems to me that you have an EAV model here.
if you're using sql server 2005 and up i'd suggest you use XML datatype for this:
http://weblogs.sqlteam.com/mladenp/archive/2006/10/14/14032.aspx
makes searching and stuff much easier with built in xml querying capabilities.
if you can't change your model then look at this:
http://weblogs.sqlteam.com/davidm/articles/12117.aspx