Trying to join tables using ARRAY_CONTAINS yields never-ending run - snowflake-cloud-data-platform

We have three tables: "USERS", "COMPANIES" and "FILES", where each USER has a variant column named: "COMAPNY_IDS" that is a simple array that matches records in the "COMPANIES" table.
Each FILE belongs to a single COMPANY via its "COMPANY_ID" field.
We would like to join each USER with every one of the FILEs that its connected to via any of the COMPANIES its associated to.
This naive solution:
SELECT u._id as user_id, z._id as file_id, z.SENT_ON
FROM users u,
LATERAL (SELECT f._id, f.SENT_ON
from FILES f
where ( ARRAY_CONTAINS(TO_VARIANT(f.COMPANY_ID), TO_ARRAY(u.COMPANY_IDS)))) z
Takes forever and never finishes.
A more convoluted solution that avoids the "ARRAY_CONTAINS" function - finishes in a second and a half:
SELECT u._id as user_id, max(files.sent_on)
FROM users u
LEFT JOIN (
select *
FROM companies c,
lateral (select flt.value as cid , us._id as uid
from users us,
lateral flatten ( company_ids) as flt
where cid = c._id) ccc
) x on (x.uid = u._id)
LEFT JOIN files
on x.cid = files.company_id
GROUP BY u._id
Is there something preventing ARRAY_CONTAINS from being used properly in "ON" clauses or as part of the "WHERE" clause of correlated lateral sub-queries?

Related

Understanding DISTINCT vs DISTINCT ON vs Group by

I have a query which returns a set of 'records'.
The result is always from the same table, and should always be unique. It has a set of inner joins to filter the rows down to the appropriate subset.
The query is returning roughly 10 columns.
However, I found that it was returning duplicate rows, so I added select distinct to the query, which solved the duplication problem but has significant performance issues.
My understanding is that select distinct on (records.id), id... will return the same result in this case, as all duplicates would have the same primary key, and seems to be about twice as fast.
My other tests show that group by records.id is even faster again, and seems to do the same thing?
Am I correct that all three of these approaches will always return the same set of single table records?
Also, is there an easy way to compare the results of different approaches to ensure the set is being returned?
Here is my query:
SELECT DISTINCT records.*
FROM records
INNER JOIN records parents on parents.path #> records.path
INNER JOIN record_types ON record_types.id = records.record_type_id
INNER JOIN user_roles ON user_roles.record_id = parents.id AND user_roles.user_id = _user_id
INNER JOIN memberships ON memberships.role_id = user_roles.role_id
INNER JOIN roles ON roles.id = memberships.role_id
INNER JOIN groups ON memberships.group_id = groups.id AND
groups.id = record_types.view_group_id
Any individual record can have tree of 'parent' records. This is done using the ltree plugin. Effectively, we are looking to see if the user has a role which is in a group which is defined as the 'view group' for either the current record, or any of the parents. The query is actually a function, and _user_id is being passed in.
Since you are only selecting from records, you don't need DISTINCT; the records are already distinct (I presume).
So the duplicates you encounter could be caused by all the joins, for instance if more than one role or group membership matches one of your records, the same record will be combined with each of these references.
SELECT *
FROM records r
WHERE EXISTS (
SELECT *
FROM records pa on pa.path #> r.path
JOIN record_types typ ON typ.id = r.record_type_id
JOIN user_roles ur ON ur.record_id = pa.id AND ur.user_id = _user_id
JOIN memberships mem ON mem.role_id = ur.role_id
JOIN roles ON roles.id = mem.role_id
JOIN groups gr ON mem.group_id = gr.id AND gr.id = typ.view_group_id
)
;

SQL - join two tables based on up-to-date entries

I have two tables
1- Table of TestModules
TestModules
2- Table of TestModule_Results
TestModule_Results
in order to get the required information for each TestModule, I am using FULL OUTER JOIN and it works fine.
FULL OUTER JOIN result
But what is required is slightly different. The above picture shows that TestModuleID = 5 is listed twice, and the requirement is to list the 'up-to-date' results based on time 'ChangedAt'
Of course, I can do the following:
SELECT TOP 1 * FROM TestModule_Results
WHERE DeviceID = 'xxx' and TestModuleID = 'yyy'
ORDER BY ChangedAt DESC
But this solution is for a single row and I want to do it in a Stored Procedure.
Expected output should be like:
ExpectedOutput
Any advise how can I implement it in a SP?
Use a Common Table Expression and Row_Number to add a field identifying the newest results, if any, and select for just those
--NOTE: a Common Table Expression requires the previous command
--to be explicitly terminiated, prepending a ; covers that
;WITH cteTR as (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY DeviceID, TestModuleID
ORDER BY ChangedAt DESC) AS ResultOrder
FROM TestModule_Results
--cteTR is now just like TestModule_Results but has an
--additional field ResultOrder that is 1 for the newest,
--2 for the second newest, etc. for every unique (DeviceID,TestModuleID) pair
)
SELECT *
FROM TestModules as M --Use INNER JOIN to get only modules with results,
--or LEFT OUTER JOIN to include modules without any results yet
INNER JOIN cteTR as R
ON M.DeviceID = R.DeviceID AND M.TestModuleID = R.TestModuleID
WHERE R.ResultOrder = 1
-- OR R.ResultOrder IS NULL --add if Left Outer Join
You say "this solution is for a single row"? Excellent. Use CROSS APPLY and change the WHERE clause from hand-input literal to the fields of the original table. APPLY operates at row level.
SELECT *
FROM TestModules t
CROSS APPLY
(
SELECT TOP 1 * FROM TestModule_Results
WHERE TestModule_Results.DeviceID = TestModules.DeviceID -- put the connecting fields here
ORDER BY ChangedAt DESC
)tr

Select value from different tables based on the column value in SQL Server

I have a main table A with the following fields:
Then I have three separate tables, each for Buildings, Classrooms and Offices. All these tables have two columns; ID and Name.
I want to query the table A to get the following result:
How can I do this?
Your data isn't really normalized. Having three separate tables all serving the same lookup is causing you some headache... so I unioned the 3 tables together and created a 'src' column so you could join table A's type and Id back to table B's ID and src. You'd have been better off having one table and non-repeating IDs and a type ID to specify if it's a building classroom or office.
Select *
from A
LEFT JOIN (SELECT 'Building' as src, ID, Name FROM Buildings UNION ALL
SELECT 'Classroom' as src, ID, Name FROM Classrooms UNION ALL
SELECT 'Office' as src, ID, Name FROM Offices) B
on A.Location_Type = B.Src
and A.LocationID = B.ID
I used a left join here in case not all records in A have an associated record in B. However, an inner join should work as well.
Should be doable with the use of a join.
Something along those lines should work:
SELECT tabelA.ID, tabelA.Subject, tabelA.Date, tabelA.locationType, tabelB.location
FROM tabelA INNER JOIN tabelB on tabelA.locationID = tabelB.locationID

Query values from another table based on column value

I am attempting to create a query that pulls information from two other tables, however I only know which table to pull from based on a column in another table. I'm currently looking into doing this using a stored procedure (e.g. build the query and then run it) but I wanted to know if there is a better way to do this, or if I could accomplish it in a single query.
In terms of the connections, ID's are unique accross the entire database, so no two ID's will overlap. However I do not know which subtable the ID relates. I am able to find this by pulling in an unrelated table that happens to have the information (call it the Object Table). One of the columns will give me the table name for the information (in my example below, Person). I have drafted a simple example below. Can you see any way I could accomplish this in a single query? Something like this is what I am aiming for but I am starting to think its not possible.
SELECT * FROM base_table
LEFT JOIN object ON object.id = base_table.role
LEFT JOIN [object.type] tmp ON tmp.entity_id = base_table.entity_id
id | role | entity_id (Base Table)
---------------------
1 | 101 | 1000
id | type (Objects Table)
------------
101| person
entity_id | name | etc.. (Person Table)
------------------------
1000 | Bob | ...
I also expect unions might be a possible solution - but other then just joining all the possible tables and parsing the columns to match up properly (which it could be as many as 20 tables) I'd rather not. This solution is also a bit of a nusience since the columns don't always match in a good way (e.g. the Person table doesn't have similar columns to the Address table)
I don't think the left join idea is that bad if you just ignore object type.
Since each ID is unique you don't need to look at type at all if you use coalesce. So to use #TT model as an example:
SELECT bt.*,
COALESCE(P.f1, L.f1, C.f1) AS f1,
-- ...,
COALESCE(P.fn, L.fn, C.fn) AS fn
FROM
base_table AS bt
LEFT JOIN Person AS P ON P.entity_id = bt.entity_id
LEFT JOIN [Legal Person] AS L ON L.entity_id = bt.entity_id
LEFT JOIN Counterpart AS C ON C.entity_id = bt.entity_id
Depending on your data size and indexes this might perform faster or the same as TT's example -- remember there is only 1 select with N joins while TT's has N selects, 2N joins. It really depends on your data.
If there is some field (fz) that does not show up in all types then you just don't inlcude that in the coalesce clause.
I think this style might be easier to maintain and understand and will be the same or faster as TT code.
What you probably want to do is the following: for each possible detail-table (ie the possible values in [object.value]), write a query that only links with that one detail-table and have a WHERE clause to restrict to the proper entities. Then do a UNION ALL for all those queries.
Say you have Person, Legal Person and Counterpart as possible values in [object.type]. Suppose the detail-tables have the same names. You can write:
SELECT
bt.*,
dt.f1,
-- ...,
dt.fn
FROM
base_table AS bt
INNER JOIN object AS o ON o.id = bt.role
INNER JOIN Person AS dt ON dt.entity_id = bt.entity_id
WHERE
o.type='Person'
UNION ALL
SELECT
bt.*,
dt.f1,
-- ...,
dt.fn
FROM
base_table AS bt
INNER JOIN object AS o ON o.id = bt.role
INNER JOIN [Legal Person] AS dt ON dt.entity_id = bt.entity_id
WHERE
o.type='Legal Person'
UNION ALL
SELECT
bt.*,
dt.f1,
-- ...,
dt.fn
FROM
base_table AS bt
INNER JOIN object AS o ON o.id = bt.role
INNER JOIN Counterpart AS dt ON dt.entity_id = bt.entity_id
WHERE
o.type='Counterpart'

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.
Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

Resources