Clickhouse correlated queries/joins with multiple inequalities - database

I have this query I am trying to join 2 table based on 4 conditions
select
*
from table1 t1
asof left join table2 t2 on t1.start=t2.start and t1.end=t2.end and (t1.spending between t2.spending-10 and t2.spending+10) and (t1.timestamp_ between subtractSeconds(t2.timestamp_, 10) and addSeconds(t2.timestamp_, 10))
t2.id in ('623d9d7ce62af3465dadf31f'))
where t1.id in ('623d9d86e62af3465dadf327')
SETTINGS join_use_nulls = 1
My conditions are:
t1.start=t2.start
t1.end=t2.end
t1.spending between t2.spending - 10 and t2.spending + 10
t1.timestamp_ between subtractSeconds(t2.timestamp_, 10) and addSeconds(t2.timestamp_, 10)
I am trying to join them but clickhouse only supports one inequality condition. I cannot use subqueries because I have to correlate the data and clickhouse does not support correlated queries. I have tried another query using with clause and has but has is also not supported. I have seen dictionaries but according to my understanding we have to define these things in dictionaries but i cant because my last two conditions i.e iii,iv are generated dynamically
I have seen these links but i wasnt able to solve my issue
https://github.com/ClickHouse/ClickHouse/issues/3627
https://clickhouse.com/docs/en/sql-reference/statements/select/join/
clickhouse : left join using IN
https://github.com/ClickHouse/ClickHouse/issues/5736
https://ittone.ma/ittone/clickhouse-asof-join-with-multiple-inequalities/
ClickHouse left join between
Get retention analytics: ASOF JOIN with multiple inequalities
Can someone pls tell me how to run this query. Any help would be appreciated
thanks

Related

Is there an equivalent to OR clause in CONTAINSTABLE - FULL TEXT INDEX

I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...

MS SQL Server trouble with JOIN

I'm new to SQL and need a push in the right direction.
I currently have a working SQL query accessing 3 tables in a database, and I need to add a JOIN using a 4th table, but the syntax escapes me. To simplify, what I have now is:
SELECT
t1_col1, t1_col2, t2_col1, t2_col2, t3_col1
FROM
table1, table2, table3
WHERE
{some conditions}
ORDER BY
t1_col1 ASC;
What I need to do is to add a LEFT OUTER JOIN selecting 2 columns from table4 and have ON t1_field1 = t4_field1, but whatever I try, I'm getting syntax errors all over the place. I don't seem to understand the correct syntax.
I tried
SELECT *
FROM table1
LEFT OUTER JOIN table2;
which has no errors, but as soon as I start SELECTing columns and adding conditions, I get stuck.
I would greatly appreciate any assistance with this.
You do not specify the join criteria. Those would be in your WHERE clause under "some conditions". So, I will make up so that I can show syntax. The syntax you show is often termed "old". It has been discouraged for 15 years or more in the SQL Server documentation. Microsoft consistently threatens to stop recognizing the syntax. But, apparently they have not followed through on that threat.
The syntax errors you are getting occur because you are mixing the old style joins (comma separated with WHERE clause) with the new style (LEFT OUTER JOIN) with ON clauses.
Your existing query should be changed to something like this. The tables are aliased because it makes it easier to read and is customary. I just made up the JOIN criteria.
SELECT t1_col1, t1_col2, t2_col1, t2_col2, t3_col1
FROM table1 t1
INNER JOIN table2 t2 ON t2.one_ID = t1.one_ID
INNER JOIN table3 t3 ON t3.two_ID = t2.two_ID
LEFT OUTER JOIN table4 t4 ON t4.three_ID = t3.three_ID
I hope that helps with "a push in the right direction."
You may also want to read this post that explains the different ways to join tables in a query. What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
Also for the record the "OLD STYLE" of joining tables (NOT RECOMMENDED) would look like this (but do NOT do this - it is a horrible way to write SQL). And it does not work for left outer joins. Get familiar with using the ...JOIN...ON... syntax:
SELECT t1_col1, t1_col2, t2_col1, t2_col2, t3_col1
FROM table1 t1, table2 t2, table3 t3
LEFT OUTER JOIN table4 t4 ON t4.three_ID = t3.three_ID
WHERE
t2.one_ID = t1.one_ID
AND t3.two_ID = t2.two_ID

I want to know why merge join cannot be done without equal sign(=) on ON clause in SQL Server

I'm recently studying SQL Server. I read a book and it says that merge join requires at least one equal sign(=) on ON clause in SQL Server.
So, I tried this query below and I found that the error occurs.
SELECT *
FROM TABLE_1 AS T1
INNER MERGE JOIN TABLE_2 AS T2
ON T1.COL_1 > T2.COL_2
Error Message:
Msg 8622, Level 16, State 1, Line 6 Query processor could not produce
a query plan because of the hints defined in this query. Resubmit the
query without specifying any hints and without using SET FORCEPLAN.
And this book also says this can be done in case of Full Outer Join.
So I tried this query below and found it committed successfully with no errors.
SELECT *
FROM TABLE_1 AS T1
FULL OUTER MERGE JOIN TABLE_2 AS T2
ON T1.COL_1 > T2.COL_2
I tried to search for the reason but I couldn't find any explanation about this.
Can anyone tell me why SQL Server doesn't allow merge join without an equality operator unless it's full outer join?
Thank you for reading my question
You used Inner Merge Join. Merge clause is a join hint that tells sql engine to work more efficient.
Merge joins have the fastest algorithm since each row only needs to be read once from the source inputs. Also, optimizations occurring in other join operators can give those operators better performance under certain conditions.
if you must use '>' operator, use regular Inner Join, like this.
SELECT *
FROM TABLE_1 AS T1
INNER JOIN TABLE_2 AS T2
ON T1.COL_1 > T2.COL_2

PostgreSQL: Using AND statement in LEFT JOIN is not working as expected

This query returns all the elements in the table la and all nulls for fields coming from the lar table which is not what I expected.
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id AND la.listing_id = 2780;
This query returns correct and expected results but shouldn't both queries do the same thing ?
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id
WHERE la.listing_id = 2780;
What am I missing here?
I want to make conditional joins as I have noticed that for complex queries Postgresql does the join then do the WHERE clause which is actually very slow. How to make the database filter out some records before doing the JOIN ?
The confusion around LEFT JOIN and WHERE clause has been clarified many times:
SQL / PostgreSQL left join ignores "on = constant" predicate, on left table
This interesting question remains:
How to make the database filter out some records before doing the JOIN?
There are no explicit query hints in Postgres. (Which is a matter of ongoing debate.) But there are still various tricks to make Postgres bend your way.
But first, ask yourself: Why did the query planner estimate the chosen plan to be cheaper to begin with? Is your server configuration basically sane? Cost settings adequate? autovacuum running? Postgres version outdated? Are you working around an underlying problem that should really be fixed?
If you force Postgres to do it your way, you should be sure it won't fire back, after a version upgrade or update to the server configuration ... You'd better know what you are doing exactly.
That said, you can force Postgres to "filter out some records before doing the JOIN" with a subquery where you add OFFSET 0 - which is just noise, logically, but prevents Postgres from rearranging it into the form of a regular join. (Query hint after all)
SELECT la.listing_id, la.id, lar.*
FROM (
SELECT listing_id, id
FROM la
WHERE listing_id = 2780
OFFSET 0
) la
LEFT JOIN lar ON lar.application_id = la.id;
Or you can use a CTE (less obscure, but more expensive). Or other tricks like setting certain config parameters. Or, in this particular case, I would use a LATERAL join to the same effect:
SELECT la.listing_id, la.id, lar.*
FROM la
LEFT JOIN LATERAL (
SELECT *
FROM lar
WHERE application_id = la.id
) lar ON true
WHERE la.listing_id = 2780;
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
Here is an extensive blog on Query hints by 2ndQuadrant. Five year old but still valid.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
So no matter you try to filter with AND la.listing_id = 2780; you still get all the rows from first table. But only those with la.listing_id = 2780; will have something <> NULL on the right side
The behaviour is different if you try INNER JOIN in that case only the matching columns are created and the AND condition will filter the rows.
So to make the first query work you need add WHERE la.listing_id IS NOT NULL
The problem with second query is will try to JOIN every row and then will filter only the one you need.

How can I speed up this SQL view?

I'm a beginner at this so hope you can help. I'm working in SQL server 2008R2 and have a view that is comprised from four tables all joined together:
SELECT DISTINCT ad.award_id,
bl.funding_id,
bl.budget_line,
dd4.monthnumberofyear AS month,
dd4.yearcalendar AS year,
CASE
WHEN frb.full_value IS NULL THEN '0'
ELSE frb.full_value
END AS Expenditure_value,
bl.budget_id,
frb.accode,
'Actual' AS Type
FROM dw.dbo.dimdate5 AS dd4
LEFT OUTER JOIN dbo.award_data AS ad
ON dd4.fulldate BETWEEN ad.usethisstartdate AND
ad.usethisenddate
LEFT OUTER JOIN dbo.budget_line AS bl
ON bl.award_id = ad.award_id
LEFT OUTER JOIN dw.dbo.fctresearchbalances AS frb
ON frb.el3 = bl.award_id
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The view has 9 columns and 1.5 million rows and growing. A select * from this view was taking 20 minutes for all the rows. I added indexes on the fields in the tables that are joined on and that improved it to 10 minutes. My question is what else could I do to get the select to run faster?
Many thanks, Violet.
Try getting rid of the case statement.
If you have 1.5 million rows, if you're interesting in the aggregation of those rows rather than the whole set, you might want to sum the rows in fctResearchBalances first and then do the joins.
(It's a bit difficult to determine what else you might benefit from, without seeing the access plan.)
1- You can use stored procedure to have buffer cache.
2- you can use indexed view , this means creating index on schemabound views.
3- you can use query hints in join to order the query optimizer to use special kind of join.
4- you can use table partitioning .
SELECT DISTINCT --#1 - potential bottleneck
ad.award_id
, bl.funding_id
, bl.budget_line
, [month] = dd4.monthnumberofyear
, [year] = dd4.yearcalendar
, Expenditure_value = ISNULL(frb.full_value, '0')
, bl.budget_id
, frb.accode
, [type] = 'Actual'
FROM dbo.dimdate5 dd4
LEFT JOIN dbo.award_data ad ON dd4.fulldate BETWEEN ad.usethisstartdate AND ad.usethisenddate
LEFT JOIN dbo.budget_line bl ON bl.award_id = ad.award_id
LEFT JOIN dbo.fctresearchbalances frb ON frb.el3 = bl.award_id --#2 - join by multiple columns
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The CASE statement can be replace by
COALESCE(frb.full_value,'0') AS Expenditure_value
Without more info it's not possible to tell exactly what is wrong but just to give you some pointers.
When you have so many LEFT JOINS the order of the joins can make a difference.
Do you have standard indexes or covering indexes with included columns?
If you don't have covering indexes, then primary keys matter in the joins. Including all the primary key columns in the join condition will speed up the query.
Then look at your data - do you need all those LEFT JOINS base on the foreign keys between those tables? Depending on your keys a LEFT JOIN may be equivalent to an INNER JOIN.
And with all those LEFT JOINS is having a DISTINCT really useful?
How much RAM do you have? If you have 8GB+ then 1.5m rows is nothing for SQL Server. You need to optimise those joins.

Resources