Apache Flink - enable join ordering - apache-flink

I have noticed that Apache Flink does not optimise the order in which the tables are joined. At the moment, it keeps the user-specified join order (basically, it takes the the query literally). I suppose that Apache Calcite can optimise the order of joins but for some reason these rules are not in use in Apache Flink.
If, for example, we have two tables 'R' and 'S'
private val tableEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
private val fileNumber = 1
tableEnv.registerTableSource("R", getDataSourceR(fileNumber))
tableEnv.registerTableSource("S", getDataSourceS(fileNumber))
private val r = tableEnv.scan("R")
private val s = tableEnv.scan("S")
and we suppose that 'S' is empty and we want to join these tables in two ways:
val tableOne = r.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(s.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
val tableTwo = s.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(r.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
If we want to count the number of rows in tableOne and in tableTwo the result will be zero in both cases.
The problem is that evaluating tableOne will take much longer than evaluating tableTwo.
Is there any way by which we can automatically optimise the order of how the join are executed, or even enable a possible plan cost operation by adding some statistics? How can these statistic can be added?
In the documentation at this link it is written that maybe it is necessary to change the Table environment CalciteConfig but it is not clear to me how to do it.
Please help.

Join reordering is not enabled because Flink does not handle statistics well. Reordering joins without somewhat accurate cardinality estimates is basically gambling. Therefore, join reordering is disabled and tables are joined in the order as provided by the user. This gives a deterministic and controllable behavior.
However, you can pass optimization rules into the optimizer by passing a TableConfig with a CalciteConfig when creating the TableEnvironment, i.e., TableEnvironment.getTableEnvironment(env, yourTableConfig). In the CalciteConfig you can add optimization rules to different optimization phases. You probably want to add JoinCommuteRule and JoinAssociateRule to the logical optimization phase. You probably also have to dig into the code to check how to pass statistics into the optimizer.

Related

Making a postgres query less expensive for the DB

In a SQL query I have to join many tables and its very expensive for the DB.
In the DB a hostgroup has many host, there are like 20 hostgroups, and there is 4 hostgroups that I don't use...
I was wandering if I add a "not in" operator in my query, excluding those 4 hostgroup, the query will be less expensive? or just make thing worst using more resources on the db?
this is my query, just in case...
select history.clock, hstgrp.name as hostgroup, hstgrp.groupid as hgid , hosts.name as hostname ,
items.name as item, hosts.hostid, history.value as porcentaje, items.key_ as key ,items.itemid,
applications.name as appname, applications.applicationid as appid
FROM history
join items_applications on history.itemid = items_applications.itemid
join applications on items_applications.applicationid = applications.applicationid
join items on items.itemid = history.itemid
join hosts on items.hostid = hosts.hostid
join hosts_groups on hosts.hostid = hosts_groups.hostid
join hstgrp on hosts_groups.groupid = hstgrp.groupid
where lower(items.name) SIMILAR TO lower('Used disk space%|Used disk space on%')
and hstgrp.name not in ('Discovered', 'Discover VMs') <==============
The additional filter sure cannot harm, but unless it is very selective, it will probably not reduce the execution time significantly.
I am reduced to guessing, since you didn't add EXPLAIN (ANALYZE, BUFFERS) output to the question, but I'd assume that the query returns a lot of rows and is bound to be slow.
You could change the SIMILAR TO condition to
WHERE lower(items.name) LIKE lower('Used disk space%')
and support it with an index:
CREATE INDEX ON items (lower(name) text_pattern_ops);
Perhaps that will speed up the execution somewhat.

CTE performance on execution plan. Is it displayed two times, or two times processed?

This is the SQL with CommonTableExpression. Note, that USERS_PROJECTS_CTE used twice.
WITH USERS_PROJECTS_CTE (PRO_ID, SHOW_IAS, USERNAME)
AS
(
SELECT up.PRO_ID, up.SHOW_IAS, ISNULL(u.FIRST_NAME, '') + ' ' + ISNULL(u.SECOND_NAME, '')
FROM SFMIS07_PRO.USERS_PROJECTS up
INNER JOIN SFMIS07_ADM.USERS AS u
ON up.USER_ID = u.ID
WHERE up.IS_RESP_PERSON = 1 AND up.valid_to is null
)
SELECT up.PRO_ID,
up1.USERNAME as RESP_USER1,
up2.USERNAME as RESP_USER2,
up.COUNT_
FROM SFMIS07_PRO.PRO_RESP_USERS_KERNEL_MV AS up
LEFT JOIN USERS_PROJECTS_CTE AS up1 ON up.PRO_ID = up1.PRO_ID AND up1.SHOW_IAS=1
LEFT JOIN USERS_PROJECTS_CTE AS up2 ON up.PRO_ID = up2.PRO_ID AND up2.SHOW_IAS=0
The Execution Plan. Note that CTE displayed twice:
Questions:
am I right that CTE is not only displayed twice but processed twice?
is it possible to inform QO to reuse CTE ?
is it possible for QO in principle to detect "the same SQL fragment" and reuse results (I imagine the realization of this - by coping already prepared data)?
how to optimize the query (without using temporal tables :) ?
Am I right that CTE is not only displayed twice but processed twice?
Yes
Is it possible to inform QO to reuse CTE ?
Not directly but there are some hacks to encourage this.
is it possible for QO in principle to detect "the same SQL fragment"
and reuse results (I imagine the realization of this - by coping
already prepared data)?
In principle yes. See Microsoft Research Paper Efficient Exploitation of Similar Subexpressions for Query
Processing for examples.
how to optimize the query (without using temporal tables :) ?
The most reliable way would be to use a temporary (not temporal) table. See Provide a hint to force intermediate materialization of CTEs or derived tables for a more hacky workaround.

Oracle spatial SDO_RELATE: why better performance result from UNION ALL/INTERSECT combined individual specified mask

Initially I was trying to find out why it's so slow to do a spatial query with multiple SDO_REALTE in a single SELECT statement like this one:
SELECT * FROM geom_table a
WHERE SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=inside')='TRUE' AND
SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=anyinteract')='TRUE';
Note the two SDO_GEOMETRY may not be necessary the same. So it's a bit different from SDO_GEOMETRY(a.geom_column, the_same_geometry, 'mask=inside+anyinteract')='TRUE'
Then I found this paragraph from oracle documentation for SDO_RELATE:
Although multiple masks can be combined using the logical Boolean
operator OR, for example, 'mask=touch+coveredby', better performance
may result if the spatial query specifies each mask individually and
uses the UNION ALL syntax to combine the results. This is due to
internal optimizations that Spatial can apply under certain conditions
when masks are specified singly rather than grouped within the same
SDO_RELATE operator call. (There are two exceptions, inside+coveredby
and contains+covers, where the combination performs better than the
UNION ALL alternative.) For example, consider the following query using the logical
Boolean operator OR to group multiple masks:
SELECT a.gid FROM polygons a, query_polys B WHERE B.gid = 1 AND
SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch+coveredby') = 'TRUE';
The preceding query may result in better performance if it is
expressed as follows, using UNION ALL to combine results of multiple
SDO_RELATE operator calls, each with a single mask:
SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch') = 'TRUE' UNION ALL SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=coveredby') = 'TRUE';
It somehow gives the answer for my question, but still it only says: "due to internal optimizations that Spatial can apply under certain conditions". So I have two questions:
What does it mean with "internal optimization", is it something to do with spatial index? (I'm not sure if I'm too demanding on this question, maybe only developers in oracle know about it.)
The oracle documentation doesn't say anything about my original problem, i.e. SDO_RELATE(..., 'mask=inside') AND SDO_RELATE(..., 'maks=anyinteract') in a single SELECT. Why does it also have very bad performance? Does it work similarly to SDO_RELATE(..., 'mask=inside+anyinteract')?

Two radically different queries against 4 mil records execute in the same time - one uses brute force

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:
Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.
In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.
Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?
Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).

Why Is My Inline Table UDF so much slower when I use variable parameters rather than constant parameters?

I have a table-valued, inline UDF. I want to filter the results of that UDF to get one particular value. When I specify the filter using a constant parameter, everything is great and performance is almost instantaneous. When I specify the filter using a variable parameter, it takes a significantly larger chunk of time, on the order of 500x more logical reads and 20x greater duration.
The execution plan shows that in the variable parameter case the filter is not applied until very late in the process, causing multiple index scans rather than the seeks that are performed in the constant case.
I guess my questions are: Why, since I'm specifying a single filter parameter that is going to be highly selective against an indexed field, does my performance go into the weeds when that parameter is in a variable? Is there anything I can do about this?
Does it have something to do with the analytic function in the query?
Here are my queries:
CREATE FUNCTION fn_test()
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT DISTINCT GCN_SEQNO, Drug_package_version_ID
FROM
(
SELECT COALESCE(ndctbla.GCN_SEQNO, ndctblb.GCN_SEQNO) AS GCN_SEQNO,
dpv.Drug_package_version_ID, ROW_NUMBER() OVER (PARTITION BY dpv.Drug_package_version_id ORDER BY
ndctbla.GCN_SEQNO DESC) AS Predicate
FROM dbo.Drug_Package_Version dpv
LEFT JOIN dbo.NDC ndctbla ON ndctbla.NDC = dpv.Sp_package_code
LEFT JOIN dbo.NDC ndctblb ON ndctblb.SPC_NDC = dpv.Sp_package_code
) iq
WHERE Predicate = 1
GO
GRANT SELECT ON fn_test TO public
GO
-- very fast
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = 10000
GO
-- comparatively slow
DECLARE #dpvid int
SET #dpvid = 10000
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = #dpvid
Once you create a new projection through a UDF, it can't be expected that your indexes will still apply on the columns that are indexed on the original table and included in the projection. When you filter on the projection (and not in the UDF against the original table with the indexes) the indexes no longer apply.
What you want to do is parameterize the function to take in the parameter.
If you find that you have too many fields that you want to set parameters on, then you might want to take a look at indexed views, as you can create your projection and index it as well and then run queries against that.
Simply, the constant is easy to evaluate in the plan. The local variable is not. Especially with the ranking function and filter Predicate = 1
Paraphrasing casparOne, you need to push the filter as far inwards as possible so that you filter on dpv.Drug_package_version_id inside the iq derived table.
If you do that, then you also have no need for the PARTITION BY because you have only a single dpv.Drug_package_version_id. Then you can do a cleaner ...TOP 1 ... ORDER BY ndctbla.GCN_SEQNO DESC.
The responses I got were good, and I learned from them, but I think I've found an answer that satisfies me.
I do think it's the use of the PARTITION BY clause that is causing the problem here. I reformulated the UDF using a variant of the self-join idiom:
SELECT t1.A, t1.B, t1.C
FROM T t1
INNER JOIN
(
SELECT A, MAX(C) AS C
FROM T
GROUP BY A
) t2 ON t1.A = t2.A AND t1.C = t2.C
Ironically, this is more performant than using the SQL 2008-specific query, and also the optimizer doesn't have a problem with joining this version of the query using variables rather than constants. At this point, I'm concluding that the optimizer just doesn't handle the more recent SQL extensions as well as the older stuff. As a bonus, I can make use of the UDF now, in my pre-upgraded SQL 2000 platforms.
Thanks for your help, everyone!

Resources