BigQuery GENERATE_UUID() and CTE's - database

This behavior surprised me a little bit.
When you generate a uuid in a CTE (to make a row id, etc) and reference it in the future you'll find that it changes. It seems that generate_uuid() is being called twice instead of once. Anyone know why this is the case w/ BigQuery and what this is called?
I was using generate_uuid() to create a row_id and was finding that in my eventual joins that no matches were occurring because of this. Best way to get around it I've found is by just creating a table from the first CTE which cements the uuid in place for future use.
Still curious to know more about the why and what behind this.
with _first as (
select generate_uuid() as row_id
)
,_second as (
select * from _first
)
select row_id from _first
union all
select row_id from _second

curious to know more about the why and what behind this
This is by design:
WITH clauses are not materialized. Placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause.
If a query appears in more than one WITH clause, it executes in each clause.
You can see in documentation - Do not treat WITH clauses as prepared statements

Related

Microsoft SQL Server: run arbitrary query and save result into temp table

Given an arbitrary select query, how can I save its results into a temporary table?
To simplify things let's assume the select query does not contain an order by clause at the top level; it's not dynamic SQL; it really is a select (not a stored procedure call), and it's a single query (not something that returns multiple result sets). All of the columns have an explicit name. How can I run it and save the results to a temp table? Either by processing the SQL on the client side, or by something clever in T-SQL.
I am not asking about any particular query -- obviously, given some particular SQL I could rewrite it by hand to save into a temp table -- but about a rule that will work in general and can be programmed.
One possible "answer" that does not work in general
For simple queries you can do
select * into #tmp from (undl) x where undl is the underlying SQL query. But this fails if undl is a more complex query; for example if it uses common table expressions using with.
For similar reasons with x as (undl) select * into #tmp from x does not work in general; with clauses cannot be nested.
My current approach, but not easy to program
The best I've found is to find the top level select of the query and munge it to add into #tmp just before the from keyword. But finding which select to munge is not easy; it requires parsing the whole query in the general case.
Possible solution with user-defined function
One approach may be to create a user-defined function wrapping the query, then select * into #tmp from dbo.my_function() and drop the function afterwards. Is there something better?
More detail on why the simple approach fails when the underlying uses CTEs. Suppose I try the rule select * into #tmp from (undl) x where undl is the underlying SQL. Now let undl be with mycte as (select 5 as mycol) select mycol from mycte. Once the rule is applied, the final query is select * into #tmp from (with mycte as (select 5 as mycol) select mycol from mycte) x which is not valid SQL, at least not on my version (MSSQL 2016). with clauses cannot be nested.
To be clear, CTEs must be defined at the top level before the select. They cannot be nested and cannot appear in subqueries. I fully understand that and it's why I am asking this question. An attempt to wrap the SQL that ends up trying to nest the CTEs will not work. I am looking for an approach that will work.
"Put an into right before the select". This will certainly work but requires parsing the SQL in the general case. It's not always obvious (to a computer program) which select needs to change. I did try the rule of adding it to the last select in the query, but this also fails. For example if the underlying query is
with mycte as (select 5 as mycol) select mycol from mycte except select 6
then the into #x needs to be added to the second select, not to the one that appears after except. Getting this right in the general case involves parsing the SQL into a syntax tree.
In the end creating a user-defined function appears to be the only general answer. If undl is the underlying select query, then you can say
create function dbo.myfunc() returns table as return (undl)
go
select * into #tmp from dbo.myfunc()
go
drop function dbo.myfunc
go
The pseudo-SQL go indicates starting a new batch. The create function must be executed in one batch before the select, otherwise you get a syntax error. (Just separating them with ; is not enough.)
This approach works even when undl contains subqueries or common table expressions using with. However, it does not work when the query uses temporary tables.

How to force SQL Server to process CONTAINS clauses before WHERE clauses?

I have a SQL query that uses both standard WHERE clauses and full text index CONTAINS clauses. The query is built dynamically from code and includes a variable number of WHERE and CONTAINS clauses.
In order for the query to be fast, it is very important that the full text index be searched before the rest of the criteria are applied.
However, SQL Server chooses to process the WHERE clauses before the CONTAINS clauses and that causes tables scans and the query is very slow.
I'm able to rewrite this using two queries and a temporary table. When I do so, the query executes 10 times faster. But I don't want to do that in the code that creates the query because it is too complex.
Is there an a way to force SQL Server to process the CONTAINS before anything else? I can't force a plan (USE PLAN) because the query is built dynamically and varies a lot.
Note: I have the same problem on SQL Server 2005 and SQL Server 2008.
You can signal your intent to the optimiser like this
SELECT
*
FROM
(
SELECT *
FROM
WHERE
CONTAINS
) T1
WHERE
(normal conditions)
However, SQL is declarative: you say what you want, not how to do it. So the optimiser may decide to ignore the nesting above.
You can force the derived table with CONTAINS to be materialised before the classic WHERE clause is applied. I won't guarantee performance.
SELECT
*
FROM
(
SELECT TOP 2000000000
*
FROM
....
WHERE
CONTAINS
ORDER BY
SomeID
) T1
WHERE
(normal conditions)
Try doing it with 2 queries without temp tables:
SELECT *
FROM table
WHERE id IN (
SELECT id
FROM table
WHERE contains_criterias
)
AND further_where_classes
As I noted above, this is NOT as clean a way to "materialize" the derived table as the TOP clause that #gbn proposed, but a loop join hint forces an order of evaluation, and has worked for me in the past (admittedly usually with two different tables involved). There are a couple of problems though:
The query is ugly
you still don't get any guarantees that the other WHERE parameters don't get evaluated until after the join (I'll be interested to see what you get)
Here it is though, given that you asked:
SELECT OriginalTable.XXX
FROM (
SELECT XXX
FROM OriginalTable
WHERE
CONTAINS XXX
) AS ContainsCheck
INNER LOOP JOIN OriginalTable
ON ContainsCheck.PrimaryKeyColumns = OriginalTable.PrimaryKeyColumns
AND OriginalTable.OtherWhereConditions = OtherValues

Is it okay to use table look up functionalities in scalar function?

In our case we have some business logic that looks into several tables in a certain order, so that the first non null value from one table is used. While the look up is not hard, but it does take several lines of SQL code to accomplish. I have read about scalar valued functions in SQL Server, but don't know if the re-compliation issue affects me enough to do it in a less convenient way.
So what's the general rule of thumb?
Would you rather have something like
select id, udfGetFirstNonNull(id), from mytable
Or is table-valued functions any better than scalar?
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from myTable
The scalar udf will look up for each row in myTable which can run exponentially longer as data increases. Effectively you have a CURSOR. If you have a few rows, it won't matter of course.
I do the same myself where I don't expect a lot of rows (more than a few hundred).
However, I would consider a table value function where I've placed "foo" here. "foo" could also be a CTE in a UDF too (not tested):
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from
myTable M
JOIN
(SELECT value, id as firstNonNull
FROM OtherTable
WHERE value IS NOT NULL
GROUP BY id
ORDER BY value) foo ON M.id = foo.id
Your first query is fine. One place I work for is absolutely obsessed with speed and optimization, and they use UDF's heavily in this way.
I think for readibility and maintainability, I would prefer to use the scalar function, as that is what it is returning.

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

How to simplify this Sql query

The Table - Query has 2 columns (functionId, depFunctionId)
I want all values that are either in functionid or in depfunctionid
I am using this:
select distinct depfunctionid from Query
union
select distinct functionid from Query
How to do it better?
I think that's the best you'll get.
Thats as good as it gets I think...
Lose the DISTINCT clauses, as your UNION (vs UNION ALL) will take care of removing duplicates.
An alternative - but perhaps less clear and probably with the same execution plan - would be to do a FULL JOIN across the 2 columns.
SELECT
COALESCE(Query1.FunctionId, Query2.DepFunctionId) as FunctionId
FROM Query as Query1
FULL OUTER JOIN Query as Query2 ON
Query1.FunctionId = Query2.DepFunctionId
I am almost sure you can loose the distinct's.
When you use UNION instead of UNION ALL, duplicated results are thrown away.
It all depends on how heavy your inline view query is. The key for a better perfomance would be to execute only once, but that is not possible given the data that it returns.
If you do it like this :
select depfunctionid , functionid from Query
group by depfunctionid , functionid
It is very likely that you'll get repeated results for depfunctionid or functionid.
I may be wrong, but it seems to me that you're trying to retrieve a tree of dependencies. If thats the case, I personally would try to use a materialized path approach.
If the materialized path is stored in a self referencing table name, I would retrieve the tree using something like
select asrt2.function_id
from a_self_referencig_table asrt1,
a_self_referencig_table asrt2
where asrt1.function_name = 'blah function'
and asrt2.materialized_path like (asrt1.materialized_path || '%')
order by asrt2.materialized_path, asrt2.some_child_node_ordering_column
This would retrieved the whole tree in the proper order. What sucks is having to construct the materialized path based on the function_id and parent_function_id (or in your case, functionid and depfunctionid), but a trigger could take care of it quite easily.

Resources