Simple Distinct Count across Bigquery arrays without using HLL or UDFs - arrays

Like in the example here, I want to distinct count across BigQuery arrays: Distinct Count across Bigquery arrays
However, I have a few extra requirements that make the solution provided in that post feasable for me:
The solution must not use UDFs (too slow)
The solution must not use the HLL function (must be exact)
The solution must not use the SELECT from SELECT pattern displayed on the linked solution, as it needs to aggregate on a flexible group of dimensions that is selected by an end user using a BI tool
So, while this extended example (containing user as a grouping dimension) works using HLL:
#standardSQL
WITH
test AS (
SELECT
'A' AS User, DATE('2018-01-01') AS ReportDate, 2 AS value, [1,2,3] AS key
UNION ALL
SELECT
'A' AS User, DATE('2018-01-02') AS ReportDate, 3 AS value, [1,4,5] AS key
UNION ALL
SELECT
'B' AS User, DATE('2018-01-02') AS ReportDate, 4 AS value, [4,5,6,7,8] AS key
UNION ALL
SELECT
'B' AS User, DATE('2018-01-02') AS ReportDate, 5 AS value, [3,4,5,6,7] AS key )
SELECT
User,
SUM(value) total_value,
HLL_COUNT.MERGE((
SELECT
HLL_COUNT.INIT(key)
FROM
UNNEST(key) key)) AS unique_key_count
FROM
test
GROUP BY
user
I need a version that accomplishes this distinct aggregated array counting with the requirements mentioned above.
Again, this means it also should work properly if I group only on ReportDate, an combination of User / ReportDate or a scenario where this example is extended with additional dimensions.

#standardSQL
WITH test AS
(
SELECT 'A' AS User, DATE('2018-01-01') AS ReportDate, 2 AS value, [1,2,3] AS key UNION ALL
SELECT 'A' AS User, DATE('2018-01-02') AS ReportDate, 3 AS value, [1,4,5] AS key UNION ALL
SELECT 'B' AS User, DATE('2018-01-02') AS ReportDate, 4 AS value, [4,5,6,7,8] AS key UNION ALL
SELECT 'B' AS User, DATE('2018-01-02') AS ReportDate, 5 AS value, [3,4,5,6,7] AS key
)
SELECT
User,
SUM(IF(flag=0, value, 0)) total_value,
COUNT(DISTINCT key) unique_key_count
FROM test, UNNEST(key) key WITH OFFSET flag
GROUP BY User
with result
Row User total_value unique_key_count
1 A 5 5
2 B 9 6

Related

How to filter IDs based on dates?

I have the following table:
ID | DATES
---+-----------
1 02-09-2010
2 03-08-2011
1 08-01-2011
3 04-03-2010
I am looking for IDs who had at least one date before 05-01-2010 AND at least one date after 05-02-2010
I tried the following:
WHERE tb1.DATES < '05-01-2010' AND tb1.DATES > '05-02-2010'
I don't think it's correct because I wasn't getting the right IDs when I did that and there's something wrong with that logic.
Can someone explain what I am doing wrong here?
The SQL command SELECT * FROM tb1 WHERE tb1.DATES < '05-01-2010' AND tb1.DATES > '05-02-2010' is asking "find all the rows where the 'dates' field is before 1 May and after 2 May" which - when put in English - is obviously none of them.
Instead, the command should be asking "find all the IDs which have a record that is before 1 May, and another record after 2 May" - creating the need to look at multiple records for each ID.
As #Martheen suggested, you could do this with two (sub)queries e.g.,
SELECT A.ID
FROM
(SELECT DISTINCT tb1.ID
FROM mytable tb1
WHERE tb1.[dates] < '20100501'
) AS A
INNER JOIN
(SELECT DISTINCT tb1.ID
FROM mytable tb1
WHERE tb1.[dates] > '20100502'
) AS B
ON A.ID = B.ID;
or using INTERSECT
SELECT DISTINCT tb1.ID
FROM mytable tb1
WHERE tb1.[dates] < '20100501'
INTERSECT
SELECT mt2.ID
FROM mytable mt2
WHERE mt2.[dates] > '20100502';
The use of DISTINCT in the above is so that you only get one row per ID, no matter how many rows they have before/after the relevant dates.
You could also do it via GROUP BY and HAVING - which in this particular case is easy as if any dates are before 1 May, then their earliest date must be before 1 May (and correspondingly for their max data and 2 May) e.g.,
SELECT mt1.ID
FROM mytable mt1
GROUP BY mt1.ID
HAVING MIN(mt1.[dates]) < '20100501' AND MAX(mt1.[dates]) > '20100502';
Here is a db<>fiddle with all 3 of these; all provide the same answer (one row, with ID = 1).
Finally, you should use an unambiguous format for your dates. My preferred one of these is 'yyyyMMdd' with no dashes/slashes/etc (as these make them ambiguous).
Different countries/servers/etc will convert the dates you have there differently e.g., SQL Server UTC string comparison not working
This is one solution to use between to specify range.
SELECT * from Table_name where
From_date BETWEEN '2013-01-03'AND '2013-01-09'
Other solution is to what you mentioned but see that the logic is correct
SELECT * from Table_name where
From_date > '2010-01-05'AND From_date <'2010-02-05'

SQL CAST causes Arithmetic Overflow Error even when row is NOT in result set

This one has us perplexed ...
We have a query that uses CAST to convert a float to a decimal, this query joins a number of tables to find the rows to return. One of the rows in one of the tables contains a value that when CAST to a decimal causes an Arithmetic Overflow Error.
The strange thing is that the row that has this value is NOT one of the rows that is being returned in the result set.
Overly simplified example:
ID Value
1 1.1
2 11.1
3 11111.1
Query:
SELECT Id, CAST(value as decimal(4,1))
FROM <complex number of joins>
WHERE <conditions that don't return row with Id 3>
... Arithmetic Error
If we explicitly exclude that row in the WHERE clause then the error goes away.
Eg. WHERE ... AND Id <> 3
.. Works fine
Does anyone know how this is possible?
NOTE: The issue here is not that the CAST fails on row with Id 3!
The issue is that the WHERE clause excludes the row with Id 3, and yet the query still fails. How can the query fail if the row with value 11111.1 is not being returned by the WHERE clause?
The type DECIMAL(4, 1) means a total of four places of precision, one of which is to the right of the decimal place. So, to accommodate the value 11111.1, you would need at least DECIMAL(6, 1). The following query should work:
SELECT Id, CAST(value AS DECIMAL(6,1))
FROM <complex number of joins>
WHERE <conditions that don't return row with Id 3>
At least, the above would work for the three points of sample data you provided.
Demo
It seems that the filter is being applied after the operation, not the other way around. Take a look at the execution plan for your query to help you understand the order of operations.
it is not because of your where condition which filters data, but it is because you have chosen less data length in cast. You should change it to DECIMAL(8,2) or maximum length of your column data is there. You can try following example which will explain you how it works.
Following will work as it doesn't fetch any data
WITH yourTable AS (
SELECT 1 AS ID, '1.1' AS Value UNION ALL
SELECT 2, '11.1' UNION ALL
SELECT 3, '11111.1313'
)
SELECT Id, CAST(value as decimal(4,1)) AS Id_casted
FROM yourTable WHERE yourTable.ID=4
Following won't work as decimal value exceed than conversion length
WITH yourTable AS (
SELECT 1 AS ID, '1.1' AS Value UNION ALL
SELECT 2, '11.1' UNION ALL
SELECT 3, '11111.1313'
)
SELECT Id, CAST(value as decimal(4,1)) AS Id_casted
FROM yourTable WHERE yourTable.ID=3
You can solve this by changing Decimal(4,1) = 3 digit to Decimal(8,2) = 6 digit
WITH yourTable AS (
SELECT 1 AS ID, '1.1' AS Value UNION ALL
SELECT 2, '11.1' UNION ALL
SELECT 3, '11111.1313'
)
SELECT Id, CAST(value as DECIMAL(8,2)) AS Id_casted
FROM yourTable WHERE yourTable.ID=3
Simply following code shows it throws exception as max value will be 999.99 for numeric(5,2) and when you assign 1000 it will throw exception
DECLARE #aritherror NUMERIC(5,2)
SET #aritherror = 1000.554
SELECT #aritherror

MS SQL Query in Access via ODBC connecting to SQL Server

Due to a few limitations (which I won't go into here).. our architecture is using queries in Access running via ODBC SQL Server driver.
The following query produces 2 errors:
SELECT Tbl2.columnid,
Tbl2.label,
Tbl2.group1,
Tbl2.group2,
Count(Tbl2.columnid) AS Total
FROM (SELECT scanned AS Group1,
false AS Group2,
scanned AS Label,
scanned AS ColumnID
FROM (SELECT *,
( quantity - productqty ) AS Variance
FROM order_line
WHERE processed = false) AS Tbl1
WHERE wsid = 1 ) AS Tbl2
WHERE Tbl2.columnid = false
GROUP BY Tbl2.group1,
Tbl2.group2,
Tbl2.columnid,
Tbl2.label
ORDER BY Tbl2.group1 DESC,
Tbl2.group2
Error 1: Each GROUP BY Expression must contain at least one column that is an outer reference: (#164)
Error 2: The ORDER BY position number 0 is out of range of the number of items in the Select list (#108)
Its important to note that "scanned" is a BIT field in SQL Server (and therefore Group1, Label, ColumnId are also bits). I believe this is the reason why GROUP BY and ORDER BY are treating it as a constant (value=0), resulting in these errors.
But I do not know how to resolve these issues. Any suggestions would be great!
PS - The reason why 2 sub queries are being used is due to other constraints, where we are trying to get ID, Label, Counts for a column in Kanban.
Based on DRapp's comment and suggestion.. the following works:
SELECT Tbl2.columnid,
Tbl2.label,
Tbl2.group1,
Tbl2.group2,
Count(Tbl2.columnid) AS Total
FROM (SELECT IIf(scanned=True, 'Y', 'N') AS Group1,
'N' AS Group2,
IIf(scanned=True, 'Y', 'N') AS Label,
IIf(scanned=True, 'Y', 'N') AS ColumnID
FROM (SELECT *,
( quantity - productqty ) AS Variance
FROM order_line
WHERE processed = false) AS Tbl1
WHERE wsid = 1 ) AS Tbl2
WHERE Tbl2.columnid = 'N'
GROUP BY Tbl2.group1,
Tbl2.group2,
Tbl2.columnid,
Tbl2.label
ORDER BY Tbl2.group1 DESC,
Tbl2.group2
Not ideal (as the first subquery is generated dynamically, and now needs extra handling if group field is bit. But works! Still open to any other solutions.

SQL to split a column values into rows in Netezza

I have data in the below way in a column. The data within the column is separated by two spaces.
4EG C6CC C6DE 6MM C6LL L3BC C3
I need to split it into as beloW. I tried using REGEXP_SUBSTR to do it but looks like it's not in the SQL toolkit. Any suggestions?
1. 4EG
2. C6CC
3. C6DE
4. 6MM
5. C6LL
6. L3BC
7. C3
This has ben answered here: http://nz2nz.blogspot.com/2016/09/netezza-transpose-delimited-string-into.html?m=1
Please note the comment at the button about the best performing way of use if array functions. I have measured the use of regexp_extract_all_sp() versus repeated regex matches and the benefit can be quite large
The examples from nz2nz.blogpost.com are hard to follow. I was able to piece together this method:
with
n_rows as (--update on your end
select row_number() over(partition by 1 order by some_field) as seq_num
from any_table_with_more_rows_than_delimited_values
)
, find_values as ( -- fake data
select 'A' as id, '10,20,30' as orig_values
union select 'B', '5,4,3,2,1'
)
select
id,
seq_num,
orig_values,
array_split(orig_values, ',') as array_list,
get_value_varchar(array_list, seq_num) as value
from
find_values
cross join n_rows
where
seq_num <= regexp_match_count(orig_values, ',') + 1 -- one row for each value in list
order by
id,
seq_num

BigQuery standard SQL: how to group by an ARRAY field

My table has two columns, id and a. Column id contains a number, column a contains an array of strings. I want to count the number of unique id for a given array, equality between arrays being defined as "same size, same string for each index".
When using GROUP BY a, I get Grouping by expressions of type ARRAY is not allowed. I can use something like GROUP BY ARRAY_TO_STRING(a, ","), but then the two arrays ["a,b"] and ["a","b"] are grouped together, and I lose the "real" value of my array (so if I want to use it later in another query, I have to split the string).
The values in this field array come from the user, so I can't assume that some character is simply never going to be there (and use it as a separator).
Instead of GROUP BY ARRAY_TO_STRING(a, ",") use GROUP BY TO_JSON_STRING(a)
so your query will look like below
#standardsql
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
You can test it with dummy data like below
#standardsql
WITH `project.dataset.table` AS (
SELECT 1 id, ["a,b", "c"] a UNION ALL
SELECT 1, ["a","b,c"]
)
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
with result as
Row arr cnt
1 ["a,b","c"] 1
2 ["a","b,c"] 1
Update based on #Ted's comment
#standardsql
SELECT
ANY_VALUE(a) a,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY TO_JSON_STRING(a)
Alternatively, you can use another separator than comma
ARRAY_TO_STRING(a,"|")

Resources