Snowflake count nulls in all columns - snowflake-cloud-data-platform

I've seen a few questions like this - Count NULL Values from multiple columns with SQL
But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?

But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?
yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas?
Keypoint here is if something is not provided as "batteries included" then you need to write your own version. It is not so hard as it may look.
Let's say the input table is as follow:
CREATE OR REPLACE TABLE t AS SELECT $1 AS col1, $2 AS col2, $3 AS col3, $4 AS col4
FROM VALUES (1,2,3,10),(NULL,2,3,10),(NULL,NULL,4,10),(NULL,NULL,NULL,10);
SELECT * FROM t;
/*
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1 | 2 | 3 | 10 |
| NULL | 2 | 3 | 10 |
| NULL | NULL | 4 | 10 |
| NULL | NULL | NULL | 10 |
+------+------+------+------+
*/
You probably know how to write the query that gives the desired output, but as it was not provided in the question I will use my own version:
WITH cte AS (
SELECT
COUNT(*) AS total_rows
,total_rows - COUNT(col1) AS col1
,total_rows - COUNT(col2) AS col2
,total_rows - COUNT(col3) AS col3
,total_rows - COUNT(col4) AS col4
FROM t
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (col1,col2,col3, col4))
ORDER BY COLUMN_NAME;
/*
+-------------+--------------------+-------------------+
| COLUMN_NAME | NULLS_COLUMN_COUNT | NULLS_TOTAL_COUNT |
+-------------+--------------------+-------------------+
| COL1 | 3 | 6 |
| COL2 | 2 | 6 |
| COL3 | 1 | 6 |
| COL4 | 0 | 6 |
+-------------+--------------------+-------------------+
*/
Here we could see that the query is "static" in nature with few moving parts(column_count_list/table_name/column_list):
WITH cte AS (
SELECT
COUNT(*) AS total_rows
<column_count_list>
FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;
Now using the metadata and variables:
-- input
SET sch_name = 'my_schema';
SET tab_name = 't';
SELECT
LISTAGG(c.COLUMN_NAME, ', ') WITHIN GROUP(ORDER BY c.COLUMN_NAME) AS column_list
,ANY_VALUE(c.TABLE_SCHEMA || '.' || c.TABLE_NAME) AS full_table_name
,LISTAGG(REPLACE(SPACE(6) || ',total_rows - COUNT(<col_name>) AS <col_name>'
|| CHAR(13)
, '<col_name>', c.COLUMN_NAME), '')
WITHIN GROUP(ORDER BY COLUMN_NAME) AS column_count_list
,REPLACE(REPLACE(REPLACE(
'WITH cte AS (
SELECT
COUNT(*) AS total_rows
<column_count_list>
FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;'
,'<column_count_list>', column_count_list)
,'<table_name>', full_table_name)
,'<column_list>', column_list) AS query_to_run
FROM INFORMATION_SCHEMA.COLUMNS c
WHERE TABLE_SCHEMA = UPPER($sch_name)
AND TABLE_NAME = UPPER($tab_name);
Running the code will generate the query to be run:
Copying the output and running it will give the output. This template could be further refined and wrapped with stored procedure if needed(but I will left it as an exercise).

#chris you should note that the metadata in Snowflake is similar to SQL Server. So anything you want to know at metadata level, would have already been solved by SQL Server practitioners.
See this link - Count number of NULL values in each column in SQL
This is different in Oracle where the metadata table gives the number of nulls in each column as well as density.

Related

SQL Server 2017 - get column name, datatype and value of table

I thought it was a simple task but it's a couple of hours I'm still struggling :-(
I want to have the list of column names of a table, together with its datatype and the value contained in the columns, but have no idea how to bind the table itself to get the current value:
DECLARE #TTab TABLE
(
fieldName nvarchar(128),
dataType nvarchar(64),
currentValue nvarchar(128)
)
INSERT INTO #TTab (fieldName,dataType)
SELECT
i.COLUMN_NAME,
i.DATA_TYPE
FROM
INFORMATION_SCHEMA.COLUMNS i
WHERE
i.TABLE_NAME = 'Users'
Expected result:
+------------+----------+---------------+
| fieldName | dataType | currentValue |
+------------+----------+---------------+
| userName | nvarchar | John |
| active | bit | true |
| age | int | 43 |
| balance | money | 25.20 |
+------------+----------+---------------+
In general the answer is: No, this is impossible. But there is a hack using text-based containers like XML or JSON (v2016+):
--Let's create a test table with some rows
CREATE TABLE dbo.TestGetMetaData(ID INT IDENTITY,PreName VARCHAR(100),LastName NVARCHAR(MAX),DOB DATE);
INSERT INTO dbo.TestGetMetaData(PreName,LastName,DOB) VALUES
('Tim','Smith','20000101')
,('Tom','Blake','20000202')
,('Kim','Black','20000303')
GO
--Here's the query
SELECT C.colName
,C.colValue
,D.*
FROM
(
SELECT t.* FROM dbo.TestGetMetaData t
WHERE t.Id=2
FOR XML PATH(''),TYPE
) A(rowSet)
CROSS APPLY A.rowSet.nodes('*') B(col)
CROSS APPLY(VALUES(B.col.value('local-name(.)','nvarchar(500)')
,B.col.value('text()[1]', 'nvarchar(max)'))) C(colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON D.TABLE_SCHEMA='dbo'
AND D.TABLE_NAME='TestGetMetaData'
AND D.COLUMN_NAME=C.colName;
GO
--Clean-Up (carefull with real data)
DROP TABLE dbo.TestGetMetaData;
GO
Part of the result
+----------+------------+-----------+--------------------------+-------------+
| colName | colValue | DATA_TYPE | CHARACTER_MAXIMUM_LENGTH | IS_NULLABLE |
+----------+------------+-----------+--------------------------+-------------+
| ID | 2 | int | NULL | NO |
+----------+------------+-----------+--------------------------+-------------+
| PreName | Tom | varchar | 100 | YES |
+----------+------------+-----------+--------------------------+-------------+
| LastName | Blake | nvarchar | -1 | YES |
+----------+------------+-----------+--------------------------+-------------+
| DOB | 2000-02-02 | date | NULL | YES |
+----------+------------+-----------+--------------------------+-------------+
The idea in short:
Using FOR XML PATH(''),TYPE will create a XML representing your SELECT's result set.
The big advantage with this: The XML's element will carry the column's name.
We can use a CROSS APPLY to geht the column's name and value
Now we can JOIN the metadata from INFORMATION_SCHEMA.COLUMNS.
One hint: All values will be of type nvarchar(max) actually.
The value being a string type might lead to unexpected results due to implicit conversions or might lead into troubles with BLOBs.
UPDATE
The following query wouldn't even need to specify the table's name in the JOIN:
SELECT C.colName
,C.colValue
,D.DATA_TYPE,D.CHARACTER_MAXIMUM_LENGTH,IS_NULLABLE
FROM
(
SELECT * FROM dbo.TestGetMetaData
WHERE Id=2
FOR XML AUTO,TYPE
) A(rowSet)
CROSS APPLY A.rowSet.nodes('/*/#*') B(attr)
CROSS APPLY(VALUES(A.rowSet.value('local-name(/*[1])','nvarchar(500)')
,B.attr.value('local-name(.)','nvarchar(500)')
,B.attr.value('.', 'nvarchar(max)'))) C(tblName,colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON CONCAT(D.TABLE_SCHEMA,'.',D.TABLE_NAME)=C.tblName
AND D.COLUMN_NAME=C.colName;
Why?
Using FOR XML AUTO will use attribute centered XML. The elements name will be the tables name, while the values rest within attributes.
UPDATE 2
Fully generic function:
CREATE FUNCTION dbo.GetRowWithMetaData(#input XML)
RETURNS TABLE
AS
RETURN
SELECT C.colName
,C.colValue
,D.*
FROM #input.nodes('/*/#*') B(attr)
CROSS APPLY(VALUES(#input.value('local-name(/*[1])','nvarchar(500)')
,B.attr.value('local-name(.)','nvarchar(500)')
,B.attr.value('.', 'nvarchar(max)'))) C(tblName,colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON CONCAT(D.TABLE_SCHEMA,'.',D.TABLE_NAME)=C.tblName
AND D.COLUMN_NAME=C.colName;
--You call it like this (see the extra paranthesis!)
SELECT * FROM dbo.GetRowWithMetaData((SELECT * FROM dbo.TestGetMetaData WHERE ID=2 FOR XML AUTO));
As you see, the function does not even has to know anything in advance...

How to find distinct values of string columns in hive? [duplicate]

I have a comma-separated column(string) with duplicate values. I want to remove duplicates:
e.g.
column_name
-----------------
gun,gun,man,gun,man
shuttle,enemy,enemy,run
hit,chase
I want result like:
column_name
----------------
gun,man
shuttle,enemy,run
hit,chase
I am using hive database.
Option 1: keep last occurrence
This will keep the last occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'world,hello'
select regexp_replace
(
column_name
,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
,''
)
from mytable
;
+-------------------+
| gun,man |
| shuttle,enemy,run |
| hit,chase |
+-------------------+
Option 2: keep first occurrence
This will keep the first occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'hello,world'
select reverse
(
regexp_replace
(
reverse(column_name)
,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
,''
)
)
from mytable
;
Option 3: sorted
E.g. 'Cherry,Apple,Cherry,Cherry,Cherry,Banana,Apple' will result in 'Apple,Banana,Cherry'
select regexp_replace
(
concat_ws(',',sort_array(split(column_name,',')))
,'(?<=^|,)(?<word>.*?)(,\\k<word>(?=,|$))+'
,'${word}'
)
from mytable
;
If value sort is not a concern:
with mytable as (
select 'gun,gun,man,gun,man' as column_name union
select 'shuttle,enemy,enemy,run' as column_name union
select 'hit,chase' as column_name
) -- test data
SELECT column_name, concat_ws(',',collect_set(item)) from (
select distinct column_name, s.item from mytable
lateral view explode(split(column_name,',')) s as item
) t
group by column_name
;
+--------------------------+--------------------+--+
| column_name | _c1 |
+--------------------------+--------------------+--+
| gun,gun,man,gun,man | gun,man |
| hit,chase | chase,hit |
| shuttle,enemy,enemy,run | enemy,run,shuttle |
+--------------------------+--------------------+--+
If want to keep the value sorted:
with mytable as (
select 'gun,gun,man,gun,man' as column_name union
select 'shuttle,enemy,enemy,run' as column_name union
select 'hit,chase' as column_name
) -- test data
select column_name,concat_ws(',',collect_set(item)) as column_name_distincted
from (
select column_name,item, min(pos) as pos
from (
select column_name,pos,item
from mytable
lateral view posexplode(split(column_name,',')) s as pos,item
) t
group by column_name,item
order by column_name,pos
) t
group by column_name
;
+--------------------------+-------------------------+--+
| column_name | column_name_distincted |
+--------------------------+-------------------------+--+
| gun,gun,man,gun,man | gun,man |
| hit,chase | hit,chase |
| shuttle,enemy,enemy,run | shuttle,enemy,run |
+--------------------------+-------------------------+--+

SQL Server to combine multi rows into single row where col0=col1

my table :
previousid|CurrentID|Data
| 1 | 2 | Data 1
| 2 | 3 | Data 2
| 3 | 4 | Data 3
| 4 | 5 | Data 4
Result i look for :
Select .... where PreviousID=1 :
|Col0|Col1|Col2 |Col3|Col 4|Col5| Col6 | Col7 | Col8
|1 |2 |Data 1|3 |Data 2| 4 | data 3| 5 | data 4
Select .....where PreviousID=2
|Col0|Col1|Col2 |Col3|Col 4|Col5| Col6 |
|2 |3 |Data 2|4 |Data 3| 5 | data 4|
i tried to create some SQL server query to get result with no luck, please help me guys
We can do this in a few steps:
Declare and set a variable to use for our root node, and create a temporary table to store the results from our recusive query:
Insert the results from the recusive query into the temporary table
Generate and execute dynamic sql to pivot() the temporary table.
(alternate) Generate and execute dynamic sql to use conditional aggregation instead of pivot():
rextester demo: http://rextester.com/MRFZC75180
test setup:
create table t (PreviousID int, CurrentID int, Data varchar(32));
insert into t values
(1,2,'Data 1'),(2,3,'Data 2'),(3,4,'Data 3'),(4,5,'Data 4');
Declare and set a variable to use for our root node, and create a temporary table to store the results from our recusive query:
declare #PreviousId int = 2;
create table #temp (PreviousID int
, Level int
, Col varchar(32)
, Value varchar(32)
, rn int
);
Insert the results from the recusive query into the temporary table
;with cte as (
select PreviousID, CurrentID, Data, level = 0
from t
where previousId = #PreviousId
union all
select c.PreviousID, c.CurrentID, c.Data, level = p.level +1
from t c
inner join cte as p
on c.PreviousID = p.CurrentID
)
insert into #temp
select p.PreviousId, t.level, x.col, x.value
, rn = row_number() over (order by t.level, x.col)
from cte t
cross apply (
select top 1
PreviousId
from cte i
order by level
) as p (PreviousId)
cross apply (
values ('CurrentId',convert(varchar(32),CurrentId)),('Data',Data)
) as x (col,value);
results so far:
+------------+-------+-----------+--------+----+
| PreviousID | Level | Col | Value | rn |
+------------+-------+-----------+--------+----+
| 2 | 0 | CurrentId | 3 | 1 |
| 2 | 0 | Data | Data 2 | 2 |
| 2 | 1 | CurrentId | 4 | 3 |
| 2 | 1 | Data | Data 3 | 4 |
| 2 | 2 | CurrentId | 5 | 5 |
| 2 | 2 | Data | Data 4 | 6 |
+------------+-------+-----------+--------+----+
Generate and execute dynamic sql to pivot() the temporary table.
/* pivot */
declare #cols nvarchar(max);
declare #sql nvarchar(max);
select #cols = stuff((
select
', Col'+convert(nvarchar(10),rn)
from #temp
order by 1
for xml path (''), type).value('.','nvarchar(max)')
,1,1,'')
select #sql ='
select Col0=PreviousID, ' + #cols +'
from (
select PreviousID, Value, rn= ''Col''+convert(nvarchar(10),rn)
from #temp
) as t
pivot (max([Value]) for [rn] in (' + #cols +')) p'
select #sql as CodeGenerated;
exec sp_executesql #sql;
code generated:
select Col0=PreviousID, Col1, Col2, Col3, Col4, Col5, Col6
from (
select PreviousID, Value, rn= 'Col'+convert(nvarchar(10),rn)
from #temp
) as t
pivot (max([Value]) for [rn] in ( Col1, Col2, Col3, Col4, Col5, Col6)) p
returns:
+------+------+--------+------+--------+------+--------+
| Col0 | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+--------+------+--------+------+--------+
| 2 | 3 | Data 2 | 4 | Data 3 | 5 | Data 4 |
+------+------+--------+------+--------+------+--------+
(alternate) Generate and execute dynamic sql to use conditional aggregation instead of pivot():
/* conditional aggregation */
--declare #cols nvarchar(max);
--declare #sql nvarchar(max);
select #cols = stuff((
select
char(10)+' , '
+ 'Col'+convert(nvarchar(10),rn)
+' = max(case when rn = '+convert(nvarchar(10),rn)+' then Value end)'
from #temp
order by 1
for xml path (''), type).value('.','nvarchar(max)')
,1,0,'')
select #sql ='
select Col0 = PreviousID'+#cols+'
from #temp
group by PreviousID'
select #sql as CodeGenerated;
exec sp_executesql #sql;
code generated:
select Col0 = PreviousID
, Col1 = max(case when rn = 1 then Value end)
, Col2 = max(case when rn = 2 then Value end)
, Col3 = max(case when rn = 3 then Value end)
, Col4 = max(case when rn = 4 then Value end)
, Col5 = max(case when rn = 5 then Value end)
, Col6 = max(case when rn = 6 then Value end)
from #temp
group by PreviousID
returns:
+------+------+--------+------+--------+------+--------+
| Col0 | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+--------+------+--------+------+--------+
| 2 | 3 | Data 2 | 4 | Data 3 | 5 | Data 4 |
+------+------+--------+------+--------+------+--------+
I think try with concatenating column and give them alias.
For better understanding go through this link https://www.mssqltips.com/sqlservertip/2985/concatenate-sql-server-columns-into-a-string-with-concat/

SQL Server making rows into columns

I'm trying to take three tables that I have and show the data in a way the user asked me to do it. The tables look like this. (I should add that I am using MS SQL Server)
First Table: The ID is varchar, since it's an ID they use to identify assets and they use numbers as well as letters.
aID| status | group |
-----------------------
1 | acti | group1 |
2 | inac | group2 |
A3 | acti | group1 |
Second Table: This table is fixed. It has around 20 values and the IDs are all numbers
atID| traitname |
------------------
1 | trait1 |
2 | trait2 |
3 | trait3 |
Third Table: This table is used to identify the traits the assets in the first table have. The fields that have the same name as fields in the above tables are obviously linked.
tID| aID | atID | trait |
----------------------------------
1 | 1 | 1 | NAME |
2 | 1 | 2 | INFO |
3 | 2 | 3 | GOES |
4 | 2 | 1 | HERE |
5 | A3 | 2 | HAHA |
Now, the user wants the program to output the data in the following format:
aID| status | group | trait1 | trait2 | trait 3
-------------------------------------------------
1 | acti | group1 | NAME | INFO | NULL
2 | inac | group2 | HERE | NULL | GOES
A3 | acti | group1 | NULL | HAHA | NULL
I understand that to achieve this, I have to use the Pivot command in SQL. However, I've read and tried to understand it but I just can't seem to get it. Especially the part where it asks for a MAX value. I don't get why I need that MAX.
Also, the examples I've seen are for one table. I'm not sure if I can do it with three tables. I do have a query that joins all three of them with the information I need. However, I don't know how to proceed from there. Please, any help with this will be appreciated. Thank you.
There are several ways that you can get the result, including using the PIVOT function.
You can use an aggregate function with a CASE expression:
select t1.aid, t1.status, t1.[group],
max(case when t2.traitname = 'trait1' then t3.trait end) trait1,
max(case when t2.traitname = 'trait2' then t3.trait end) trait2,
max(case when t2.traitname = 'trait3' then t3.trait end) trait3
from table1 t1
inner join table3 t3
on t1.aid = t3.aid
inner join table2 t2
on t3.atid = t2.atid
group by t1.aid, t1.status, t1.[group];
See SQL Fiddle with Demo
The PIVOT function requires an aggregate function this is why you would need to use either the MIN or MAX function (since you have a string value).
If you have a limited number of traitnames then you could hard-code the query:
select aid, status, [group],
trait1, trait2, trait3
from
(
select t1.aid,
t1.status,
t1.[group],
t2.traitname,
t3.trait
from table1 t1
inner join table3 t3
on t1.aid = t3.aid
inner join table2 t2
on t3.atid = t2.atid
) d
pivot
(
max(trait)
for traitname in (trait1, trait2, trait3)
) piv;
See SQL Fiddle with Demo.
If you have an unknown number of values, then you will want to look at using dynamic SQL to get the final result:
DECLARE #cols AS NVARCHAR(MAX),
#query AS NVARCHAR(MAX)
select #cols = STUFF((SELECT distinct ',' + QUOTENAME(traitname)
from Table2
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
set #query = 'SELECT aid, status, [group],' + #cols + '
from
(
select t1.aid,
t1.status,
t1.[group],
t2.traitname,
t3.trait
from table1 t1
inner join table3 t3
on t1.aid = t3.aid
inner join table2 t2
on t3.atid = t2.atid
) x
pivot
(
max(trait)
for traitname in (' + #cols + ')
) p '
execute sp_executesql #query;
See SQL Fiddle with Demo

TSQL - Merge two tables

I have a following task: I have two single column tables in a procedure and both of them have the same amount of rows. I'd like to "merge" them so I get a resulting table with 2 columns. I there some easy way for this?
In worst case I could try to add primary key and use INSERT INTO ... SELECT with JOIN but it requires quite big changes in the code I already have so I decided to ask you guys.
Just to explain my answer below, here's the example. I have following tables:
tableA
col1
----
1
2
3
4
tableB
col2
----
a
b
c
d
Resulting table:
col1 | col2
1 | a
2 | b
3 | c
4 | d
You can do this:
SELECT t1.col1, t2.col1 AS col2
INTO NewTable
FROM
(
SELECT col1, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM table1
) AS t1
INNER JOIN
(
SELECT col1, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM table2
) AS t2 ON t1.rn = t2.rn
This will create a brand new table NewTable with the two columns from the two tables:
| COL1 | COL2 |
---------------
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
See it in action here:
SQL Fiddle Demo.

Resources