Group by. Ignore value in group clause - sql-server

I have a complex sql query, with several joins and subqueries. I want to know if there's a simple way to generate several groups using a GROUP BY clause for the same vale of the grouping fields, when this fields have some special value.
For example, if I'm grouping by fieldA, and fieldB, I get groups for diferent values of fieldA and fieldB, but when fieldA takes a special constant value like "specialValue", generate different groups for each record.
For example, if I have this records:
fieldA | fieldB | fieldC
_______________________
val1 | val2 | 1
val1 | val2 | 2
val2 | valx | 3
val2 | valx | 4
specialValue | vala | 5
specialValue | vala | 6
Selecting (fieldA, fieldB, max(fieldC)), Grouping by (filedA,fieldB) but ifgoring "specialValue" in fieldA, I would obtain the following results:
fieldA | fieldB | fieldC
_______________________
val1 | val2 | 2
val2 | valx | 4
specialValue | vala | 5 <-- Two rows
specialValue | vala | 6 <--
I want to get it in the simpliest possible way, if it's possible without join or subqueries, because the query is already too complex.
Thanks

What about
SELECT fieldA, fieldB, MAX(fieldC)
FROM table
WHERE fieldA <> 'specialValue'
GROUP BY fieldA, fieldB
UNION ALL
SELECT fieldA, fieldB, fieldC
FROM table
WHERE fieldA = 'specialValue'
?
Or alternatively with a SUBSELECT, although the availability of ROW_NUMBER and the OVER clause will depend on your version of SQL Server1.
SELECT fieldA, fieldB, MAX(fieldC)
FROM (
SELECT fieldA, fieldB, fieldC,
case when fieldA = 'specialValue' then
ROW_NUMBER() over (ORDER BY fieldA)
end as rownum
FROM t
) as subselect
GROUP BY rownum, fieldA, fieldB
1 2008 to present - source

You may want to look into using cube and/or rollup.

Does the table have a primary key column that is different from these? You could do:
group by case when fieldA = 'specialValue' then primarykey else fieldA end

Related

Snowflake count nulls in all columns

I've seen a few questions like this - Count NULL Values from multiple columns with SQL
But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?
But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?
yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas?
Keypoint here is if something is not provided as "batteries included" then you need to write your own version. It is not so hard as it may look.
Let's say the input table is as follow:
CREATE OR REPLACE TABLE t AS SELECT $1 AS col1, $2 AS col2, $3 AS col3, $4 AS col4
FROM VALUES (1,2,3,10),(NULL,2,3,10),(NULL,NULL,4,10),(NULL,NULL,NULL,10);
SELECT * FROM t;
/*
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1 | 2 | 3 | 10 |
| NULL | 2 | 3 | 10 |
| NULL | NULL | 4 | 10 |
| NULL | NULL | NULL | 10 |
+------+------+------+------+
*/
You probably know how to write the query that gives the desired output, but as it was not provided in the question I will use my own version:
WITH cte AS (
SELECT
COUNT(*) AS total_rows
,total_rows - COUNT(col1) AS col1
,total_rows - COUNT(col2) AS col2
,total_rows - COUNT(col3) AS col3
,total_rows - COUNT(col4) AS col4
FROM t
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (col1,col2,col3, col4))
ORDER BY COLUMN_NAME;
/*
+-------------+--------------------+-------------------+
| COLUMN_NAME | NULLS_COLUMN_COUNT | NULLS_TOTAL_COUNT |
+-------------+--------------------+-------------------+
| COL1 | 3 | 6 |
| COL2 | 2 | 6 |
| COL3 | 1 | 6 |
| COL4 | 0 | 6 |
+-------------+--------------------+-------------------+
*/
Here we could see that the query is "static" in nature with few moving parts(column_count_list/table_name/column_list):
WITH cte AS (
SELECT
COUNT(*) AS total_rows
<column_count_list>
FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;
Now using the metadata and variables:
-- input
SET sch_name = 'my_schema';
SET tab_name = 't';
SELECT
LISTAGG(c.COLUMN_NAME, ', ') WITHIN GROUP(ORDER BY c.COLUMN_NAME) AS column_list
,ANY_VALUE(c.TABLE_SCHEMA || '.' || c.TABLE_NAME) AS full_table_name
,LISTAGG(REPLACE(SPACE(6) || ',total_rows - COUNT(<col_name>) AS <col_name>'
|| CHAR(13)
, '<col_name>', c.COLUMN_NAME), '')
WITHIN GROUP(ORDER BY COLUMN_NAME) AS column_count_list
,REPLACE(REPLACE(REPLACE(
'WITH cte AS (
SELECT
COUNT(*) AS total_rows
<column_count_list>
FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;'
,'<column_count_list>', column_count_list)
,'<table_name>', full_table_name)
,'<column_list>', column_list) AS query_to_run
FROM INFORMATION_SCHEMA.COLUMNS c
WHERE TABLE_SCHEMA = UPPER($sch_name)
AND TABLE_NAME = UPPER($tab_name);
Running the code will generate the query to be run:
Copying the output and running it will give the output. This template could be further refined and wrapped with stored procedure if needed(but I will left it as an exercise).
#chris you should note that the metadata in Snowflake is similar to SQL Server. So anything you want to know at metadata level, would have already been solved by SQL Server practitioners.
See this link - Count number of NULL values in each column in SQL
This is different in Oracle where the metadata table gives the number of nulls in each column as well as density.

Remove Almost Duplicate Rows in SQL

I have found a lot of examples online of how to remove duplicate rows in a SQL table but I cannot figure out how to remove almost duplicate rows.
Data Example
+--------+----------+--------+
| Col1 | Col2 | NumCol |
+--------+----------+--------+
| USA | Organic | 300 |
| USA | Organic | 400 |
| Canada | Referral | 120 |
| Canada | Referral | 120 |
+--------+----------+--------+
Desired Output
+--------+----------+--------+
| Col1 | Col2 | NumCol |
+--------+----------+--------+
| USA | Organic | 400 |
| Canada | Referral | 120 |
+--------+----------+--------+
In this example, if 2 rows are identical then I would like one of them to be removed. In addition, if 2 rows match based on Col1 and Col2, then I would like the row with the lesser value in NumCol to be removed.
My SQL Server Express code is:
WITH CTE AS(
SELECT [Col1]
,[Col2]
,[NumCol]
, RN = ROW_NUMBER()OVER(PARTITION BY [Col1]
,[Col2]
,[NumCol] ORDER BY [Col1])
FROM table
)
DELETE FROM CTE WHERE RN > 1
This code does a good job of deleting duplicates but it doesn't get rid of rows where only Col1 and Col2 match but not NumCol. How should I approach something like this? I'm a newbie to SQL, so any explanation in layman's terms is appreciated!
You can let the row numbers restart per (Col1, Col2) pair by changing:
RN = ROW_NUMBER()OVER(PARTITION BY [Col1]
,[Col2]
,[NumCol] ORDER BY [Col1])
To:
RN = ROW_NUMBER() OVER(
PARTITION BY Col1, Col1
ORDER BY NumCol desc)
The order by NumCol desc makes sure that the rows with the lower NumCol are removed.

How to select last occurrence of duplicating record in oracle

I am having a problem with Oracle query where the basic goal is to get the last row of every re-occurring rows, but there's a complication that you'll understand from the data:
Suppose I have a table that looks like this:
ID | COL1 | COL2 | COL3 | UPDATED_DATE
------|------|------|------|-------------
001 | a | b | c | 14/05/2013
002 | a | b | c | 16/05/2013
003 | a | b | c | 12/05/2013
You should be able to guess that since columns 1 to 3 have the same values for all 3 rows they are re-occurring data. The problem is, I want to get the latest updated row, which means row #2.
I have an existing query that works if the table is without ID column, but I still need that column, so if anybody could help me point out what I'm doing wrong, that'd be great.
select col1,
col2,
col3,
max(updated_date)
from tbl
order by col1, col2, col3;
The above query returns me row #2, which is correct, but I still need the ID.
Note: I know that I could have encapsulate the above query with another query that selects the ID column based on the 4 columns, but since I'm dealing with millions of records, the re-query will make the app very ineffective.
Try
WITH qry AS
(
SELECT ID, COL1, COL2, COL3, updated_date,
ROW_NUMBER() OVER (PARTITION BY COL1, COL2, COL3 ORDER BY updated_date DESC) rank
FROM tbl
)
SELECT ID, COL1, COL2, COL3, updated_date
FROM qry
WHERE rank = 1
or
SELECT t1.ID, t2.COL1, t2.COL2, t2.COL3, t2.updated_date
FROM tbl t1 JOIN
(
SELECT COL1, COL2, COL3, MAX(updated_date) updated_date
FROM tbl
GROUP BY COL1, COL2, COL3
) t2 ON t1.COL1 = t2.COL1
AND t1.COL2 = t2.COL2
AND t1.COL3 = t2.COL3
AND t1.updated_date = t2.updated_date
Output in both cases:
| ID | COL1 | COL2 | COL3 | UPDATED_DATE |
--------------------------------------------------------
| 2 | a | b | c | May, 16 2013 00:00:00+0000 |
Here is SQLFiddle demo for both queries.

TSQL - Merge two tables

I have a following task: I have two single column tables in a procedure and both of them have the same amount of rows. I'd like to "merge" them so I get a resulting table with 2 columns. I there some easy way for this?
In worst case I could try to add primary key and use INSERT INTO ... SELECT with JOIN but it requires quite big changes in the code I already have so I decided to ask you guys.
Just to explain my answer below, here's the example. I have following tables:
tableA
col1
----
1
2
3
4
tableB
col2
----
a
b
c
d
Resulting table:
col1 | col2
1 | a
2 | b
3 | c
4 | d
You can do this:
SELECT t1.col1, t2.col1 AS col2
INTO NewTable
FROM
(
SELECT col1, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM table1
) AS t1
INNER JOIN
(
SELECT col1, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM table2
) AS t2 ON t1.rn = t2.rn
This will create a brand new table NewTable with the two columns from the two tables:
| COL1 | COL2 |
---------------
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
See it in action here:
SQL Fiddle Demo.

Postgres: Query that can filter during table join

I have a postgres database with duplicated entries on one of the table. I would like to show the created_by columns
Table1
id | number
1 | 123
2 | 124
3 | 125
4 | 126
Table2
id | number | created_on
1 | 123 | 3/29
2 | 123 | 4/3
3 | 124 | 3/31
4 | 124 | 4/1
On table 2 number are duplicated. I would like to form a single query to list the following:
id | number | created_on
1 | 123 | 4/3
2 | 124 | 4/1
For duplicated entries only the latest entry will be included. How could I form that SQL query?
SELECT DISTINCT ON (Table1.number) Table1.id, Table2.number, Table2.create_on FROM Table1
JOIN Table2 ON Table1.number=Table2.number
ORDER BY Table2.create_on;
Actually I tried putting 'DISTINCT ON' and 'ORDER BY' in a single query (with JOIN) it gives me the following error:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
The columns in DISTINCT ON() have to be the first ones in the ORDER BY query, also if you want the latest created_on date you should order by created_on DESC
SELECT DISTINCT ON (Table1.number) Table1.id, Table2.number, Table2.created_on
FROM Table1
JOIN Table2
ON Table1.number=Table2.number
ORDER BY Table1.number,Table2.created_on DESC;
http://sqlfiddle.com/#!12/5538a/2
As you said in the comment: created_on=date_trunc('day', now()), so the data type of the field created_on is timestamp. Here is what you can do:
SELECT table_1.id, table_1.number, max(created_on) as created_on
FROM table_1
inner join table_2 using(number)
group by table_1.id, table_1.number

Resources