Talend: Get most common value in a column - database

I have a table with a couple hundred rows. I want to know the most common value of the data in one of the columns. How do I go about that?

I recommend you do it in your sql query with something like this :
select top 1 column, count(*) cnt
from table
group by column
order by count(*) desc
This syntax has to be adapted to your rdbms. For instance, in Oracle it would be something like this :
select column from (
select column, count(*)
from table
group by column
order by count(*) desc
) where rownum = 1
If you want to do it in Talend you can use :
Input -- tAggregateRow -- tSortRow -- tSampleRow -- Output
In tAggregateRow you use a count function to count the frequency of values in your column, then you sort them by descending order in tSortRow, then you get the first line with tSampleRow (just put "1")

Related

Row number for for same value

The result of my SQL Server query returns 3 columns.
Select Id, InItemId, Qty
from Mytable
order by InItemId
I need to add a column, call it row, that starts from 1 and will increase by 1, based on the initemid column with same value.
So the result should be:
Thank you !
Use row_number():
select row_number() over (partition by initemid order by initemid) as row,
t.*
from t;
Note: There is no ordering within a given value of initemid. SQL tables represent unordered sets and there is no obvious column to use for ordering.

SQL Query to return rows with the most columns populated

Azure SQL Server 2019.
We have a table Table1 with over 100 columns of differing types of nvarchar data, all of which are allowed NULL values, and where there could be anywhere from 1 to 100 columns populated in a given record. I need to formulate a query that returns the rows ranked by how many columns have values in them, in descending order.
I started going down a road of using DATALENGTH and having to type out the name of every single column, but I can only imagine there has to be a more efficient way. Assuming the column names are column1, column2, column3 etc, how would I accomplish this?
How about a lateral join that unpivots the columns to rows? This requires enumerating the columns just once, like so:
select t.*, c.cnt
from mytable t
cross apply (
select count(*) cnt
from (values (t.column1), (t.column2), (t.column3)) x(col)
where col is not null
) c
order by c.cnt desc

How to find and delete all duplicates from SQL Server database

I'm new to SQL in general and I need to delete all duplicates in a given database.
For the moment, I use this DB to experiment some things.
The table currently looks like this :
I know I can find all duplicates using this query :
SELECT COUNT(*) AS NBR_DOUBLES, Name, Owner
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
but I have a lot of trouble finding an adapted and updated solution to not only find all the duplicates, but also delete them all, only leaving one of each.
Thanks a lot for taking some of your time to help me.
;WITH numbered AS (
SELECT ROW_NUMBER() OVER(PARTITION BY Name, Owner ORDER BY Name, Owner) AS _dupe_num
FROM dbo.Animals
)
DELETE FROM numbered WHERE _dupe_num > 1;
This will delete all but one of each occurance with the same Name & Owner, if you need it to be more specific you should extend the PARTITION BY clause. If you want it to take in account the entire record you should add all your fields.
The record left behind is currently random, since it seems you do not have any field to have any sort of ordering on.
What you want to do is use a projection that numbers each record within a given duplicate set. You can do that with a Windowing Function, like this:
SELECT Name, Owner
,Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
ORDER BY Name, Owner
This should give you results like this:
Name Owner RowNum
Ecstasy Sacha 1
Ecstasy Sacha 2
Ecstasy Sacha 3
Gremlin Max 1
Gremlin Max 2
Gremlin Max 3
Outch Max 1
Outch Max 2
Outch Max 3
Now you want to convert this to a DELETE statement that has a WHERE clause targeting rows with RowNum > 1. The way to use a windowing function with a DELETE is to first include the windowing function as part of a common table expression (CTE), like this:
WITH dupes AS
(
SELECT Name, Owner,
Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
)
DELETE FROM dupes WHERE RowNum > 1;
This will delete later duplicates, but leave row #1 for each group intact. The only trick now is to make sure row #1 is the correct row, since not all of your duplicates have the same values for the Birth or Death columns. This is the reason I included the Birth column in the windowing function, while other answers (so far) have not. You need to decide if you want to keep the oldest animal or the youngest, and optionally change the Birth order in the OVER clause to match your needs.
Use CTE. I will show you a sample :
Create table #Table1(Field1 varchar(100));
Insert into #Table1 values
('a'),('b'),('f'),('g'),('a'),('b');
Select * from #Table1;
WITH CTE AS(
SELECT Field1,
RN = ROW_NUMBER()OVER(PARTITION BY Field1 ORDER BY Field1)
FROM #Table1
)
--SELECT * FROM CTE WHERE RN > 1
DELETE FROM CTE WHERE RN > 1
What I am doing is, numbering the rows. If there are duplicates based on PARTITION BY columns, it will be numbered sequentially, else 1.
Then delete those records whose count is greater than 1.
I won't spoon feed you solution hence you will have to play with PARTITION BY to reach your output
output :
Select * from #Table1;
Field1
---------
a
b
f
g
a
b
/*with cte as (...) SELECT * FROM CTE;*/
Field1 RN
------- -----
a 1
a 2
b 1
b 2
f 1
g 1
if NBR_DOUBLES had an ID field, I believe you could use this;
DELETE FROM NBR_DOUBLES WHERE ID IN
(
SELECT MAX(ID)
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
)

Retrieving X rows from an ordered CTE, TOP vs Range

Objective:
Want to know which is faster/better performance when trying to retrieve a finite number of rows from CTE that is already ordered.
Example:
Say I have a CTE(intentionally simplified) that looks like this, and I only want the top 5 rows :
WITH cte
AS (
SELECT Id = RANK() OVER (ORDER BY t.ActionID asc)
, t.Name
FROM tblSample AS t -- tblSample is indexed on Id
)
Which is faster:
SELECT TOP 5 * FROM cte
OR
SELECT * FROM cte WHERE Id BETWEEN 1 AND 5 ?
Notes:
I am not a DB programmer, so to me the TOP solution seems better as
once SS finds the 5th row, it will stop executing and "return" (100%
assumption) while in the other method, i feel it will unnecessarily
process the whole cte.
My question is for a CTE, would the answer to this question be the same if it were a table?
The most important thing to note is that both queries are not going to always produce the same result set. Consider the following data:
CREATE TABLE #tblSample (ActionId int not null, name varchar(10) not null);
INSERT #tblSample VALUES (1,'aaa'),(2,'bbb'),(3,'ccc');
Both of these will produce the same result:
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT TOP(2) * FROM CTE;
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT * FROM CTE WHERE id BETWEEN 1 AND 2;
Now let's do this update:
UPDATE #tblSample SET ActionId = 1;
After this update the first query still returns two rows, the second query returns 3. Keep in mind too that, without an ORDER BY in the TOP query the results are not guaranteed because there is no default order in SQL.
With that out of the way - which performs better? It depends. It depends on your indexing, your statistics, number of rows, and the execution plan that the SQL Engine goes with.
Top 5 selects any 5 rows as per Index defined on the table whereas Id between 1 and 5 tries to fetch data based on Id column whether by Index seek or scan depends on the selected attributes. Both are two different queries.. 'Id between' query might be slow if you do not have any index on Id,
Let me try to explain with an example...
Consider this is your data..
create index nci_name on yourcte(id) include(name)
--drop index nci_name on yourcte
;with cte as (
select * from yourcte )
select top 5 * from cte
;with cte as (
select * from yourcte )
select * from cte where id between 1 and 5
First i am creating index on id with name included, Now if you see your second query does Index seek and first one does index scan and selects top 5, so in this case second approach is better
See the execution plan:
Now i am removing the index
Executing
--drop index nci_name on yourtable
Now it does table scan on both the approaches
If you notice in both the table scans, in the first one it reads only 5 rows and second approach it reads 10 rows and applies predicate
See execution plan properties for first plan
For second approach it reads 10 rows
Now first approach is better..
In your case this index needs to be on ActionId which determines the id.
Hence performance depends on how you index on your base table.
In order to get the RANK() which you are calculating in your cte it must sort all the data by t.ActionID. Sorting is a blocking operation: the entire input must be processed before a single row is output.
So in this case whether you select any five rows, or if you take the five that sorted to the top of the pile is probably irrelevant.

How does DISTINCT work in SQL Server 2008 R2? Are there other options? [duplicate]

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

Resources