SQL Pivot Table producing duplicates

SQL Pivot Table producing duplicates - database

Developers,
I am new to pivot tables, and am having a little problem with duplicates. My table, before pivoting looks like so:
location | food
Tennessee | pear
Tennessee | orange
Florida | orange
Florida | apple
Virginia | pear
Here is the code to pivot, which works fine:
SELECT PivotTable.location, [apple], [orange], [pear]
FROM
(SELECT location, food FROM someTable) as inventory
PIVOT
(COUNT(inventory.food) FOR inventory.location IN ([apple],[orange],[pear])) AS PivotTable
This produces an output like so:
Location | Apple | Orange | Pear
Tennessee | 0 | 1 | 1
Florida | 1 | 1 | 0
Virginia | 0 | 0 | 1
Which as I said works fine. However, I added new columns for comments to my original table, like so:
location | food | apple_comments | orange_comments | pear_comments
Tennessee | pear | NULL | NULL | NULL
Tennessee | orange | NULL | very juicy | NULL
Florida | orange | NULL | NULL | NULL
Florida | apple | crisp | NULL | NULL
Virginia | pear | NULL | NULL| tasty
Here is my altered pivot table to account for the comments:
SELECT PivotTable.location, [apple], [apple_comments], [orange], [orange_comments], [pear], [pear_comments]
FROM
(SELECT location, food, apple_comments, orange_comments, pear_comments FROM someTable) as inventory
PIVOT
(COUNT(inventory.food) FOR inventory.location IN ([apple],[orange],[pear])) AS PivotTable
This produces an output like so:
Location | Apple | apple_comments | Orange | Orange_comments | Pear | Pear_comments
Tennessee | 0 | NULL | 0 | NULL | 1 | NULL
Tennessee | 0 | NULL | 1 | very juicy | 0 | NULL
Florida | 0 | NULL | 1 | NULL | 0 | NULL
Florida | 1 | crisp | 1 | NULL | 0 | NULL
Virginia | 0 | NULL | 1 | NULL | 1 | tasty
So, essentially, it is creating a duplicate row when comments are added for each entry where there are multiple locations. In the case of Virginia, there is only one entry, so the row turns out fine.
It almost seems like I need to do another pivot or something. Can anyone offer advice on where I'm going wrong?
Sorry. The desired output should look like so:
Location | Apple | apple_comments | Orange | Orange_comments | Pear | Pear_comments
Tennessee | 0 | NULL | 1 | very juicy | 1 | NULL
Florida | 1 | crisp | 1 | NULL | 0 | NULL
Virginia | 0 | NULL | 1 | NULL | 1 | tasty
Essentially, merging the duplicates into one row.
Thanks.

The fundamental problem is that you have effectively told the compiler to group by the comment column in addition to the food column. There are some solutions such as rolling up the comments into a delimited list like so:
Select location
, Sum( Case When S.food = 'Apple' Then 1 Else 0 End ) As Apple
, Stuff(
(
Select ', ' + S1.Apple_Comments
From SomeTable As S1
Where S1.location = S.location
And S1.Apple_Comments Is Not Null
Group By S1.Apple_Comments
For Xml Path(''), type
).value('.','nvarchar(max)')
, 1, 2, '') As Apple_Comments
, Sum( Case When S.food = 'Orange' Then 1 Else 0 End ) As Orange
, Stuff(
(
Select ', ' + S1.Orange_Comments
From SomeTable As S1
Where S1.location = S.location
And S1.Orange_Comments Is Not Null
Group By S1.Orange_Comments
For Xml Path(''), type
).value('.','nvarchar(max)')
, 1, 2, '') As Orange_Comments
, Sum( Case When S.food = 'Pear' Then 1 Else 0 End ) As Pear
, Stuff(
(
Select ', ' + S1.Pear_Comments
From SomeTable As S1
Where S1.location = S.location
And S1.Pear_Comments Is Not Null
Group By S1.Pear_Comments
For Xml Path(''), type
).value('.','nvarchar(max)')
, 1, 2, '') As Pear_Comments
From SomeTable As S
Group By S.location

Found the answer (utilizes the 'with CTE' and MAX functions):
;With CTE as (
SELECT PivotTable.location, [apple], [apple_comments], [orange], [orange_comments], [pear], [pear_comments]
FROM
(SELECT location, food, apple_comments, orange_comments, pear_comments FROM someTable) as inventory
PIVOT
(COUNT(inventory.food) FOR inventory.location IN ([apple],[orange],[pear])) AS PivotTable)
select location, MAX([apple]) as [apple], MAX([apple_comments]) as [apple_comments],MAX([orange]) as [orange],
MAX([orange_comments]) as [orange_comments], MAX([pear]) as [pear], MAX([pear_comments]) as [pear_comments]
from CTE group by location

Related

SQL Server: how count from value from dynamic columns?

SQL Server: how count from value from dynamic columns?
I have data:
+ Subject
___________________
| SubID | SubName |
|-------|---------|
| 1 | English |
|-------|---------|
| 2 | Spanish |
|-------|---------|
| 3 | Korean |
|_______|_________|
+ Student
______________________________________
| StuID | StuName | Gender | SubID |
|---------|---------|--------|--------|
| 1 | David | M | 1,2 |
|---------|---------|--------|--------|
| 2 | Lucy | F | 2,3 |
|_________|_________|________|________|
I want to query result as:
____________________________________
| SubID | SubName | Female | Male |
|--------|---------|--------|------|
| 1 | English | 0 | 1 |
|--------|---------|--------|------|
| 2 | Spanish | 1 | 1 |
|--------|---------|--------|------|
| 3 | Koean | 1 | 0 |
|________|_________|________|______|
This is my query:
SELECT
SubID, SubName, 0 AS Female, 0 AS Male
FROM Subject
I don't know to replace 0 with real count.

Because you made the mistake of storing CSV data in your tables, we will have to do some SQL Olympics to get your result set. We can try joining the two tables on the condition that the SubID from the subject table appears somewhere in the CSV list of IDs in the student table. Then, aggregated by subject and count the number of males and females.
SELECT
s.SubID,
s.SubName,
COUNT(CASE WHEN st.Gender = 'F' THEN 1 END) Female,
COUNT(CASE WHEN st.Gender = 'M' THEN 1 END) Male
FROM Subject s
LEFT JOIN Student st
ON ',' + CONVERT(varchar(10), st.SubID) + ',' LIKE
'%,' + CONVERT(varchar(10), s.SubID) + ',%'
GROUP BY
s.SubID,
s.SubName;
Demo
But, you would be best off refactoring your table design to normalize the data better. Here is an example of a student table which looks a bit better:
+---------+---------+--------+--------+
| StuID | StuName | Gender | SubID |
+---------+---------+--------+--------+
| 1 | David | M | 1 |
+---------+---------+--------+--------+
| 1 | David | M | 2 |
+---------+---------+--------+--------+
| 2 | Lucy | F | 2 |
+---------+---------+--------+--------+
| 2 | Lucy | F | 3 |
+---------+---------+--------+--------+
We can go a bit further, and even store the metadata separately from the StuID and SubID relationship. But even using just the above would have avoided the ugly join condition.

If the version of your SQL Server is SQL Server or above, you could use STRING_split function to get expected results.
create table Subjects
(
SubID int,
SubName varchar(30)
)
insert into Subjects values
(1,'English'),
(2,'Spanish'),
(3,'Korean')
create table student
(
StuID int,
StuName varchar(30),
Gender varchar(10),
SubID varchar(10)
)
insert into student values
(1,'David','M','1,2'),
(2,'Lucy','F','2,3')
--Query
;WITH CTE AS
(
SELECT
S.Gender,
S1.value AS SubID
FROM student S
CROSS APPLY STRING_SPLIT(S.SubID,',') S1
)
select
T.SubID,
T.SubName,
COUNT(CASE T1.Gender WHEN 'F' THEN 1 END) AS Female,
COUNT(CASE T1.Gender WHEN 'M' THEN 1 END) AS Male
from Subjects T
LEFT JOIN CTE T1 ON T.SubID=T1.SubID
GROUP BY T.SubID,T.SubName
ORDER BY T.SubID
--Output
/*
SubID SubName Female Male
----------- ------------------------------ ----------- -----------
1 English 0 1
2 Spanish 1 1
3 Korean 1 0
*/

Getting values from a table that's inside a table (unpivot / cross apply)

I'm having a serious problem with one of my import tables. I've imported an Excel file to a SQL Server table. The table ImportExcelFile now looks like this (simplified):
+----------+-------------------+-----------+------------+--------+--------+-----+---------+
| ImportId | Excelfile | SheetName | Field1 | Field2 | Field3 | ... | Field10 |
+----------+-------------------+-----------+------------+--------+--------+-----+---------+
| 1 | C:\Temp\Test.xlsx | Sheet1 | Age / Year | 2010 | 2011 | | 2018 |
| 2 | C:\Temp\Test.xlsx | Sheet1 | 0 | Value1 | Value2 | | Value9 |
| 3 | C:\Temp\Test.xlsx | Sheet1 | 1 | Value1 | Value2 | | Value9 |
| 4 | C:\Temp\Test.xlsx | Sheet1 | 2 | Value1 | Value2 | | Value9 |
| 5 | C:\Temp\Test.xlsx | Sheet1 | 3 | Value1 | Value2 | | Value9 |
| 6 | C:\Temp\Test.xlsx | Sheet1 | 4 | Value1 | Value2 | | Value9 |
| 7 | C:\Temp\Test.xlsx | Sheet1 | 5 | NULL | NULL | | NULL |
+----------+-------------------+-----------+------------+--------+--------+-----+---------+
I now want to insert those values from Field1 to Field10 to the table AgeYear(in my original table there are about 70 columns and 120 rows). The first row (Age / Year, 2010, 2011, ...) is the header row. The column Field1 is the leading column. I want to save the values in the following format:
+-----------+-----+------+--------+
| SheetName | Age | Year | Value |
+-----------+-----+------+--------+
| Sheet1 | 0 | 2010 | Value1 |
| Sheet1 | 0 | 2011 | Value2 |
| ... | ... | ... | ... |
| Sheet1 | 0 | 2018 | Value9 |
| Sheet1 | 1 | 2010 | Value1 |
| Sheet1 | 1 | 2011 | Value2 |
| ... | ... | ... | ... |
| Sheet1 | 1 | 2018 | Value9 |
| ... | ... | ... | ... |
+-----------+-----+------+--------+
I've tried the following query:
DECLARE #sql NVARCHAR(MAX) =
';WITH cte AS
(
SELECT i.SheetName,
ROW_NUMBER() OVER(PARTITION BY i.SheetName ORDER BY i.SheetName) AS rn,
' + #columns + ' -- #columns = 'Field1, Field2, Field3, Field4, ...'
FROM dbo.ImportExcelFile i
WHERE i.Sheetname LIKE ''Sheet1''
)
SELECT SheetName,
age Age,
y.[Year]
FROM cte
CROSS APPLY
(
SELECT Field1 age
FROM dbo.ImportExcelFile
WHERE SheetName LIKE ''Sheet1''
AND ISNUMERIC(Field1) = 1
) a (age)
UNPIVOT
(
[Year] FOR [Years] IN (' + #columns + ')
) y
WHERE rn = 1'
EXEC (#sql)
So far I'm getting the desired ages and years. My problem is that I don't know how I could get the values. With UNPIVOT I don't get the NULL values. Instead it fills the whole table with the same values even if they are NULL in the source table.
Could you please help me?

Perhaps an alternative approach. This is not dynamic, but with the help of a CROSS APPLY and a JOIN...
The drawback is that you'll have to define the 70 fields.
Example
;with cte0 as (
Select A.ImportId
,A.SheetName
,Age = A.Field1
,B.*
From ImportExcelFile A
Cross Apply ( values ('Field2',Field2)
,('Field3',Field3)
,('Field10',Field10)
) B (Item,Value)
)
,cte1 as ( Select * from cte0 where ImportId=1 )
Select A.SheetName
,[Age] = try_convert(int,A.Age)
,[Year] = try_convert(int,B.Value)
,[Value] = A.Value
From cte0 A
Join cte1 B on A.Item=B.Item
Where A.ImportId>1
Returns

MS sql server 2014 select dynamic pivot with lag using last known value for the pivoted column

I have not found an answer yet, so this may not be possible.
I am looking for a pivot query that will replace a pivoted NULL row with the last value available for the column that was not NULL. If the First row is Null then rows are NULL until a row has a value.
Updated When CID changes the rows start as new rows. So if the first row of CID 3 is Null, then the value is null.
Here is my pivot query
DECLARE #Columns AS VARCHAR(MAX)
DECLARE #Query VARCHAR(MAX)
DECLARE #TEMP_DB VARCHAR(255)
SET #TEMP_DB = 'Demo_DataSet'
SELECT #Columns =
COALESCE(#Columns + ', ','') + QUOTENAME(AttrName)
FROM
(
SELECT DISTINCT AttrName
FROM Demo_FirstPass_Data_Raw
) AS B
ORDER BY B.AttrName
SET #Query = '
WITH PivotData AS
(
SELECT
DocID
, Customer
, Version
, CID
, AttrName
, AttrText
FROM Demo_FirstPass_Data_Raw
)
SELECT
DocID
, Customer
, Version
, CID
, ' + #Columns + '
INTO Demo_FirstPass_Data_Pivot
FROM PivotData
PIVOT
(
MAX(AttrText)
FOR AttrName
IN (' + #Columns + ')
) AS PivotResult
Where Version = Version
ORDER BY DocID, Version, CID'
DECLARE #SQL_SCRIPT VARCHAR(MAX)
SET #SQL_SCRIPT = REPLACE(#Query, '' + #TEMP_DB + '', #TEMP_DB)
EXECUTE (#SQL_SCRIPT)
My result is
DocID | Customer | Version | CID | Username | Sales_Order | Date | Description
1852 | Acme | 1 | 2 | User1 | NULL | 11/17/2010 | Product
1852 | Acme | 2 | 2 | NULL | NULL | NULL | NULL
1852 | Acme | 3 | 2 | NULL | NULL | 12/15/2010 | NULL
1852 | Acme | 4 | 2 | NULL | NULL | NULL | NULL
1852 | Acme | 5 | 2 | NULL | S-0001 | 11/17/2010 | NULL
1852 | Acme | 7 | 2 | NULL | S-0001 | NULL | NULL
1852 | Acme | 8 | 2 | NULL | NULL | 1/14/2011 | NULL
1852 | Acme | 9 | 2 | NULL | NULL | NULL | NULL
1852 | Acme | 10 | 2 | NULL | NULL | NULL | NULL
1852 | Acme | 1 | 3 | User2 | NULL | 10/10/2010 | Product
1852 | Acme | 2 | 3 | NULL | NULL | NULL | NULL
1852 | Acme | 3 | 3 | NULL | NULL | 12/15/2010 | NULL
What I am looking for is
DocID | Customer | Version | CID | Username | Sales_Order | Date | Description
1852 | Acme | 1 | 2 | User1 | NULL | 11/17/2010 | Product
1852 | Acme | 2 | 2 | User1 | NULL | 11/17/2010 | Product
1852 | Acme | 3 | 2 | User1 | NULL | 12/15/2010 | Product
1852 | Acme | 4 | 2 | User1 | NULL | 12/15/2010 | Product
1852 | Acme | 5 | 2 | User1 | S-0001 | 11/17/2010 | Product
1852 | Acme | 7 | 2 | User1 | S-0001 | 11/17/2010 | Product
1852 | Acme | 8 | 2 | User1 | S-0001 | 1/14/2011 | Product
1852 | Acme | 9 | 2 | User1 | S-0001 | 1/14/2011 | Product
1852 | Acme | 10 | 2 | User1 | S-0001 | 1/14/2011 | Product
1852 | Acme | 1 | 3 | User2 | NULL | 10/10/2010 | Product
1852 | Acme | 2 | 3 | User2 | NULL | 10/10/2010 | Product
1852 | Acme | 3 | 3 | User2 | NULL | 12/15/2010 | Product
Any help is appreciated.

For an unknown number of columns and to integrate into a dynamic pivot, one option is to generate the code for a recursive cte and use that to retain the last non null value based on your partitions like so:
declare #Columns as nvarchar(max)
declare #Query nvarchar(max)
declare #temp_db nvarchar(255)
set #temp_db = 'Demo_DataSet'
select #Columns =
coalesce(#Columns + ', ','') + quotename(AttrName)
from
(
select distinct AttrName
from Demo_FirstPass_Data_Raw
) as B
order by B.AttrName
/* generate isnull statements for columns in recursive cte */
declare #isnull nvarchar(max) = stuff((
select distinct ', isnull(t.'+quotename(d.AttrName)+',cte.'+quotename(d.AttrName)+')'
from Demo_FirstPass_Data_Raw d
order by 1
for xml path (''), type).value('(./text())[1]','nvarchar(max)')
,1,2,'')
set #Query = 'with PivotData as (
select Docid, Customer, Version, cid, AttrName, AttrText
from Demo_FirstPass_Data_Raw
)
, t as (
select
Docid, Customer, Version, cid
, ' + #Columns + '
, rn = row_number() over (partition by DocId, Customer, cid order by Version)
from PivotData
pivot(max(AttrText) for AttrName in (' + #Columns + ')) as PivotResult
)
, cte as (
select [Docid], [Customer], [Version], [cid], ' + #Columns + ', rn
from t
where version = 1
union all
select t.[Docid], t.[Customer], t.[Version], t.[cid]
, '+ #isnull + '
'+',t.rn
from t
inner join cte
on t.rn = cte.rn+1
and t.docid = cte.docid
and t.customer = cte.customer
and t.cid = cte.cid
)
select *
from cte
order by docid, customer, cid, version
'
select #query
exec sp_executesql #query
rextester demo: http://rextester.com/OQZOW62536
code generated:
with PivotData as (
select Docid, Customer, Version, cid, AttrName, AttrText
from Demo_FirstPass_Data_Raw
)
, t as (
select
Docid, Customer, Version, cid
, [Date], [Description], [Sales_Order], [Username]
, rn = row_number() over (partition by DocId, Customer, cid order by Version)
from PivotData
pivot(max(AttrText) for AttrName in ([Date], [Description], [Sales_Order], [Username])) as PivotResult
)
, cte as (
select [Docid], [Customer], [Version], [cid], [Date], [Description], [Sales_Order], [Username], rn
from t
where version = 1
union all
select t.[Docid], t.[Customer], t.[Version], t.[cid]
, isnull(t.[Date],cte.[Date]), isnull(t.[Description],cte.[Description]), isnull(t.[Sales_Order],cte.[Sales_Order]), isnull(t.[Username],cte.[Username])
,t.rn
from t
inner join cte
on t.rn = cte.rn+1
and t.docid = cte.docid
and t.customer = cte.customer
and t.cid = cte.cid
)
select *
from cte
order by docid, customer, cid, version
results:
+-------+----------+---------+-----+------------+-------------+-------------+----------+----+
| Docid | Customer | Version | cid | Date | Description | Sales_Order | Username | rn |
+-------+----------+---------+-----+------------+-------------+-------------+----------+----+
| 1852 | Acme | 1 | 2 | 2010-11-17 | Product | NULL | User1 | 1 |
| 1852 | Acme | 2 | 2 | 2010-11-17 | Product | NULL | User1 | 2 |
| 1852 | Acme | 3 | 2 | 2010-12-15 | Product | NULL | User1 | 3 |
| 1852 | Acme | 4 | 2 | 2010-12-15 | Product | NULL | User1 | 4 |
| 1852 | Acme | 5 | 2 | 2010-11-17 | Product | S-0001 | User1 | 5 |
| 1852 | Acme | 7 | 2 | 2010-11-17 | Product | S-0001 | User1 | 6 |
| 1852 | Acme | 8 | 2 | 2011-01-14 | Product | S-0001 | User1 | 7 |
| 1852 | Acme | 9 | 2 | 2011-01-14 | Product | S-0001 | User1 | 8 |
| 1852 | Acme | 10 | 2 | 2011-01-14 | Product | S-0001 | User1 | 9 |
| 1852 | Acme | 1 | 3 | 2010-10-10 | Product | NULL | User2 | 1 |
| 1852 | Acme | 2 | 3 | 2010-10-10 | Product | NULL | User2 | 2 |
| 1852 | Acme | 3 | 3 | 2010-12-15 | Product | NULL | User2 | 3 |
+-------+----------+---------+-----+------------+-------------+-------------+----------+----+

Selecting the longest string in each field

I am trying to clean up a data set similar in structure to the following table:
dataSource
| ID_dec | ID_base | name | field1 | field2 | field3 |
| 1.01 | 1 | AAA | Cat | Brown | Domesticated |
| 1.02 | 1 | AAA | Cat | Brown | Domesticated |
| 1.03 | 1 | AAA | Feline | NULL | Dom. |
| 1.04 | 1 | AAA | Beautiful cat | NULL | NULL |
| 1.05 | 1 | AAA | NULL | Light Brown | NULL |
| 2.01 | 2 | BBB | Dog | Black | Wild |
| 2.02 | 2 | BBB | Barker | NULL | NULL |
| 3.01 | 3 | CCC | Bird | Yellow | Domesticated |
| 4.01 | 4 | DDD | Snake | NULL | NULL |
| 4.02 | 4 | DDD | NULL | Green | NULL |
| 4.03 | 4 | DDD | NULL | Forest Green | NULL |
| 4.04 | 4 | DDD | NULL | Green | Wild |
| 4.05 | 4 | DDD | NULL | NULL | Wild |
I want to pull the longest string of each combination of field[N] and ID_base, like so:
result
| ID_base | name | field1 | field2 | field3 |
| 1 | AAA | Beautiful cat | Light Brown | Domesticated |
| 2 | BBB | Barker | Black | Wild |
| 3 | CCC | Bird | Yellow | Domesticated |
| 4 | DDD | Snake | Forest Green | Wild |
This has been asked before, but only while examining to a single field. The following SQL gets me the desired result, but feels inefficient when scaled up to the real data set of 37 fields and 5665 rows (4029 ID_bases and the most ID_decs to a single ID_base is 10):
SELECT DISTINCT a.id_base, a.name, b.result, c.result, d.result
FROM
dataSource a
LEFT JOIN
(
SELECT y.id_base, max(y.field1) result
FROM dataSource y
LEFT JOIN
(
SELECT id_base, max(len(field1)) leng
FROM dataSource
GROUP BY id_base
) z
ON y.id_base = z.id_base
WHERE len(y.field1) = z.leng
GROUP BY y.id_base
) b
ON a.id_base = b.id_base
LEFT JOIN
(
SELECT y.id_base, max(y.field2) result
FROM dataSource y
LEFT JOIN
(
SELECT id_base, max(len(field2)) leng
FROM dataSource
GROUP BY id_base
) z
ON y.id_base = z.id_base
WHERE len(y.field1) = z.leng
GROUP BY y.id_base
) c
ON a.id_base = c.id_base
LEFT JOIN
(
SELECT y.id_base, max(y.field3) result
FROM dataSource y
LEFT JOIN
(
SELECT id_base, max(len(field3)) leng
FROM dataSource
GROUP BY id_base
) z
ON y.id_base = z.id_base
WHERE len(y.field1) = z.leng
GROUP BY y.id_base
) d
ON a.id_base = d.id_base
What is the best way to go about this query?

WITH a AS (
SELECT id_base, name, max(len(field1)) l1, max(len(field2)) l2, max(len(field3)) l3
FROM datasource
GROUP BY id_base, name
)
SELECT a.*,
(SELECT TOP 1 field1 FROM datasource WHERE id_base = a.id_base AND len(field1) = a.l1),
(SELECT TOP 1 field2 FROM datasource WHERE id_base = a.id_base AND len(field2) = a.l2),
(SELECT TOP 1 field3 FROM datasource WHERE id_base = a.id_base AND len(field3) = a.l3)
from a

Another simpler variation:
SELECT
t.id_base,
t.name
(SELECT TOP 1 field1 FROM table WHERE id_base = t.id_base ORDER BY LEN(field1) DESC),
(SELECT TOP 1 field2 FROM table WHERE id_base = t.id_base ORDER BY LEN(field2) DESC),
(SELECT TOP 1 field3 FROM table WHERE id_base = t.id_base ORDER BY LEN(field3) DESC)
FROM (SELECT DISTINCT id_base, name FROM table) t

Select coalesce(t1.ID_base, t2.ID_base, t3.ID_base) base,
coalesce(t1.Name, t2.Name, t3.Name) Name,
coalesce(t1.field1, t2.field1, t3.field1) field1,
coalesce(t1.field2, t2.field2, t3.field2) field2,
coalesce(t1.field3, t2.field3, t3.field3) field3
from dataSource t1
full join dataSource t2 on t2.ID_base = t1.ID_base
and len(t1.field1) = (Select Max(len(field1)) from dataSource
where ID_base = t1.ID_base)
and len(t2.field2) = (Select Max(len(field2)) from dataSource
where ID_base = t2.ID_base)
full join dataSource t3 on t3.ID_base = t1.ID_base
and len(t3.field3) = (Select Max(len(field3)) from dataSource
where ID_base = t3.ID_base)

Group By same or similar string sql

1) Suppose i have a table like this:-
| id | color_code | fruit |
|:------|--------------|----------------:|
| 1 | 000001 | apple |
| 2 | 000001 | apple |
| 3 | 000001 | apple |
| 4 | 000002 | lemon |
| 5 | 000002 | lemon |
| 6 | 000003 | grapes |
| 7 | 000003 | grapes |
How can i group by the fruit column according to the color_code column in sql server?
like this i suppose:-
| id | color_code | fruit | group_concat(id) |
|:------|--------------|-----------------|---------------------|
| 1 | 000001 | apple | 1,2,3 |
| 4 | 000002 | lemon | 2,5 |
| 6 | 000003 | grapes | 6,7 |
2) What if i have 3 tables (like shown below) which require join, how can i achieve this?
series_no table:
| id | desc_seriesno |
|:------|----------------:|
| 7040 | AU1011 |
| 7041 | AU1022 |
| 7042 | AU1033 |
| 7043 | AU1044 |
| 7044 | AU1055 |
| 7045 | AU1066 |
brand table:
| id | desc_brand |
|:------|----------------:|
| 1020 | Audi |
| 1021 | Bentley |
| 1022 | Ford |
| 1023 | BMW |
| 1024 | Mazda |
| 1025 | Toyota |
car_info table:
| seriesno_id | brand_id | color |
|:---------------|------------|--------:|
| 7040 | 1020 | white |
| 7040 | 1020 | black |
| 7040 | 1020 | pink |
| 7041 | 1021 | yellow |
| 7041 | 1021 | brown |
| 7042 | 1022 | purple |
| 7042 | 1022 | black |
| 7042 | 1022 | green |
| 7043 | 1023 | blue |
| 7044 | 1024 | red |
| 7045 | 1025 | maroon |
| 7045 | 1025 | white |
this is my current query with sql server 2014:-
SELECT SN.id AS seriesid, B.id AS brandid, B.desc_brand
FROM [db1].[dbo].[series_no] SN
LEFT JOIN [db1].[dbo].[car_info] CI
ON CI.seriesno_id = SN.id
RIGHT JOIN [db1].[dbo].[brand] B
ON B.id = CI.brand_id
GROUP BY SN.id, B.id
ORDER BY SN.id ASC
but unfortunately it gave me an error since i cannot group by similar string this way.
i want it to be like this:-
| seriesid | brandid | desc_brand | count |
|:-----------|------------|---------------|-------|
| 7040 | 1020 | Audi | 3 |
| 7041 | 1021 | Bentley | 2 |
| 7042 | 1022 | Ford | 3 |
| 7043 | 1023 | BMW | 1 |
| 7044 | 1024 | Mazda | 1 |
| 7045 | 1025 | Toyota | 2 |

1 Fruit Color
Assuming the table name is FruitColor, you can get the desired output by the following query -
SELECT MIN(id) AS id
, color_code
, fruit
, group_concat_id = STUFF((SELECT ',' + CAST(id AS VARCHAR)
FROM FruitColor AS fci
WHERE fci.fruit = fc.fruit AND fci.color_code = fc.color_code
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')
FROM FruitColor AS fc
GROUP BY color_code, fruit
ORDER BY id;
The MIN() selects the first id of the group.
Since there is no default GROUP_CONCAT function like in MySql in SQL Server, you have to use the STUFF function and FOR XML PATH. To learn more about group concat you can visit this link https://sqlperformance.com/2014/08/t-sql-queries/sql-server-grouped-concatenation
You can customize the WHERE clause to match only by color_code.
2. You can have several options for this -
Option (a): Show counts for series with brands
SELECT seriesno_id AS seriesid, ci.brand_id AS bandid, desc_brand, COUNT(*) AS [count]
FROM db1.dbo.car_info AS ci
LEFT JOIN db1.dbo.brand AS b ON (b.id = ci.brand_id)
GROUP BY seriesno_id, ci.brand_id, desc_brand;
Here you don't need to use the series table if you want to show counts for cars having brand(s).
You may not need to use the RIGHT JOIN on the brand table because if brand table contains a record which
is not in car_info table, then seriesno_id would be null.
Option (b): Show counts for all the series with or without a brand
SELECT sn.id AS seriesid, ci.brand_id AS bandid, desc_brand, COUNT(*) AS [count]
FROM db1.dbo.series_no AS sn
LEFT JOIN db1.dbo.car_info AS ci ON (ci.seriesno_id = sn.id)
LEFT JOIN db1.dbo.brand AS b ON (b.id = ci.brand_id)
GROUP BY sn.id, ci.brand_id, desc_brand;
Option (c): The work around for selecting a column which is not in a GROUP BY
SELECT seriesno_id AS seriesid, ci.brand_id AS bandid, MAX(desc_brand) AS desc_brand, COUNT(*) AS [count]
FROM db1.dbo.car_info AS ci
LEFT JOIN db1.dbo.brand AS b ON (b.id = ci.brand_id)
GROUP BY seriesno_id, ci.brand_id;
Here, if we are certain that each brand contains only one desc_brand, we can use an aggregate on it.
This is bcause applying aggregate only one value returns that value. I used MAX here.
Personally I would go with option (a) as it makes more sense.
Update on GROUP BY exception for desc_brand being NTEXT...
Cast desc_brand to NVARCHAR to avoid the exception.
CAST(desc_brand AS NVARCHAR(200))
Also I highly recommend using VARCHAR / NVARCHAR instead of any TEXT, CHAR etc. because they usually occupy more memory.

SELECT
id = SUBSTRING(group_concat,1,1),
color_code,
fruit,
group_concat
FROM(
SELECT distinct
m.color_code,
m.fruit,
group_concat = STUFF((SELECT ',' + CONVERT(varchar(10),md.id)
FROM [Test_1].[dbo].[Stuff] md
WHERE m.fruit = md.fruit
AND m.color_code = md.color_code
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')
FROM [Test_1].[dbo].[Stuff] m)x

use below code ..
SELECT distinct
m.color_code
, m.fruit
, group_concat = STUFF((
SELECT ',' + CONVERT(varchar(10),md.id)
FROM dbo.tablename md
WHERE m.fruit = md.fruit and m.color_code = md.color_code
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')
FROM dbo.tablename m
for second :
SELECT SN.id AS seriesid, B.id AS brandid, B.desc_brand ,count(*)
FROM [db1].[dbo].[series_no] SN
LEFT JOIN [db1].[dbo].[car_info] CI
ON CI.seriesno_id = SN.id
RIGHT JOIN [db1].[dbo].[brand] B
ON B.id = CI.brand_id
GROUP BY SN.id, B.id ,B.desc_brand
ORDER BY 4 ASC