How to group by column having spelling mistakes - sql-server

While working with some legacy data, I want to group the data on a column ignoring spelling mistakes. I think SOUNDEX() could do the job to achieve the desired result. Here is what I tried:
SELECT soundex(AREA)
FROM MASTER
GROUP BY soundex(AREA)
ORDER BY soundex(AREA)
But (obviously) the SOUNDEX returned 4-character code in result rows like this, loosing actual strings:
A131
A200
A236
How could I include at least one occurrence from the group into the query result instead of 4-character code.

SELECT soundex(AREA) as snd_AREA, min(AREA) as AREA_EXAMPLE_1, max(AREA) as AREA_EXAMPLE_2
from MASTER
group by soundex(AREA)
order by AREA_EXAMPLE_1
;
In MySQL you could select group_concat(distinct AREA) as list_area to get all the versions, and I don't know about that in SQL-Server, but min and max give two examples of the areas, and you wanted to discard the diffs anyway.

You could also use row_number() to get one row for each soundex(area) value:
select AREA, snd
from
(
select AREA, soundex(AREA) snd,
row_number() over(partition by soundex(AREA)
order by soundex(AREA)) rn
from master
) x
where rn = 1
See SQL Fiddle with Demo

Related

Return column with varying values depending on change points

I'm fairly new to Microsoft SQL Server, so maybe this is very simple yet I just don't have the experience to pull from.
The data I have is similar to the first three columns shown (A, B, C). I want to use those columns to return the data in the yellow highlighted column (D). Basically, I'm trying to show all values of a variable from the current week onward, including when there are change points of the variable. The value of the variable should continue forward in time until the value of the variable changes (column C).
Thanks in advance.
SELECT T1.*, COALESCE(SQ.NewValue, T1.StartingValue) FROM YourTable T1
OUTER APPLY (SELECT TOP 1 T2.NewValue FROM YourTable T2
WHERE T1.Week <= T2.week AND
T2.NewValue IS NOT NULL
ORDER BY T2.Week DESC) SQ
One way is to make Column D a correlated sub-query that gets the most recent previous value of C that is not NULL.
One method, which doesn't need 2 table scans is to use a CTE to create a "group number" and then the OVER clause with a MAX:
WITH VTE AS (
SELECT *
FROM (VALUES(1,0.5,NULL),
(2,0.5,1),
(3,0.5,NULL),
(4,0.5,NULL),
(5,0.5,0.8),
(6,0.5,NULL)) V(WeekNo, Starting, New)),
CTE AS(
SELECT *,
COUNT(New) OVER (ORDER BY WeekNo ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM VTE)
SELECT WeekNo, Starting, New,
ISNULL(MAX(New) OVER (PARTITION BY CTE.Grp),Starting) AS Result
FROM CTE
ORDER BY WeekNo;

How ROW_NUMBER used with insertions?

I've multipe uniond statements in MSSQL Server that is very hard to find a unique column among the result.
I need to have a unique value per each row, so I've used ROW_NUMBER() function.
This result set is being copied to other place (actually a SOLR index).
In the next time I will run the same query, I need to pick only the newly added rows.
So, I need to confirm that, the newly added rows will be numbered afterward the last row_number value of the last time.
In other words, Is the ROW_NUMBER functions orders the results with the insertion order - suppose I don't adding any ORDER BY clause?
If no, (as I think), Is there any alternatives?
Thanks.
Without seeing the sql I can only give the general answer that MS Sql does not guarantee the order of select statements without an order clause so that would mean that the row_number may not be the insertion order.
I guess you can do something like this..
;WITH
cte
AS
(
SELECT * , rn = ROW_NUMBER() OVER (ORDER BY SomeColumn)
FROM
(
/* Your Union Queries here*/
)q
)
INSERT INTO Destination_Table
SELECT * FROM
CTE LEFT JOIN Destination_Table
ON CTE.Refrencing_Column = Destination_Table.Refrencing_Column
WHERE Destination_Table.Refrencing_Column IS NULL
I would suggest you consider 'timestamping' the row with the time it was inserted. Or adding an identity column to the table.
But what it sounds like you want to do is get current max id and then add the row_number to it.
Select col1, col2, mid + row_number() over(order by smt) id
From (
Select col1, col2, (select max(id) from tbl) mid
From query
) t

Row_Number Over Where RowNumber between

I'm try to select a certain rows from my table using the row_number over. However, the sql will prompt the error msg "Invalid column name 'ROWNUMBERS' ". Anyone can correct me?
SELECT ROW_NUMBER() OVER (ORDER BY Price ASC) AS ROWNUMBERS, *
FROM Product
WHERE ROWNUMBERS BETWEEN #fromCount AND #toCount
Attempting to reference the aliased column in the WHERE clause does not work because of the logical query processing taking place. The WHERE is evaluated before the SELECT clause. Therefore, the column ROWNUMBERS does not exist when WHERE is evaluated.
The correct way to reference the column in this example would be:
SELECT a.*
FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Price ASC) AS ROWNUMBERS, *
FROM Product) a
WHERE a.ROWNUMBERS BETWEEN #fromCount AND #toCount
For your reference, the order for operations is:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
There is another answer here that solves the specific error reported. However, I also want to address the wider problem. It looks a lot like what you are doing here is paging your results for display. If that is the case, and if you can use Sql Server 2012, there is a better way now. Take a look at OFFSET/FETCH:
SELECT First Name + ' ' + Last Name
FROM Employees
ORDER BY First Name
OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;
That would show the third page of a query where the page size is 5.

Replace Group By clause with any other clause

In below query, I am using GROUP BY clause to get list of recently updated records depends on updated date. But I would like to have the query without a GROUP BY clause because of some internal reasons. Can please any one help me to solve this.
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT MAX(Proj_UpdatedDate)
FROM ProjectProgress
GROUP BY
Proj_ProjectID)
ORDER BY
Proj_ProjectID
Using TOP 1 should give you the same result assuming you meant the MAX(Proj_UpdatedDate):
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT TOP 1 Proj_UpdatedDate
FROM ProjectProgress
ORDER BY Proj_UpdatedDate DESC)
ORDER BY
Proj_ProjectID
However your query actually returns multiple dates since it's GROUPED BY Proj_ProjectId (the max date for each project). Is that your desired outcome - to show a list of dates that the projects were updated and by whom?
If so, try using ROW_NUMBER():
SELECT Proj_UpdatedDate, Proj_UpdatedBy
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Proj_ProjectID ORDER BY Proj_UpdatedBy DESC) rn,
Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress
) t
WHERE rn = 1
And here is the SQL Fiddle. This assumes you are running SQL Server 2005 or greater.
Good luck.

How to elegantly write a SQL ORDER BY (which is invalid in inline query) but required for aggregate GROUP BY?

I have a simple query that runs in SQL 2008 and uses a custom CLR aggregate function, dbo.string_concat which aggregates a collection of strings.
I require the comments ordered sequentially hence the ORDER BY requirement.
The query I have has an awful TOP statement in it to allow ORDER BY to work for the aggregate function otherwise the comments will be in no particular order when they are concatenated by the function.
Here's the current query:
SELECT ID, dbo.string_concat(Comment)
FROM (
SELECT TOP 10000000000000 ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
) x
GROUP BY ID
Is there a more elegant way to rewrite this statement?
So... what you want is comments concatenated in order of ID then CommentDate of the most recent comment?
Couldn't you just do
SELECT ID, dbo.string_concat(Comment)
FROM Comments
GROUP BY ID
ORDER BY ID, MAX(CommentDate) DESC
Edit: Misunderstood your objective. Best I can come up with is that you could clean up your query a fair bit by making it SELECT TOP 100 PERCENT, it's still using a top but at least it gets around having an arbitrary number as the limit.
Since you're using sql server 2008, you can use a Common Table Expression:
WITH cte_ordered (ID, Comment, CommentDate)
AS
(
SELECT ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
)
SELECT ID, dbo.string_concat(Comment)
FROM cte_ordered
GROUP BY ID

Resources