How to find misspellings in data

How to find misspellings in data - sql-server

I am trying to find the misspellings in TOWN_C field. Data looks something like below. There is no specific pattern, sometimes misspelling can be at the beginning, sometimes it can be in middle or at the end. Length of misspelling can be different too.
I am using SQL Server Management Studio to execute the queries. I used SUBSTR to find out duplicates along with the left outer join. But that does not give only misspelling. I still need to go and manually look at data.
Data ->
Achampet
ACHEMPET
AGIA
AGIYA
ASHOK NAGAR
ASHOKNAGAR
ASHOKNAGER
SQL query which I am using ->
Select distinct(T3.TOWN__C)
From (Select T1.Sub_Str, Count(T1.Sub_Str) as Y
From (SELECT TOWN__C, SUBSTRING(TOWN__C, 1, 3) as Sub_Str
FROM [SALESFORCE].[dbo].[Outlet Master] group by TOWN__C)T1
Group by T1.Sub_Str having count(*)> 1)T2
Left outer join
[SALESFORCE].[dbo].[Outlet Master]T3
On T2.Sub_Str = SUBSTRING(T3.TOWN__C, 1, 3)
Order by T3.TOWN__C
Is there a way to find out all such cases using SQL or Excel or anything else?

Here's an example using SOUNDEX, to try to locate values where multiple spellings have been used for "similar" names:
declare #t table (town varchar(35) not null)
insert into #t(town) values
('Achampet'),
('ACHEMPET'),
('AGIA'),
('AGIYA'),
('ASHOK NAGAR'),
('ASHOKNAGAR'),
('ASHOKNAGER'),
('Downtown'),
('DOWNTOWN'),
('DownTown')
select
v.*
from
(select
*,
MIN(town) OVER (PARTITION BY town_sound) as minTown,
MAX(town) OVER (PARTITION BY town_sound) as maxTown
from
#t
cross apply
(select SOUNDEX(REPLACE(town,' ','')) as town_sound) u
) v
where minTown != maxTown
Note that this doesn't return "downtown" where the only variations are in capitalization, but does return all of the values in your given sample data, which I assume were all meant to be found as possible misspellings.
Also note that SOUNDEX has had a chequered history and under older versions of SQL Server it was usually recommended that a "better" soundex be implemented as a UDF. You should be able to find versions of that with a simple search, if required.
Note, also, that Soundex was specifically designed around English pronunciation. Again, you may be able to find a better tailored function as a UDF for specific other languages.
Results:
town town_sound minTown maxTown
------------- ---------- ------------- ------------
AGIA A200 AGIA AGIYA
AGIYA A200 AGIA AGIYA
ASHOK NAGAR A225 ASHOK NAGAR ASHOKNAGER
ASHOKNAGAR A225 ASHOK NAGAR ASHOKNAGER
ASHOKNAGER A225 ASHOK NAGAR ASHOKNAGER
Achampet A251 Achampet ACHEMPET
ACHEMPET A251 Achampet ACHEMPET

Related

Why Hibernate HSQL Concat is not working for MSSQL?

So, I have Hibernate 5.3.1 in a project which connects to different enginees (MySql, Oracle, PostgreSQL and MS SQL), so I can't use native queries.
Let's say I have 3 records in a table, which all of them have the same datetime, but I need to group them only by date (not time). For example, 2019-12-04;
I execute this query:
SELECT
CONCAT(year(tx.date_), month(tx.date_), day(tx.date_)),
iss.code,
COUNT(tx.id)
FROM
tx_ tx
JOIN
issuer_ iss
ON
tx.id_issuer = iss.id
GROUP BY
CONCAT(year(tx.date_), month(tx.date_), day(tx.date_)), iss.code
But, when I test it connected to SQL SERVER 2017, instead of return 20191204, it's returning 2035. In Oracle and MySQL is working fine.
Anyone has any idea why is this happen? I've tried different ways, like use + instead of CONCAT but the result is the same.
I've also tried to extract them for separate (without concat), and they have been returning correct. The problem is, I need to group them by the complete date.
And just for the record, the field is declared as datetime2 in DDBB

How about simply adding them, instead of using CONCAT.
(year(tx.date_)*10000 + month(tx.date_)*100 + day(tx.date_)*1) AS datenum
Thus, try this:
SELECT
CAST((year(tx.date_)*10000 + month(tx.date_)*100 + day(tx.date_)*1) AS string) AS datenum,
iss.code
FROM tx_ tx
JOIN issuer_ iss
ON tx.id_issuer = iss.id
GROUP BY year(tx.date_), month(tx.date_), day(tx.date_), iss.code

Thanks for the hint Gert Arnold gave me. I just didn't realize that the query was adding like if they were numbers in MSSQL.
Finally, I manage to make it work in the 4 RDBMS casting to string first
SELECT
CONCAT(CAST(year(tx.date_) AS string), CAST(month(tx.date_) AS string), CAST(day(tx.date_) AS string)),
iss.code
FROM
tx_ tx
JOIN
issuer_ iss
ON
tx.id_issuer = iss.id
GROUP BY
CONCAT(year(tx.date_), month(tx.date_), day(tx.date_)), iss.code
I tried also casting to TEXT, but it throws exception in MySQL

Why use concat() to begin with?
Assuming Hibernate takes care of converting the non-standard year(), month() and day() functions, then the following should work on any DBMS
SELECT year(tx.date_), month(tx.date_), day(tx.date_), iss.code
FROM tx_ tx
JOIN issuer_ iss ON tx.id_issuer = iss.id
GROUP BY year(tx.date_), month(tx.date_), day(tx.date_), iss.code

How to get the result of CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF in snowflake without using them?

I need to make hierarchical queries, and I need to get the results of CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF, but these features are supported in Oracle not in Snowflake.
What are the alternative ways to implement the functionalities of CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF in snowflake without using them as these keywords are not supported there?

Wonder if you have taken a look at the following Snowflake features?
https://docs.snowflake.net/manuals/user-guide/queries-hierarchical.html#using-connect-by-or-recursive-ctes-to-query-hierarchical-data

Yes I took a look there. I also took a look at https://docs.snowflake.net/manuals/sql-reference/constructs/connect-by.html where it clearly says that these features are not supported in Snowflake.
I was trying below code block to find an alternative but facing varieties of error in snowflake.
person_vertex as (
select
emp_number,
user_id
from person
),
person_edges as (
select
supervisor_emp_number,
emp_number
from person
where supervisor_emp_number is not null
),
select
pv.emp_number emp_id_pk,
level,
CONNECT_BY_ROOT pv.emp_number AS root,
concat(SYS_CONNECT_BY_PATH(pv.emp_number,':'),':') as path,
-- CONNECT_BY_ISCYCLE AS iscyclic, ------------------- no idea how to implement this
-- CONNECT_BY_ISLEAF as isleaf ------------------- i tried below block, but it is not working
case
when (pe.supervisor_emp_number in (select emp_number from pv)) then 0
else 1
end AS isleaf
from person_vertex pv
left join person_edges pe on pv.emp_number = pe.emp_number
connect by prior A.emp_number = A.supervisor_emp_number
start with A.supervisor_emp_number is null
Any help with this block is really appreciated.
Thanks.
enter code here

Give reasons as to why my AND clause returns an empty string in SQL

I am new to SQL and it seems not to be the same as tradition coding. Anyways, I am trying to figure out why my results end up empty but only with the first AND statement. If I remove this statement, the code works. The syntax seems correct. What I am trying to do is match channel names with 'P' and 'HDP' at the end of the columns and not match channel numbers. Maybe I am wrong on the syntax. Any help on this matter would be appreciated. Also, I am using Microsoft SQL Server Management Studio 2012.
How the results should look:
SELECT a.ChannelNumber AS "Standard Channel",
a.DisplayName AS "Standard Name",
b.ChannelNumber AS "HD Channel",
b.DisplayName AS "HD Name"
FROM CHANNEL a CROSS JOIN CHANNEL b
WHERE b.ChannelNumber <> a.ChannelNumber
AND b.DisplayName = a.DisplayName /*this is what is giving me problems*/
AND RIGHT(b.DisplayName, 3) LIKE '%HDP'
AND RIGHT(a.DisplayName, 1) LIKE '%P';

Ultimately you want things like AETVP and AETVHDP to be "equal". This doesn't seem like a use case for a Cross Join. You can break this down with a CTE.
First you'll define your HD channels, then your Standard Channels. In each of those blocks you can get the core part of the channel's name (the part without the P or HDP). Then join those together on the CoreName. This will enable us to join AETV to AETV
WITH HdChannels
AS (
SELECT *
,CoreName = left(DisplayName, len(DisplayName) - len('HDP'))
FROM Channel
WHERE displayName LIKE '%HDP'
)
,StdChannels
AS (
SELECT *
,CoreName = left(DisplayName, len(DisplayName) - len('P'))
FROM Channel
WHERE displayName LIKE '%P'
AND displayName NOT LIKE '%HDP'
)
SELECT std.ChannelNumber AS [Standard Channel]
,std.DisplayName AS [Standard Name]
,hd.ChannelNumber AS [HD Channel]
,hd.DisplayName AS [HD Name]
FROM HdChannels hd
INNER JOIN StdChannels std ON std.CoreName = hd.CoreName

To answer your question,
"Give reasons as to why my AND clause returns an empty string in SQL"
it's because you've declared a.DisplayName = b.DisplayName in the WHERE. And that can't be the case according to the picture of the output you've linked to because the Display names are spelled differently.
The only difference between standard and HD tables is HD tables end with "HDP". Standard tables never end with "HDP", though they do end with a "P".
In the absence of sample data, I've included the most basic example I could think of using a temp table.
DECLARE #CHANNEL TABLE(ChannelNumber int, DisplayName varchar(100))
INSERT INTO #CHANNEL VALUES
(3, 'ABCP'), (25, 'ABCHDP')
SELECT a.ChannelNumber AS "Standard Channel",
a.DisplayName AS "Standard Name",
b.ChannelNumber AS "HD Channel",
b.DisplayName AS "HD Name"
FROM #CHANNEL a CROSS JOIN #CHANNEL b
WHERE LEFT(a.DisplayName, LEN(a.DisplayName) - 1) + 'HDP' = b.DisplayName
AND a.DisplayName NOT LIKE '%HDP'
AND b.DisplayName LIKE '%HDP'
AND a.ChannelNumber <> b.ChannelNumber
Produces output:
Standard Channel Standard Name HD Channel HD Name
3 ABCP 25 ABCHDP
The algorithm identifies standard channels (NOT LIKE '%HDP') and HD channels (LIKE '%HDP') on the left and right sides of the CROSS JOIN.
Notice in your code you put: AND RIGHT(b.DisplayName, 3) LIKE '%HDP'... it is unnecessary to specify the RIGHT function with a length of chars- when you indicate the end of the string using LIKE '%HDP'.
LEFT(a.DisplayName, LEN(a.DisplayName) - 1) + 'HDP' cuts off the last char of the Standard Channel's DisplayName (which is always a 'P' by it's naming convention) and concatenates 'HDP' at the end of the result. This is compared to the format for HD channels which always end with 'HDP'.
When the conditions match you get a row of data.
Looking at the filtering conditions- you can see that a.DisplayName can never equal b.DisplayName

Attempting to run a while loop in my select statement under cases in SQL Server 2012

The Data
Let us say I have a field in SQL that consists of multi-line Information, each of which consists of i topics, each topic consisting of m points of information. Topics are prefaced with 'i.' and information with a dash. It looks something like:
________________________________________________
|Number | Information
|===============================================
|1 | 1. Topic 1.1
| | -Info 1.1.1
| | - ... [more info]
| | 2. Topic 1.2
| | -Info 1.2.1
| | - ...[more info]
| | ... [more topics]
|_______|_____________________________
|2 | 1. Topic 2.1
|....and so on
The Current System
What I am doing with this information is to parse out each topic into it's own column, then unpivoting those columns and searching for Topics that contain a given keyword #keyword.
Currently the code reads something like:
Select
Number
,Case When Information LIKE '%1. %2. %'
Then substring (Information, charindex('1.',Information),
charindex('2.', Information) -(charindex('1.',Information)+2) )
Else Information
End as [Topic1]
,Case When Information LIKE '%2. %3. %'
Then substring (Information, charindex('2.',Information),
charindex('3.', Information) -(charindex('2.',Information)+2) )
Else 'N/A'
End as [Topic2]
...repeat 2nd case for each set of numbers up to '%20. %21. %'
The only reason the first one is different is because if it doesn't match the pattern then I want to grab the whole field so that I don't miss anything. I then unpivot the Topic fields that I just created into a general [Topic] field, and then utilize a WHERE [Topic] LIKE '%' +#keyword+'%' to pull out any particular topics and their associated case number to output as my final table. The cases can have anywhere from 1 to 40+ topics attached, with 1-7 attached info fields per topic.
The Desired Modification
Notice: To make the code easier to read, I will not be writing my substring code in proper syntax, instead opting to write substring(Information,ci(#Iter), ci(#Iter+1)-ci(#Iter)) to denote the substring running from the position given by '(iter).' to the position given by '(iter+1).'
What I would like to do is to perform the following:
Declare #Iter smallint
Declare #Result varchar(max)
Select
Number
, Set #Iter=1
Set #Result = ' '
Case When Information LIKE '%'+#keyword+'%' --keyword chosen at front end
Then While #Iter < #n --#n set by the user from front end
Begin
Case When Information LIKE '%' + cast(#Iter as varchar(5))
+ '. %'+cast((#Iter+1) as varchar(5))+'. %'
and substring(Information,ci(#Iter), ci(#Iter+1)-ci(#Iter) )
LIKE '%'+#keyword+'%'
Then Set #Result = #Result +substring(Information,ci(#Iter),
ci(#Iter+1)-ci(#Iter) )
Else Set #Result = #Result end
Set #Iter = #Iter +1
End
Else ' ' end [Result]
The Explanation
In case what I want isn't clear, I'll run through what I'm trying to accomplish
I want to output a list of case numbers that include Topics that include the keyword.
For each case in the list I want to output only those topics that include the keyword.
I want to allow the end user of the report to choose how many Topics in each case they'll search.
I don't want to have to create a table with a column for each Topic when I can't know how many the user will want to create.
Due to these considerations it feels like a loop would be the best option, but there are problems in trying to accomplish that.
The Problem
SQL server won't allow me to utilize a loop in my Select statement--Incorrect syntax near 'While'.
The place where the information comes from prohibits normalization of the information in the table I'm searching
Even if it didn't I am barred from creating my own permanent tables at work, so I can't normalize the data for all incoming data
I am also not allowed to write my own stored procedures.
If there is any way (for example through a cte) to implement these changes, I'm open to hearing them! I'm mostly looking at ways to make the code less daunting looking (20 cases to produce 20 fields in my current cte looks scary, which then needs 3 ctes just to unpack properly [unpivot, removal of certain cases meeting certain conditions, combination into a workable output table])
Thanks in advance for reading this and helping!

I think you're working too hard.
If all you need are topic names and numbers, isn't it easier to split the Information column by newlines, and then collect all lines that start with a number and not a "dash" by then, you will have a list of strings that look like:
Topic 1.1
Topic 2.1
And then it's easy to just match the lines against the keyword?
Something like this untested SQL:
select SUBSTRING(s.value,1, PATINDEX('% %', s.Value) - 1) AS topicId
, SUBSTRING(s.Value, PATINDEX('% %', s.Value), LENGTH(s.Value)) AS topicText
from [table that would make Codd cry] t
cross apply STRING_SPLIT(t.Information, CHAR(13)) s
where s.Value LIKE '[0-9]%' -- Starts with a number
AND s.Value LIKE #keywords --matches keywords
Not sure if you can create functions or you have STRING_SPLIT available in your SQL Server version, but if you don't, there are some string splitting CTEs you can find on the net to do the job for you

How do I update an XML column in sql server by checking for the value of two nodes including one which needs to do a contains (like) comparison

I have an xml column called OrderXML in an Orders table...
there is an XML XPath like this in the table...
/Order/InternalInformation/InternalOrderBreakout/InternalOrderHeader/InternalOrderDetails/InternalOrderDetail
There InternalOrderDetails contains many InternalOrderDetail nodes like this...
<InternalOrderDetails>
<InternalOrderDetail>
<Item_Number>FBL11REFBK</Item_Number>
<CountOfNumber>10</CountOfNumber>
<PriceLevel>FREE</PriceLevel>
</InternalOrderDetail>
<InternalOrderDetail>
<Item_Number>FCL13COTRGUID</Item_Number>
<CountOfNumber>2</CountOfNumber>
<PriceLevel>NONFREE</PriceLevel>
</InternalOrderDetail>
</InternalOrderDetails>
My end goal is to modify the XML in the OrderXML column IF the Item_Number of the node contains COTRGUID (like '%COTRGUID') AND the PriceLevel=NONFREE. If that condition is met I want to change the PriceLevel column to equal FREE.
I am having trouble with both creating the xpath expression that finds the correct nodes (using OrderXML.value or OrderXML.exist functions) and updating the XML using the OrderXML.modify function).
I have tried the following for the where clause:
WHERE OrderXML.value('(/Order/InternalInformation/InternalOrderBreakout/InternalOrderHeader/InternalOrderDetails/InternalOrderDetail/Item_Number/node())[1]','nvarchar(64)') like '%13COTRGUID'
That does work, but it seems to me that I need to ALSO include my second condition (PriceLevel=NONFREE) in the same where clause and I cannot figure out how to do it. Perhaps I can put in an AND for the second condition like this...
AND OrderXML.value('(/Order/InternalInformation/InternalOrderBreakout/InternalOrderHeader/InternalOrderDetails/InternalOrderDetail/PriceLevel/node())[1]','nvarchar(64)') = 'NONFREE'
but I am afraid it will end up operating like an OR since it is an XML query.
Once I get the WHERE clause right I will update the column using a SET like this:
UPDATE Orders SET orderXml.modify('replace value of (/Order/InternalInformation/InternalOrderBreakout/InternalOrderHeader/InternalOrderDetails/InternalOrderDetail/PriceLevel[1]/text())[1] with "NONFREE"')
However, I ran this statement on some test data and none of the XML columns where updated (even though it said zz rows effected).
I have been at this for several hours to no avail. Help is appreciated. Thanks.

if you don't have more than one node with your condition in each row of Orders table, you can use this:
update orders set
data.modify('
replace value of
(
/Order/InternalInformation/InternalOrderBreakout/
InternalOrderHeader/InternalOrderDetails/
InternalOrderDetail[
Item_Number[contains(., "COTRGUID")] and
PriceLevel="NONFREE"
]/PriceLevel/text()
)[1]
with "FREE"
');
sql fiddle demo
If you could have more than one node in one row, there're a several possible solutions, none of each is really elegant, sadly.
You can reconstruct all xmls in table - sql fiddle demo
or you can do your updates in the loop - sql fiddle demo

This may get you off the hump.
Replace #HolderTable with the name of your table.
SELECT T2.myAlias.query('./../PriceLevel[1]').value('.' , 'varchar(64)') as MyXmlFragmentValue
FROM #HolderTable
CROSS APPLY OrderXML.nodes('/InternalOrderDetails/InternalOrderDetail/Item_Number') as T2(myAlias)
SELECT T2.myAlias.query('.') as MyXmlFragment
FROM #HolderTable
CROSS APPLY OrderXML.nodes('/InternalOrderDetails/InternalOrderDetail/Item_Number') as T2(myAlias)
EDIT:
UPDATE
#HolderTable
SET
OrderXML.modify('replace value of (/InternalOrderDetails/InternalOrderDetail/PriceLevel/text())[1] with "MyNewValue"')
WHERE
OrderXML.value('(/InternalOrderDetails/InternalOrderDetail/PriceLevel)[1]', 'varchar(64)') = 'FREE'
print ##ROWCOUNT
Your issue is the [1] in the above.
Why did I put it there?
Here is a sentence from the URL listed below.
Note that the target being updated must be, at most, one node that is explicitly specified in the path expression by adding a "[1]" at the end of the expression.
http://msdn.microsoft.com/en-us/library/ms190675.aspx
EDIT.
I think I've discovered the the root of your frustration. (No fix, just the problem).
Note below, the second query works.
So I think the [1] is some cases is saying "only ~~search~~ the first node".....and not (as you and I were hoping)...... "use the first node..after you find a match".
UPDATE
#HolderTable
SET
OrderXML.modify('replace value of (/InternalOrderDetails/InternalOrderDetail/PriceLevel/text())[1] with "MyNewValue001"')
WHERE
OrderXML.value('(/InternalOrderDetails/InternalOrderDetail/PriceLevel[text() = "NONFREE"])[1]', 'varchar(64)') = 'NONFREE'
/* and OrderXML.value('(/InternalOrderDetails/InternalOrderDetail/Item_Number)[1]', 'varchar(64)') like '%COTRGUID' */
UPDATE
#HolderTable
SET
OrderXML.modify('replace value of (/InternalOrderDetails/InternalOrderDetail/PriceLevel/text())[1] with "MyNewValue002"')
WHERE
OrderXML.value('(/InternalOrderDetails/InternalOrderDetail/PriceLevel[text() = "FREE"])[1]', 'varchar(64)') = 'FREE'

Try this :
;with InternalOrderDetail as (SELECT id,
Tbl.Col.value('Item_Number[1]', 'varchar(40)') Item_Number,
Tbl.Col.value('CountOfNumber[1]', 'int') CountOfNumber,
case
when Tbl.Col.value('Item_Number[1]', 'varchar(40)') like '%COTRGUID'
and Tbl.Col.value('PriceLevel[1]', 'varchar(40)')='NONFREE'
then 'FREE'
else
Tbl.Col.value('PriceLevel[1]', 'varchar(40)')
end
PriceLevel
FROM (select id ,orderxml from demo)
as a cross apply orderxml.nodes('//InternalOrderDetail')
as
tbl(col) ) ,
cte_data as(SELECT
ID,
'<InternalOrderDetails>'+(SELECT ITEM_NUMBER,COUNTOFNUMBER,PRICELEVEL
FROM InternalOrderDetail
where ID=Results.ID
FOR XML AUTO, ELEMENTS)+'</InternalOrderDetails>' as XML_data
FROM InternalOrderDetail Results
GROUP BY ID)
update demo set orderxml=cast(xml_data as xml)
from demo
inner join cte_data on demo.id=cte_data.id
where cast(orderxml as varchar(2000))!=xml_data;
select * from demo;
SQL Fiddle
I have handled following cases :
1. As required both where clause in question.
2. It will update all <Item_Number> like '%COTRGUID' and <PriceLevel>= NONFREE in one
node, not just the first one.
It may require minor changes for your data and tables.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to find misspellings in data - sql-server

Related

Why Hibernate HSQL Concat is not working for MSSQL?

How to get the result of CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF in snowflake without using them?

Give reasons as to why my AND clause returns an empty string in SQL

Attempting to run a while loop in my select statement under cases in SQL Server 2012

How do I update an XML column in sql server by checking for the value of two nodes including one which needs to do a contains (like) comparison

Categories

Resources