SQL String: Counting Words inside a String - sql-server

I searched through many of the questions here but all I found with decent answer is for different language like Javascript etc.
I have a simple task in SQL that I can't seem to find a simple way to do.
I just need to count the number of "words" inside a SQL string (a sentence). You can see why "words" is in quotes in my examples. The "words" are delimited by white space.
Sample sentences:
1. I am not your father.
2. Where are your brother,sister,mother?
3. Where are your brother, sister and mother?
4. Who are you?
Desired answer:
1. 5
2. 4
3. 7
4. 3
As you can see, I need to count the "words" disregarding the symbols (I have to treat them as part of the word). So in sample no. 2:
(1)Where (2)are (3)your (4)brother,sister,mother? = 4
I can handle the multiple whitespaces by doing a replace like this:
REPLACE(string, ' ', ' ') -> 2 whitespaces to 1
REPLACE(string, ' ', ' ') -> 3 whitespaces to 1 and so on..
What SQL function can I use to do this? I use SQL Server 2012 but needs a function that works in SQL Server 2008 as well.

Here is one way to do it:
Create and populate sample table (Please save is this step in your future questions)
DECLARE #T AS TABLE
(
id int identity(1,1),
string varchar(100)
)
INSERT INTO #T VALUES
('I am not your father.'),
('Where are your brother,sister,mother?'),
('Where are your brother, sister and mother?'),
('Who are you?')
Use a cte to replace multiple spaces to a single space (Thanks to Gordon Linoff's answer here)
;WITH CTE AS
(
SELECT Id,
REPLACE(REPLACE(REPLACE(string, ' ', '><' -- Note that there are 2 spaces here
), '<>', ''
), '><', ' '
) as string
FROM #T
)
Query the CTE - length of the string - length of the string without spaces + 1:
SELECT id, LEN(string) - LEN(REPLACE(string, ' ', '')) + 1 as CountWords
FROM CTE
Results:
id CountWords
1 5
2 4
3 7
4 3

This is a minor improvement of #ZoharPeled's answer. This can also handle 0 length values:
DECLARE #t AS TABLE(id int identity(1,1), string varchar(100))
INSERT INTO #t VALUES
('I am not your father.'),
('Where are your brother,sister,mother?'),
('Where are your brother, sister and mother?'),
('Who are you?'),
('')
;WITH CTE AS
(
SELECT
Id,
REPLACE(REPLACE(string,' ', '><'), '<>', '') string
FROM #t
)
SELECT
id,
LEN(' '+string)-LEN(REPLACE(string, '><', ' ')) CountWords
FROM CTE

To handle multiple spaces too, use the method shown here
Declare #s varchar(100)
set #s='Who are you?'
set #s=ltrim(rtrim(#s))
while charindex(' ',#s)>0
Begin
set #s=replace(#s,' ',' ')
end
select len(#s)-len(replace(#s,' ',''))+1 as word_count
https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/

I found this query more useful than the first. it omit extra characters and numbers and symbols, so it would count just words within a passage...
drop table if exists #t
create table #t (id int identity(1,1), c1 varchar(2000))
insert into #t (c1)
values
('Alireza Sattarzadeh Farkoush '),
('yes it is the   best .'),
('abc def ghja a the . asw'),
('?>< 123 ...!  z a b'),
('Wallex is   the greatest exchange in the .. world a after binance ...!')
select c1 , Count(*)
from (
select id, c1, value  
from #t t
cross apply (
select rtrim(ltrim(value)) as value from string_split(c1,' ')) a
where len(value) > 1 and value like '%[a-Z]%'
) Final
group by c1

Related

Replace specials chars with HTML entities

I have the following in table TABLE
id content
-------------------------------------
1 Hellö world, I äm text
2 ènd there äré many more chars
3 that are speçial in my dat£base
I now need to export these records into HTML files, using bcp:
set #command = 'bcp "select [content] from [TABLE] where [id] = ' +
#id queryout +' + #filename + '.html" -S ' + #instance +
' -c -U ' + #username + ' -P ' + #password"
exec xp_cmdshell #command, no_ouput
To make the output look correct, I need to first replace all special characters with their respective HTML entities (pseudo)
insert into [#temp_html] ..
replace(replace([content], 'ö', 'ö'), 'ä', 'ä')
But by now, I have 30 nested replaces and it's starting to look insane.
After much searching, I found this post which uses a HTML conversion table but it is too advanced for me to understand:
The table does not list the special chars itself as they are in my text (ö, à etc) but UnicodeHex. Do I need to add them to the table to make the conversions that I need?
I am having trouble understanding how to update my script to replace all special chars. Can someone please show me a snippet of (pseudo) code?
One way to do that with a translation table is using a recursive cte to do the replaces, and one more cte to get only the last row of each translated value.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
id int,
content nvarchar(100)
)
INSERT INTO #T (id, content) VALUES
(1, 'Hellö world, I äm text'),
(2, 'ènd there äré many more chars'),
(3, 'that are speçial in my dat£base')
Then, create and populate the translation table (I don't know the HTML entities for these chars, so I've just used numbers [plus it's easier to see in the results]). Also, please note that this can be done using yet another cte in the chain.
DECLARE #Translations AS TABLE
(
str nchar(1),
replacement nvarchar(10)
)
INSERT INTO #Translations (str, replacement) VALUES
('ö', '-1-'),
('ä', '-2-'),
('è', '-3-'),
('ä', '-4-'),
('é', '-5-'),
('ç', '-6-'),
('£', '-7-')
Now, the first cte will do the replaces, and the second cte just adds a row_number so that for each id, the last value of lvl will get 1:
;WITH CTETranslations AS
(
SELECT id, content, 1 As lvl
FROM #T
UNION ALL
SELECT id, CAST(REPLACE(content, str, replacement) as nvarchar(100)), lvl+1
FROM CTETranslations
JOIN #Translations
ON content LIKE '%' + str + '%'
), cteNumberedTranslation AS
(
SELECT id, content, ROW_NUMBER() OVER(PARTITION BY Id ORDER BY lvl DESC) rn
FROM CTETranslations
)
Select from the second cte where rn = 1, I've joined the original table to show the source and translation side by side:
SELECT r.id, s.content, r.content
FROM #T s
JOIN cteNumberedTranslation r
ON s.Id = r.Id
WHERE rn = 1
ORDER BY Id
Results:
id content content
1 Hellö world, I äm text Hell-1- world, I -4-m text
2 ènd there äré many more chars -3-nd there -4-r-5- many more chars
3 that are speçial in my dat£base that are spe-6-ial in my dat-7-base
Please note that if your content have more that 100 special chars, you will need to add the maxrecursion 0 hint to the final select:
SELECT r.id, s.content, r.content
FROM #T s
JOIN cteNumberedTranslation r
ON s.Id = r.Id
WHERE rn = 1
ORDER BY Id
OPTION ( MAXRECURSION 0 );
See a live demo on rextester.

TSQL: How insert separator between each character in a string

I have a string like this:
Apple
I want to include a separator after each character so the end result will turn out like this:
A,p,p,l,e
In C#, we have one liner method to achieve the above with Regex.Replace('Apple', ".{1}", "$0,");
I can only think of looping each character with charindex to append the separator but seems a little complicated. Is there any elegant way and simpler way to achieve this?
Thanks HABO for the suggestions. I'm able to generate the result that I want using the code but takes a little bit of time to really understand how the code work.
After some searching, I manage to found one useful article to insert empty spaces between each character and it's easier for me to understand.
I modify the code a little to define and include desire separator instead of fixing it to space as the separator:
DECLARE #pos INT = 2 -- location where we want first space
DECLARE #result VARCHAR(100) = 'Apple'
DECLARE #separator nvarchar(5) = ','
WHILE #pos < LEN(#result)+1
BEGIN
SET #result = STUFF(#result, #pos, 0, #separator);
SET #pos = #pos+2;
END
select #result; -- Output: A,p,p,l,e
Reference
In following SQL scripts, I get each character using SUBSTRING() function using with a number table (basically I used spt_values view here for simplicity) and then I concatenate them via two different methods, you can choose one
If you are using SQL Server 2017, we have a new SQL string aggregation function
First script uses string_agg function
declare #str nvarchar(max) = 'Apple'
SELECT
string_agg( substring(#str,number,1) , ',') Within Group (Order By number)
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
If you are working with a previous version, you can use string concatenation using FOR XML Path and SQL Stuff function as follows
declare #str nvarchar(max) = 'Apple'
; with cte as (
SELECT
number,
substring(#str,number,1) as L
FROM master..spt_values n
WHERE
Type = 'P' and
Number between 1 and len(#str)
)
SELECT
STUFF(
(
SELECT
',' + L
FROM cte
order by number
FOR XML PATH('')
), 1, 1, ''
)
Both solution yields the same result, I hope it helps
If you have SQL Server 2017 and a copy of ngrams8k it's ultra simple:
declare #word varchar(100) = 'apple';
select newString = string_agg(token, ',') within group (order by position)
from dbo.ngrams8k(#word,1);
For pre-2017 systems it's almost as simple:
declare #word varchar(100) = 'apple';
select newstring =
( select token + case len(#word)+1-position when 1 then '' else ',' end
from dbo.ngrams8k(#word,1)
order by position
for xml path(''))
One ugly way to do it is to split the string into characters, ideally using a numbers table, and reassemble it with the desired separator.
A less efficient implementation uses recursion in a CTE to split the characters and insert the separator between pairs of characters as it goes:
declare #Sample as VarChar(20) = 'Apple';
declare #Separator as Char = ',';
with Characters as (
select 1 as Position, Substring( #Sample, 1, 1 ) as Character
union all
select Position + 1,
case when Position & 1 = 1 then #Separator else Substring( #Sample, Position / 2 + 1, 1 ) end
from Characters
where Position < 2 * Len( #Sample ) - 1 )
select Stuff( ( select Character + '' from Characters order by Position for XML Path( '' ) ), 1, 0, '' ) as Result;
You can replace the select Stuff... line with select * from Characters; to see what's going on.
Try this
declare #var varchar(50) ='Apple'
;WITH CTE
AS
(
SELECT
SeqNo = 1,
MyStr = #var,
OpStr = CAST('' AS VARCHAR(50))
UNION ALL
SELECT
SeqNo = SeqNo+1,
MyStr = MyStR,
OpStr = CAST(ISNULL(OpStr,'')+SUBSTRING(MyStR,SeqNo,1)+',' AS VARCHAR(50))
FROM CTE
WHERE SeqNo <= LEN(#var)
)
SELECT
OpStr = LEFT(OpStr,LEN(OpStr)-1)
FROM CTE
WHERE SeqNo = LEN(#Var)+1

Issue with patindex and unicode character '-'

I have a string called Dats which is either of the general appearence xxxx-nnnnn (where x is a character, and n is a number) or nnn-nnnnnn.
I want to return only the numbers.
For this I've tried:
SELECT Distinct dats,
Left(SubString(artikelnr, PatIndex('%[0-9.-]%', artikelnr), 8000), PatIndex('%[^0-9.-]%', SubString(artikelnr, PatIndex('%[0-9.-]%', artikelnr), 8000) + 'X')-1)
FROM ThatDatabase
It is almost what I want. It removes the regular characters x, but it does not remove the unicode character -. How can I remove this as well? And also, it seems rather ineffective to have two PatIndex functions for every row, is there a way to avoid this? (This will be used on a big database where the result of this Query will be used as keys).
EDIT: Updated as a new database sometimes contained additional -'s or . together with -.
DECLARE #T as table
(
dats nvarchar(10)
)
INSERT INTO #T VALUES
('111BWA30'),
('115-200-11')
('115-22.4-1')
('10.000.22')
('600F-FFF200')
I wasn't sure if you wanted the numbers before the - char as well, but if you do, here is one way to do it:
Create and populate sample table (Please save us this step in your future questions)
DECLARE #T as table
(
dats nvarchar(10)
)
INSERT INTO #T VALUES
('abcde-1234'),
('23-343')
The query:
SELECT dats,
case when patindex('%[^0-9]-[0-9]%', dats) > 0 then
right(dats, len(dats) - patindex('%-[0-9]%', dats))
else
stuff(dats, charindex('-', dats), 1, '')
end As NumbersOnly
FROM #T
Results:
dats NumbersOnly
abcde-1234 1234
23-343 23343
If you want the only the numbers to the right of the - char, it's simpler:
SELECT dats,
right(dats, len(dats) - patindex('%-[0-9]%', dats)) As RightNumbersOnly
FROM #T
Results:
dats RightNumbersOnly
abcde-1234 1234
23-343 343
If you know which characters you need to remove then use REPLACE function
DECLARE #T as table
(
dats nvarchar(100)
)
INSERT INTO #T
VALUES
('111BWA30'),
('115-200-11'),
('115-22.4-1'),
('10.000.22'),
('600F-FFF200')
SELECT REPLACE(REPLACE(dats, '.', ''), '-', '')
FROM #T

select only integer from right before certain character

I want a simple solution to select only numbers from right of a VARCHAR column
before a certain character.
For example, in the strings below, I only want to select numbers before slash character. The numbers vary, it can be 1 digit or more.
'ST/11/SCI/1' 'ST/11/SCI/22' 'ST/11/BIO/854' 'ST/11/BIO/5421'
You can use REVERSE with CHARINDEX so as to locate the position of the last '/' character. Then use RIGHT to extract the number:
DECLARE #str VARCHAR(100) = 'ST/11/BIO/854'
SELECT RIGHT(#str, CHARINDEX('/', REVERSE(#str)) - 1)
Edit:
To get second number starting from the end you can use:
DECLARE #str VARCHAR(100) = 'ST/11/BIO/1288/544'
SELECT SUBSTRING(#str, q2.x + 2, q2.x - q1.x - 1)
FROM (SELECT #str AS v) AS t
CROSS APPLY (SELECT CHARINDEX('/', REVERSE(#str))) AS q1(x)
CROSS APPLY (SELECT CHARINDEX('/', REVERSE(#str), q1.x + 1)) AS q2(x)
CROSS APPLY (SELECT LEN(#str)) AS s(l)
Use SUBSTRING, CHARINDEX, REVERSE
REVERSE : It will REVERSE your original string value
CHARINDEX : It will find index/location of any character from string value
SUBSTRING : It will split your string value
DECLARE #myString as VARCHAR(50)='ST/11/SCI/22'
SELECT
REVERSE
(
SUBSTRING
(
REVERSE(#myString),
0,
CHARINDEX('/',REVERSE(#myString))
)
)
I poste this as another answer, because the approach is completely different to the one of my other answer:
Starting with SQL Server 2012 (thx #NEER!) there is PARSENAME, which is a very straight approach to split a dot-delimited string up to 4 parts:
DECLARE #stringTable TABLE(string VARCHAR(100));
INSERT INTO #stringTable VALUES
('ST/11/SCI/1'),('ST/11/SCI/22'),('ST/11/BIO/854'),('ST/11/BIO/5421');
SELECT PARSENAME(REPLACE(s.string,'/','.'),1)
FROM #stringTable AS s
The result
1
22
854
5421
If you need all data you can split this easily and type-safe:
DECLARE #stringTable TABLE(string VARCHAR(100));
INSERT INTO #stringTable VALUES
('ST/11/SCI/1'),('ST/11/SCI/22'),('ST/11/BIO/854'),('ST/11/BIO/5421');
SELECT s.string
,x.value('/x[1]','nvarchar(10)') AS Part1
,x.value('/x[2]','int') AS Part2
,x.value('/x[3]','nvarchar(10)') AS Part3
,x.value('/x[4]','int') AS Part4
FROM #stringTable AS s
CROSS APPLY(SELECT CAST('<x>' + REPLACE(s.string,'/','</x><x>')+'</x>' AS XML)) AS A(x)
The result
string Part1 Part2 Part3 Part4
ST/11/SCI/1 ST 11 SCI 1
ST/11/SCI/22 ST 11 SCI 22
ST/11/BIO/854 ST 11 BIO 854
ST/11/BIO/5421 ST 11 BIO 5421
You can use a combination of reverse and charindex :
select reverse(left(reverse(...), charindex('/', reverse(...)) -1))
Documentation of reverse : https://msdn.microsoft.com/fr-fr/library/ms180040.aspx
Documentation of charindex : https://msdn.microsoft.com/fr-fr/library/ms186323.aspx
Test: select reverse(left(reverse('ST/11/SCI/2'), charindex('/', reverse('ST/11/SCI/2')) -1))
Output: 2
Note: We didnt know your version of SQL-Server, check the documentation of reverse/charindex to know wich version of SQL Server is supported !

Keep only desired characters and separate with semicolon in T-SQL

The problem:
I have text data imported into the db with a lot of unwanted characters. I need to keep only 4 capital letter strings within the imported text string. Example:
1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)
this could be in one column in one row of my table. What I need to extract from the string is:
MIBD and LKRE
And the result should preferably be the desired strings separated with semicolons.
It should be applied to the whole column and I cannot know how many of these 4 upper case letter strings might appear in one row.
Went through all sorts of function like PATINDEX etc. but really do not know how to approach it. thanks for any help!
try this, it assumes that the four char code is always preceded by ;# . As PATINDEX is case insensitive I have added additional check to verify that all the four character are capital.
DECLARE #MyTable Table( ID INT, MyString VARCHAR(8000))
INSERT INTO #MyTable
VALUES
(1, '1447;#MIBD (This is a nice name);#2056;#LKRE (Very nice name indeed)')
,(2, ';#DBCC (This is a nice name);#2056;#LLC (Very nice name indeed) ;#ABCD')
,(3, ';#AaaA;#OPQR;1234 (and) ;#WXYZ')
,(4, ';#abc this empty string without any code')
;WITH CTE AS
(
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM #MyTable m
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
UNION ALL
SELECT ID
,SUBSTRING(MyString, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+2, 4) AS NewString
,STUFF(MyString, 1, PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString)+6, '') AS MyString
FROM CTE c
WHERE PATINDEX('%;#[A-Z][A-Z][A-Z][A-Z]%',MyString) > 0
)
SELECT c.ID,
STUFF(( SELECT '; ' + NewString
FROM CTE c1
WHERE c1.ID = c.ID
AND ASCII(SUBSTRING(NewString, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- first char
AND ASCII(SUBSTRING(NewString, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- second char
AND ASCII(SUBSTRING(NewString, 3, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- third char
AND ASCII(SUBSTRING(NewString, 4, 1)) BETWEEN ASCII('A') AND ASCII('Z') -- fourth char
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)') -- use the value clause to hanlde xml character issue like, &,",>,<
,1,1,'') AS CodeList
FROM CTE c
GROUP BY ID
OPTION (MAXRECURSION 0);
I came to something like this so far:
ALTER FUNCTION CleanData
(
-- Parameters here
#Text AS VARCHAR(4000)
)
RETURNS VARCHAR(4000)
AS
BEGIN
WHILE PATINDEX('%[0-9#;()]%', #Text) > 0
BEGIN
SET #Text = STUFF(#Text, PATINDEX('%[0-9#;()]%', #Text), 1, '')
END
RETURN #Text
END
But what I get is the Initials and the characters in parantheses as the PATINDEX cannot differ between the upper and lower case. Maybe it might be helpful for somebody else

Resources