Word popularity leaderboard in SQL Server based message-board - sql-server

In a SQL server database, I have a table Messages with the following columns:
Id INT(1,1)
Detail VARCHAR(5000)
DatetimeEntered DATETIME
PersonEntered VARCHAR(25)
Messages are pretty basic, and only allow alphanumeric characters and a handful of special characters, which are as follows:
`¬!"£$%^&*()-_=+[{]};:'##~\|,<.>/?
Ignoring the bulk of the special characters bar the apostrophe, what I need is a way to list each word along with how many times the word occurs in the Detail column, which I can then filter by PersonEntered and DatetimeEntered.
Example output:
Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
It doesn't need to be particularly clever. It is perfectly fine if dont and don't are treated as separate words.
I'm having trouble splitting out the words into a temporary table called #Words.
Once I have a temporary table, I would apply the following query:
SELECT
Word,
SUM(Word) AS WordCount
FROM #Words
GROUP BY Word
ORDER BY SUM(Word) DESC
Please help.

Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only ' is going to appear in a word; anything else is going to be grammatical.
You haven't posted what version of SQL you're using, so I've going to use SQL Server 2017 syntax. If you don't have the latest version, you'll need to replace TRANSLATE with a nested REPLACE (So REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '), and find a string splitter (for example, Jeff Moden's DelimitedSplit8K).
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:##~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
As a note, this is going to perform AWFULLY. SQL Server is not designed for this type of work. I also imagine you'll get some odd results and it'll include numbers in there. Things like dates are going to get split out,, numbers like 9,000,000 would be treated as the words 9 and 000, and hyperlinks will be separated.

Related

Customize Normalization in SQL Server Full Text Search by replacing characters

I want to customize SQL Server FTS to handle language specific features better.
In many language like Persian and Arabic there are similar characters that in a proper search behavior they should consider as identical char like these groups:
['آ' , 'ا' , 'ء' , 'ا']
['ي' , 'ی' , 'ئ']
Currently my best solution is to store duplicate data in new column and replace these characters with a representative member and also normalize search term and perform search in the duplicated column.
Is there any way to tell SQL Server to treat any members of these groups as an identical character?
as far as i understand ,this would be used for suggestioning purposes so the being so accurate is not important. so
in farsi actually none of the character in list above doesn't share same meaning but we can say they do have a shared short form in some writing cases ('آ' != 'اِ' but they both can write as 'ا' )
SCENARIO 1 : THE INPUT TEXT IS IN COMPLETE FORM
imagine "محمّد" is a record in a table formatted (id int,text nvarchar(12))named as 'table'.
after removing special character we can use following command :
select * from [db].[dbo].[table] where text REPLACE(text,' ّ ','') = REPLACE(N'محمد',' ّ ','');
the result would be
SCENARIO 2: THE INPUT IS IN SHORT FORMAT
imagine "محمد" is a record in a table formatted (id int,text nvarchar(12))named as 'table'.
in this scenario we need to do some logical operation on text before we query in data base
for e.g. if "محمد" is input as we know and have a list of this special character ,it should be easily searched in query as :
select * from [db].[dbo].[table] where REPLACE(text,' ّ ','') = 'محمد';
note:
this solution is not exactly a best one because the input should not be affected in client side it, would be better if the sql server configure to handle this.
for people who doesn't understand farsi simply he wanna tell sql that َA =["B","C"] and a have same value these character in the list so :
when a "dad" word searched, if any word "dbd" or "dcd" exist return them too.
add:
some set of characters can have same meaning some of some times not ( ['ي','أ'] are same but ['آ','اِ'] not) so in we got first scenario :
select * from [db].[dbo].[table] where text like N'%هی[أي]ت' and text like N'هی[أي]ت%';

Oracle data is returned with spaces between characters

I am trying to retrieve data(select *..) from a SQL Server database to an Oracle database using dblinks. In my SQL Server database, I have a columns AddressLine1 and AddressLine2 of type nvarchar.
I am running the below script in SQL Developer (v 4.1.3.20). The results appear having spaces between characters. I used Benthic and SQL Plus and the results are same, spaces between characters.
SELECT
c.CandidateID,
pa."AddressLine1", pa."AddressLine2"
FROM
CANDIDATES c --Oracle table
INNER JOIN
PostalAddress#HIM pa ON pa."EntityID" = c.CandidateID -- SQL Server table
--#HIM --dblink name`
This screenshot shows the results (when copying blank spaces are copied):
I also tried to cast the results to varchar and the results are same. I tried to trim the spaces and also tried to replace the whitespaces with NULL but the results remain the same.
Any suggestions would be greatly appreciated. Thank you.
Your problem does in fact appear to have something to do with the encoding. Specifically, your text seems to be getting decoded using a character set where the width is two bytes, yet your ASCII data is only taking up one byte.
As a temporary fix, consider the following query:
SELECT REGEXP_REPLACE('6 2 1 1 W r i g h t s v i l l e A v e', ' ([^ ])', '\1')
FROM dual;
Demo
This outputs 6211 Wrightsville Ave, which is what you want. Note that I assume that every character has an extra ghost space, the result of which is that words which were originally separated by one space would now be separated by two spaces.
This isn't the best solution for so many reasons. From a regex point of view, a much tighter answer could be given using lookarounds, but REGEXP_REPLACE does not appear to support them.

T-SQL Regex for social security number (SQL Server 2008 R2)

I need to find invalid social security numbers in a varchar field in a SQL Server 2008 database table. (Valid SSNs are being defined by being in the format ###-##-#### - doesn't matter what the numbers are, as long as they are in that "3-digit dash 2-digit dash 4-digit" pattern.
I do have a working regex:
SELECT *
FROM mytable
WHERE ssn NOT LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
That does find the invalid SSNs in the column, but I know (okay - I'm pretty sure) that there is a way to shorten that to indicate that the previous pattern can have x iterations.
I thought this would work:
'[0-9]{3}-[0-9]{2}-[0-9]{4}'
But it doesn't.
Is there a shorter regex than the one above in the select, or not? Or perhaps there is, but T-SQL/SQL Server 2008 doesn't support it!?
If you plan to get a shorter variant of your LIKE expression, then the answer is no.
In T-SQL, you can only use the following wildcards in the pattern:
%
- Any string of zero or more characters.
WHERE title LIKE '%computer%' finds all book titles with the word computer anywhere in the book title.
_ (underscore)
Any single character.
WHERE au_fname LIKE '_ean' finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ]
Any single character within the specified range ([a-f]) or set ([abcdef]).
WHERE au_lname LIKE '[C-P]arsen' finds author last names ending with arsen and starting with any single character between C and P, for example Carsen, Larsen, Karsen, and so on. In range searches, the characters included in the range may vary depending on the sorting rules of the collation.
[^]
Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
So, your LIKE statement is already the shortest possible expression. No limiting quantifiers can be used (those like {min,max}), not shorthand classes like \d.
If you were using MySQL, you could use a richer set of regex utilities, but it is not the case.
I suggest you to use another solution like this:
-- Use `REPLICATE` if you really want to use a number to repeat
Declare #rgx nvarchar(max) = REPLICATE('#', 3) + '-' +
REPLICATE('#', 2) + '-' +
REPLICATE('#', 4);
-- or use your simple format string
Declare #rgx nvarchar(max) = '###-##-####';
-- then use this to get your final `LIKE` string.
Set #rgx = REPLACE(#rgx, '#', '[0-9]');
And you can also use something like '_' for characters then replace it with [A-Z] and so on.

Identify all strings in SQL Server code (red color - like in SSMS)

I was not able to solve this by myself so I hope I didn't miss any similar post here and I'm not wasting your time.
What I want is to identify (get a list) of all strings used in SQL Server code.
Example:
select 'WordToCatch1' as 'Column1'
from Table1
where Column2 = 'WordToCatch2'
If you put above code to SSMS all three words in apostrophes will be red but only words 'WordToCatch1' and 'WordToCatch2' are "real" strings used in code.
My goal is to find all those "real" strings in any code.
For example if I will have stored procedure 10k rows long it would be impossible to search them manually so I want something what will find all those "real" strings for me and return a list of them or something.
Thanks in advance!
The trouble is, Column1 is nothing particular different compared to WordToCatch1 and WordToCatch2 - not unless you parse the SQL yourself. You could modify your query to take the quotes away from Column1 and it will show up coloured black.
I guess a simple regex will show up all identifiers after an AS keyword, which would be easier than fully parsing SQL, if all the unwanted strings are like that, and its not just an example.

Apostrophes and SQL Server FT search

I have setup FT search in SQL Server 2005 but I cant seem to find a way to match "Lias" keyword to a record with "Lia's". What I basically want is to allow people to search without the apostrophe.
I have been on and off this problem for quite some time now so any help will really be a blessing.
EDIT 2: just realised this doesn't actually resolve your problem, please ignore and see other answer! The code below will return results for a case when a user has inserted an apostrophe which shouldn't be there, such as "abandoned it's cargo".
I don't have FT installed locally and have not tested this - you can use the syntax of CONTAINS to test for both the original occurrence and one with the apostrophe stripped, i.e.:
SELECT *
FROM table
WHERE CONTAINS ('value' OR Replace('value', '''',''))
EDIT: You can search for phrases using double quotes, e.g.
SELECT *
FROM table
WHERE CONTAINS ("this phrase" OR Replace("this phrase", '''',''))
See MSDN documentation for CONTAINS. This actually indicates the punctuation is ignored anyway, but again I haven't tested; it may be worth just trying CONTAINS('value') on its own.
I haven't used FT, but in doing queries on varchar columns, and looking for surnames such as O'Reilly, I've used:
surname like Replace( #search, '''', '') + '%' or
Replace( surname,'''','') like #search + '%'
This allows for the apostrophe to be in either the database value or the search term. It's also obviously going to perform like a dog with a large table.
The alternative (also not a good one probably) would be to save a 2nd copy of the data, stripped of non-alpha characters, and search (also?) against that copy. So there original would contain Lia's and the 2nd copy Lias. Doubling the amount of storage, etc.
Another attempt:
SELECT surname
FROM table
WHERE surname LIKE '%value%'
OR REPLACE(surname,'''','') LIKE '%value%'
This works for me (without FT enabled), i.e. I get the same results when searching for O'Connor or OConnor.

Resources