T-SQL Regex for social security number (SQL Server 2008 R2) - sql-server

I need to find invalid social security numbers in a varchar field in a SQL Server 2008 database table. (Valid SSNs are being defined by being in the format ###-##-#### - doesn't matter what the numbers are, as long as they are in that "3-digit dash 2-digit dash 4-digit" pattern.
I do have a working regex:
SELECT *
FROM mytable
WHERE ssn NOT LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
That does find the invalid SSNs in the column, but I know (okay - I'm pretty sure) that there is a way to shorten that to indicate that the previous pattern can have x iterations.
I thought this would work:
'[0-9]{3}-[0-9]{2}-[0-9]{4}'
But it doesn't.
Is there a shorter regex than the one above in the select, or not? Or perhaps there is, but T-SQL/SQL Server 2008 doesn't support it!?

If you plan to get a shorter variant of your LIKE expression, then the answer is no.
In T-SQL, you can only use the following wildcards in the pattern:
%
- Any string of zero or more characters.
WHERE title LIKE '%computer%' finds all book titles with the word computer anywhere in the book title.
_ (underscore)
Any single character.
WHERE au_fname LIKE '_ean' finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ]
Any single character within the specified range ([a-f]) or set ([abcdef]).
WHERE au_lname LIKE '[C-P]arsen' finds author last names ending with arsen and starting with any single character between C and P, for example Carsen, Larsen, Karsen, and so on. In range searches, the characters included in the range may vary depending on the sorting rules of the collation.
[^]
Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
So, your LIKE statement is already the shortest possible expression. No limiting quantifiers can be used (those like {min,max}), not shorthand classes like \d.
If you were using MySQL, you could use a richer set of regex utilities, but it is not the case.

I suggest you to use another solution like this:
-- Use `REPLICATE` if you really want to use a number to repeat
Declare #rgx nvarchar(max) = REPLICATE('#', 3) + '-' +
REPLICATE('#', 2) + '-' +
REPLICATE('#', 4);
-- or use your simple format string
Declare #rgx nvarchar(max) = '###-##-####';
-- then use this to get your final `LIKE` string.
Set #rgx = REPLACE(#rgx, '#', '[0-9]');
And you can also use something like '_' for characters then replace it with [A-Z] and so on.

Related

Word popularity leaderboard in SQL Server based message-board

In a SQL server database, I have a table Messages with the following columns:
Id INT(1,1)
Detail VARCHAR(5000)
DatetimeEntered DATETIME
PersonEntered VARCHAR(25)
Messages are pretty basic, and only allow alphanumeric characters and a handful of special characters, which are as follows:
`¬!"£$%^&*()-_=+[{]};:'##~\|,<.>/?
Ignoring the bulk of the special characters bar the apostrophe, what I need is a way to list each word along with how many times the word occurs in the Detail column, which I can then filter by PersonEntered and DatetimeEntered.
Example output:
Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
It doesn't need to be particularly clever. It is perfectly fine if dont and don't are treated as separate words.
I'm having trouble splitting out the words into a temporary table called #Words.
Once I have a temporary table, I would apply the following query:
SELECT
Word,
SUM(Word) AS WordCount
FROM #Words
GROUP BY Word
ORDER BY SUM(Word) DESC
Please help.
Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only ' is going to appear in a word; anything else is going to be grammatical.
You haven't posted what version of SQL you're using, so I've going to use SQL Server 2017 syntax. If you don't have the latest version, you'll need to replace TRANSLATE with a nested REPLACE (So REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '), and find a string splitter (for example, Jeff Moden's DelimitedSplit8K).
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:##~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
As a note, this is going to perform AWFULLY. SQL Server is not designed for this type of work. I also imagine you'll get some odd results and it'll include numbers in there. Things like dates are going to get split out,, numbers like 9,000,000 would be treated as the words 9 and 000, and hyperlinks will be separated.

Encoding error reading Greek characters string form SQL database

I have a search form (with method GET) with only one text field named “search_field”. When a user submits the form, the typed by the user characters are posted to the URL. For example if the user type "blablabla" the generated URL will be something like that:
results.asp?search_field=blablabla
In my MSSQL 2012 database I have a table named “Products” with a column named “kodikos” in it.
I want to display all the records from the column “kodikos” containing the typed characters. My SQL select statement if the following:
"SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE '%' + ? + '%' "
(the question mark is the “search_field” that contains the typed by the user characters.
All the above works perfect and I am getting the correct results. The problem that I am facing is with the Greek characters. For example when the user type “fff” my codes works perfect and finds all the records containing the characters “fff”. Also works perfect with numbers too. But if the user type in Greek characters “φφφ” I am not getting any results. And there are a lot of records with “φφφ”. The problem is that the Greek characters are not recognized at all.
For your information:
In my local PC with the same SQL version the Greek characters are recognized correctly with my code, because my regional settings are set in Greek. But the same code in the hosting server in US does not recognize them.
All of my pages have UTF-8 encoding.
Can someone have any idea to solve this issue???
SQL Server knows two encodings natively:
2-byte-unicode (in most cases NVARCHAR)
extended ASCII in connection with a collation (in most cases VARCHAR)
I assume, that the language you are calling this from is using 2-byte-unicode for normal strings. This is pretty usual today...
I assume, that your column Products.kodikos is of type NVARCHAR (2-byte-unicode). In this case it should help to force your search string to be 2-byte-unicode too. Try
LIKE N'%' + CAST(? AS NVARCHAR(MAX)) + N'%'
If your column is not 2-byte encoded it might help to use COLLATE to force your search string to know your special characters.
If you pass this string into a SQL-Server routine as-is, you should make sure, that the accepting parameter is 2-byte-unicode too.
You have to make sure your search string is two byte encoded using the N'' notation...
For instance, the following query uses a string that is two byte encoded:
SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE N'%φφφ%'
But this query uses a string that is not two byte encoded (you won't get any results):
SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE '%φφφ%'

Tdbf/tdataset sorting multiple fields in delphi

I have a delphi application that uses tdbf which is based on tdataset with the advantage of not requiring bde engine. I needed the table to be sorted and I did this one a single field by adding an indexdef and then specifying the indexfieldnames.
I am trying to now get it to sort on 2 fields ie group men together and then women together and then have each group sorted on salary so that we see females from lowest earner to highest earner followed by men in the same way.
I have read every piece of material stating that you simply specify sortfield of the indexdef as 'gender+salary'. When I try use the index I get told that '+' is not a valid fieldname. I have tried every delimeter from '.'. ','. '&' and ';'. Every delimeter gets picked up as a field that doesn't exist. What is the correct way to sort a table on multiple fields?
Thanks in advance
Clinton Brits
xBASE (dBASE and it's derivatives) requires that fields in an index all be converted to the same data type, typically strings. To do that typically requires some common functions:
DTOS() - Converts an xBASE date to the format CCYYMMDD as a string
STR() - Converts a numeric to a string, with an optional width specifier (default 10) and number of digits to the right of the decimal point. Specifically, the syntax is specified as STR(<numeric> [, <width> [, <decimaldigits>] ]).
SUBSTR() - Extracts a portion of a string from another, with a specified starting position and number of characters
IIF() - Immediate IF, used to convert logicals (eg., IIF(Married = .T., 'Y', 'N')
Index expressions are indeed combined with the + operator. The error you're receiving is probably because you haven't converted to a common data type.
As you've specified the Gender column (probably defined as CHAR 1) and Salary column (probably a NUMERIC of some size), you can use something like
Dbf1.AddIndex('GENDER_SAL', 'GENDER + STR(SALARY, 10, 0)', []);
This creates a index on an expression like F 10000, F 200000, M 12000, where SALARY is converted to the default width of 10 characters (left padded with spaces) and no decimal digits. This should work for you.
I have not used the component, but it looks like they want to use index expressions that are similar to what we used to use in dBase III. On page 7 in the PDF version of the documentation, they offer an example under the Expressions topic:
Dbf1. AddIndex('INDEX1 ', 'DTOS( DATEFIELD)+ SUBSTR ( LONGFIELD ,1 ,10)+ SUBSTR
( LONGFIELD2 ,1 ,20)', []);
You could try their SubStr function on your fields with parameters that would include the whole string and see if that at least gets you a result.

SQL Server 2008 Empty String vs. Space

I ran into something a little odd this morning and thought I'd submit it for commentary.
Can someone explain why the following SQL query prints 'equal' when run against SQL 2008. The db compatibility level is set to 100.
if '' = ' '
print 'equal'
else
print 'not equal'
And this returns 0:
select (LEN(' '))
It appears to be auto trimming the space. I have no idea if this was the case in previous versions of SQL Server, and I no longer have any around to even test it.
I ran into this because a production query was returning incorrect results. I cannot find this behavior documented anywhere.
Does anyone have any information on this?
varchars and equality are thorny in TSQL. The LEN function says:
Returns the number of characters, rather than the number of bytes, of the given string expression, excluding trailing blanks.
You need to use DATALENGTH to get a true byte count of the data in question. If you have unicode data, note that the value you get in this situation will not be the same as the length of the text.
print(DATALENGTH(' ')) --1
print(LEN(' ')) --0
When it comes to equality of expressions, the two strings are compared for equality like this:
Get Shorter string
Pad with blanks until length equals that of longer string
Compare the two
It's the middle step that is causing unexpected results - after that step, you are effectively comparing whitespace against whitespace - hence they are seen to be equal.
LIKE behaves better than = in the "blanks" situation because it doesn't perform blank-padding on the pattern you were trying to match:
if '' = ' '
print 'eq'
else
print 'ne'
Will give eq while:
if '' LIKE ' '
print 'eq'
else
print 'ne'
Will give ne
Careful with LIKE though: it is not symmetrical: it treats trailing whitespace as significant in the pattern (RHS) but not the match expression (LHS). The following is taken from here:
declare #Space nvarchar(10)
declare #Space2 nvarchar(10)
set #Space = ''
set #Space2 = ' '
if #Space like #Space2
print '#Space Like #Space2'
else
print '#Space Not Like #Space2'
if #Space2 like #Space
print '#Space2 Like #Space'
else
print '#Space2 Not Like #Space'
#Space Not Like #Space2
#Space2 Like #Space
The = operator in T-SQL is not so much "equals" as it is "are the same word/phrase, according to the collation of the expression's context," and LEN is "the number of characters in the word/phrase." No collations treat trailing blanks as part of the word/phrase preceding them (though they do treat leading blanks as part of the string they precede).
If you need to distinguish 'this' from 'this ', you shouldn't use the "are the same word or phrase" operator because 'this' and 'this ' are the same word.
Contributing to the way = works is the idea that the string-equality operator should depend on its arguments' contents and on the collation context of the expression, but it shouldn't depend on the types of the arguments, if they are both string types.
The natural language concept of "these are the same word" isn't typically precise enough to be able to be captured by a mathematical operator like =, and there's no concept of string type in natural language. Context (i.e., collation) matters (and exists in natural language) and is part of the story, and additional properties (some that seem quirky) are part of the definition of = in order to make it well-defined in the unnatural world of data.
On the type issue, you wouldn't want words to change when they are stored in different string types. For example, the types VARCHAR(10), CHAR(10), and CHAR(3) can all hold representations of the word 'cat', and ? = 'cat' should let us decide if a value of any of these types holds the word 'cat' (with issues of case and accent determined by the collation).
Response to JohnFx's comment:
See Using char and varchar Data in Books Online. Quoting from that page, emphasis mine:
Each char and varchar data value has a collation. Collations define
attributes such as the bit patterns used to represent each character,
comparison rules, and sensitivity to case or accenting.
I agree it could be easier to find, but it's documented.
Worth noting, too, is that SQL's semantics, where = has to do with the real-world data and the context of the comparison (as opposed to something about bits stored on the computer) has been part of SQL for a long time. The premise of RDBMSs and SQL is the faithful representation of real-world data, hence its support for collations many years before similar ideas (such as CultureInfo) entered the realm of Algol-like languages. The premise of those languages (at least until very recently) was problem-solving in engineering, not management of business data. (Recently, the use of similar languages in non-engineering applications like search is making some inroads, but Java, C#, and so on are still struggling with their non-businessy roots.)
In my opinion, it's not fair to criticize SQL for being different from "most programming languages." SQL was designed to support a framework for business data modeling that's very different from engineering, so the language is different (and better for its goal).
Heck, when SQL was first specified, some languages didn't have any built-in string type. And in some languages still, the equals operator between strings doesn't compare character data at all, but compares references! It wouldn't surprise me if in another decade or two, the idea that == is culture-dependent becomes the norm.
I found this blog article which describes the behavior and explains why.
The SQL standard requires that string
comparisons, effectively, pad the
shorter string with space characters.
This leads to the surprising result
that N'' = N' ' (the empty string
equals a string of one or more space
characters) and more generally any
string equals another string if they
differ only by trailing spaces. This
can be a problem in some contexts.
More information also available in MSKB316626
There was a similar question a while ago where I looked into a similar problem here
Instead of LEN(' '), use DATALENGTH(' ') - that gives you the correct value.
The solutions were to use a LIKE clause as explained in my answer in there, and/or include a 2nd condition in the WHERE clause to check DATALENGTH too.
Have a read of that question and links in there.
To compare a value to a literal space, you may also use this technique as an alternative to the LIKE statement:
IF ASCII('') = 32 PRINT 'equal' ELSE PRINT 'not equal'
Sometimes one has to deal with spaces in data, with or without any other characters, even though the idea of using Null is better - but not always usable.
I did run into the described situation and solved it this way:
... where ('>' + #space + '<') <> ('>' + #space2 + '<')
Of course you wouldn't do that for large amount of data but it works quick and easy for some hundred lines ...
As SQL - 92 8.2 comparison predicate saying:
If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad char-
acters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any char-
acter in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.
How to distinct records on select with fields char/varchar on sql server:
example:
declare #mayvar as varchar(10)
set #mayvar = 'data '
select mykey, myfield from mytable where myfield = #mayvar
expected
mykey (int) | myfield (varchar10)
1 | 'data '
obtained
mykey | myfield
1 | 'data'
2 | 'data '
even if I write
select mykey, myfield from mytable where myfield = 'data' (without final blank)
I get the same results.
how I solved? In this mode:
select mykey, myfield
from mytable
where myfield = #mayvar
and DATALENGTH(isnull(myfield,'')) = DATALENGTH(#mayvar)
and if there is an index on myfield, it'll be used in each case.
I hope it will be helpful.
Another way is to put it back into a state that the space has value.
eg: replace the space with a character known like the _
if REPLACE('hello',' ','_') = REPLACE('hello ',' ','_')
print 'equal'
else
print 'not equal'
returns: not equal
Not ideal, and probably slow, but is another quick way forward when needed quickly.

Function to find the Exact match in Microsoft SQL Server

What is the way to find the exactly matching substring in the given string in Microsoft SQL server?
For example, in the string '0000020354', I want to find '20354'. Of course it has to be an exact match. I tried to use CHARINDEX(#providerId, external_prv_id) > -1, but the problem with CHARINDEX is that it gives me the index as soon as it finds the first match.
Basically I am looking for function like indexOf("") in Microsoft SQL SERVER.
Assuming #ProviderId is a VARCHAR
You could just use LIKE :
SELECT Id FROM TableName WHERE Column LIKE '%' + #ProviderId + '%'
Which will return rows where Column contains 2034.
And if you don't want to use LIKE, You can use PATINDEX:
SELECT Id FROM TableName WHERE PATINDEX('%' + #ProviderId + '%', Column) > 0
Which returns the starting position of any match that it finds.
What's the data you're storing? It sounds like another storage type (e.g. a separate table) might be more suitable.
Ahh, 2034 was a typo. What I don't understand from your question is that you say you need the exact match. If CHARINDEX returns non-zero for '20354' you know that it's matched '20354'. If you don't know what #providerId is, return that in your query along with the result of CHARINDEX. Similarly, if you want external_prv_id, include that, e.g.:
SELECT external_prv_id, CHARINDEX(#providerId, external_prv_id)
WHERE CHARINDEX(#providerId, external_prv_id) > 0
(Note that CHARINDEX returning 0 means it was not found.)
If you actually mean that '20354' could include wildcards, you need PATINDEX.
The LIKE %VAL% stuff will be overly broad, e.g. the database contains 00000012345 and you search for 1234 you'll pull this row, which is what the OP does not intend (if I'm understanding the "EXACT" part correctly).
What you want is a regular expression that does something like: any number of zeroes followed by the match and end of line.
From this question we know how to trim leading zeroes:
Better techniques for trimming leading zeros in SQL Server?
SUBSTRING(str_col, PATINDEX('%[^0]%', str_col+'.'), LEN(str_col))
So, combine that with your query, and you can do something like the following:
WHERE SUBSTRING(external_prv_id, PATINDEX('%[^0]%', external_prv_id+'.'), LEN(external_prv_id)) = '12345'
Of course, the better (best?) solution would be to store them as INTEGERS so you get full indexability and don't have to muck with all of this crap. If you REALLY need to store the exact string then you have a couple of options:
store the normalized integer results
in another column and use that for
all internal queries
always store an integer but then pad
with zeros upon query (my vote)

Resources