Collation for URL - sql-server

Collation for URL - sql-server

Warning: I know very little about database collations so apologies in advance if any of this is obvious...
We've got a database column that contains urls. We'd like to place a unique constraint/index on this column.
It's come to my attention that under the default db collation Latin1_General_CI_AS, dupes exist in this column because (for instance) the url http://1.2.3.4:5678/someResource and http://1.2.3.4:5678/SomeResource are considered equal. Frequently this is not the case... the kind of server this url points at is case sensitive.
What would be the most appropriate collation for such a column? Obviously case-sensitivity is a must, but Latin1_General? Are urls Latin1_General? I'm not bothered about a lexicographical ordering, but equality for unique indexes/grouping is important.

You can alter table to set CS (Case Sensitive) collation for this column:
ALTER TABLE dbo.MyTable
ALTER COLUMN URLColumn varchar(max) COLLATE Latin1_General_CS_AS
Also you can specify collation in the SQL statement:
SELECT * FROM dbo.MyTable
WHERE UrlColumn like '%AbC%' COLLATE Latin1_General_CS_AS
Here is a short article for reference.

The letters CI in the collation indicates case insensitivity.
For a URL, which is going to be a small subset of latin characters and symbols, then try Latin1_General_CS_AI

Latin1_General uses code page 1252 (1) and URL's allowed characters are included on that code page(2), so you can say that URLs are Latin1_General.
You just have to select the case sensitive option Latin1_General_CS_AS

rfc3986 says:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
Wikipedia say that allowed chars are:
Unreserved
May be encoded but it is not necessary
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ . ~
Reserved
Have to be encoded sometimes
! * ' ( ) ; : # & = + $ , / ? % # [ ]
It seems that they are not conflicts between this chars in compare operations. Also, you can use HASHBYTES function for make this comparation.
But this kind of operation is not the major problem. Major problem is that http://domain:80 and http://domain may be the same. Also with encoded characters, a url may seems different with encoded chars.
In my opinion, RDBMS will incorporate this kind of structures as new data types: url, phone number, email address, mac address, password, latitude, longitude, ... . I think that collation can helps but don't will solve this issue.

Related

Convert Excel formulae (logic) to SQL Server

I'm looking for help on converting an Excel formula to SQL Server.
=If(AND(N3="A", R3>O3),
R3,If(AND(N3="P",S3>O3),S3,If(N3="D","",If(OR(Q3="P",Q3="A")*AND(P3>TODAY(),P3>O3),P3,O3))))
SQL formula I tried ....Colum N & Q consists of varchar and other fields are datetime in SQL Server. In below SQL statement, I have replaced and (BOLD) with OR condition. When I use "AND"(bold) am getting right data in few cases if I use (OR), am getting right data in few other cases. Here is database structure with insert statements.
https://www.db-fiddle.com/f/iHYxufV2NuyXwHeBM832NS/4
create table dbo.test (id int, N varchar(10), O datetime, P datetime, Q varchar(10), R datetime, S datetime)
select case when N='A' and R>O THEN R
when N='P' and S>O then S
when N='D' then ''
when (Q='P' or Q='Á') **and** p>getdate() and P>O then P else O end data
from test
output for above fiddler =
id-data
1-2020-11-20 00:00:00
2-2021-02-15 00:00:00
3-2021-04-11 00:00:00
4-2021-04-16 00:00:00
5-2021-04-30 00:00:00

The problem is * is a multiplication operator, but both sides of the expression are boolean rather than numeric. I think what's going on is Excel is converting the boolean true/false values to 1 and 0 for the multiplication operation.
If this is correct, then AND is the correct operator and almost everything else in the translation is correct.
There is one other mistake. when N='D' then '' is wrong, because the other result values all appear to be DateTime columns. You can't mix strings and dates. Instead, you need when N='D' then NULL.
CASE WHEN N = 'A' AND R > O THEN R
WHEN N = 'P' AND S > O THEN S
WHEN N = 'D' THEN NULL
WHEN Q IN ('P', 'A') AND P > current_timestamp AND P > O THEN P
ELSE O END
If you really need an empty string, you can convert the result and coalesce to empty string at a different level, but don't do it inside the CASE expression.
It's also worth noting the DateTime/String mismatch could entirely explain the strange results. If you have a sample somewhere for testing with the columns represented as Varchar values instead of Date or DateTime, then the comparisons could be wrong, throwing off the results. For example, O comes before S in the third row of sample data if they are compared as strings instead of dates.

If this is your verbatim code, you have an accented A in this line:
when (Q='P' or Q='Á') and p>getdate() and P>O then P else O end data
A and Á are not equivalent, so that may be short circuiting your OR and failing to return values for any Q = 'A' values that aren't handled further up in the logic.
Other than that your logic looks equivalent. The use of OR(...)*AND(...) is odd but does produce a 1/0 value, and your conversion into SQL has the correct boolean operators to match that logic.

Why character d & f ignored for oracle Number field in where condition?

As I mentioned in question title,character d & f are ignored(?) in Oracle where condition
Below query runs without any error
select employee_id from employees where employee_id > 106f
But if I specify other than d or f after 106 ORA-00933: SQL command not properly ended error will be thrown because employee_id is of datatype Number
Why this Strange behaviour?? That to it happens for only single letter after number,if I specify 106df it throws error(which is correct)

According to Oracle docs, d and f are allowable suffixes for numeric literals, denoting 64-bit (double) and 32-bit (float) binary floating-point types. In your case, the type doesn't make any difference (it probably just gets converted back to integer for the comparison, and with no loss of accuracy because 106 is small enough to be represented exactly as a float), so it looks like nothing is happening. Other letters, and 106df, aren't allowed by the syntax. (e is allowed, but only if followed by a number.)

T-SQL ORDER BY ignores " '-' + ... " but not " '+' + ... "

So i recently encountered a wierd bug when comparing two values.
My values was a range from -1 to 2.
Sometimes it thought that -1 was bigger than 0, the solution was easy. Apparently was the column set to varchar(50) instead of int.
But this made me think why this happened. Because even if the column was set to varchar(50) the '-' should have a lower char value than '0' (charvalue for '-' is 45 and charvalue for '0' should be 48)
I made some tests and it turns out, what i can find, that '-' is the only character that ORDER BY doesn't care about.
Example:
SELECT
A.x
FROM
(
VALUES
('-5'), ('-4'), ('-3'), ('-2'), ('-1'),
('0'), ('1'), ('2'), ('3'), ('4'), ('5')
) A(x)
ORDER BY
A.x;
SELECT
B.x
FROM
(
VALUES
('+5'), ('+4'), ('+3'), ('+2'), ('+1'),
('0'), ('1'), ('2'), ('3'), ('4'), ('5')
) B(x)
ORDER BY
B.x
Result:
Result of A
0
1
-1
2
-2
3
-3
4
-4
5
-5
Result of B
+1
+2
+3
+4
+5
0
1
2
3
4
5
(+ has a charvalue of 43)
The '+' order by feels right but the '-' seems... wrong
Anyone knows why it is like this?
Additional info
Server version: 12.0.4213
Collation: Finnish_Swedish_CI_AS
No clue what else could skew the result. Ask if you need more information.

Found out why.
TLDR: Non-unicode and unicode collation sorts '-' differently.
"A SQL collation's rules for sorting non-Unicode data are incompatible
with any sort routine that is provided by the Microsoft Windows
operating system; however, the sorting of Unicode data is compatible
with a particular version of the Windows sorting rules. Because the
comparison rules for non-Unicode and Unicode data are different, when
you use a SQL collation you might see different results for
comparisons of the same characters, depending on the underlying data
type. For example, if you are using the SQL collation
"SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less
than the string 'ab' because the hyphen ("-") is sorted as a separate
character that comes before "b". However, if you convert these strings
to Unicode and you perform the same comparison, the Unicode string
N'a-c' is considered to be greater than N'ab' because the Unicode
sorting rules use a "word sort" that ignores the hyphen."
Source: https://support.microsoft.com/en-us/kb/322112

How do I match a substring of variable length?

I am importing data into my SQL database from an Excel spreadsheet.
The imp table is the imported data, the app table is the existing database table.
app.ReceiptId is formatted as "A" followed by some numbers. Formerly it was 4 digits, but now it may be 4 or 5 digits.
Examples:
A1234
A9876
A10001
imp.ref is a free-text reference field from Excel. It consists of some arbitrary length description, then the ReceiptId, followed by an irrelevant reference number in the format " - BZ-0987654321" (which is sometimes cropped short, or even missing entirely).
Examples:
SHORT DESC A1234 - BZ-0987654321
LONGER DESCRIPTION A9876 - BZ-123
REALLY LONG DESCRIPTION A2345 - B
REALLY REALLY LONG DESCRIPTION A23456
The code below works for a 4-digit ReceiptId, but will not correctly capture a 5-digit one.
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId = right(right(rtrim(replace(replace(imp.ref,'-',''),'B','')),5)
+ rtrim(left(imp.ref,charindex(' - BZ-',imp.ref))),5)
How can I change the code so it captures either 4 (A1234) or 5 (A12345) digits?

As ughai rightfully wrote in his comment, it's not recommended to use anything other then columns in the on clause of a join.
The reason for that is that using functions prevents sql server for using any indexes on the columns that it might use without the functions.
Therefor, I would suggest adding another column to imp table that will hold the actual ReceiptId and be calculated during the import process itself.
I think the best way of extracting the ReceiptId from the ref column is using substring with patindex, as demonstrated in this fiddle:
SELECT ref,
RTRIM(SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9]%', ref), 6)) As ReceiptId
FROM imp
Update
After the conversation with t-clausen-dk in the comments, I came up with this:
SELECT ref,
CASE WHEN PATINDEX('%[ ]A[0-9][0-9][0-9][0-9][0-9| ]%', ref) > 0
OR PATINDEX('A[0-9][0-9][0-9][0-9][0-9| ]%', ref) = 1 THEN
SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9][0-9| ]%', ref), 6)
ELSE
NULL
END As ReceiptId
FROM imp
fiddle here
This will return null if there is no match,
when a match is a sub string that contains A followed by 4 or 5 digits, separated by spaces from the rest of the string, and can be found at the start, middle or end of the string.

Try this, it will remove all characters before the A[number][number][number][number] and take the first 6 characters after that:
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId in
(
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 5),
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 6)
)
When using equal, the spaces after is not evaluated

Find all special characters in a column in SQL Server 2008

I need to find the occurrence of all special characters in a column in SQL Server 2008. So, I don't care about A, B, C ... 8, 9, 0, but I do care about !, #, &,, etc.
The easiest way to do so, in my mind, would exclude A, B, C, ... 8, 9, 0, but if I wrote a statement to exclude those, I would miss entries that had ! and A. So, it seems to me that I would have to get a list of every non-alphabet / non-number character, then run a SELECT with a LIKE and Wildcard qualifiers.
Here is what I would run:
SELECT Col1
FROM TABLE
WHERE Col1 LIKE ('!', '#', '#', '$', '%'....)
However, I don't think you can run multiple qualifiers, can you? Is there a way I could accomplish this?

Negatives are your friend here:
SELECT Col1
FROM TABLE
WHERE Col1 like '%[^a-Z0-9]%'
Which says that you want any rows where Col1 consists of any number of characters, then one character not in the set a-Z0-9, and then any number of characters.
If you have a case sensitive collation, it's important that you use a range that includes both upper and lower case A, a, Z and z, which is what I've given (originally I had it the wrong way around. a comes before A. Z comes after z)
Or, to put it another way, you could have written your original WHERE as:
Col1 LIKE '%[!##$%]%'
But, as you observed, you'd need to know all of the characters to include in the [].

The following transact SQL script works for all languages (international). The solution is not to check for alphanumeric but to check for not containing special characters.
DECLARE #teststring nvarchar(max)
SET #teststring = 'Test''Me'
SELECT 'IS ALPHANUMERIC: ' + #teststring
WHERE #teststring NOT LIKE '%[-!#%&+,./:;<=>#`{|}~"()*\\\_\^\?\[\]\'']%' {ESCAPE '\'}

Select * from TableName Where ColumnName LIKE '%[^A-Za-z0-9, ]%'
This will give you all the row which contains any special character.

select count(*) from dbo.tablename where address_line_1 LIKE '%[\'']%' {eSCAPE'\'}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Collation for URL - sql-server

The letters CI in the collation indicates case insensitivity. For a URL, which is going to be a small subset of latin characters and symbols, then try Latin1_General_CS_AI

Latin1_General uses code page 1252 (1) and URL's allowed characters are included on that code page(2), so you can say that URLs are Latin1_General. You just have to select the case sensitive option Latin1_General_CS_AS

Related

Convert Excel formulae (logic) to SQL Server

Why character d & f ignored for oracle Number field in where condition?

T-SQL ORDER BY ignores " '-' + ... " but not " '+' + ... "

How do I match a substring of variable length?

Find all special characters in a column in SQL Server 2008

Categories

Resources