Is there any way in SQL Server of determining what a character in a code page would represent without actually creating a test database of that collation?
Example. If I create a test database with collation SQL_Ukrainian_CP1251_CS_AS and then do CHAR(255) it returns я.
If I try the following on a database with SQL_Latin1_General_CP1_CS_AS collation however
SELECT CHAR(255) COLLATE SQL_Ukrainian_CP1251_CS_AS
It returns y
SELECT CHAR(255)
Returns ÿ so it is obviously going first via the database's default collation then trying to find the closest equivalent to that in the explicit collation. Can this be avoided?
Actually I have found an answer to my question now. A bit clunky but does the job unless there's a better way out there?
SET NOCOUNT ON;
CREATE TABLE #Collations
(
code TINYINT PRIMARY KEY
);
WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1), --2
E02(N) AS (SELECT 1 FROM E00 a, E00 b), --4
E04(N) AS (SELECT 1 FROM E02 a, E02 b), --16
E08(N) AS (SELECT 1 FROM E04 a, E04 b) --256
INSERT INTO #Collations
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1
FROM E08
DECLARE #AlterScript NVARCHAR(MAX) = ''
SELECT #AlterScript = #AlterScript + '
RAISERROR(''Processing' + name + ''',0,1) WITH NOWAIT;
ALTER TABLE #Collations ADD ' + name + ' CHAR(1) COLLATE ' + name + ';
EXEC(''UPDATE #Collations SET ' + name + '=CAST(code AS BINARY(1))'');
EXEC(''UPDATE #Collations SET ' + name + '=NULL WHERE ASCII(' + name + ') <> code'');
'
FROM sys.fn_helpcollations()
WHERE name LIKE '%CS_AS'
AND name NOT IN /*Unicode Only Collations*/
( 'Assamese_100_CS_AS', 'Bengali_100_CS_AS',
'Divehi_90_CS_AS', 'Divehi_100_CS_AS' ,
'Indic_General_90_CS_AS', 'Indic_General_100_CS_AS',
'Khmer_100_CS_AS', 'Lao_100_CS_AS',
'Maltese_100_CS_AS', 'Maori_100_CS_AS',
'Nepali_100_CS_AS', 'Pashto_100_CS_AS',
'Syriac_90_CS_AS', 'Syriac_100_CS_AS',
'Tibetan_100_CS_AS' )
EXEC (#AlterScript)
SELECT * FROM #Collations
DROP TABLE #Collations
While MS SQL supports both code pages and Unicode unhelpfully it doesn't provide any functions to convert between the two so figuring out what character is represented by a value in a different code page is a pig.
There are two potential methods I've seen to handle conversions, one is detailed here
http://www.codeguru.com/cpp/data/data-misc/values/article.php/c4571
and involves bolting a custom conversion program onto the database and using that for conversions.
The other is to construct a db table consisting of
[CodePage], [ANSI Value], [UnicodeValue]
with the unicode value stored as either the int representing the unicode character to be converted using nchar()or the nchar itself
Your using the collation SQL_Ukrainian_CP1251_CS_AS which is code page 1251 (CP1251 from the centre of the string). You can grab its translation table here http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT
Its a TSV so after trimming the top off the raw data should import fairly cleanly.
Personally I'd lean more towards the latter than the former especially for a production server as the former may introduce instability.
Related
I have a data set of about 33 million rows and 20 columns. One of the columns is a raw data tab I'm using to extract relevant data from, inlcuding ID's and account numbers.
I extracted a column for User ID's into a temporary table to trim the User ID's of spaces. I'm now trying to add the trimmed User ID column back into the original data set using this code:
SELECT *
FROM [dbo].[DATA] AS A
INNER JOIN #TempTable AS B ON A. [RawColumn] = B. [RawColumn]
Extracting the User ID's and trimming the spaces took about a minute for each query. However, running this last query I'm at the 2 hour mark and I'm only 2% of the way through the dataset.
Is there a better way to run the query?
I'm running the query in SQL Server 2014 Management Studio
Thanks
Update:
I continued to let it run through the night. When I got back into work, only 6 million rows had been completed of the 33 million rows. I cancelled the execution and I'm trying to add a smaller primary key (The only other key I could see on the table was the [RawColumn], which was a very long string of text) using:
ALTER TABLE [dbo].[DATA]
ADD ID INT IDENTITY(1,1)
Right now I'm an hour into the execution.
Next, I'm planning to make it the primary key using
ALTER TABLE dbo.[DATA]
ADD CONSTRAINT PK_[DATA] PRIMARY KEY(ID)
I'm not familiar with using Indexes.. I've tried looking up on Stack Overflow how to create one, but from what I'm reading it sounds like it would take just as long to create an index as it would to run this query. Am I wrong about that?
For context on the RawColumn data, it looks something like this:
FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000
Update #2:
I'm now learning that using "ALTER TABLE" is a bad idea. I should have done a little bit more research into how to add a primary key to a table.
Update #3
Here's the code I used to extract the "UserID" code out of the "RawColumn" data.
DROP #TEMPTABLE1
GO
SELECT [RAWColumn],
SUBSTRING([RAWColumn], CHARINDEX('USERID:', [RAWColumn])+LEN('USERID:'), CHARINDEX('Account#:', [RAWColumn])-Charindex('Username:', [RAWColumn]) - LEN('Account#:') - LEN('USERID:')) AS 'USERID_NEW'
INTO #TempTable1
FROM [dbo].[DATA]
Next I trimmed the data from the temporary tables
DROP #TEMPTABLE2
GO
SELECT [RawColumn],
LTRIM([USERID_NEW]) AS 'USERID_NEW'
INTO #TempTable2
FROM #TempTable1
So now I'm trying to get the data from #TEMPTABLE2 back into my original [DATA] table. Hopefully this is more clear now.
So I think your parsing code is a little bit wrong. Here's an approach that doesn't assume that the values appear in any particular order. It does assume that the header/tag name has a space after the colon character and it assumes that the value end at the subsequent space character. Here's a snippet that manipulates a single value.
declare #dat varchar(128) = 'FirstName: John LastName: Smith UserID: JohnS Account#: 000-000-0000';
declare #tag varchar(16) = 'UserID: ';
/* datalength() counts the trailing space character unlike len() */
declare #idx int = charindex(#tag, #dat) + datalength(#tag);
select substring(#dat, #idx, charindex(' ', #dat + ' ', #idx + 1) - #idx) as UserID
To use it in a single query without the temporary variable, the most straightforward approach is to just replace each instance of "#idx" with the original expression:
declare #tag varchar(16) = 'UserID: ';
select RawColumn,
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID
from dbo.DATA;
As an update it looks something like this:
declare #tag varchar(16) = 'UserID: ';
update dbo.DATA
set UserID =
substring(
RawColumn,
charindex(#tag, RawColumn) + datalength(#tag),
charindex(
' ', RawColumn + ' ',
charindex(#tag, RawColumn) + datalength(#tag) + 1
) - charindex(#tag, RawColumn) + datalength(#tag)
) as UserID;
You also appear to be ignoring upper/lower case in your string matches. It's not clear to me whether you need to consider that more carefully.
I have a question about SQL Server: I have a database column with a pattern which is like this:
up to 10 digits
then a comma
up to 10 digits
then a semicolon
e.g.
100000161, 100000031; 100000243, 100000021;
100000161, 100000031; 100000243, 100000021;
and I want to extract within the pattern the first digits (up to 10) (1.) and then a semicolon (4.)
(or, in other words, remove everything from the semicolon to the next semicolon)
100000161; 100000243; 100000161; 100000243;
Can you please advice me how to establish this in SQL Server? Im not very familiar with regex and therefore have no clue how to fix this.
Thanks,
Alex
Try this
Declare #Sql Table (SqlCol nvarchar(max))
INSERT INTO #Sql
SELECT'100000161,100000031;100000243,100000021;100000161,100000031;100000243,100000021;'
;WITH cte
AS (SELECT Row_number()
OVER(
ORDER BY (SELECT NULL)) AS Rno,
split.a.value('.', 'VARCHAR(1000)') AS Data
FROM (SELECT Cast('<S>'
+ Replace( Replace(sqlcol, ';', ','), ',',
'</S><S>')
+ '</S>'AS XML) AS Data
FROM #Sql)AS A
CROSS apply data.nodes('/S') AS Split(a))
SELECT Stuff((SELECT '; ' + data
FROM cte
WHERE rno%2 <> 0
AND data <> ''
FOR xml path ('')), 1, 2, '') AS ExpectedData
ExpectedData
-------------
100000161; 100000243; 100000161; 100000243
I believe this will get you what you are after as long as that pattern truly holds. If not it's fairly easy to ensure it does conform to that pattern and then apply this
Select Substring(TargetCol, 1, 10) + ';' From TargetTable
You can take advantage of SQL Server's XML support to convert the input string into an XML value and query it with XQuery and XPath expressions.
For example, the following query will replace each ; with </b><a> and each , to </a><b> to turn each string into <a>100000161</a><a>100000243</a><a />. After that, you can select individual <a> nodes with /a[1], /a[2] :
declare #table table (it nvarchar(200))
insert into #table values
('100000161, 100000031; 100000243, 100000021;'),
('100000161, 100000031; 100000243, 100000021;')
select
xCol.value('/a[1]','nvarchar(200)'),
xCol.value('/a[2]','nvarchar(200)')
from (
select convert(xml, '<a>'
+ replace(replace(replace(it,';','</b><a>'),',','</a><b>'),' ','')
+ '</a>')
.query('a') as xCol
from #table) as tmp
-------------------------
A1 A2
100000161 100000243
100000161 100000243
value extracts a single value from an XML field. nodes returns a table of nodes that match the XPath expression. The following query will return all "keys" :
select
a.value('.','nvarchar(200)')
from (
select convert(xml, '<a>'
+ replace(replace(replace(it,';','</b><a>'),',','</a><b>'),' ','')
+ '</a>')
.query('a') as xCol
from #table) as tmp
cross apply xCol.nodes('a') as y(a)
where a.value('.','nvarchar(200)')<>''
------------
100000161
100000243
100000161
100000243
With 200K rows of data though, I'd seriously consider transforming the data when loading it and storing it in indivisual, indexable columns, or add a separate, related table. Applying string manipulation functions on a column means that the server can't use any covering indexes to speed up queries.
If that's not possible (why?) I'd consider at least adding a separate XML-typed column that would contain the same data in XML form, to allow the creation of an XML index.
I have a column with the name of a person in the following format: "LAST NAME, FIRST NAME"
Only Upper Cases Allowed
Space after comma optional
I would like to use a regular expression like: [A-Z]+,[ ]?[A-Z]+ but I do not know how to do this in T-SQL. In Oracle, I would use REGEXP_LIKE, is there something similar for SQL Server 2016?
I need something like the following:
UPDATE table
SET is_correct_format = 'YES'
WHERE REGEXP_LIKE(table.name,'[A-Z]+,[ ]?[A-Z]+');
First, case sensitivity depends on the collation of the DB, though with LIKE you can specify case comparisons. With that... here is some Boolean logic to take care of the cases you stated. Though, you may need to add additional clauses if you discover some bogus input.
declare #table table (Person varchar(64), is_correct_format varchar(3) default 'NO')
insert into #table (Person)
values
('LowerCase, Here'),
('CORRECTLY, FORMATTED'),
('CORRECTLY,FORMATTEDTWO'),
('ONLY FIRST UPPER, LowerLast'),
('WEGOT, FormaNUMB3RStted'),
('NoComma Formatted'),
('CORRECTLY, TWOCOMMA, A'),
(',COMMA FIRST'),
('COMMA LAST,'),
('SPACE BEFORE COMMA , GOOD'),
(' SPACE AT BEGINNING, GOOD')
update #table
set is_correct_format = 'YES'
where
Person not like '%[^A-Z, ]%' --check for non characters, excluding comma and spaces
and len(replace(Person,' ','')) = len(replace(replace(Person,' ',''),',','')) + 1 --make sure there is only one comma
and charindex(',',Person) <> 1 --make sure the comma isn't at the beginning
and charindex(',',Person) <> len(Person) --make sure the comma isn't at the end
and substring(Person,charindex(',',Person) - 1,1) <> ' ' --make sure there isn't a space before comma
and left(Person,1) <> ' ' --check preceeding spaces
and UPPER(Person) = Person collate Latin1_General_CS_AS --check collation for CI default (only upper cases)
select * from #table
The tsql equivalent could look like this. I'm not vouching for the efficiency of this solution.
declare #table as table(name varchar(20), is_Correct_format varchar(5))
insert into #table(name) Values
('Smith, Jon')
,('se7en, six')
,('Billy bob')
UPDATE #table
SET is_correct_format = 'YES'
WHERE
replace(name, ', ', ',x')
like (replicate('[a-z]', charindex(',', name) - 1)
+ ','
+ replicate('[a-z]', len(name) - charindex(',', name)) )
select * from #table
The optional space is hard to solve, so since it's next to a legal character I'm just replacing with another legal character when it's there.
TSQL does not provide the kind of 'repeating pattern' of * or + in regex, so you have to count the characters and construct the pattern that many times in your search pattern.
I split the string at the comma, counted the alphas before and after, and built a search pattern to match.
Clunky, but doable.
I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result
I am stuck on converting a varchar column UserID to INT. I know, please don't ask why this UserID column was not created as INT initially, long story.
So I tried this, but it doesn't work. and give me an error:
select CAST(userID AS int) from audit
Error:
Conversion failed when converting the varchar value
'1581............................................................................................................................' to data type int.
I did select len(userID) from audit and it returns 128 characters, which are not spaces.
I tried to detect ASCII characters for those trailing after the ID number and ASCII value = 0.
I have also tried LTRIM, RTRIM, and replace char(0) with '', but does not work.
The only way it works when I tell the fixed number of character like this below, but UserID is not always 4 characters.
select CAST(LEFT(userID, 4) AS int) from audit
You could try updating the table to get rid of these characters:
UPDATE dbo.[audit]
SET UserID = REPLACE(UserID, CHAR(0), '')
WHERE CHARINDEX(CHAR(0), UserID) > 0;
But then you'll also need to fix whatever is putting this bad data into the table in the first place. In the meantime perhaps try:
SELECT CONVERT(INT, REPLACE(UserID, CHAR(0), ''))
FROM dbo.[audit];
But that is not a long term solution. Fix the data (and the data type while you're at it). If you can't fix the data type immediately, then you can quickly find the culprit by adding a check constraint:
ALTER TABLE dbo.[audit]
ADD CONSTRAINT do_not_allow_stupid_data
CHECK (CHARINDEX(CHAR(0), UserID) = 0);
EDIT
Ok, so that is definitely a 4-digit integer followed by six instances of CHAR(0). And the workaround I posted definitely works for me:
DECLARE #foo TABLE(UserID VARCHAR(32));
INSERT #foo SELECT 0x31353831000000000000;
-- this succeeds:
SELECT CONVERT(INT, REPLACE(UserID, CHAR(0), '')) FROM #foo;
-- this fails:
SELECT CONVERT(INT, UserID) FROM #foo;
Please confirm that this code on its own (well, the first SELECT, anyway) works for you. If it does then the error you are getting is from a different non-numeric character in a different row (and if it doesn't then perhaps you have a build where a particular bug hasn't been fixed). To try and narrow it down you can take random values from the following query and then loop through the characters:
SELECT UserID, CONVERT(VARBINARY(32), UserID)
FROM dbo.[audit]
WHERE UserID LIKE '%[^0-9]%';
So take a random row, and then paste the output into a query like this:
DECLARE #x VARCHAR(32), #i INT;
SET #x = CONVERT(VARCHAR(32), 0x...); -- paste the value here
SET #i = 1;
WHILE #i <= LEN(#x)
BEGIN
PRINT RTRIM(#i) + ' = ' + RTRIM(ASCII(SUBSTRING(#x, #i, 1)))
SET #i = #i + 1;
END
This may take some trial and error before you encounter a row that fails for some other reason than CHAR(0) - since you can't really filter out the rows that contain CHAR(0) because they could contain CHAR(0) and CHAR(something else). For all we know you have values in the table like:
SELECT '15' + CHAR(9) + '23' + CHAR(0);
...which also can't be converted to an integer, whether you've replaced CHAR(0) or not.
I know you don't want to hear it, but I am really glad this is painful for people, because now they have more war stories to push back when people make very poor decisions about data types.
This question has got 91,000 views so perhaps many people are looking for a more generic solution to the issue in the title "error converting varchar to INT"
If you are on SQL Server 2012+ one way of handling this invalid data is to use TRY_CAST
SELECT TRY_CAST (userID AS INT)
FROM audit
On previous versions you could use
SELECT CASE
WHEN ISNUMERIC(RTRIM(userID) + '.0e0') = 1
AND LEN(userID) <= 11
THEN CAST(userID AS INT)
END
FROM audit
Both return NULL if the value cannot be cast.
In the specific case that you have in your question with known bad values I would use the following however.
CAST(REPLACE(userID COLLATE Latin1_General_Bin, CHAR(0),'') AS INT)
Trying to replace the null character is often problematic except if using a binary collation.
This is more for someone Searching for a result, than the original post-er. This worked for me...
declare #value varchar(max) = 'sad';
select sum(cast(iif(isnumeric(#value) = 1, #value, 0) as bigint));
returns 0
declare #value varchar(max) = '3';
select sum(cast(iif(isnumeric(#value) = 1, #value, 0) as bigint));
returns 3
I would try triming the number to see what you get:
select len(rtrim(ltrim(userid))) from audit
if that return the correct value then just do:
select convert(int, rtrim(ltrim(userid))) from audit
if that doesn't return the correct value then I would do a replace to remove the empty space:
select convert(int, replace(userid, char(0), '')) from audit
This is how I solved the problem in my case:
First of all I made sure the column I need to convert to integer doesn't contain any spaces:
update data set col1 = TRIM(col1)
I also checked whether the column only contains numeric digits.
You can check it by:
select * from data where col1 like '%[^0-9]%' order by col1
If any nonnumeric values are present, you can save them to another table and remove them from the table you are working on.
select * into nonnumeric_data from data where col1 like '%[^0-9]%'
delete from data where col1 like '%[^0-9]%'
Problems with my data were the cases above. So after fixing them, I created a bigint variable and set the values of the varchar column to the integer column I created.
alter table data add int_col1 bigint
update data set int_col1 = CAST(col1 AS VARCHAR)
This worked for me, hope you find it useful as well.