Prevent bad data being selected with SQL Server COLLATE - sql-server

I have two databases and I am moving data from one database another. On some columns I need to resolve a collation conflict between "Latin1_General_CI_AS" and "Latin1_General_CI_AI". So I currently do:
SELECT [column] COLLATE Latin1_General_CI_AI
FROM xxx
This works in the main but in some fields I see characters like:
�
In the data that has been copied across. It seems some characters from the original DB are lost when the collation is changed on the new DB. Now I know I can use REPLACE to get rid of them, but I wondered rather than search the entire record set - is there a way to filter out or prevent the special characters coming across in the first place?

Related

SQL: Update Unicode Data in column to Accented characters

I have a column which was set to Varchar and the database set to SQL_Latin1_General_CP1_CI_AS.
When a user entered their name into our web front end and save the data, it was not saving accented characters correctly.
The web user was entering the following, "Béala" but this was being saved on the database as the following, "Béala".
I believe that changing the column from Varchar to NVarchar should prevent this from happening going forward(?), however, I have two questions.
1) How do I perform a select on the existing data in the column and display it correctly?
select CONVERT(NVARCHAR(100),strAddress1) from [dbo].[tblCustomer]
This still shows the data incorrectly.
2) How do I update the data in the column once converted to NVarchar to save the accented characters correctly?
Many thanks,
Ray.
The only idea that came to my mind is that you have to prepare an update that will fool this badly loaded data, i.e. a sign
'é' will always match exactly one character (in this case 'é'), you have to catch all special characters and this
as have been changed (just a simple update with cases and replace). Of course, the first column must be of the nvarchar type.
It solves the problem 1 and 2 (the data will be correct in the table, the data will be displayed correctly, I described the update above)
Here is way to get it in normal characters scheme.
select 'Réunion', cast('Réunion' as varchar(100)) COLLATE SQL_Latin1_General_CP1253_CI_AI
Moreover to check all possible collations in SQL Server you can try this query
SELECT name, description
FROM sys.fn_helpcollations();

How to remove white spaces from columns in SQL Server 2000?

I have an old SQL Server 2000 database from which I read data. My problem is every query involving a String returns a value the size of that column filled with blank spaces.
e.g: let's say we have a column called NAME CHAR(20). All queries would return:
"John "
instead of just "John".
Is there a configuration or parameter in my database that causes this, or anything at all that can be changed to avoid it? Thank you.
EDIT:
I'd like to clarify, I'm reading my DB using JPA Repositories. I don't want to physically remove the whitespaces from the columns, or trim the values manually using RTRIM/LTRIM/REPLACE. I'm just trying to retrieve the column without trailing spaces, without adding any extra strain to the query or trimming the fields programatically.
you can use REPLACE/RTRIM/LTRIM
select RTRIM(LTRIM(column_name)) from table_name
or
select replace(column_name, ' ', '') from table_name

SQL Server 2012- Server collation and database collation

I have SQL Server 2012 installed that is used for a few different applications. One of our applications needs to be installed, but the company is saying that:
The SQL collation isn't correct, it needs to be: SQL_Latin1_General_CP1_CI_AS
You can just uninstall the SQL Server Database Engine & upon reinstall select the right collation.
What possible reason would this company have to want to change the collation of the database engine itself?
Yes, you are able to set the collation at the database level. To do so, here is an example:
USE master;
GO
ALTER DATABASE <DatabaseName>
COLLATE SQL_Latin1_General_CP1_CI_AS;
GO
You can alter the database Collation even after you have created the database using the following query
USE master;
GO
ALTER DATABASE Database_Name
COLLATE Your_New_Collation;
GO
For more information on database collation Read here
What possible reason would this company have to want to change the collation of the database engine itself?
The other two answers are speaking in terms of Database-level Collation, not Instance-level Collation (i.e. "database engine itself"). The most likely reason that the vendor has for wanting a highly specific Collation (not just a case-insensitive one of your choosing, for example) is that, like most folks, they don't really understand how Collations work, but what they do know is that their application works (i.e. does not get Collation conflict errors) when the Instance and Database both have a Collation of SQL_Latin1_General_CP1_CI_AS, which is the Collation of their Instance and Database (that they develop the app on), because that is the default Collation when installing on an OS having English as its language.
I'm guessing that they have probably had some customers report problems that they didn't know how to fix, but narrowed it down to those Instances not having SQL_Latin1_General_CP1_CI_AS as the Instance / Server -level Collation. The Instance-level Collation controls not just tempdb meta-data (and default column Collation when no COLLATE keyword is specified when creating local or global temporary tables), which has been mentioned by others, but also name resolution for variables / parameters, cursors, and GOTO labels. Even if unlikely that they would be using GOTO statements, they are certainly using variables / parameters, and likely enough to be using cursors.
What this means is that they likely had problems in one or more of the following areas:
Collation conflict errors related to temporary tables:
tempdb being in the Collation of the Instance does not always mean that there will be problems, even if the COLLATE keyword was never used in a CREATE TABLE #[#]... statement. Collation conflicts only occur when attempting to combine or compare two string columns. So assuming that they created a temporary table and used it in conjunction with a table in their Database, they would need to be JOINing on those string columns, or concatenating them, or combining them via UNION, or something along those lines. Under these circumstances, an error will occur if the Collations of the two columns are not identical.
Unexpected behavior:
Comparing a string column of a table to a variable or parameter will use the Collation of the column. Given their requirement for you to use SQL_Latin1_General_CP1_CI_AS, this vendor is clearly expecting case-insensitive comparisons. Since string columns of temp tables (that were not created using the COLLATE keyword) take on the Collation of the Instance, if the Instance is using a binary or case-sensitive Collation, then their application will not be returning all of the data that they were expecting it to return.
Code compilation errors:
Since the Instance-level Collation controls resolution of variable / parameter / cursor names, if they have inconsistent casing in any of their variable / parameter / cursor names, then errors will occur when attempting to execute the code. For example, doing this:
DECLARE #CustomerID INT;
SET #customerid = 5;
would get the following error:
Msg 137, Level 15, State 1, Line XXXXX
Must declare the scalar variable "#customerid".
Similarly, they would get:
Msg 16916, Level 16, State 1, Line XXXXX
A cursor with the name 'Customers' does not exist.
if they did this:
DECLARE customers CURSOR FOR SELECT 1 AS [Bob];
OPEN Customers;
These problems are easy enough to avoid, simply by doing the following:
Specify the COLLATE keyword on string columns when creating temporary tables (local or global). Using COLLATE DATABASE_DEFAULT is handy if the Database itself is not guaranteed to have a particular Collation. But if the Collation of the Database is always the same, then you can specify either DATABASE_DEFAULT or the particular Collation. Though I suppose DATABASE_DEFAULT works in both cases, so maybe it's the easier choice.
Be consistent in casing of identifiers, especially variables / parameters. And to be more complete, I should mention that Instance-level meta-data is also affected by the Instance-level Collation (e.g. names of Logins, Databases, server-Roles, SQL Agent Jobs, SQL Agent Job Steps, etc). So being consistent with casing in all areas is the safest bet.
Am I being unfair in assuming that the vendor doesn't understand how Collations work? Well, according to a comment made by the O.P. on M.Ali's answer:
I got this reply from him: "It's the other way around, you need the new SQL instance collation to match the old SQL collation when attaching databases to it. The collation is used in the functioning of the database, not just something that gets set when it's created."
the answer is "no". There are two problems here:
No, the Collations of the source and destination Instances do not need to match when attaching a Database to a new Instance. In fact, you can even attach a system DB to an Instance that has a different Collation, thereby having a mismatch between the attached system DB and the Instance and the other system DBs.
It's unclear if "database" in that last sentence means actual Database or the Instance (sometimes people use the term "database" to refer to the RDBMS as a whole). If it means actual "Database", then that is entirely irrelevant because the issue at hand is the Instance-level Collation. But, if the vendor meant the Instance, then while true that the Collation is used in normal operations (as noted above), this only shows awareness of simple cause-effect relationship and not actual understanding. Actual understanding would lead to doing those simple fixes (noted above) such that the Instance-level Collation was a non-issue.
If needing to change the Collation of the Instance, please see:
Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
For more info on working with Collations / encodings / Unicode / etc, please visit:
Collations.Info

in sql server, what is: Latin1_General_CI_AI versus Latin1_General_CI_AS

i am using SQL Compare to compare two versions of a database. it keeps highlighting differences in the nvarchar fields, where it shows one db that has:
Latin1_General_CI_AS
and the other one has this:
Latin1_General_CI_AI
can someone please explain what this is and if i should be worried about this difference
Accent Sensitive and Accent Insensitive
Lòpez and Lopez are the same if Accent Insensitive.
What the comparison is showing is that the two columns have difference collations. The collation of a (text) field affects how it is both stored and compared.
The particular difference in your case is that accents on characters will be ignored when comparisons and sorting is done.
When you install SQL Server, you set a default collation for the whole server. You can also set a collation per database and per column, meaning that you can mix them within a database (whether you want to depends on your particular case). The MSDN page I linked to has more information on collations, how to choose the best one, and how to set them.

How can I recover Unicode data which displays in SQL Server as?

I have a database in SQL Server containing a column which needs to contain Unicode data (it contains user's addresses from all over the world e.g. القاهرة‎ for Cairo)
This column is an nvarchar column with a collation of database default (Latin1_General_CI_AS), but I've noticed data inserted into it via SQL statements containing non English characters and displays as ?????.
The solution seems to be that I wasn't using the n prefix e.g.
INSERT INTO table (address) VALUES ('القاهرة')
Instead of:
INSERT INTO table (address) VALUES (n'القاهرة')
I was under the impression that Unicode would automatically be converted for nvarchar columns and I didn't need this prefix, but this appears to be incorrect.
The problem is I still have some data in this column which appears as ????? in SQL Server Management Studio and I don't know what it is!
Is the data still there but in an incorrect character encoding preventing it from displaying but still salvageable (and if so how can I recover it?), or is it gone for good?
Thanks,
Tom
To find out what SQL Server really stores, use
SELECT CONVERT(VARBINARY(MAX), 'some text')
I just tried this with umlauted characters and Arabic (copied from Wikipedia, I have no idea) both as plain strings and as N'' Unicode strings.
The results are that Arabic non-Unicode strings really end up as question marks (0x3F) in the conversion to VARCHAR.
SSMS sometimes won't display all characters, I just tried what you had and it worked for me, copy and paste it into Word and it might display it corectly
Usually if SSMS can't display it it should be boxes not ?
Try to write a small client that will retrieve these data to a file or web page. Check ALL your code if there are no other inserts or updates that might convertthe data to varchar before storing them in tables.

Resources