Migrating from sql server to Oracle varchar length issues

Migrating from sql server to Oracle varchar length issues - sql-server

Im facing a strange issue trying to move from sql server to oracle.
in one of my tables i have column defined by NVARCHAR(255)
after reading a bit i understod that SQL server is counting characters when oracle count bytes.
So i defined my table in oracle as VARCHAR(510) 255*2 = 510
But when using sqlldr to load the data from a tab delimetered text file i get en error indicating some entries had exiceeded the length of this column.
after checking in the sql server using:
SELECT MAX(DATALENGTH(column))
FROM table
i get that the max data length is 510.
I do use Hebrew_CI_AS collationg even though i dont think it changes anything....
I checked in SQL Server also if any of the entries contains TAB but no... so i guess its not a corrupted data....
Any one have an idea?
EDIT
After further checkup i've noticed that the issue is due to the data file (in addition to the issue solved by #Justin Cave post.
I have changed the row delimeter to '^' since none of my data contains this character and '|^|' as column delimeter.
creating a control file as follows:
load data
infile data.txt "str '^'"
badfile "data_BAD.txt"
discardfile "data_DSC.txt"
into table table
FIELDS TERMINATED BY '|^|' TRAILING NULLCOLS
(
col1,
col2,
col3,
col4,
col5,
col6
)
The problem is that my data contain <CR> and sqlldr expecting a stream file there for fails on the <CR>!!!! i do not want to change the data since its a textual data (error messages for examples).

What is your database character set
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter LIKE '%CHARACTERSET'
Assuming that your database character set is AL32UTF8, each character could require up to 4 bytes of storage (though almost every useful character can be represented with at most 3 bytes of storage). So you could declare your column as VARCHAR2(1020) to ensure that you have enough space.
You could also simply use character length semantics. If you declare your column VARCHAR2(255 CHAR), you'll allocate space for 255 characters regardless of the amount of space that requires. If you change the NLS_LENGTH_SEMANTICS initialization parameter from the default BYTE to CHAR, you'll change the default so that VARCHAR2(255) is interpreted as VARCHAR2(255 CHAR) rather than VARCHAR2(255 BYTE). Note that the 4000-byte limit on a VARCHAR2 remains even if you are using character length semantics.
If your data contains line breaks, do you need the TRAILING NULLCOLS parameter? That implies that sometimes columns may be omitted from the end of a logical row. If you combine columns that may be omitted with columns that contain line breaks and data that is not enclosed by at least an optional enclosure character, it's not obvious to me how you would begin to identify where a logical row ended and where it began. If you don't actually need the TRAILING NULLCOLS parameter, you should be able to use the CONTINUEIF parameter to combine multiple physical rows into a single logical row. If you can change the data file format, I'd strongly suggest adding an optional enclosure character.

The bytes used by an NVARCHAR field is equal to two times the number of characters plus two (see http://msdn.microsoft.com/en-us/library/ms186939.aspx), so if you make your VARCHAR field 512 you may be OK. There's also some indication that some character sets use 4 bytes per character, but I've found no indication that Hebrew is one of these character sets.

Related

How is Unicode (UTF-16) data that is out of collation stored in varchar column?

This is purely theoretical question to wrap my head around
Let's say I have Unicode cyclone (🌀 1F300) symbol. If I try to store it in varchar column that has default Latin1_General_CI_AS collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar...
The ways I can see this done:
Like javascript does for symbols out of Basic plane(BMP) where it stores them as 2 symbols (surrogate pairs), and then additional processing is needed to put them back together...
Just truncate the symbol, store first byte and drop the second.... (data is toast - you should have read the manual....)
Data is destroyed and nothing of use is saved... (data is toast - you should have read the manual....)
Some other option that is outside of my mental capacity.....
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10)) in both cases I got 0x3F3F.
3F in ascii is ? (question mark) e.g two question marks (??) that I also see when doing normal select * does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?

The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2)) which is also 0xF3F3 (as opposed to casting N'👽').
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.

Yes the data has gone.
Varchar requires less space, compared to NVarchar. But that reduction comes at a cost. There is no space for a Varchar to store Unicode characters (at 1 byte per character the internal lookup just isn't big enough).
From Microsoft's Developer Network:
...consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.
As you've spotted, unsupported characters are repalced with question marks.

ORA-01401: inserted value too large for column CHAR

I'm pretty new to SQL Oracle and my class is going over Bulk Loading at the moment. I pretty much get the idea however I am having a little trouble getting it to read all of my records.
This is my SQL File;
PROMPT Creating Table 'CUSTOMER'
CREATE TABLE CUSTOMER
(CustomerPhoneKey CHAR(10) PRIMARY KEY
,CustomerLastName VARCHAR(15)
,CustomerFirstName VARCHAR(15)
,CustomerAddress1 VARCHAR(15)
,CutomerAddress2 VARCHAR(30)
,CustomerCity VARCHAR(15)
,CustomerState VARCHAR(5)
,CustomerZip VARCHAR(5)
);
Quick and easy. Now This is my Control File to load in the data
LOAD DATA
INFILE Customer.dat
INTO TABLE Customer
FIELDS TERMINATED BY"|"
(CustomerPhoneKey, CustomerLastName, CustomerFirstName, CustomerAddress1 , CutomerAddress2, CustomerCity, CustomerState, CustomerZip)
Then the Data File
2065552123|Lamont|Jason|NULL|161 South Western Ave|NULL|NULL|98001
2065553252|Johnston|Mark|Apt. 304|1215 Terrace Avenue|Seattle|WA|98001
2065552963|Lewis|Clark|NULL|520 East Lake Way|NULL|NULL|98002
2065553213|Anderson|Karl|Apt 10|222 Southern Street|NULL|NULL|98001
2065552217|Wong|Frank|NULL|2832 Washington Ave|Seattle|WA|98002
2065556623|Jimenez|Maria|Apt 13 B|1200 Norton Way|NULL|NULL|98003
The problem is, that only the last record
2065556623|Jimenez|Maria|Apt 13 B|1200 Norton Way|NULL|NULL|98003
is being loaded in. The rest are in my bad file
So I took a look at my log file and the errors I'm getting are
Record 1: Rejected - Error on table CUSTOMER, column CUSTOMERZIP.
ORA-01401: inserted value too large for column
Record 2: Rejected - Error on table CUSTOMER, column CUSTOMERZIP.
ORA-01401: inserted value too large for column
Record 3: Rejected - Error on table CUSTOMER, column CUSTOMERZIP.
ORA-01401: inserted value too large for column
Record 4: Rejected - Error on table CUSTOMER, column CUSTOMERZIP.
ORA-01401: inserted value too large for column
Record 5: Rejected - Error on table CUSTOMER, column CUSTOMERZIP.
ORA-01401: inserted value too large for column
Table CUSTOMER: 1 Row successfully loaded. 5 Rows not loaded due
to data errors. 0 Rows not loaded because all WHEN clauses were
failed. 0 Rows not loaded because all fields were null.
Onto the question. I see that CustomerZip is the problem, and initially I had it as CHAR(5) -- I did this because my understanding of the data type, is that for numeric values like a zip code, I would not be doing arithmetic operations with it, so it would be better to store it as CHAR. Also I did not use VARCHAR2 (5) initially, because seeing as it is a zip code, I don't want the value to vary, It should always be 5. Now maybe I'm just misunderstanding this. So if there is anyone that can clear that up, that would be awesome.
My second question, is "How do I fix this problem?" Given the above understanding of these data types, it doesn't make sense why CHAR(5) NOR VARCHAR2(5) work. As I am getting the same errors for both.
It makes even less sense that one record(the last one) actually works.
Thank you for the help in advance

Your data file has extra, invisible characters. We can't see the original but presumably it was created in Windows and has CRLF new line separators; and you're running SQL*Loader in a UNIX/Linux environment that is only expecting line feed (LF). The carriage return (CR) characters are still in the file, and Oracle is seeing them as part of the ZIP field in the file.
The last line doesn't have a CRLF (or any new-line marker), so on that line - and only that line - the ZIP field is being seen as 5 characters, For all the others it's being seen as six, e.g. 98001^M.
You can read more about the default behaviour in the documentation:
On UNIX-based platforms, if no terminator_string is specified, then SQL*Loader defaults to the line feed character, \n.
On Windows NT, if no terminator_string is specified, then SQL*Loader uses either \n or \r\n as the record terminator, depending on which one it finds first in the data file. This means that if you know that one or more records in your data file has \n embedded in a field, but you want \r\n to be used as the record terminator, then you must specify it.
If you open the data file in an edit like vi or vim, you'll see those extra ^M control characters.
There are several ways to fix this. You can modify the file; the simplest thing to do that is copy and paste the data into a new file created in the environment you'll run SQL*Loader in. There are utilities to convert line endings if you prefer, e.g. dos2unix. Or your Windows editor may be able to save the file without the CRs. You could also add an extra field delimiter to the data file, as Ditto suggested.
Or you could tell SQL*Loader to expect CRLF by changing the INFILE line:
LOAD DATA
INFILE Customer.dat "str '\r\n'"
INTO TABLE Customer
...
... though that will then cause problems if you do supply a file created in Linux, without the CR characters.

There is a utility, dos2unix, that is present on almost all UNIX machines. If you run it, you can output the data file with the DOS/Windows CRLF combination stripped.

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.

Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????

Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

TClientDataset Widestring field doubles in size after reading NVARCHAR from database

I'm converting one of our Delphi 7 projects to Delphi X3 because we want to support Unicode. We're using MS SQL Server 2008/R2 as our database server. After changing some database fields from VARCHAR to NVARCHAR (and the fields in the accompanying ClientDatasets to ftWideString), random crashes started to occur. While debugging I noticed some unexpected behaviour by the TClientDataset/DbExpress:
For a NVARCHAR(10) databasecolumn I manually create a TWideStringField in a clientdataset and set the 'Size' property to 10. The 'DataSize' property of the field tells me 22 bytes are needed, which is expected since TWideStringField's encoding is UTF-16, so it needs two bytes per character and some space for storing the length. Now when I call 'CreateDataset' on the ClientDataset and write the dataset to XML (using .SaveToFile), in the XML file the field is defined as
<FIELD WIDTH="20" fieldtype="string.uni" attrname="TEST"/>
which looks ok to me.
Now, instead of calling .CreateDataset I call .Open on the TClientDataset so that it gets its data through the linked components ->TDatasetProvider->TSQLDataset (.CommandText = a simple select * from table)->TSQLConnection. When I inspect the properties of the field in my watch list, Size is still 10, Datasize is still 22. After saving to XML file however, the field is defined as
<FIELD WIDTH="40" fieldtype="string.uni" attrname="TEST"/>
..the width has doubled?
Finally, if I call .Open on the TClientDataset without creating any fielddefinitions in advance at all, the Size of the field will afterwards be 20(incorrect !) and Datasize 42. After saving to XML, the field is still defined as
<FIELD WIDTH="40" fieldtype="string.uni" attrname="TEST"/>
Does anyone have any idea what is going wrong here?

Check the fieldtype and it's size at the SQLCommand component (which is before DatasetProvider).
Size doubling may be a result of two implicit "conversions": first - server provides NVarchar data which is stored into ansi-string field (and every byte becomes a separate character), second - it is stored into clientdataset's field of type Widestring and each character becomes 2 bytes (size doubles).
Note that in prior versions of Delphi string field size mismatch between ClientDataset's field and corresponding Query/Command field did not result in an exception but starting from one of XE*'s it offten results in AV. So you have to check carefully string field sizes during migration.

Sounds like because of the column datatype being changed, it has created unexpected issues for you. My suggestion is to
1. back up the table,multiple ways to doing this,pick your poison figuratively speaking
2. delete the table,
3. recreate the table,
4. import the data from the old table to the newly created table. See if that helps.
Sql tables DO NOT like it when column datatypes get changed, and unexpected issues may arise from doing just that. So try that, and worst case scenario, you have wasted maybe ten minutes of your time trying a possible solution.

varchar field length reported incorrectly

I have a varchar(800) column that is a primary key in one table and a FK to another.
The problem is that if I do len(field) - it says 186. If I copy/paste the text and I check it in notepad or something, I have 198 characters
The content is this :
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGGTo8JmCWDydNA19MrL4aON-02pA&url=http://creativity-online.com/news/chrysler-nokia-target-among-winners-of-teds-first-ad-contest/149189
Any ideas on why the length difference?
EDIT
You are right. I was using a web based sql manager and that tricked me.
Thank you.

Are you HTML encoding the URL after you have read it from the database?
moriartyn suggested that the SQL Server len function would count & as a single character, but that is not the case. However, if the actual content in the field is not HTML encoded, and it's HTML encoded when inserted in the page, that would change each & character into &, which would account for the extra length.

My guess is that because there are three & in your text, the sql server len function is counting those as just & or one character, and in notepad it is counting them as five each, that would give you twelve extra in that count.

186:
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGGTo8JmCWDydNA19MrL4aON-02pA&url=http://creativity-online.com/news/chrysler-nokia-target-among-winners-of-teds-first-ad-contest/149189
198:
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGGTo8JmCWDydNA19MrL4aON-02pA&url=http://creativity-online.com/news/chrysler-nokia-target-among-winners-of-teds-first-ad-contest/149189
Note the & and &: 3 of them, & is 4 characters longer = 12
You are neither comparing equally nor comparing the same strings.
In SQL:
SELECT
LEN('http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGGTo8JmCWDydNA19MrL4aON-02pA&url=http://creativity-online.com/news/chrysler-nokia-target-among-winners-of-teds-first-ad-contest/149189'),
LEN('http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGGTo8JmCWDydNA19MrL4aON-02pA&url=http://creativity-online.com/news/chrysler-nokia-target-among-winners-of-teds-first-ad-contest/149189')