How are oracle clobs internally handled? - database

I have a column (CLOB type) in a database which holds json strings. These size of these json strings can be quite variable. In the case where these strings are less than 4000 characters I have heard that Oracle treats these CLOBs as VARCHAR internally. However, I am curious how exactly this process works. My interest is in the performance and ability to visually see json being stored.
If a CLOB in the DB has 50 characters does Oracle treat this single object as VARCHAR2(50) ? Do all CLOBs stored in the column need to be less than 4000 characters for Oracle to treat the whole column as a VARCHAR ? How does this all work?

Oracle does not always treat short CLOB values as VARCHAR2 values. It only does this if you allow it to do so, using the CLOB storage option of ENABLE STORAGE IN ROW. E.g.,
create table clob_test (
id number NOT NULL PRIMARY KEY,
v1 varchar2(60),
c1 clob
) lob(c1) store as (enable storage in row);
In this case, Oracle will store the data for C1 in the table blocks, right next to the values for ID and V1. It will do this as long as the length of the CLOB value is less than close to 4000 bytes (i.e., 4000 minus system control information that takes space in the CLOB).
In this case, the CLOB data will be read like a VARCHAR2 (e.g., the storage CHUNK size becomes irrelevant).
If the CLOB grows too big, Oracle will quietly move it out of the block into separate storage, like any big CLOB value.
If a CLOB in the DB has 50 characters does Oracle treat this single object as VARCHAR2(50)?
Basically, if the CLOB was created with ENABLE STORAGE IN ROW. This option cannot be altered after the fact. I wouldn't count on Oracle treating the CLOB exactly like a VARCHAR2 in every respect. E.g., there is system control information stored in the in-row CLOB that is not stored in a VARCHAR2 column. But for many practical purposes, including performance, they're very similar.
Do all CLOBs stored in the column need to be less than 4000 characters for Oracle to treat the whole column as a VARCHAR?
No. It's on a row-by-row basis.
How does this all work?
I explained what I know as best I could. Oracle doesn't publish its internal algorithms.

Related

Changing a column data type results in an index that is too large

In the sql server db table I have a field CarDealerOwner of type nvarchar(255). Now I want to change the max length of the accepted values from 255 to 900 characters.
When I manually try to change from 255 to 900 the sql server management studio pop up with the message:
Changing a column data type results in an index that is too large.
....
Changing the data type of column CarDealerOwner causes the following indexes to exceed the maximum index size of 900 bytes:...
Do you want to proceed with the data type change and delete the indexes?
Does this actually mean that I would need to recreate index again?
Any other smarter way?
The 900-byte limit on the indexed field size is inherent to SQL server, so you can't change that.
NVARCHAR() columns take two bytes per character, so you could resize your column to 450 instead of 900.
You could change the data type from NVARCHAR to VARCHAR, as long as your data is in Western European languages. That is probably not a good assumption.

Table with lots of text columns. Should I use nvarchar or text?

I have table with 210 text columns.
The columns contain a kind of comments and are not always filled.
What is a better solution: NVARCHAR(2000) or TEXT?
If I choose NVARCHAR and I wanted (in the future) to increase the number of characters to NVARCHAR(8000), is this affects the physical size of a row?
Text types are deprecated:
ntext , text, and image data types will be removed in a future version of Microsoft SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use nvarchar(max), varchar(max), and varbinary(max) instead.
see MSDN: ntext, text, and image
Therefore you should use varchar(x) or nvarchar(x) if you need unicode.
You must choose a size big enough for what you want to store or use varchar(max).
Besides, each column does not need to be set with a similar size.
The physical size of variable length type is based on what you store in it. If you later decide to change the size from 100 to 1000, a 50 characters long string will still occupy the same number of byte. However, a 200 characters string will only fit in a (n)varchar(1000) or at least varchar(200).
If you have a lot of null value, you should consider using the SPARSE option.
This MSDN link gives more details and data on potential gain on SPARSE columns. You can expect a 40% in a varchar column containing 60% null values.
Use Sparse Columns
Let's start with the fact that text, ntext and image are all deprecated.
According to MSDN:
ntext, text, and image data types will be removed in a future version of Microsoft SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use nvarchar(max), varchar(max), and varbinary(max) instead.
Now let's deal with the fact that since 2005 version, you can specify max as the length of columns, effectively eliminating the need for
text, ntext and image - the replacements are
varchar(max), nvarchar(max) and varbinary(max) respectively.
As for storage size:
for varchar and varbinary, The storage size is the actual length of the data entered + 2 bytes.
for nvarchar, > The storage size, in bytes, is two times the actual length of data entered + 2 bytes.
All this information is available in the MSDN pages I've linked to.
Don't use TEXT, it's deprecated!
Use NVARCHAR, if you expect international data (special characters) and if the size of your db is not important.
Using VARCHAR or NVARCHAR stores a reference to the actual string within your row. No problem to define VARCHAR(MAX) or NVARCHAR(MAX) right from the beginning. It will not take more place than starting with a smaller amount.
How many rows will there be? What is the expected percentage of your columns to be filled? Will the structure change in future? How are these values used? Will you filter them? Search them? Search for text parts?
You see, the answer is not that easy :-)
You should read about SPARSE columns (very good, if you have a high "not filled" rate) or even an XML column

Does the number of fields in a table affect performance even if not referenced?

I'm reading and parsing CSV files into a SQL Server 2008 database. This process uses a generic CSV parser for all files.
The CSV parser is placing the parsed fields into a generic field import table (F001 VARCHAR(MAX) NULL, F002 VARCHAR(MAX) NULL, Fnnn ...) which another process then moves into real tables using SQL code that knows which parsed field (Fnnn) goes to which field in the destination table. So once in the table, only the fields that are being copied are referenced. Some of the files can get quite large (a million rows).
The question is: does the number of fields in a table significantly affect performance or memory usage? Even if most of the fields are not referenced. The only operations performed on the field import tables are an INSERT and then a SELECT to move the data into another table, there aren't any JOINs or WHEREs on the field data.
Currently, I have three field import tables, one with 20 fields, one with 50 fields and one with 100 fields (this being the max number of fields I've encountered so far). There is currently logic to use the smallest file possible.
I'd like to make this process more generic, and have a single table of 1000 fields (I'm aware of the 1024 columns limit). And yes, some of the planned files to be processed (from 3rd parties) will be in the 900-1000 field range.
For most files, there will be less than 50 fields.
At this point, dealing with the existing three field import tables (plus planned tables for more fields (200,500,1000?)) is becoming a logistical nightmare in the code, and dealing with a single table would resolve a lot of issues, provided I don;t give up much performance.
First, to answer the question as stated:
Does the number of fields in a table affect performance even if not referenced?
If the fields are fixed-length (*INT, *MONEY, DATE/TIME/DATETIME/etc, UNIQUEIDENTIFIER, etc) AND the field is not marked as SPARSE or Compression hasn't been enabled (both started in SQL Server 2008), then the full size of the field is taken up (even if NULL) and this does affect performance, even if the fields are not in the SELECT list.
If the fields are variable length and NULL (or empty), then they just take up a small amount of space in the Page Header.
Regarding space in general, is this table a heap (no clustered index) or clustered? And how are you clearing the table out for each new import? If it is a heap and you are just doing a DELETE, then it might not be getting rid of all of the unused pages. You would know if there is a problem by seeing space taken up even with 0 rows when doing sp_spaceused. Suggestions 2 and 3 below would naturally not have such a problem.
Now, some ideas:
Have you considered using SSIS to handle this dynamically?
Since you seem to have a single-threaded process, why not create a global temporary table at the start of the process each time? Or, drop and recreate a real table in tempdb? Either way, if you know the destination, you can even dynamically create this import table with the destination field names and datatypes. Even if the CSV importer doesn't know of the destination, at the beginning of the process you can call a proc that would know of the destination, can create the "temp" table, and then the importer can still generically import into a standard table name with no fields specified and not error if the fields in the table are NULLable and are at least as many as there are columns in the file.
Does the incoming CSV data have embedded returns, quotes, and/or delimiters? Do you manipulate the data between the staging table and destination table? It might be possible to dynamically import directly into the destination table, with proper datatypes, but no in-transit manipulation. Another option is doing this in SQLCLR. You can write a stored procedure to open a file and spit out the split fields while doing an INSERT INTO...EXEC. Or, if you don't want to write your own, take a look at the SQL# SQLCLR library, specifically the File_SplitIntoFields stored procedure. This proc is only available in the Full / paid-for version, and I am the creator of SQL#, but it does seem ideally suited to this situation.
Given that:
all fields import as text
destination field names and types are known
number of fields differs between destination tables
what about having a single XML field and importing each line as a single-level document with each field being <F001>, <F002>, etc? By doing this you wouldn't have to worry about number of fields or have any fields that are unused. And in fact, since the destination field names are known to the process, you could even use those names to name the elements in the XML document for each row. So the rows could look like:
ID LoadFileID ImportLine
1 1 <row><FirstName>Bob</FirstName><LastName>Villa</LastName></row>
2 1 <row><Number>555-555-5555</Number><Type>Cell</Type></row>
Yes, the data itself will take up more space than the current VARCHAR(MAX) fields, both due to XML being double-byte and the inherent bulkiness of the element tags to begin with. But then you aren't locked into any physical structure. And just looking at the data will be easier to identify issues since you will be looking at real field names instead of F001, F002, etc.
In terms of at least speeding up the process of reading the file, splitting the fields, and inserting, you should use Table-Valued Parameters (TVPs) to stream the data into the import table. I have a few answers here that show various implementations of the method, differing mainly based on the source of the data (file vs a collection already in memory, etc):
How can I insert 10 million records in the shortest time possible?
Pass Dictionary<string,int> to Stored Procedure T-SQL
Storing a Dictionary<int,string> or KeyValuePair in a database
As was correctly pointed out in comments, even if your table has 1000 columns, but most of them are NULL, it should not affect performance much, since NULLs will not waste a lot of space.
You mentioned that you may have real data with 900-1000 non-NULL columns. If you are planning to import such files, you may come across another limitation of SQL Server. Yes, the maximum number of columns in a table is 1024, but there is a limit of 8060 bytes per row. If your columns are varchar(max), then each such column will consume 24 bytes out of 8060 in the actual row and the rest of the data will be pushed off-row:
SQL Server supports row-overflow storage which enables variable length
columns to be pushed off-row. Only a 24-byte root is stored in the
main record for variable length columns pushed out of row; because of
this, the effective row limit is higher than in previous releases of
SQL Server. For more information, see the "Row-Overflow Data Exceeding
8 KB" topic in SQL Server Books Online.
So, in practice you can have a table with only 8060 / 24 = 335 nvarchar(max) non-NULL columns. (Strictly speaking, even a bit less, there are other headers as well).
There are so-called wide tables that can have up to 30,000 columns, but the maximum size of the wide table row is 8,019 bytes. So, they will not really help you in this case.
yes. large records take up more space on disk and in memory, which means loading them is slower than small records and fewer can fit in memory. both effects will hurt performance.

SQL Server TO Oracle table creation

We are using MS-SQL and Oracle as our database.
We have used hibernate annotations to create tables, in the annotation class file we have declared column definition as
#Column(name="UCAALSNO",nullable=false,columnDefinition="nvarchar(20)")
and this works fine for MS-SQL.
But when it comes to Oracle nvarchar throws an exception as oracle supports only nvarchar2.
How to create annotation file to support datatype nvarchar for both the databases.
You could use NCHAR:
In MSSQL:
nchar [ ( n ) ]
Fixed-length Unicode string data. n defines the string length and must
be a value from 1 through 4,000. The storage size is two times n
bytes. When the collation code page uses double-byte characters, the
storage size is still n bytes. Depending on the string, the storage
size of n bytes can be less than the value specified for n. The ISO
synonyms for nchar are national char and national character.
while in Oracle:
NCHAR
The maximum length of an NCHAR column is 2000 bytes. It can hold up to
2000 characters. The actual data is subject to the maximum byte limit
of 2000. The two size constraints must be satisfied simultaneously at
run time.
Nchar occupies a fixed space, so for very large table there will be a considerable space difference between an nchar and an nvarchar, so you should take this into consideration.
I usually have incremental DB schema migration scripts for my production DBs and I only rely on Hibernate DDL generation for my integration testing in-memory databases (e.g. HSQLDB or H2). This way I choose the production schema types first and the "columnDefinition" only applies to the testing schema, so there is no conflict.
You might want to read this too, which disregards the N(VAR)CHAR(2) additional complexity, so you might consider setting a default character encoding:
Given that, I'd much rather go with the approach that maximizes
flexibility going forward, and that's converting the entire database
to Unicode (AL32UTF8 presumably) and just using that.
Although that you might be recommanded to use VARCHAR2, VARCHAR has been synonym with VARCAHR2 for a long time now.
So quoting a DBA opinion:
The Oracle 9.2 and 8.1.7 documentation say essentially the same thing,
so even though Oracle continually discourages the use of VARCHAR, so
far they haven't done anything to change it's parity with VARCHAR2.
I'd say give it a try for VARCHAR too, as it's supported on most DBs.

what advantage does TEXT have over varchar when required length <8000?

SQL Server Text type vs. varchar data type:
As a rule of thumb, if you ever need you text value to exceed 200
characters AND do not use join on this column, use TEXT.
Otherwise use VARCHAR.
Assuming my data now is 4000 characters AND i do not use join on this column. By that quote, it is more advantageous to use TEXT/varchar(max) compared to using varchar(4000).
Why so? (what advantage does TEXT/varchar(max) have over normal varchar in this case?)
TEXT is deprecated, use nvarchar(max), varchar(max), and varbinary(max) instead: http://msdn.microsoft.com/en-us/library/ms187993.aspx
I disagree with the 200 thing because it isn't explained, unless it relate to the deprecated "text in row" option
If your data is 4000 characters then use char(4000). It is fixed length
Text is deprecated
BLOB types are slower
In old versions of SQL (2000 and earlier?) there was a max row length of 8 KB (or 8060 bytes). If you used varchar for lots of long text columns they would be included in this length, whereas any text columns would not, so you can keep more text in a row.
This issue has been worked around in more recent versions of SQL.
This MSDN page includes the statement:
SQL Server 2005 supports row-overflow storage which enables variable
length columns to be pushed off-row. Only a 24-byte root is stored in
the main record for variable length columns pushed out of row; because
of this, the effective row limit is higher than in previous releases
of SQL Server. For more information, see the "Row-Overflow Data
Exceeding 8 KB" topic in SQL Server 2005 Books Online.

Resources