I have a lot (over a thousand places) of legacy T-SQL code that only makes INSERTs into a varchar(8000) column in a utility table. Our needs have changed and now that column needs to be able to handle larger values. As a result I need to make that column varchar(max). This is just a plain data column where there are no searches preformed on it, no index on it, only one procedure reads it, it is INSERT and forget for the application (almost like a log entry).
I plan on making changes in only a few places that will actually generate the larger data, and in the single stored procedure that processes this column.
Are there any hidden pitfalls changing a column from varchar(8000) to a varchar(max)?
Will all the T-SQL string functions work the same, LEN(), RTRIM(), SUBSTRING(), etc.
Can anyone imagine any reason why I'd have to make any changes to the code that thinks the column is still varchar(8000)?
All MAX types have a small performance penalty, see Performance comparison of varchar(max) vs. varchar(N).
If your maintenance include online operations (online index rebuild), you will lose the possibility to do them. Online operations are not supported for tables with BLOB columns:
Clustered indexes must be created, rebuilt, or dropped offline when the underlying table contains large object (LOB) data types: image, ntext, text, varchar(max), nvarchar(max), varbinary(max), and xml.
Nonunique nonclustered indexes can be created online when the table contains LOB data types but none of these columns are used in the index definition as either key or nonkey (included) columns. Nonclustered indexes defined with LOB data type columns must be created or rebuilt offline.
The performance penalty is really small, so I wouldn't worry about it. The loss of ability to do online rebuilds may be problematic for really hot must-be-online operations tables. Unless online operations are a must, I'd vote to go for it and change it to MAX.
Crystal Reports 12 (and other versions, as far as I know) doesn't handle varchar(max) properly and interprets it as varchar(255) which leads to truncated data in reports.
So if you're using Crystal Reports, thats a disadvantage to varchar(max). Or a disadvantage to using Crystal, to be precise.
See:
http://www.crystalreportsbook.com/Forum/forum_posts.asp?TID=5843&PID=17503
http://michaeltbeeitprof.blogspot.com/2010/05/crystal-xi-and-varcharmax-aka-memo.html
If you genuinely don't need indexes and it is a large column you should be fine. Varchar (max) appears to be exactly what you need and you will have less problems with existing code than you would if you used text.
Make sure to test any updates where text is added to the existing text. It should work using regular concatenation, but I'd want to be able to prove it.
Related
I have a SAP ASE 16.0 SP02 PL06 and would like to know if the data of the row changed.
My use-case is that I have a table with the data and the same table with all previous statuses of all rows (it is history of the "data evolution", if you like) plus some auditing columns (rowID, historizationDate). And I need to know if the last historized version and the current version of a given row differs.
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly (this leads to a lot of convert(varchar, column), but no hashing).
Are there better solution with the respect to the constraints given bellow?
Constraints: I cannot alter the original table, solution has to be as fast as reasonably possible (high throughput through table, high concurrency, literally the heart of whole database) and the source table has a 4-column primary key and high 10´s of columns overall. No sensitive data (like passwords) which needs hashing are present.
Drastically simplified structure of tables:
Original:
CREATE TABLE data (
dataID int,
column1 int,
column2 datetime,
...)
History:
CREATE TABLE dataHistory (
rowID int identity,
historizationDate datetime default getDate(),
dataID int,
column1 int,
column2 datetime,
...)
EDIT: As per #markp´s comment: The table is not accessed directly by user, but through a stored procedure. So when all checking/preparation is done, then data are saved into the table. The problem is that after all checking the new data can be rejected. Not because some invalidity/referential integrity/etc., but because the source of data is deemed less reliable then data already present. This checking is done through a few dozens procedures nested several layers deep and for each column separately, so alternating all those procedures to see if/what they did to the data is not a very viable solution... (Yes, the system is very well matured having some 20 years of age.)
The final use of all those historized data is to see what data was present in the database at any given moment to draw some business conclusions from them (e.g. how often or how much are those data changed, if operational procedures are adhered to and similar uses).
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly
SAP ASE also has a hashbytes() function that can do something like this:
select hashbytes('sha1',col1,col2,col3) from mytable
Note, in some versions of ASE 16, using 'md5' with hashbytes() causes major memory leakage. You can't use "*" for the column list.
Is there a tool out there that can analyse SQL Server databases for potential problems?
For example:
a foreign key column that is not indexed
an index on a uniqueidentifier column that has no FILL FACTOR
a LastModifiedDate DATETIME column that has no UPDATE trigger to update the datetime
a large index with "high" fragmentation
a non-fragmented index that exists in multiple extents
a trigger that does not contain SET NOCOUNT ON (leaving it suspectible to "A trigger returned a resultset and/or was running with SET NOCOUNT OFF while another outstanding result set was active.")
a database, table, stored procedure, trigger, view, created with SET ANSI_NULLS OFF
a database or table with SET ANSI_PADDING OFF
a database or table created with SET CONCAT_NULL_YIELDS_NULL OFF
a highly fragmented index that might benefit from a lower FILLFACTOR (i.e. more padding)
a table with a very wide clustered index (e.g. uniqueidentifier+uniqueidentifier)
a table with a non-unique clustered index
use of text/ntext rather than varchar(max)/nvarchar(max)
use of varchar in columns that could likely contain localized strings and should be nvarchar (e.g. Name, FirstName, LastName, BusinessName, CountryName, City)
use of *=, =*, *=* rather than LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN
trigger that returns a results set
any column declared as timestamp rather than rowversion
a nullable timestamp column
use of image rather than varbinary(max)
databases not in simple mode (or a log file more than 100x the size of the data file)
Is there an FxCop for SQL Server?
Note: The Microsoft SQL Server 2008 R2 Best Practices Analyzer doesn't fit the bill.
There's SQLCop - free, and quite an interesting tool, too!
There is a tool called Static Code Analysis (not exactly a great name given its collision with VS-integrated FxCop) that is included with Visual Studio Premium and Ultimate that can cover at least the design-time subset of your rules. You can also add your own rules if the in-box rule set doesn't do everything you want.
Check out SQL Enlight - http://www.ubitsoft.com/products/sqlenlight/sqlenlight.php
I'm not aware of one. It would be welcome.
I post this as an answer, because I actually went a long way to implementing monitoring many things which can be easily done in straight T-SQL - the majority of the examples you give can be done by inspecting the metadata.
After writing a large number of "system health" procedures and some organization around them, I wrote a framework for something like this myself, using metadata including extended properties. It allowed objects to be marked to be excluded from warnings using extended properties, and rules could be categorized. I included examples of some rules and their implementations in my metadata presentation. http://code.google.com/p/caderoux/source/browse/#hg%2FLeversAndTurtles This also includes a Windows Forms app which will call the system, but the system itself is entirely coded and organized in T-SQL.
Take a look at SQLCop. It's the closest I've seen to FXCop.
I have a table with about 100.000 rows that used to look more or less like this:
id varchar(20),
omg varchar(10),
ponies varchar(3000)
When adding support for international characters, we had to redefine the ponies column to an nclob, as 3000 (multibyte) characters is too big for an nvarchar
id varchar(20),
omg varchar(10),
ponies nclob
We read from the table using a prepared statement in java:
select omg, ponies from tbl where id = ?
After the 'ponies' column was changed to an NCLOB and some other tables where changed to use nchar columns, Oracle 11g decided to do a full table scan instead of using the index for the id column, which causes our application to grind to a halt.
When adding a hint to the query, the index is used and everything is "fine", or rather just a little bit more slow than it was when the column was a varchar.
We have defined the following connection properties:
oracle.jdbc.convertNcharLiterals="true"
defaultNChar=true
Btw, The database statistics are updated.
I have not had time to look at all queries, so I don't know if other indexes are ignored, but do I have to worry that the defaultNChar setting somehow is confusing the optimizer since the id is not a nchar? It would be rather awkward to either sprinkle hints on virtually all queries or redefine all keys.
Alternatively, is the full table scan regarded as insignificant as a "large" nclob is going to be loaded - that assumption seems to be off by 3 orders of magnitude, and I would like to believe that Oracle is smarter than that.
Or is it just bad luck? Or, something else? Is it possible to fix without hints?
The problem turns out to be the jdbc-flag defaultNChar=true.
Oracles optimizer will not use indexes created on char/varchar2 columns if the parameter is sent as a nchar/nvarchar. This is nearly making sense, as I suppose you could get phantom results.
We are mostly using stored procedures, with the parameters defined as char/varchar2 - forcing a conversion before the query is executed - so we didn't notice this effect except in a few places where dynamic sql is used.
The solution is to convert the database to AL32UTF8 and get rid of the nchar columns.
When you redid the statistics did you estimate or use dbms_stats.gather_table_stats with an estimate_percentage > 50%? If you didn't then use dbms_stats with a 100% estimate_percentage.
If your table is only 3 columns and these are the ones you're returning then the best index is all 3 columns no matter what you hint and even if the id index is unique. As it stands your explain plan should by a unique index scan followed by a table access by rowid. If you index all 3 columns this becomes a unique scan as all the information you're returning will be in the index already and there's no need to re-access the table to get it. The order would be id, omg, ponies to make use of it in the where clause. This would effectively make your table an index organized table, which would be easier than having a separate index. Obviously, gather stats after.
Saying all that I'm not actually certain you can index a nclob and no matter what you do the size of the column will have an impact as the longer it is the more disk reads you will have to do.
Sorry, but I don't understand why had you change your column ponies from varchar to clob. If your maximum lenght is 3000 char in this column, why don't you use a NVARCHAR2 column instead? As far as I know, nvarchar2 can hold up to 4000 characters.
But you're right, the maximum column size allowed is 2000 characters when the national character set is AL16UTF16 and 4000 when it is UTF8.
I have a table on SQL Server 2005 that was about 4gb in size.
(about 17 million records)
I changed one of the fields from datatype char(30) to char(60) (there are in total 25 fields most of which are char(10) so the amount of char space adds up to about 300)
This caused the table to double in size (over 9gb)
I then changed the char(60) to varchar(60) and then ran a function to cut extra whitespace out of the data (so as to reduce the average length of the data in the field to about 15)
This did not reduce the table size. Shrinking the database did not help either.
Short of actually recreating the table structure and copying the data over (that's 17 million records!) is there a less drastic way of getting the size back down again?
You have not cleaned or compacted any data, even with a "shrink database".
DBCC CLEANTABLE
Reclaims space from dropped variable-length columns in tables or indexed views.
However, a simple index rebuild if there is a clustered index should also do it
ALTER INDEX ALL ON dbo.Mytable REBUILD
A worked example from Tony Rogerson
Well it's clear you're not getting any space back ! :-)
When you changed your text fields to CHAR(60), they are all filled up to capacity with spaces. So ALL your fields are now really 60 characters long.
Changing that back to VARCHAR(60) won't help - the fields are still all 60 chars long....
What you really need to do is run a TRIM function over all your fields to reduce them back to their trimmed length, and then do a database shrinking.
After you've done that, you need to REBUILD your clustered index in order to reclaim some of that wasted space. The clustered index is really where your data lives - you can rebuild it like this:
ALTER INDEX IndexName ON YourTable REBUILD
By default, your primary key is your clustered index (unless you've specified otherwise).
Marc
I know I'm not answering your question as you are asking, but have you considered archiving some of the data to a history table, and work with fewer rows?
Most of the times you might think at first glance that you need all that data all the time but when actually sitting down and examining it, there are cases where that's not true. Or at least I've experienced that situation before.
I had a similar problem here SQL Server, Converting NTEXT to NVARCHAR(MAX) that was related to changing ntext to nvarchar(max).
I had to do an UPDATE MyTable SET MyValue = MyValue in order to get it to resize everything nicely.
This obviously takes quite a long time with a lot of records. There were a number of suggestions as how better to do it. They key one was a temporary flag indicated if it had been done or not and then updating a few thousand at a time in a loop until it was all done. This meant I had "some" control over how much it was doing.
On another note though, if you really want to shrink the database as much as possible, it can help if you turn the recovery model down to simple, shrink the transaction logs, reorganise all the data in the pages, then set it back to full recovery model. Be careful though, shrinking of databases is generally not advisable, and if you reduce the recovery model of a live database you are asking for something to go wrong.
Alternatively, you could do a full table rebuild to ensure there's no extra data hanging around anywhere:
CREATE TABLE tmp_table(<column definitions>);
GO
INSERT INTO tmp_table(<columns>) SELECT <columns> FROM <table>;
GO
DROP TABLE <table>;
GO
EXEC sp_rename N'tmp_table', N'<table>';
GO
Of course, things get more complicated with identity, indexes, etc etc...
I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!