SQL Server varbinary(max) and varchar(max) data in a separate table - sql-server

using SQL Server 2005 standard edition with SP2
I need to design a table where I will be storing a text file (~200KB) along with filename ,description and datetime.
Should we design a table where varchar(max) and varbinary(max) data should be stored in a separate table or should column of LOB data types be part of the main table?
Per this thread
What is the benefit of having varbinary field in a separate 1-1 table?
there is no performance or operational benefits which I agree to some extent however
I can see two benefits
store those into a separatable table that can be stored on a separate file group
you can not rebuild index on a table containing lob data type ONLINE
Any suggestions would be appreciated.

I would advise against separation. It complicates the design significantly for little or no benefit. As you probably know, SQL Server already stores LOBs on separate allocation units, as described in Table and Index Organization.
Your first concern (separate filegroup allocation for the LOB data) can be addressed explicitly, as Mikael has already pointed out, by appropriately specifying the desired filegroup in the CREATE TABLE statement.
Your second concern is no longer a concern with SQL Server 2012, see Online Index Operations for Indexes containing LOB columns. Even prior to SQL Server 2012 you could reorganize indexes with LOBs without problems (and REORGANIZE is online). Given that a full index rebuild is a very expensive operation (an online rebuild must be done at the table/index level, there is no partition online rebuild options), are you sure you want to complicate the design to accommodate for something that is, on one hand, seldom required, and on the other hand, will be available when you upgrade to SQL 2012?

I can answer your question in one simple word: Kiss.
Which of course stands for... Keep It Simple Stupid.
Adding a table for is generally a no-no unless you really need one to solve a problem.
Generally, I disagree with splitting tables. It adds complexity to databases and code. Having useless columns in a table is a bad thing, but it's not as bad as multiple tables when you only need one.
Cases where you would consider adding another table:
Some of your columns are BloB's of data (greater than page size) and they are rarely used and other columns with small data sizes are accessed frequently.
If you lack a brain.
If you are evil.
Or... if you are trying to piss-off your coworkers.

Related

SQL Server: Best technique to regenerate a computed table

We have a few tables that are periodically recomputed within SQL Server. The computation takes a few seconds to a few minutes and we do the following:
Dump the results in computed_table_tmp
Drop computed_table
Rename computed_table_tmp to computed_table. (and all indexes).
However, we seem to still run into concurrency issues where we have our application requesting a view that utilizes this computed table at the precise moment where it no longer exists.
What would be the best technique to avoid this type of problem while ensuring high availability?
If this table is part of your high-availability requirement, then you can't do this the way you've been doing it. Dropping a table in a production SQL environment breaks the concept of high availability.
You might be able to accomplish what you're trying to achieve by creating one or more partitions on this table. A partitioned table is divided into subgroups of rows that can be spread across more than one filegroup in your database. For querying purposes, however, the table is still a single logical entity. The advantage of using a table partition is that you can move around subsets of your data without breaking the integrity of the database, i.e., high-availability is still in place.
In your scenario, you'd have to modify your process such that all activity takes place in the production version of the table. The new rows are dumped in to a separate partition, based on the value of your partition function. Then you'll need to switch the partitions.
One of the things you'll need to do is identify a column in your table that may be used as the partition column, which determines which partition a row will be allocated to. This might be, for example, a datetime column indicating when the row was generated. You can even use a computed column for this purpose, provided it is a PERSISTED column.
One caveat: Table partitioning is not available in all editions of SQL Server... I don't believe Standard has it.

Handling large datasets with SQL Server

I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.
For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?
The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?
edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?
In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.
I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.
Here's how I would probably do it initially:
Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.
If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.
Update:
For the most part, I just need to filter event types and count the number.
Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.
Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.
Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.
"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.
Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.
I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

Are these tables too big for SQL Server or Oracle

I'm not much of a database guru so I would like some advice.
Background
We have 4 tables that are currently stored in Sybase IQ. We don't currently have any choice over this, we're basically stuck with what someone else decided for us. Sybase IQ is a column-oriented database that is perfect for a data warehouse. Unfortunately, my project needs to do a lot of transactional updating (we're more of an operational database) so I'm looking for more mainstream alternatives.
Question
Given these tables' dimensions, would anyone consider SQL Server or Oracle to be a viable alternative?
Table 1 : 172 columns * 32 million rows
Table 2 : 453 columns * 7 million rows
Table 3 : 112 columns * 13 million rows
Table 4 : 147 columns * 2.5 million rows
Given the size of data what are the things I should be concerned about in terms of database choice, server configuration, memory, platform, etc.?
Yes, both should be able to handle your tables (if your server is suited for it). But, I would consider redesigning your database a bit. Even in a datawarehouse where you denormalize your data, a table with 453 columns is not normal.
It really depends on what's in the columns. If there are lots of big VARCHAR columns -- and they are frequently filled to near capacity -- then you could be in for some problems. If it's all integer data then you should be fine.
453 * 4 = 1812 # columns are 4 byte integers, row size is ~1.8k
453 * 255 = 115,515 # columns are VARCHAR(255), theoretical row size is ~112k
The rule of thumb is that row size should not exceed the disk block size, which is generally 8k. As you can see, your big table is not a problem in this regard if it consists entirely of 4-byte integers but if it consists of 255-char VARCHAR columns then you could be exceeding the limit substantially. This 8k limit used to be a hard limit in SQL Server but I think these days it's just a soft limit and performance guideline.
Note that VARCHAR columns don't necessarily consume memory commensurate with the size you specify for them. That is the max size, but they only consume as much as they need. If the actual data in the VARCHAR columns is always 3-4 chars long then size will be similar to that of integer columns regardless of whether you created them as VARCHAR(4) or VARCHAR(255).
The general rule is that you want row size to be small so that there are many rows per disk block, this reduces the number of disk reads necessary to scan the table. Once you get above 8k you have two reads per row.
Oracle has another potential problem which is that ANSI joins have a hard limit on the total number of columns in all tables in the join. You can avoid this by avoiding the Oracle ANSI join syntax. (There are equivalents that don't suffer from this bug.) I don't recall what the limit is or which versions it applies to (I don't think it's been fixed yet).
The numbers of rows you're talking about should be no problem at all, presuming you have adequate hardware.
With suitable sized hardware and I/O subsystem to meet your demands both are quite adequate - Wihlst you have a lot of columns the row counts are really very low - we regularily use datasets that are expressed in billions, not millions. (Just do not try it on SQL 2000 :) )
If you know your usages and I/O requirements, most I/O vendors will translate that into hardware specs for you. Memory, processors etc again is dependant on workloads that only you can model.
Oracle 11g has no problems with such data and structure.
More info at: http://neworacledba.blogspot.com/2008/05/database-limits.html
Regards.
Oracle limitations
SQL Server limitations
You might be close on SQL Server, depending on what data types you have in that 453 column table (note the bytes per row limitation, but also read the footnote). I know you said that this is normalized, but I suggest looking at your workflow and considering ways of reducing the column count.
Also, these tables are big enough that hardware considerations are a major issue with performance. You'll need an experienced DBA to help you spec and set up the server with either RDBMS. Properly configuring your disk subsystem will be vital. You will probably also want to consider table partitioning among other things to help with performance, but this all depends on exactly how the data is being used.
Based on your comments in the other answers I think what I'd recommend is:
1) Isolate which data is actually updated vs. which data is more or less read only (or infrequently)
2) Move the updated data to separate tables joined on an id to the bigger tables (deleting those columns from the big tables)
3) Do your OLTP transactions against the smaller, more relational tables
4) Use inner joins to hook back up to the big tables to retrieve data when necessary.
As others have noted you are trying to make the DB do both OLTP and OLAP at the same time and that is difficult. Server settings need to be tweaked differently for either scenario.
Either SQL Server or Oracle should work. I use census data as well and my giganto table has around 300+ columns. I use SQL Server 2005 and it complains that if all the columns were to be filled to their capacity it would exceed that max possible size for a record. We use our census data in an OLAP fashion, so it isn't such a big deal to have so many columns.
Are all of the columns in all of those tables updated by your application?
You could consider having data marts (AKA operational or online data store) that are updated during the day, and then the new records are migrated into the main warehouse at night? I say this because rows with massive amounts of columns are going to be slower to insert and update, so you may want to consider tailoring your specific online architecture to your application's update requirements.
Asking one DB to act as an operational and warehouse system at the same time is still a bit of a tall order. I would consider using SQL server or Oracle for operational system and having a separate DW for reporting and analytic, probably keeping the system you have.
Expect some table re-design and normalization to happen on the operational side to fit one-row per page limitations of row-based storage.
If you need to have fast updates of the DW, you may consider EP for ETL approach, as opposed to standard (scheduled) ETL.
Considering that you are in the early stage of this, take a look at Microsoft project Madison, which is auto-scalable DW appliance up to 100s TB. They have already shipped some installations.
I would very carefully consider switching from a column oriented database to a relational one. Column oriented databases are indeed inadequate for operational work as updates are very slow, but they are more than adequate for reporting and business intelligence support.
More often than not one has to split the operational work into a OLTP database containing the current activity needed for operations (accounts, inventory etc) and use an ETL process to populate the data warehouse (history, trends). A column oriented DW will beat hands down a relational one in almost any circumstance, so I wouldn't give up the Sybase IQ so easily. Perhaps you can design your system to have an operational OLTP side using your relational product of choice (I would choose SQL Server, but I'm biased) and keep the OLAP part you have now.
Sybase have a product called RAP that combines IQ with an in-memory instance of ASE (their relational database) which is designed to help in situations such as this.
Your data isn't so vast that you couldn't consider moving to a row-oriented database but, depending on the structure of the data, you could end up using considerably more disk space and slowing down many kinds of queries.
Disclaimer: I do work for Sybase but not currently on the ASE/IQ/RAP side.

What is the benefit of having varbinary field in a separate 1-1 table?

I need to store binary files in a varbinary(max) column on SQL Server 2005 like this:
FileInfo
FileInfoId int, PK, identity
FileText varchar(max) (can be null)
FileCreatedDate datetime etc.
FileContent
FileInfoId int, PK, FK
FileContent varbinary(max)
FileInfo has a one to one relationship with FileContent. The FileText is meant to be used when there is no file to upload, and only text will be entered manually for an item. I'm not sure what percentage of items will have a binary file.
Should I create the second table. Would there be any performance improvements with the two table design? Are there any logical benefits?
I've found this page, but not sure if it applies in my case.
There is no performance nor operational advantage. Since SQL 2005 the LOB types are already stored for you by the engine in a separate allocation unit, a separate b-tree. If you study the Table and Index Organization of SQL Server you'll see that every partition has up to 3 allocation units: data, LOB and row-overflow:
(source: s-msft.com)
A LOB field (varchar(max), nvarchar(max), varbinary(max), XML, CLR UDTs as well as the deprecated types text, ntext and image) will have in the data record itself, in the clustered index, only a very small footprint: a pointer into the LOB allocation unit, see Anatomy of a Record.
By storing a LOB explicitly in a separate table you gain absolutely nothing. You just add unneeded complexity as former atomic updates have to distribute themselves now into two separate tables, complicating the application and the application transaction structure.
If the LOB content is an entire file then perhaps you should consider upgrade to SQL 2008 and using FILESTREAM.
There is no real logical advantage to this two-tables design, since the relationship is 1-1, you might have all the info bundled in the FileInfo table. However, there are serious operational and performance advantages, in particular if your binary data is more than a few hundred bytes in size, on average.
EDIT: As pointed out by Remus Rusanu, on some DBMS implementations such as SQL2005, the large object types are transparently stored to a separate table, effectively alleviating the practical drawback of having big records. The introduction of this feature implicitly confirms the the [true] single table approach's weakness.
I merely scanned the SO posting referenced in this question. I generally thing that while that other posting makes a few valid points, such as intrinsic data integrity (since all CRUD actions on a given item are atomic), but on the whole, and unless of relatively atypical use cases (such as using the item table as a repository mostly queried for single items at a time), the performance advantage is with the two tables approach (whereby indexes on "header" table will be more efficient, queries that do not require the binary data will return much more quickly etc. etc.)
And the two tables approach has further benefits in case the design evolves to supply different types of binary objects in differnt context. For example, say these items are images (GIFs, JPGs etc.). At a later date you want to also provide a small preview version of these images (and/or a hi-resolution version), the choice of this being driven by the context (user preference, low band-width clients, subscriber vs. visitor etc.). In such a case not only are the operational issues associated with the single table approach made more acute, the model becomes more versatile.
It can help to separate IMAGE, (N)TEXT, (N)VARCHAR(max) and VARBINARY(max) columns out of wider tables purely for some restrictions of SQL Server.
For example before 2012 it was not possible to online rebuild a clustered table if it contained LOBs. On the other hand you might not care about those restrictions, so setting up the table like your data is related is the better thing to do.
In case you physically want to keep the LOB data out of the table allocation unit you still can set the "large value types out of row" table option.

Performance implications of computed columns in SQL Server 2005 database?

The situation: we have a large database with a number of denormalized tables. We frequently have to resummarize the data to keep the summary tables in synch. We've talked on and off about using computed columns to keep the data fresh. We've also talked about triggers, but that's a separate discussion.
In our summary tables, we denormalized the table such that the Standard ID as well as the Standard Description is stored in the table. This inherently assumes that the table will be resummarized often enough so that if they change the standard description, it will also change it in the summary table.
A bad assumption.
Question:
What if we made the Standard Description in the summary table a derived/computed column which selects the standard description from the standard table?
Is there a tremendous performance hit by dropping a computed column on a table with 100,000-500,000 rows?
Computed columns are fine when they are not calculation intensive and are not executed on a large number of rows. Your questions is "will there be a hit by dropping the computed column." Unless this column is an index that is used by the query (REAL bad idea to index a comp col - i don't know if you can depending on your DB), then dropping it cant hurt your performance (less data to query and crunch).
If the standard table has the description, then you should be joining it in from the id and not using any computation.
You alluded to what may be a real problem, and that is the schema of your database. I have had problems like this before, where a system was built to handle one thing, and something like reporting needs to be bolted on/in. Without refactoring your schema to balance all of the needs, Sunny's idea of using views is just about the only easy way.
If you want to post some cleansed DDL and data, and an example of what you are trying to get out of the db, we may be able to give you a less subjective answer.
A computed column in a table can only be derived from values on that row. You can't have a lookup in the computed column. For that you would require a view.
On a table that small denormalising the name into the table will probably have negligable performance impact. You can use DBCC PINTABLE to hint the server to keep the table in the cache.
If you need the updates to be made in realtime then really your only option is triggers. Putting a clustered index on the ID column corresponding to the name you are updating should reduce the amount of I/O overall (the records for a given ID will be in the same block or set of blocks) so try this if the triggers are causing performance issues.
Just to clarify the issue for the sql2005 and up:
This functionality was introduced for
performance in SQL Server version 6.5.
DBCC PINTABLE has highly unwanted
side-effects. These include the
potential to damage the buffer pool.
DBCC PINTABLE is not required and has
been removed to prevent additional
problems. The syntax for this command
still works but does not affect the
server.

Resources