Performance overhead of adding a BLOB field to a table - sql-server

I am trying to decide between two possible implementations and am eager to choose the best one :)
I need to add an optional BLOB field to a table which currently only has 3 simple fields. It is predicted that the new field will be used in fewer than 10%, maybe even less than 5% of cases so it will be null for most rows - in fact most of our customers will probably never have any BLOB data in there.
A colleague's first inclination was to add a new table to hold just the BLOBs, with a (nullable) foreign key in the first table. He predicts this will have performance benefits when querying the first table.
My thoughts were that it is more logical and easier to store the BLOB directly in the original table. None of our queries do SELECT * from that table so my intuition is that storing it directly won't have a significant performance overhead.
I'm going to benchmark both choices but I was hoping some SQL gurus had any advice from experience.
Using MSSQL and Oracle.

For MSSQL, the blobs will be stored on a separate page in the database so they should not affect performance if the column is null.
If you use the IMAGE data type then the data is always stored out of the row.
If you use the varbinary(max) data type then if the data is > 8kb it is stored outside the row, otherwise it may be stored in the row depending on the table options.
If you only have a few rows with blobs the performance should not be affected.

Related

SQL Server table design with non fixed column

I need your help in designing one table.
I have some groups tables and we need to load data in that group tables from xml files that contain column names and data.
The column name is actually index of some main column like activity_col1, activity_col2 and so on and not fixed every time, there is possibility that same table file contains 1000 columns sometimes and 10 column values some time also there is maximum limit is also defined so no file will contain more than
2000 column per group.
So I need to design a table that is the best possible solution for this also I need to do the aggregation of column values. The files contain min level data and I need to store this data in min table and after that this min data need to be aggregated in an hour, day, week and month.
If I create max columns in all tables but data will not come every time in all columns so this design seems not good because most of the values will be null.
If I insert column name as rows in column_name column and values against each column value in values column then aggregation will be a tedious task for me
and it will impact performance.
Please suggest.
One option would be EAV, but it's more complicated to build, to query and to insert, and readability is very low.
You require a schema-less design, Allowing an unlimited number of columns,
your best bet is probably to use a NoSQL solution. Even though the weaknesses of EAV relative to relational databases also apply to NoSQL alternatives.
Also take a look at here :
Benefits of NoSQL
Recommendations (as priority):
Choice EAV, If you are using a relational-database and this is
where you turn either the whole table or a portion (in another
table) on its side. This is a good choice if you already have a
relational-database in-house that you can't move away from easily.
Choice NoSQL, If does not matter kind of DBMS for you It is very
flexible and fast and not all of the report writers out there
support this style of storage. There are many example database
implementations of NoSQL. The one that seems to be most popular
right now, is MongoDB.
and the last option that I don't recommend you to use it:
Choice Standard tables with XML columns, If the you don't need to
query them, and you just want to be stored and retrieved as plain
text for using some extra usage.
I hope to be helpful for you:)

Impacts of Large Nullable Columns in SQL Server

I'm designing a schema where certain members can upload images (based on a permission). I'm planning on doing this using a varbinary(max) column.
What are the storage and performance implications to consider between the two following designs (apart from the obvious that the latter is one to many - that can be constrained easily enough).
A single table with a nullable varbinary(max) column
Two tables, one for Members, the second for Pictures
Clearly an additional left join will slow performance but if I use a single table approach will this require more storage space (I don't normally consider storage size too much of a concern over performance but for this project I have fairly tight limits with my hosting provider).
A nullable column variable length that is NULL takes no space in the table.
When you do store the BLOB, then it maybe stored in-row or off-row, depending on size etc. This applies whether 1 or 2 tables
If you have a separate table, you'd additionally need to store keep the primary key of Members (or it has it's own key, FK in Members). However, this is trivial though compared to your picture size.
Personally, I'd use one table to keep it simple.
Unless, say, I wanted to use FILESTREAM, or uses a different filegroup got the BLOBs.
Store the images in the same table. There will be no any storage or speed benefit of storing them in separate table, except if you'll have zillions of members and 10 of them will have a picture.
Since sql server does not store nullable variable column at all if it has value of NULL - you even may gain speed benefit comparing two-tables design
Consider using FILESTREAM column if your images are big enough (say - more than 1 Mb). It allows to store images as files, which speeding up read-write operations, but with backup consistency.
Better option... store images on disk and add nullable field with file name (path) in the Members table.

The size of a column in the database can slow a query?

I have a table with a column contains HTML content and is relative greater than the other columns.
Having a column with a great size can slow the queries in this table?
I need to put this big fields in another table?
The TOAST Technique should handle this for you, after a given size the storage will be transparently set in a _toast table and some internal things are done to avoid slowing down your queries (check the given link).
But of course if you always retrieve the whole content you'll loose time in the retrieval. And it's also clear that requests on this table where this column is not used won't suffer from this column size.
The bigger the database the slower the queries. Always.
It's likely that if you have large column, there is going to be more disk I/O since caching the column itself takes more space. However, putting these in a different table won't likely alleviate this issue (other than the issue below). When you don't explicitly need the actual HTML data, be sure not to SELECT it.
Sometimes the ordering of the columns can matter because of the way rows are stored, if you're really worried about it, store it as the last column so it doesn't get paged when selecting other columns
You would have to look at how Postgres internally stores things to see if you need to split this out but a very large field could cause the way the data is stored on the disk to be broken up and thus adds to the time it takes to access it.
Further, returning 100 bytes of data vice 10000 bytes of data for one record is clearly going to be slower, the more records the slower. If you are doing SELECT * this is clearly a problem espcially if you usually do not need the HTML.
Another consideration could be ptting the HTML information in a noSQL database. This kind of document information is what they excel at. No reason you can't use both a realtional database for some info and a noSQL database for other info.

How to design this database?

I have to design a database to store log data but I don't have experience before. My table contains about 19 columns (about 500 bytes each row) and daily grows up to 30.000 new rows. My app must be able to query effectively again this table.
I'm using SQL Server 2005.
How can I design this database?
EDIT: data I want to store contains a lot of type: datetime, string, short and int. NULL cells are about 25% in total :)
However else you'll do lookups, a logging table will almost certainly have a timestamp column. You'll want to cluster on that timestamp first to keep inserts efficient. That may mean also always constraining your queries to specific date ranges, so that the selectivity on your clustered index is good.
You'll also want indexes for the fields you'll query on most often, but don't jump the gun here. You can add the indexes later. Profile first so you know which indexes you'll really need. On a table with a lot of inserts, unwanted indexes can hurt your performance.
Well, given the description you've provided all you can really do is ensure that your data is normalized and that your 19 columns don't lead you to a "sparse" table (meaning that a great number of those columns are null).
If you'd like to add some more data (your existing schema and some sample data, perhaps) then I can offer more specific advice.
Throw an index on every column you'll be querying against.
Huge amounts of test data, and execution plans (with query analyzer) are your friend here.
In addition to the comment on sparse tables, you should index the table on the columns you wish to query.
Alternatively, you could test it using the profiler and see what the profiler suggests in terms of indexing based on actual usage.
Some optimisations you could make:
Cluster your data based on the most likely look-up criteria (e.g. clustered primary key on each row's creation date-time will make look-ups of this nature very fast).
Assuming that rows are written one at a time (not in batch) and that each row is inserted but never updated, you could code all select statements to use the "with (NOLOCK)" option. This will offer a massive performance improvement if you have many readers as you're completely bypassing the lock system. The risk of reading invalid data is greatly reduced given the structure of the table.
If you're able to post your table definition I may be able to offer more advice.

Pros and Cons of massive table that controls all data flow with stored procs

DBA (with only 2 years of google for training) has created a massive data management table (108 columns and growing) containing all neccessary attribute for any data flow in the system. Well call this table BFT for short.
Of these columns:
10 are for meta-data references.
15 are for data source and temporal tracking
1 instance of new/curr columns for textual data
10 instances of new/current/delta/ratio/range columns for multi-value numeric updates
:totaling 50 columns.
Multi valued numeric updates usually only need 2-5 of the update groups.
Batches of 15K-1500K records are loaded into the BFT and processed by stored procs with logic to validate those records shuffle them off to permanent storage in about 30 other tables.
In most of the record loads, 50-70 of the columns are empty through out the entire process.
I am no database expert, but this model and process seems to smell a little, but I don't know enough to say why, and don't want to complain without being able to offer an alternative.
Given this very small insight to the data processing model, does anyone have thoughts or suggestions? Can the database (SQL Server) be trusted to handle records with mostly empty columns efficiently, or does processing in this manner wasted lots of cycles/memory,etc.
Sounds like he reinvented BizTalk.
I typically have multiple staging tables corresponding to the input loads. These may or may not correspond to the destination tables, but we don't do what you're talking about. If he doesn't like to have a lot of what are basically temporary work tables, they could be put into their own schema or even a separate database.
As far as the columns which are empty, if they aren't referenced in the particular query which is processing BFT it doesn't matter - HOWEVER, what will happen is that the indexing becomes much more crucial that the index chosen is a non-clustered covering index. When your BFT is used and a table scan or clustered index scan is chosen, the unused column have to be read and ignored or skipped, and this definitely seems to affect processing in my experience. Whereas with a non-clustered index scan or seek, less columns are read, and hopefully this doesn't include (m)any of the unused columns.
Normalization is the keyword here. If you have so many NULL values, chances are high that you're wasting a lot of space. Normalizing the table should also make data integrity in this table easier to enforce.
One thing that might make things a little more flexible (other than normalizing) could be to create one or more views or table functions to present the data. Particularly if the table is outside your control, these would enable you to filter the spurious crap out and grab only what you need from the table.
However, if you're going to be one of the people who will be working with (and frowning every time you have to crack open) that massive table, you might want to trump the DBA's "design" and normalize that beast, and maybe give the DBA the task of creating some views and/or table functions to help you out.
I currently work with a similar but not so huge table which has been around on our system for years and has had new fields and indices and constraints rather hastily tacked on Frankenstein-style. Unfortunately some other workgroups rely on the structure as gospel, so we've created such views and functions to enable us to "shape" the data the way we need it.

Resources