Impacts of Large Nullable Columns in SQL Server - sql-server

I'm designing a schema where certain members can upload images (based on a permission). I'm planning on doing this using a varbinary(max) column.
What are the storage and performance implications to consider between the two following designs (apart from the obvious that the latter is one to many - that can be constrained easily enough).
A single table with a nullable varbinary(max) column
Two tables, one for Members, the second for Pictures
Clearly an additional left join will slow performance but if I use a single table approach will this require more storage space (I don't normally consider storage size too much of a concern over performance but for this project I have fairly tight limits with my hosting provider).

A nullable column variable length that is NULL takes no space in the table.
When you do store the BLOB, then it maybe stored in-row or off-row, depending on size etc. This applies whether 1 or 2 tables
If you have a separate table, you'd additionally need to store keep the primary key of Members (or it has it's own key, FK in Members). However, this is trivial though compared to your picture size.
Personally, I'd use one table to keep it simple.
Unless, say, I wanted to use FILESTREAM, or uses a different filegroup got the BLOBs.

Store the images in the same table. There will be no any storage or speed benefit of storing them in separate table, except if you'll have zillions of members and 10 of them will have a picture.
Since sql server does not store nullable variable column at all if it has value of NULL - you even may gain speed benefit comparing two-tables design
Consider using FILESTREAM column if your images are big enough (say - more than 1 Mb). It allows to store images as files, which speeding up read-write operations, but with backup consistency.

Better option... store images on disk and add nullable field with file name (path) in the Members table.

Related

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.
I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?
I am currently using PostgreSQL, if that's relevant.
I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.
You can (and should) normalize your database, which is OK if your information model doesn´t change so often.
Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.
The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.
In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.
Depends on the predominant queries you need:
If you need queries that select all (or most) of the columns, then this is the optimal design.
If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2
Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...
1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.
2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.
You can't search in serialzed arrays.

SQL Server varbinary(max) and varchar(max) data in a separate table

using SQL Server 2005 standard edition with SP2
I need to design a table where I will be storing a text file (~200KB) along with filename ,description and datetime.
Should we design a table where varchar(max) and varbinary(max) data should be stored in a separate table or should column of LOB data types be part of the main table?
Per this thread
What is the benefit of having varbinary field in a separate 1-1 table?
there is no performance or operational benefits which I agree to some extent however
I can see two benefits
store those into a separatable table that can be stored on a separate file group
you can not rebuild index on a table containing lob data type ONLINE
Any suggestions would be appreciated.
I would advise against separation. It complicates the design significantly for little or no benefit. As you probably know, SQL Server already stores LOBs on separate allocation units, as described in Table and Index Organization.
Your first concern (separate filegroup allocation for the LOB data) can be addressed explicitly, as Mikael has already pointed out, by appropriately specifying the desired filegroup in the CREATE TABLE statement.
Your second concern is no longer a concern with SQL Server 2012, see Online Index Operations for Indexes containing LOB columns. Even prior to SQL Server 2012 you could reorganize indexes with LOBs without problems (and REORGANIZE is online). Given that a full index rebuild is a very expensive operation (an online rebuild must be done at the table/index level, there is no partition online rebuild options), are you sure you want to complicate the design to accommodate for something that is, on one hand, seldom required, and on the other hand, will be available when you upgrade to SQL 2012?
I can answer your question in one simple word: Kiss.
Which of course stands for... Keep It Simple Stupid.
Adding a table for is generally a no-no unless you really need one to solve a problem.
Generally, I disagree with splitting tables. It adds complexity to databases and code. Having useless columns in a table is a bad thing, but it's not as bad as multiple tables when you only need one.
Cases where you would consider adding another table:
Some of your columns are BloB's of data (greater than page size) and they are rarely used and other columns with small data sizes are accessed frequently.
If you lack a brain.
If you are evil.
Or... if you are trying to piss-off your coworkers.

Best Database to store html-files (or files in general)

What is the best Database-Type (document-oriented,relational,key-value etc.) to store a html file (small sizes, ~max. 700kb) into Database?
Currently I´m using sqlite3 with python, but it seems to get pretty slow if the number of entries/files exceeds 3000 (the .db-file is about 260mb then). Besides that, sqlite is not suited for multiprocessing-usecases.
sqlite schema is like this:
CREATE TABLE articles (url TEXT NOT NULL,published DATETIME,title TEXT, fetched TEXT NOT
NULL,section TEXT,PRIMARY KEY (url), FOREIGN KEY(url) references
contents(url));
CREATE TABLE contents(url TEXT NOT NULL,date DATETIME,content TEXT,PRIMARY KEY (url));
CREATE TABLE shares (url TEXT NOT NULL, date DATETIME,likes INTEGER NOT NULL,
totals INTEGER NOT NULL,clicks INTEGER, comments INTEGER NOT
NULL,share INTEGER NOT NULL,
tweets INTEGER NOT NULL,PRIMARY KEY(date,url),FOREIGN KEY (url)
REFERENCES articles(url));
And the html files go to contents
For a document-centric database that uses a URL as the primary key, and which also has to support multiple concurrent writers, you might wish to consider one of the noSQL databases over SQLite. There are currently 122 of them listed here.
What does "pretty slow" mean to you? And are you certain the perceived slowness is # the database?
so you think, sqlite should be scalable enough in general?
There is no "in general" scenario in the actual world. No, I do not think it would scale well for a document-centric application where the records can be 500K. SQLite is not optimized to scale well in a BUSY MULTIPLE CONCURRENT WRITERS SCENARIO, where "busy" is a multivariable function involving the number of writes per second and the size of the record being written and how many indexes are on the table. In brief, the more disk-intensive (ergo time-consuming) the write operation, the less well it well scale. In other words, the larger the record and/or the more heavily indexed the table is, the fewer writes-per-second can be accommodated. And a 500K record is a very large record indeed. You'd be better served with MVCC.

What is the benefit of having varbinary field in a separate 1-1 table?

I need to store binary files in a varbinary(max) column on SQL Server 2005 like this:
FileInfo
FileInfoId int, PK, identity
FileText varchar(max) (can be null)
FileCreatedDate datetime etc.
FileContent
FileInfoId int, PK, FK
FileContent varbinary(max)
FileInfo has a one to one relationship with FileContent. The FileText is meant to be used when there is no file to upload, and only text will be entered manually for an item. I'm not sure what percentage of items will have a binary file.
Should I create the second table. Would there be any performance improvements with the two table design? Are there any logical benefits?
I've found this page, but not sure if it applies in my case.
There is no performance nor operational advantage. Since SQL 2005 the LOB types are already stored for you by the engine in a separate allocation unit, a separate b-tree. If you study the Table and Index Organization of SQL Server you'll see that every partition has up to 3 allocation units: data, LOB and row-overflow:
(source: s-msft.com)
A LOB field (varchar(max), nvarchar(max), varbinary(max), XML, CLR UDTs as well as the deprecated types text, ntext and image) will have in the data record itself, in the clustered index, only a very small footprint: a pointer into the LOB allocation unit, see Anatomy of a Record.
By storing a LOB explicitly in a separate table you gain absolutely nothing. You just add unneeded complexity as former atomic updates have to distribute themselves now into two separate tables, complicating the application and the application transaction structure.
If the LOB content is an entire file then perhaps you should consider upgrade to SQL 2008 and using FILESTREAM.
There is no real logical advantage to this two-tables design, since the relationship is 1-1, you might have all the info bundled in the FileInfo table. However, there are serious operational and performance advantages, in particular if your binary data is more than a few hundred bytes in size, on average.
EDIT: As pointed out by Remus Rusanu, on some DBMS implementations such as SQL2005, the large object types are transparently stored to a separate table, effectively alleviating the practical drawback of having big records. The introduction of this feature implicitly confirms the the [true] single table approach's weakness.
I merely scanned the SO posting referenced in this question. I generally thing that while that other posting makes a few valid points, such as intrinsic data integrity (since all CRUD actions on a given item are atomic), but on the whole, and unless of relatively atypical use cases (such as using the item table as a repository mostly queried for single items at a time), the performance advantage is with the two tables approach (whereby indexes on "header" table will be more efficient, queries that do not require the binary data will return much more quickly etc. etc.)
And the two tables approach has further benefits in case the design evolves to supply different types of binary objects in differnt context. For example, say these items are images (GIFs, JPGs etc.). At a later date you want to also provide a small preview version of these images (and/or a hi-resolution version), the choice of this being driven by the context (user preference, low band-width clients, subscriber vs. visitor etc.). In such a case not only are the operational issues associated with the single table approach made more acute, the model becomes more versatile.
It can help to separate IMAGE, (N)TEXT, (N)VARCHAR(max) and VARBINARY(max) columns out of wider tables purely for some restrictions of SQL Server.
For example before 2012 it was not possible to online rebuild a clustered table if it contained LOBs. On the other hand you might not care about those restrictions, so setting up the table like your data is related is the better thing to do.
In case you physically want to keep the LOB data out of the table allocation unit you still can set the "large value types out of row" table option.

Performance overhead of adding a BLOB field to a table

I am trying to decide between two possible implementations and am eager to choose the best one :)
I need to add an optional BLOB field to a table which currently only has 3 simple fields. It is predicted that the new field will be used in fewer than 10%, maybe even less than 5% of cases so it will be null for most rows - in fact most of our customers will probably never have any BLOB data in there.
A colleague's first inclination was to add a new table to hold just the BLOBs, with a (nullable) foreign key in the first table. He predicts this will have performance benefits when querying the first table.
My thoughts were that it is more logical and easier to store the BLOB directly in the original table. None of our queries do SELECT * from that table so my intuition is that storing it directly won't have a significant performance overhead.
I'm going to benchmark both choices but I was hoping some SQL gurus had any advice from experience.
Using MSSQL and Oracle.
For MSSQL, the blobs will be stored on a separate page in the database so they should not affect performance if the column is null.
If you use the IMAGE data type then the data is always stored out of the row.
If you use the varbinary(max) data type then if the data is > 8kb it is stored outside the row, otherwise it may be stored in the row depending on the table options.
If you only have a few rows with blobs the performance should not be affected.

Resources