Performance loss JSON in Sql Server database - sql-server

Currently I am storing JSON in my database as VARCHAR(max) that contains some transformations. One of our techs is asking to store the original JSON it was transformed from.
I'm afraid that if I add another JSON column it is going to bloat the page size and lead to slower access times. On the other hand this table is not going to be real big (about 100 rows max with each JSON column taking 4-6 K bytes) and could get accessed as much as 4 or 5 times a minute.
Am I being a stingy gatekeeper mercilessly abusing our techs or a sagacious architect keeping the system scalable?
Also, I'm curious about the (relatively) new filestream/BLOBs type. From what I've read I get the feeling that BLOBs are stored in some separate place such that relational queries aren't slowed down at all. Would switching varchar to filestream help?

Generally BLOB is preferred for Objects that are being stored are, on average, larger than 1 MB.
I think you should be good with keeping them in same database. 100 rows are not much for a database.
Also, what is the usecase of keeping the original as well as transformed JSON. If original JSON is not going to be used as part of normal processing and is just needed to keep for references, I would suggest keep a separate table and dump original JSON there with a reference key and use original only when needed.

Your use case doesn't sound to have too much demand. 4-6KB and less than 100 or even 1000 rows for that matter is still pretty light. Though I know expected use case almost never ends up being actual use case. If people use the table for things other than the JSON field you might not want them pulling back the JSON because of the potential size and unnecessary bloat.
Good thing SQL has some other lesser complex options to help us out. https://msdn.microsoft.com/en-us/library/ms173530(v=sql.100).aspx
I would suggest looking at the table option of Large Value Types out of Row as it is future compatible and the text in row option is deprecated. Essentially these options store those large text fields off of the primary page, allowing the correct data to live where it needs to live and the extra STUFF to have a different home.

Related

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.
I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?
I am currently using PostgreSQL, if that's relevant.
I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.
You can (and should) normalize your database, which is OK if your information model doesn´t change so often.
Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.
The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.
In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.
Depends on the predominant queries you need:
If you need queries that select all (or most) of the columns, then this is the optimal design.
If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2
Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...
1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.
2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.
You can't search in serialzed arrays.

The size of a column in the database can slow a query?

I have a table with a column contains HTML content and is relative greater than the other columns.
Having a column with a great size can slow the queries in this table?
I need to put this big fields in another table?
The TOAST Technique should handle this for you, after a given size the storage will be transparently set in a _toast table and some internal things are done to avoid slowing down your queries (check the given link).
But of course if you always retrieve the whole content you'll loose time in the retrieval. And it's also clear that requests on this table where this column is not used won't suffer from this column size.
The bigger the database the slower the queries. Always.
It's likely that if you have large column, there is going to be more disk I/O since caching the column itself takes more space. However, putting these in a different table won't likely alleviate this issue (other than the issue below). When you don't explicitly need the actual HTML data, be sure not to SELECT it.
Sometimes the ordering of the columns can matter because of the way rows are stored, if you're really worried about it, store it as the last column so it doesn't get paged when selecting other columns
You would have to look at how Postgres internally stores things to see if you need to split this out but a very large field could cause the way the data is stored on the disk to be broken up and thus adds to the time it takes to access it.
Further, returning 100 bytes of data vice 10000 bytes of data for one record is clearly going to be slower, the more records the slower. If you are doing SELECT * this is clearly a problem espcially if you usually do not need the HTML.
Another consideration could be ptting the HTML information in a noSQL database. This kind of document information is what they excel at. No reason you can't use both a realtional database for some info and a noSQL database for other info.

Database storage requirements and management for lots of numerical data

I'm trying to figure out how to manage and serve a lot of numerical data. Not sure an SQL database is the right approach. Scenario as follows:
10000 sets of time series data collected per hour
5 floating point values per set
About 5000 hours worth of data collected
So that gives me about 250 million values in total. I need to query this set of data by set ID and by time. If possible, also filter by one or two of the values themselves. I'm also continuously adding to this data.
This seems like a lot of data. Assuming 4 bytes per value, that's 1TB. I don't know what a general "overhead multiplier" for an SQL database is. Let's say it's 2, then that's 2TB of disk space.
What are good approaches to handling this data? Some options I can see:
Single PostgreSQL table with indices on ID, time
Single SQLite table -- this seemed to be unbearably slow
One SQLite file per set -- lots of .sqlite files in this case
Something like MongoDB? Don't even know how this would work ...
Appreciate commentary from those who have done this before.
Mongo is a key-value store; might work for your data but I don't have much experience.
I can tell you that PostgreSQL will be a good choice. It will be able to handle that kind of data. SQLite is definitely not optimized for those use-cases.

How do database perform on dense data?

Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.

Database data upload design question

I'm looking for some design help here.
I'm doing work for a client that requires me to store data about their tens of thousands of employees. The data is being given to me in Excel spreadsheets, one for each city/country in which they have offices.
I have a database that contains a spreadsheets table and a data table. The data table has a column spreadsheet_id which links it back to the spreadsheets table so that I know which spreadsheet each data row came from. I also have a simple shell script which uploads the data to the database.
So far so good. However, there's some data missing from the original spreadsheets, and instead of giving me just the missing data, the client is giving me a modified version of the original spreadsheet with the new data appended to it. I cannot simply overwrite the original data since the data was already used and there are other tables that link to it.
The question is - how do I handle this? It seems to me that I have the following options:
Upload the entire modified spreadsheet, and mark the original as 'inactive'.
PROS: It's simple, straightforward, and easily automated.
CONS: There's a lot of redundant data being stored in the database unnecessarily, especially if the spreadsheet changes numerous times.
Do a diff on the spreadsheets and only upload the rows that changed.
PROS: Less data gets loaded into the database.
CONS: It's at least partially manual, and therefore prone to error. It also means that the database will no longer tell the entire story - e.g. if some data is missing at some later date, I will not be able to authoritatively say that I never got the data just by querying the database. And will doing diffs continue working even if I have to do it multiple times?
Write a process that compares each spreadsheet row with what's in the database, inserts the rows that have changed data, and sets the original data row to inactive. (I have to keep track of the original data also, so I can't overwrite it.)
PROS: It's automated.
CONS: It will take time to write and test such a process, and it will be very difficult for me to justify the time spent doing so.
I'm hoping to come up with a fourth and better solution. Any ideas as to what that might be?
If you have no way to be 100 % certain you can avoid human error in option 2, don't do it.
Option 3: It should not be too difficult (or time consuming) to write a VBA script that does the comparison for you. VBA is not fast, but you can let it run over night. Should not take more than one or two hours to get it running error free.
Option 1: This would be my preferred approach: Fast, simple, and I can't think of anything that could go wrong right now. (Well, you should first mark the original as 'inactive', then upload the new data set IMO). Especially if this can happen more often in the future, having a stable and fast process to deal with it is important.
If you are really worried about all the inactive entries, you can also delete them after your update (delete from spreadsheets where status='inactive' or somesuch). But so far, all databases I have seen in my work had lots of those. I wouldn't worry too much about it.

Resources