database design and performance - database

I have some question regarding database performance in general. I'm using Sqlite but I assume that the performance remarks are applicable to all relational databases?
I have a database that contains a table that stores data of about 200 variables. I write about 50 variables per second to the table. A writen variable contains the id of the variable, a value and a timestamp. Readig is done very rarely but needs to be as fast as possible to get the data per variable in chronological order. When I do a query I always just need to get the data of 1 variable.
How do I design the database so the reading is as fast as possible:
1. I make 1 tabel that contains all the
variables. The variable is stored as
an id. I index the table on the id
and timestamp. The bad part is that
the index makes the write slowe(r).
2. I make 200 tables for each variable
and index the timestamp.
I think the second solution is the most performant but creaying a table for each variable doesn't seem right. Someone can give me some advice?
Thanks

If you really want to use a database, use the first approach, but make sure you are inserting your data in a single transaction; benchmarks show it makes writing much faster.
Are your searches performed on variable name/id AND timestamp, or variable name only. Indexing on timestamp may not be necessary...

Are you sure you need a database? By the sounds of it, a flat-file will work well enough for you, and you don't sound like you actually need any of the trappings of a database. Just create a flat-file for each variable and keep handles to each open. Write to them through your standard buffered IO as often as you need. To read, just open one file and deserialize.

If you are using a relational database, I am guessing those variables are all related? If they are just values, for instance, settings, then maybe a file or something similar may be better.
If you only ever have to query values for ONE variable, then, if you insist on using a database (which may not be a bad thing!), then you should create one table per variable:
id (unsigned int, auto-increment, primary key)
timestamp (datetime)
variable (whatever it is supposed to be)
Do not skimp on data just because "it might take more room on the hard drive" - that only leads to trouble.

Related

MS SQL: What is more efficient? Using a junction table or storing everything in a varchar?

here is a simple question to which I would like an answer to:
We have a member table. Each member practices one, many or no sports. Initially we (the developers) created a [member] table, a [sports] table and a [member_sports] table, just as we have always done.
However our client here doesn't like this and wants to store all the sports that the member practices in a single varchar column, separated with a special character.
So if:
1 is football
2 is tennis
3 is ping-pong
4 is swimming
and I like swimming and ping-pong, my favourite sports will be stored into the varchar column as:
x3,x4
Now we don't want to just walk up to the client and claim that his system isn't right. We would like to back it up with proof that the operation to fetch the sports from [member_sports] is more efficient than simply storing the fields as a varchar.
Is there any documentation that can back our claims? Help!
Ask your client if they care about storing accurate information1 rather than random strings.
Then set them a series of challenges. First, ensure that the sport information is in the correct "domain". For the member_sports table, that is:
sport_id int not null
^
|--correct type
For their "store everything in a varchar column" solution, I guess you're writing a CHECK constraint. A regex would probably help here but there's no native support for regex in SQL Server - so you're either bodging it or calling out to a CLR function to make sure that only actual int values are stored.
Next, we not only want to make sure that the domain is correct but that the sports are actually defined in your system. For member_sports, that's:
CONSTRAINT FK_Member_Sports_Sports FOREIGN KEY (Sport_ID) references Sports (Sport_ID)
For their "store everything in a varchar column" I guess this is going to be a far more complex CHECK constraint using UDFs to query other tables. It's going to be messy and procedural. Plus if you want to prevent a row from being removed from sports while it's still referenced by any member, you're talking about a trigger on the sports table that has to query every row in members2`.
Finally, let's say that it's meaningless for the same sport to be recorded for a single member multiple times. For member_sports, that is (if it's not the PK):
CONSTRAINT UQ_Member_Sports UNIQUE (Member_ID,Sport_ID)
For their "store everything in a varchar column" it's another horrifically procedural UDF called from a CHECK constraint.
Even if the varchar variant performed better (unlikely since you need to be ripping strings apart and T-SQL's string manipulation functions are notoriously weak (see above re: regex)) for certain values of "performs better", how do they propose that the data is meaningful and not nonsense?
Writing the procedural variants that can also cope with nonsense is an even more challenging endeavour.
In case it's not clear from the above - I am a big fan of Declarative Referential Integrity (DRI). Stating what you want versus focussing on mechanisms is a huge part of why SQL appeals to me. You construct the right DRI and know that your data is always correct (or, at least, as you expect it to be)
1"The application will always do this correctly" isn't a good answer. If you manage to build an application and related database in which nobody ever writes some direct SQL to fix something, I guess you'll be the first.
But in most circumstances, there's always more than one application, and even if the other application is a direct SQL client only employed by developers, you're already beyond being able to trust that the application will always act correctly. And bugs in applications are far more likely than bugs in SQL database engine's implementations of constraints, which have been tested far more times than any individual application's attempt to enforce constraints.
2Let alone the far more likely query - find all members who are associated with a particular sport. A second index on member_sports makes this a trivial query3. No indexes help the "it's somewhere in this string" solution and you're looking at a table scan with no indexing opportunities.
3Any index that has sport_id first should be able to satisfy such a query.

Using dependant types to provide a compile type proofe that some integer is a valid row-id in database?

In my never-ending wonder in dependent type land a strange idea came into my head. I do a lot of data base programming and it would be nice if I could get rid of all those sanity-checking and validity-checking. One specially annoying case is those functions that accept an Integer and expect that to be a valid row-id of some certain table. A very silly example is:
function loadStudent(studentId: Integer) : Student
Supposing my language of choice supports dependent types in their full glory, would it be possible to utilize the type system to make loadStudent accept only valid studentId values :
function loadStudent(studentId : ValidRowId("students_table") ) : Student
If yes, how do I write a data constructor for ValidRowId type? All the examples I have seen thus far were pure (no IO involved).
Maybe I'm misunderstanding the question, but I don't see how it's possible without doing IO. How can you know that an id is valid without searching the database to see if there is a record with that id?
I suppose that you could, at program start up time, read all the current IDs into a table in memory and then do your checks against that. But you would have to somehow know if another user had added or deleted records after you created the table.
Okay, you could have all threads on all computers that access the database communicate with some central server that keeps this master list so that it would always be current. But we already have a central place that keeps track of all the IDs currently in use in the database: it's called "the database". What would be the advantage of going to a whole bunch of work to maintain a duplicate copy of a subset of the data on the database? It's unlikely you'd get much performance gain, and you'd create the possibility that bugs in your code, bad connections, etc, would result in the data getting out of sync.

is delimiting data in a database field ok

Is delimiting data in a database field something that would be ok to do?
Something like
create table column_names (
id int identity (1,1) PRIMARY KEY,
column_name varchar(5000)
);
and then storing data in it as follows
INSERT INTO column_names (column_name) VALUES ('stocknum|name|price');
No. this is bad:
in order to create new queries you have to track down how things are stored.
queries that join on price or name or stocknum are going to be nasty
the database can't assign data types to the data or validate it
you can't create constraints on any of this data now
Basically you're subverting the RDBMS' scheme for handling things and making up your own, so you're limiting how much the RDBMS tools can help you and you've made the system harder to understand for new people.
The only possible advantage of this kind of system that I can think of is that it can serve as a workaround to avoid dealing with a totally impossible DBA who vetoes all schema changes regardless of merit. Which can happen, unfortunately.
Of course there's an exception to everything. I'm currently on a project with audit-logging requirements that are pretty stringent. the logging is done to a database, we're using delimited fields for storing the fields because the application is never going to interact with this data, it gets written once and left alone.
Almost certainly not.
It violates principles of normalization. The data stored in a particular row of a particular column should be atomic-- you shouldn't be able to parse the data into smaller component parts.
It makes it substantially more difficult to get acceptable performance. Every piece of code that queries this table will need to know how to parse the data which is generally going to mean that more data needs to be read off disk and potentially sent over the network to the client. Every query that has to parse this data is going to have to be more complex which tends to cause grief for the query optimizer. Concatenated data cannot generally be indexed effectively for searches-- you'd have to do something like a full-text index with custom delimiters rather than a nice standard index on a character string. And if you ever have to update one of the delimited values (i.e. because a product name changes), those updates are going to have to scan every row in the table, parse the data, decide whether to actually update the row, and then update a ton of rows.
It makes the application much more brittle. What happens when someone decides to include a | character in the name attribute, for example? Even if you specify an optional enclosure in the spec (i.e. | is allowed if the entire token is enclosed in double quotes), what fraction of the bits of code that actually parse this column are going to implement and test that correctly?

what's best way to leave empty database cells?

I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).

Store array of numbers in database field

Context: SQL Server 2008, C#
I have an array of integers (0-10 elements). Data doesn't change often, but is retrieved often.
I could create a separate table to store the numbers, but for some reason it feels like that wouldn't be optimal.
Question #1: Should I store my array in a separate table? Please give reasons for one way or the other.
Question #2: (regardless of what the answer to Q#1 is), what's the "best" way to store int[] in database field? XML? JSON? CSV?
EDIT:
Some background: numbers being stored are just some coefficients that don't participate in any relationship, and are always used as an array (i.e. never a value is being retrieved or used in isolation).
Separate table, normalized
Not as XML or json , but separate numbers in separate rows
No matter what you think, it's the best way. You can thank me later
The "best" way to store data in a database is the way that is most conducive to the operations that will be performed on it and the one which makes maintenance easiest. It is this later requirement which should lead you to a normalized solution which means storing the integers in a table with a relationship. Beyond being easier to update, it is easier for the next developer that comes after you to understand what and how the information is stored.
Store it as a JSON array but know that all accesses will now be for the entire array - no individual read/writes to specific coefficients.
In our case, we're storing them as a json array. Like your case, there is no relationship between individual array numbers - the array only make sense as a unit and as a unit it DOES has a relationship with other columns in the table. By the way, everything else IS normalized. I liken it to this: If you were going to store a 10 byte chunk, you'd save it packed in a single column of VARBINARY(10). You wouldn't shard it into 10 bytes, store each in a column of VARBINARY(1) and then stitch them together with a foreign key. I mean you could - but it wouldn't make any sense.
YOU as the developer will need to understand how 'monolithic' that array of int's really is.
A separate table would be the most "normalized" way to do this. And it is better in the long run, probably, since you won't have to parse the value of the column to extract each integer.
If you want you could use an XML column to store the data, too.
Sparse columns may be another option for you, too.
If you want to keep it really simple you could just delimit the values: 10;2;44;1
I think since you are talking about sql server that indicates that your app may be a data driven application. If that is the case I would keep definately keep the array in the database as a seperate table with a record for each value. It will be normalized and optimized for retreival. Even if you only have a few values in the array you may need to combine that data with other retreived data that may need to be "joined" with your array values. In which case sql is optimized for by using indexes, foreign keys, etc. (normalized).
That being said, you can always hard code the 10 values in your code and save the round trip to the DB if you don't need to change the values. It depends on how your application works and what this array is going to be used for.
I agree with all the others about the best being a separate normalized table. But if you insist in having it all in the same table don't place the array in one only column. In instead create the 10 columns and store each array value in a different column. It will save you the parsing and update problems.

Resources