DB Fk/Pk keys performance - database

In our DB we have a single centric table with millions of rows that is constantly being inserted and updated.
This table has a single column acting as the unique identifier and is used to link the content of this table with mutliple tables with a one to many relation.
This means that wehn inserting entry to, say, USERS table, in the same transaction also USERS_PETS and USERS_PARENTS (and 10 more) will be populated, with multiple rows, based on the same unique identifier from the main table.
Since the application using this DB is constantly inserting new entries and updating existing ones the relation between these tables is kept only at the application level (i.e. logical ERD instead of handling this via FK/PK decelrations).
Questions:
Is this correct to assume that from pure performnces point of view, this is the best approach?
Is there a way to set these keys (so that the DB will be more self descriptive) without impacting performaces?

This is the worst possible approach and I guarantee you will have data integrity issues eventually. Data integrity is far more critical than performance. This is stupid and short-sighted.

No, for the same reason we use seatbelts in cars even when we are in a hurry. The difference is negligeble and totally not worth it.
Some specific dbms vendors may offer a way of declaring constraints while not enforcing them. In Oracle for example, you can specify the Integrity Constraint State as DISABLE NOVALIDATE.

You base data integrity on hope. Hope doesn't scale well.
And there's no such thing as "pure performance point of view". Unless, that is, you never read from the database. If you only insert, never update, never delete, and never read, you can make a case that there exists a "pure performance point of view". But if you ever update, delete, or read, then performance isn't a point--it's more like a surface or a solid, and all you can do is move the balancing point around among inserts, updates, deletes, and reads.
And, because somebody reading this still won't get it, the most critical part of read performance is getting back the right answer. If you can't guarantee the right answer, sensible people won't care how marginally faster your inserts are.

Related

SQL Server: Many columns in a table vs Fewer columns in two tables

I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table

When we need to use 1-to-1 relationship in database design?

When do we need to use a 1-to-1 relationship in database design? In my opinion, if two tables are in a 1-to-1 relationship, they can be combined into one table. Is this true?
Vertical partitioning for large tables to reduce I/O and cache requirements -- separate columns that are queried often vs rarely.
Adding a column to a production system when the alter table is "too expensive".
Super-type/subtype pattern.
Vertical partitioning to benefit from table (join) elimination -- providing optimizer
supports it (again to reduce I/O and cache) .
Anchor modeling -- similar to 4, but down to 6NF.
On occasion it's useful for table locks. When you add a column to a database, the whole table gets locked until it's fully rewritten. This has little to no impact when your database has 100k rows. But if you've 100M rows, or 1B rows, it's a whole different story...
It's also useful to avoid dead rows that take too much space. If you're using MVCC and some of your columns are regularly overwritten, it occasionally makes sense to place them in a separate table. Arguably, auto-vacuuming eventually kicks in, but for the sake of sparing hard drive work, better vacuum a few int fields in a separate table than a whole bunch of entire rows full of text, varchar(n) and who know what else.
A last reason would be the abuse of select * in ORMs. If you're storing images or blog posts/articles, for instance, it may make sense to store the blob/text field in a separate table. Because every time it gets loaded for a reason or the other, your ORM is going to load the whole row. When you only need the URL of your image or post, the last thing you want is to pull the whole binary/text from the database; and yet your ORM will do just that...
yes in general.
One exception may be if you want to assign privileges differently to a subset of columns.
also consider that this is true only when both sides are required.
One reason would be to put frequently accessed data in one table and extremely rarely accessed data in another table. It would run faster and save some memory.
But I would have to have my arm twisted hard before I would do it.

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

What advantages do constraints provide to a database?

I realize this question may seem a little on the "green" side, but after the number of "enterprise" or "commercial" databases I've encountered I've begun to ask this question. What advantages to constraints provide to a database? I'm asking more about Foreign Key constraints rather than Unique constraints. Do they offer performance gains, or just data integrity?
I've been rather surprised at the number of relational databases without foreign keys or even without specified primary keys (just constraints on fields being not null or a unique constraint on the field).
Thoughts?
"just" data integrity? You say that like it's a minor thing. In all applications, it's critical. So yes, it provides that, and it's a huge benefit.
Data integrity is what they offer. If anything they have a performance cost (a very minor one at least).
They provide both performance and data integrity, and the latter is paramount to any serious system. I cringe every time I see a database without any foreign keys and where all integrity is done through triggers (if at all). And I saw quite a bit of those out there.
The following, assuming you get the constraint right in the first place:-
Your data will be valid with respect to the constraint
The database knows your data will be valid with respect to the constraint and can use this when querying or updating the database (e.g. removing an unnecessary join for a query on a view)
The constraint is documented for future users of the database
A violation of the constraint will be caught as soon as possible; not in some later unrelated process that fails
In relational theory, a database that allows inconsistent data isn't really a relational database. Foreign keys are necessary for data integrity and consistency to keep the database "relational"; i.e. the logical model of the database is always correct.
In practical terms, it's usually easier to define a foreign key and let the DB engine handle making sure the relation is valid. The other options are:
nothing - guaranteed data corruption
at some point
DB triggers - which
will usually be slower and less
performant
application code - which
will eventually cause problems when
either you forget to call the right
code or another application accesses
the database.
Data is an asset. Lots of textbooks state that.
But it is actually wrong. It should rather say "correct data is an asset, incorrect data is a liability".
And database constraints give you the best possible guarantee that data is correct.
In some DBMSs (e.g. Oracle) constraints can actually improve the performance of some queries, since the optimiser can use the constraints to gain knowledge about the structure of the data. For some examples, see this Oracle Magazine article.
I would say all required constraints must be in the database. Foreign key constraints prevent unusable data. They aren't a nice to have - they are a requirement unless you want a useless database. Foreign keys may hurt performance of deletes and updates but that is OK. Is it better to take a little longer to do a delete (or to tell the application not to delete this person because he has orders in the system) or to delete the user but not his data? Lack of foreign keys may cause unexpected and often serious problems in querying the data. For instance the cost reports may depend on all the tables having related data and so may fail to show important data because one or more tables have nothing to join to.
Unique constraints are also a requirement of any decent databse. If a field or group of fields must be unique, to fail to define this at the database leve is tocreate data problems that are extremely hard to fix.
You don't mention other constraints but you should. Any business rule that must always be applied to all data in the table should always be applied in the database through a datatype (such as a datatime datatype which willnot accept '02\31\2009' as a valid date), a constraint (say one that does not allow the field to have a value greater than 100) or through a trigger is the logic is so complex it cannot be handled by an ordinary constraint. (Triggers are tricky to write if you don't know what you are doing, so if you have logic this complex, you hopefully have adatabase professional on your team.) The order is important to. Datatypes are the first choice, followed by constraints, followed by triggers as a last choice.
Simpler Application Code
One nice thing they provide is that your application code has to do a lot less error checking and validation. Contrast these two bits of code and multiply by thousands of operations and you can see there's a big win.
get department number for employee # it's good coz of constraints
do something with department number
vs.
get department number for employee
if department number is empty
...
else if department number not in list of good department numbers
....
else
do something with department number
Of course, people that ignore constraints probably don't put a lot of effort into code validation anyway... :-/
Oh, and if the data constraints change, it's a database configuration issue and not a code change issue.
Integrity constraints are especially important when you integrate several applications using a shared database.
You may be able to properly manage data integrity in a single application's code (and even if you don't, at least the broken data affects only that application), but with multiple apps it gets hairy (and at the least redundant).
"Oh, and if the data constraints change, it's a database configuration issue and not a code change issue."
Unless it's a constraint that disappears from the design. In that case, there definitely IS a code impact, because some code might have been written that depends on that removed constraint being there.
That must always be taken in consideration when "laxing" or "removing" any declared constraint.
Other than that, you are of course completely right.

Have you ever worked with Database no "Relations", no "PKs", and no "FKs", just raw data?

Nowadays, I'm working on a database, with no "Relations, PKs and FKs", just
raw data.
I can say that database is just set of papers.
When I asked about this, I had this; "Hide the Business".
Also, one of my friends said, this always happens in "Large systems".
In large systems, they are tyring to hide thier business through raw data.
Regarding development; relations, constraints, validation, are done in database using triggers and of course user interface.
What do you think regarding this?
Well, this may have point on large databases, when you need fast responce on massive DML (INSERT / UPDATE / DELETE).
The problem is that if you rely on database's way to ensure integrity, you hardly can optimize it.
There is also thing called SQL/PLSQL context switching in Oracle: if you create an empty trigger on the table, it will slow down DML about 20 times — with the mere fact that the trigger exists.
In Oracle, when you write a ON UPDATE trigger and update 50,000 rows in the table, the trigger and the query in it gets called 50,000 times. Foreign keys perform better, but they may also get laggy (and you can do nothing with the underlying queries)
In this case, it's better to put the results you want to update into a temporary table, issue a MERGE, check integrity before and after, and apply the business rules. A single query that processes 50,000 rows works faster than a loop of 50,000 queries processing single row.
Of course, it's very hard to implement and only pays for itself when you have really large database and need to perform really massive updates on it.
In Oracle, in any case, FOREING KEY constraints perform better than tiggers implementing the same functionality.
PRIMARY KEYS will most likely improve performance, as a primary key implies creating the UNIQUE INDEX on the constrained field, and this index may be efficiently used in the queries. A UNIQUE INDEX is also a natural and most efficent way to enforce uniqueness.
But of course, as any index, is slows down INSERTS and those UPDATES and DELETES whose WHERE condition is not selective.
I. e. if you need to UPDATE or DELETE 1 row of 2,000,000, then the index is your friend; if you need to UPDATE or DELETE 1,500,000 rows of 2,000,000, the index is your enemy. It's a matter of tradeoff.
You may also see my answer here.
I would say sacrificing data integrity for security through obscurity is a bad trade.
I think I came across at least two applications with databases lacking relations and FKs.
The idea is probably that it's more difficult to reverse engineer the brillant database schema.
The side effect is that often the applications are not so good at checking constraints themselves, leading to lot of rubbish data in the database, which in fact does make it more difficult to reverse engineer, as FK constraints are not enforced ;)
My view is that once it's a database, somebody else can look into it, and trying to work around this "feature" of visibility is pointless and generally Not a Good Thing, considering the drawbacks (no relations, no SPs, no triggers, etc).
Primary Keys, relations etc etc are tools for making database development easier and for making the final result faster and more efficient. I can only think of a few rare cases where not having a key/index would be a good idea.
Did your friend explain why they held this view?
This kind of thing can happen in financial systems. It's the opposite of what you'd expect, you'd think that because it's finance that best practice would be applied more rigorously. However, the converse is often true. Many of these databases may have started out in excel.
I have seen a number of databases where they don't bother with fks or pks. I can't say I like it, but sometimes you have to have to just live with it, or leave and go work somewhere else for less money but with more database integrity.
Perhaps this is why they need you.
(me=MSSQL)
Deleting a row form a large table that has lots of FKs is slow.
Our APP has never been denied a Delete for a FK constraint violation on that table
I have considered dropping the FKs to improve performance.
Too chicken to have actually done it though :(
PS We would keep the FKs on the DEV / TEST systems
Maybe this was converted from a former system that used a file based structure to hold the data. The tables in the new database are just a reflection of the individual files.

Resources