While working on a content management system, I've hit a bit of a wall. Coming back to my data model, I've noticed some issues that could become more prevalent with time.
Namely, I want to maintain a audit trail (change log) of record modification by user (even user record modifications would be logged). Due to the inclusion of an arbitrary number of modules, I cannot use a by-table auto incrementation field for my primary keys, as it will inevitably cause conflicts while attempting to maintain their keys in a single table.
The audit trail would keep records of user_id, record_id, timestamp, action (INSERT/UPDATE/DELETE), and archive (a serialized copy of the old record)
I've considered a few possible solutions to the issue, such as generating a UUID primary key in application logic (to ensure cross database platform compatibility).
Another option I've considered (and I'm sure the consensus will be negative for even considering this method) is, creating a RecordKey table, to maintain a globally auto-incremented key. However, I'm sure there are far better methods to achieve this.
Ultimately, I'm curious to know of what options I should consider in attempting to implement this. For example, I intend on permitting (to start at least) options for MySQL and SQLite3 storage, but I'm concerned about how each database would handle UUIDs.
Edit to make my question less vague: Would using global IDs be a recommended solution for my problem? If so, using a 128 bit UUID (application or database generated) what can I do in my table design that would help maximize query efficiency?
Ok, you've hit a brick wall. And you realise that actually the db design has problems. And you are going to keep hitting this same brick wall many times in the future. And your future is not looking bright. And you want to change that. Good.
But what you have not yet done is, figure what the actual cause of this is. You cannot escape from the predictable future until you do that. And if you do that properly, there will not be a brick wall, at least not this particular brick wall.
First, you went and stuckIdiot columns on all the tables to force uniqueness, without really understanding the Identifiers and keys that used naturally to find the data. That is the bricks that the wall is made from. That was an unconsidered knee-jerk reaction to a problem that demanded consideration. That is what you will have to re-visit.
Do not repeat the same mistake again. Whacking GUIDs or UUIDs, or 32-byteIdiot columns to fix yourNUMERIC(10,0) Idiot columns will not do anything, except make the db much fatter, and all accesses, especially joins, much slower. The wall will be made of concrete blocks and it will hit you every hour.
Go back and look at the tables, and design them with a view to being tables, in a database. That means your starting point is No Surrrogate Keys, noIdiot columns. When you are done, you will have very fewId columns. Not zero, not all tables, but very few. Therefore you have very few bricks in the wall. I have recently posted a detailed set of steps required, so please refer to:
Link to Answer re Identifiers
What is the justification of having one audit table containing the audit "records" of all tables ? Do you enjoy meeting brick walls ? Do you want the concurrency and the speed of the db to be bottlenecked on the Insert hot-spot in one file ?
Audit requirements have been implemented in dbs for over 40 years, so the chances of your users having some other requirement that will not change is not very high. May as well do it properly. The only correct method (for a Rdb) for audit tables, is to have one audit table per auditable real table. The PK will be the original table PK plus DateTime (Compound keys are normal in a modern database). Additional columns will be UserId and Action. The row itself will be the before image (the new image is the single current row in the main table). Use the exact same column names. Do not pack it into one gigantic string.
If you do not need the data (before image), then stop recording it. It is a very silly to be recording all that volume for no reason. Recovery can be obtained from the backups.
Yes, a single RecordKey table is a monstrosity. And yet another guaranteed method of single-threading the database.
Do not react to my post, I can already see from your comments that you have all the "right" reasons for doing the wrong thing, and keeping your brick walls intact. I am trying to help you destroy them. Consider it carefully for a few days before responding.
How about keeping all the record_id local to each table, and adding another column table_name (to the audit table) to make for a composite key?
This way you can also easily filter your audit log by table_name (which will be tricky with arbitrary UUID or sequence numbers). So even if you do not go with this solution, consider adding the table_name column anyway for the sake of querying the log later.
In order to fit the record_id from all tables into the same column, you would still need to enforce that all tables use the same data type for their ids (but it seems like you were planning to do that anyway).
A more powerful scheme is to create an audit table that mirrors the structure of each table rather than put all the audit trail into one place. The "shadow" table model makes it easier to query the audit trail.
Related
I have encountered the following dilemma several times and would be interested to hear how others have addressed this issue or if there is a canonical way that the situation can be addressed.
In some domains, one is naturally led to consider very wide tables. Take, for instance, time series surveys that evolve over many years. Such surveys can have hundreds, if not thousands, of variables. Typically though there are probably only a few thousand or tens-of-thousands of rows. It is absolutely natural to consider such a result set as a table where each variable corresponds to a column in the table however, in SQL Server at least, one is limited to 1024 (non sparse) columns.
The obvious workarounds are to
Distribute each record over multiple tables
Stuff the data into a single table with columns of say, ResponseId, VariableName, ResponseValue
Number 2. I think is very bad for a number of reasons (difficult to query, suboptimal storage, etc) so really the first choice is the only viable option I see. This choice can be improved perhaps by grouping columns that are likely to be queried together into the same table - but one can't really know this until the database is actually being used.
So, my basic question is: Are there better ways to handle this situation?
You might want to put a view in front of the tables to make them appear as if they are a single table. The upside is that you can rearrange the storage later without queries needing to change. The downside is that only modifications to the base table can be done through the view. If necessary, you could mitigate this with stored procedures for frequently used modifications. Based on your use case of time series surveys, it sounds like inserts and selects are far more frequent than updates or deletes, so this might be a viable way to stay flexible without forcing your clients to update if you need to rearrange things later.
Hmmm it really depends on what you do with it. If you want to keep the table as wide as it is (possibly this is for OLAP or data warehouse), I would just use proper indexes. Also based on the columns that are selected more often , I could also use covering indexes. Based on the rows that are searched more often, I could also use filtered indexes. If there are, let’s say, billions of records in the table, you could partition the table as well. If you just want to store the table over multiple tables, definitely use proper normalization techniques, probably up to 3NF or 3.5NF, to divide the big table into smaller tables. I would use the first method of yours, normalization, to store data for the big table just because it seems like it makes sense better to me that way.
This is an old topic but something we are currently working on resolving. Neither of the above answers really give as many benefits as the solution we feel we have found.
We previously believed that having wide tables wasn't really a problem. Having spent time analysing this we have seen the light and realise that costs on inserts/updates are indeed getting out of control.
As John states above, the solution is really to create a VIEW to provide your application with a consistent schema. One of the challenges in any redesign may be, as in our case, that you have thousands or millions of lines of code referencing an old wide table and you may want to provide backwards compatibility.
Views can also be used for UPDATES and INSERTS as John alludes to, however a problem we found initially was that if you take the example of myWideTable which may have hundreds of columns and you want to split this to myWideTable_a with columns a, b and c and myWideTable_b with columns x, y and z then an insert to a view which only sets column a will only insert a record for myWideTable_a
This causes a problem when you want to later update your record and set myWideTable.z as this will fail.
The solution we're adopting, and performance testing, is to have an 'insteadof' trigger on the View insert to always insert to our split-tables so that we can continue to update or read from the view with impunity.
The question as to whether using this trigger on inserts provides more overhead than a wide table is still open, but it is clear that it will improve subsequent writes to columns in each split table.
In our DB we have a single centric table with millions of rows that is constantly being inserted and updated.
This table has a single column acting as the unique identifier and is used to link the content of this table with mutliple tables with a one to many relation.
This means that wehn inserting entry to, say, USERS table, in the same transaction also USERS_PETS and USERS_PARENTS (and 10 more) will be populated, with multiple rows, based on the same unique identifier from the main table.
Since the application using this DB is constantly inserting new entries and updating existing ones the relation between these tables is kept only at the application level (i.e. logical ERD instead of handling this via FK/PK decelrations).
Questions:
Is this correct to assume that from pure performnces point of view, this is the best approach?
Is there a way to set these keys (so that the DB will be more self descriptive) without impacting performaces?
This is the worst possible approach and I guarantee you will have data integrity issues eventually. Data integrity is far more critical than performance. This is stupid and short-sighted.
No, for the same reason we use seatbelts in cars even when we are in a hurry. The difference is negligeble and totally not worth it.
Some specific dbms vendors may offer a way of declaring constraints while not enforcing them. In Oracle for example, you can specify the Integrity Constraint State as DISABLE NOVALIDATE.
You base data integrity on hope. Hope doesn't scale well.
And there's no such thing as "pure performance point of view". Unless, that is, you never read from the database. If you only insert, never update, never delete, and never read, you can make a case that there exists a "pure performance point of view". But if you ever update, delete, or read, then performance isn't a point--it's more like a surface or a solid, and all you can do is move the balancing point around among inserts, updates, deletes, and reads.
And, because somebody reading this still won't get it, the most critical part of read performance is getting back the right answer. If you can't guarantee the right answer, sensible people won't care how marginally faster your inserts are.
We have a mid-size SQL Server based application that has no indexes defined. Not even on the the identity columns. I suggested to our moderately expensive application consultant that perhaps we might get better performance (particularly as our database grows) by creating some indexes on appropriate fields, and he said:
"Indexes will significantly impact other areas of the application and customers should not create them under any circumstances."
Anybody ever heard of anything like this? Are there ever circumstances where one should not create any indexes? I can see nothing special about this app - it's got int identity columns, then lots of string columns, bunch of relational tables but nothing special or weird that I can see.
Thanks!
[EDIT: the identity columns are not using "identity specification", they seem to be set by the program, looking at the database with Management Studio, I can find NO indexes...]
FOLLOWUP: At a conference I asked the CEO (and chief architect) of the company producing this product about this, his response was that they felt for small to midsize deployments, the overhead associated with maintaining indexes would have more of a negative to overall user experience (the application does a lot of writes) than the benefits of the indexes would offset, but for large databases, they do create indexes. The tech support guy was just overzealous and very unhelpful with his answer. Mystery solved.
Hire me and I'll create the indexes for you. 14 years' Sybase/SQL Server experience tells me to create those !darn! indexes. Unless your table has less than 500 records each.
My idea is that an index hash node is roughly sized to 1000.
The other thing you need to look out for is whether your consultant has normalized the tables. Perhaps, the table has 500 fields/columns, containing more than one conceptual entity or a whole dozen of conceptual entities. And that could be why he is nervous about creating indexes, because if there are 12 conceptual entities in the table there would be at least 12 set of indexes - in which case, he is absolutely true - under no circumstances ... blah blah.
However, if he indeed does have 500 columns or detectably multiple conceptual entities per table - he is a very very lousy data design engineer. In all my years working with more experienced data engineers, our tables rarely exceed 20 columns. 5 on the low side, 10 on the average. Sometimes for performance' sake we do allow mixing two entities in a table, or horizontalizing row occurrences into columns of a table.
When you look at the table design you can with an untrained eye see Product, Project, BuildSheet, FloorPlan, Equipment, etc records all rolled into one long row. You cannot mix all these entities together into one table.
That is the only reason I know why he could advise you against having indexes. If he is doing that, you should know that he is fraudulently representing his data design skills to your company and you should immediately drop him from your weekly contractual expenses.
OK, after reading larry's post - I agree with him too.
There is such a thing as over-indexing, especially in INSERT and UPDATE heavy applications with very large tables. So the answer to the question in your title is yes, it can sometimes be a bad idea to add indexes.
That's quite a different question from the one you ask in the body of your question, which is "Is it ever normal to have NO indexes in a SQL Server database". The answer is that unless you're using the database as a "write-only" system, in which data is added but only read after being bulk extracted and transformed into a another data store, it's exceedingly unusual not to have some indexes in the database.
Your consultant's statement is odd enough to make me believe that you may have left some important information out of your description. If not, I'd say he's nuts.
Do you have the disk space to spare? I've seen cases where the indexes weighed more than the table.
However, No indexes exist whatsoever! There can't be a case for that except for when all read operations need the entire table.
Columns with key constraints will have an implicit index on them anyway. So if you're always selecting by the primary key, then there's no point adding more indexes. If you're selecting by other criteria, then it makes sense to add indexes on those columns that you're querying on.
It also depends on how insert-heavy your data is. If you're inserting more often than you're querying, then the overhead of keeping the indexes up to date can make your inserts slower.
But to say you "should not create [indexes] under any circumstances" is a bit much.
What I would recommend is that you run the SQL Server Profiler tool with some your queries. This tool will recommend which indexes to add that will have the biggest effect on performance.
In most run-of-the-mill applications, the impact of indexes on insertion performance is a bit of non-issue. You're usually better off creating the index and if insertion performance drops dramatically (which it probably won't) you can try something else. Obviously there are some exceptions, where you should be more careful, like tables that are used for logging for instance.
As mentioned, disk space can be an issue.
Creating irrelevant indexes (e.g. duplicates) will also waste microseconds and occasionally result in a bad query execution plan.
The other problem I've seen is with strangely code third-party applications that generate parts of the database at runtime, and can delete or choke on indexes that they don't know about.
In the vast majority of cases though, a carefully chosen index will only be a benefit.
Not having indexes on id columns sounds really unusual and I would find any justification for not including them to smell very fishy.
You should be aware that if you are doing a high volume of commits to the database, adding more indexes will affect the speed of insertion, but no index on id? Wow.
It would be good to get better justification of exactly how adding extra indexes might cause problems though.
the more indexes you have the slower data inserts and modifications will be. Make sure that you add indexes when appropriate and write queries that can take advantage of those indexes, also if the selectivity leve of your index is low, it will not be used effectively
I would say that if your server is having troubles with CPU time, indexes could be a solution. If you are querying tables without indexes, the server will need a lot more resources and if tables are having millions of records, it can become a serious problem. I recently cooled down a CPU from 80-90% all the time to 10-20% just by putting the right indexes.
If using MS SQL, you could check the activity monitor to see what queries are expensive and create indexes based on the where clauses or joins.
Then at the recent expensive queries:
You can then right click and check the complete query!
For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.
Nowadays, I'm working on a database, with no "Relations, PKs and FKs", just
raw data.
I can say that database is just set of papers.
When I asked about this, I had this; "Hide the Business".
Also, one of my friends said, this always happens in "Large systems".
In large systems, they are tyring to hide thier business through raw data.
Regarding development; relations, constraints, validation, are done in database using triggers and of course user interface.
What do you think regarding this?
Well, this may have point on large databases, when you need fast responce on massive DML (INSERT / UPDATE / DELETE).
The problem is that if you rely on database's way to ensure integrity, you hardly can optimize it.
There is also thing called SQL/PLSQL context switching in Oracle: if you create an empty trigger on the table, it will slow down DML about 20 times — with the mere fact that the trigger exists.
In Oracle, when you write a ON UPDATE trigger and update 50,000 rows in the table, the trigger and the query in it gets called 50,000 times. Foreign keys perform better, but they may also get laggy (and you can do nothing with the underlying queries)
In this case, it's better to put the results you want to update into a temporary table, issue a MERGE, check integrity before and after, and apply the business rules. A single query that processes 50,000 rows works faster than a loop of 50,000 queries processing single row.
Of course, it's very hard to implement and only pays for itself when you have really large database and need to perform really massive updates on it.
In Oracle, in any case, FOREING KEY constraints perform better than tiggers implementing the same functionality.
PRIMARY KEYS will most likely improve performance, as a primary key implies creating the UNIQUE INDEX on the constrained field, and this index may be efficiently used in the queries. A UNIQUE INDEX is also a natural and most efficent way to enforce uniqueness.
But of course, as any index, is slows down INSERTS and those UPDATES and DELETES whose WHERE condition is not selective.
I. e. if you need to UPDATE or DELETE 1 row of 2,000,000, then the index is your friend; if you need to UPDATE or DELETE 1,500,000 rows of 2,000,000, the index is your enemy. It's a matter of tradeoff.
You may also see my answer here.
I would say sacrificing data integrity for security through obscurity is a bad trade.
I think I came across at least two applications with databases lacking relations and FKs.
The idea is probably that it's more difficult to reverse engineer the brillant database schema.
The side effect is that often the applications are not so good at checking constraints themselves, leading to lot of rubbish data in the database, which in fact does make it more difficult to reverse engineer, as FK constraints are not enforced ;)
My view is that once it's a database, somebody else can look into it, and trying to work around this "feature" of visibility is pointless and generally Not a Good Thing, considering the drawbacks (no relations, no SPs, no triggers, etc).
Primary Keys, relations etc etc are tools for making database development easier and for making the final result faster and more efficient. I can only think of a few rare cases where not having a key/index would be a good idea.
Did your friend explain why they held this view?
This kind of thing can happen in financial systems. It's the opposite of what you'd expect, you'd think that because it's finance that best practice would be applied more rigorously. However, the converse is often true. Many of these databases may have started out in excel.
I have seen a number of databases where they don't bother with fks or pks. I can't say I like it, but sometimes you have to have to just live with it, or leave and go work somewhere else for less money but with more database integrity.
Perhaps this is why they need you.
(me=MSSQL)
Deleting a row form a large table that has lots of FKs is slow.
Our APP has never been denied a Delete for a FK constraint violation on that table
I have considered dropping the FKs to improve performance.
Too chicken to have actually done it though :(
PS We would keep the FKs on the DEV / TEST systems
Maybe this was converted from a former system that used a file based structure to hold the data. The tables in the new database are just a reflection of the individual files.