I have about 200 settings for the user. These include notice settings and tracking settings from user activities on objects. The problem is how to store it in the DB? Should each setting be a row or a column? If column then table will have 200 colunms. If row then about 3 colunms but 200 rows per user x even 10 million users = not good.
So how else can I store all these settings?
These settings are a mix of text entry and FK lookups to other tables.
Serializing the data almost always turns out to be a bad idea, because in doing so you cripple the dbms. All of the man years that went into producing an efficient dbms will be wasted on a serialized bucket of bits.
If you have application logic hooked up against each setting, I think you should implement it as either:
1 column per setting in the settings table.
This makes it easier to leverage the power of your dbms, with constraint checking, referential integrity, correct data type for your values, plenty of information to the optimizer. The downside is that row size grows.
or
1 table per setting (or group of related settings).
This has all of the benefits of the above, but trades rowsize for a performance penalty when you need to fetch most or all of the settings at once. When settings are optional, this alternative will be significantly smaller if the actual data is sparse.
Also, lots of columns is often a "smell", that suggests you haven't normalized your data correctly, but it doesn't have to be that way. Only you know your data.
That you have 200 settings to track suggests a flexible database schema. As such I'd suggest a hybrid approach:
Users: table with row per user for properties that a user will likely always have, such as username and password. This table may also keep foreign keys, but this is a heuristic and also depends if the relationship is zero-to-1 or zero-to-many. Where the latter requires a separate table.
Features: two options
table with row per feature, effectively a hashtable, with userId, name and value columns. This could also be a spot for foreign relationships, but you would not be able to enforce data integrity in this setup.
XML, but only with a database that has features that allow you to query the data or only for data that you do not need to query, but only work with on your application server.
I think the bigger answer is you are not going to arrive at one solution from your original question, but instead need to use both to suit the data.
I think 200 columns is definitely not good idea, because of difficulty in writing stored procs, manually viewing data or extending to more settings later.
Can you try XML for all these 200 settings and then you will have only 1 row per user. username and corresponding settings xml. But again it will limit your querying capabilities, but DBs now support XML. You can specifically check out XML DBs.
Related
I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.
Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.
From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...
If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.
I have a table of about 1,000 records. Around half of them will utilise a set of fields containing certain characteristics. There's about 10 relevant fields. The other half of the records won't need that information filled in.
This table is central to the database and will be taking the bulk of the operations. Though at only around 1,000 records, it's not much.
The hardware that the database is stored on is old and slow (spinning hard drive not SSD... ) so I want to have a fairly optimised structure to make the most of it. Obviously the increased size of the database alone due to the blank fields isn't a major concern, but if it's slowing down queries then that's not good.
I guess I should describe the setup. Currently Access 2007 client and Access backend, though the backend will soon move to SQL server. Currently the backend is on the main server rack, but when moved to SQL Server it will get its own older server rack.
So should I make a separate table to store the aforementioned set of characteristics, or should I leave it as is?
The querying overhead of putting the optional fields into a separate table and then using a join doesn't provide much benefit to size or data managment. Especially if it's 1-to-1 like in your example. For size, the optional fields will NULL don't affect you much. And yes, 75% is good random threshold for when you should start moving things out but even then, you're not actually normalizing anything by moving out the optional fields (if they are 1-to-1 with the record and you will always be fetching it along with the main record).
Worth noting: With most DBs, getting large rows in single queries is better than several small queries...in case you later have the urge to get the optional data in the 2nd table in a separate query. In Access 2007 this may matter less though.
And regardless of whether or not you move those optional fields out, add indexes for those fields which you may use in a where/having/join.
My impression from what you've said is that you should use separate tables. The dependencies you want to represent and the needs of data integrity ("business rules") should determine which table(s) any attribute goes in.
In your case it sounds like you have two kinds of facts to be represented. Those fact types have distinct sets of attributes and therefore they belong in different tables. If you combine two different fact types into one table and make one set of attributes nullable then you could compromise data integrity: i.e. by permitting values for some attribute when the business rules require no such value and by allowing a value to be absent when business rules in fact require it.
For a more formal way of answering this, see Fifth Normal Form and the Principle of Orthogonal Design. If you aren't already aware of those design principles then you should familiarise yourself with them.
Vertical partitioning makes sense for a large data set to make the cache more efficient. 1000 rows doesn't qualify as "large" even on a rather old hardware.
So unless there are other reasons to redesign this table (you didn't merge lookups didn't you?), you are good to go.
I was going to include 'status', 'date_created', 'date_updated' to every table in database.
'status' is for soft deletion of rows.
Then, I've seen few people also add 'user_created', 'user_updated' columns to each table.
If I add those columns too, then I will have at least 5 columns for every table.
Will this be too much overhead?
Do you think it's a good idea to have those five columns?
Also, does the 'user' in 'user_created' mean database user? or application user?
As per comments above, would advise adding auditing only to those tables actually requiring it.
You generally want to audit the application user - in many instances, applications (such as Web or SOA) may be connecting all users with the same credential, so storing the DB login is pointless.
IMHO, the date created / last date updated / lastupdateby patterns never give the full picture, as you will only be able to see who made the last change and not see what was changed. If you are doing auditing, I would suggest that instead you do a full change audit using patterns such as an audit trigger. You can also avoid using triggers if your inserts / updates / deletes to your tables are encapsulated e.g. via Stored Procedures. True, the audit tables will grow very large, but they will generally not be queried much (generally just in witch-hunts), and can be archived, easily partitioned by date (and can be made readonly). With a separated audit table, you won't need a DateCreated or LastDateUpdated column, as this can be derived. You will generally still need the last change user however, as SQL will not be able to derive the application user.
If you do decide on logical deletes, I would avoid using 'status' as an field indicating logical deletes, as it is likely you have tables which do model a process state (e.g. Payment Status etc.) Using a bit or char field such as ActiveYN or IsActive are common for logical deletes.
Logical deletes can be cumbersome, as all your queries will need to filter out Active=N records, and by keeping deleted records in your transaction tables can make these tables larger than necessary, especially on Many : Many / junction tables. Performance can also be impacted, as a 2-state field is unlikely to be selective enough to be useful in indexes. In this case, physical deletes with the full audit might make better sense.
I've used all five before, sure. When I want to track who, through a web app, is creating and (last) editing records, and when that happens, I include timestamps and the logged-in user (but not the DB user, that's not how my system is setup; we use one account for all DB interaction).
Likewise, status can also be useful if users are changing a record's, well, status. If it goes from being "Online" to "Offline" to "Archive", that record can reflect that.
However, I don't use these for every table, nor should you. Sometimes I have tables that are meant only to store parts of a record (normalized), or just don't have a value as far as needing a status or time created by who.
What you should be considering for every table is a Primary Key field. Unless you are more sophisticated in your approach than you sound, you will almost always want one. Some things don't necessarily need one (a states list, for instance, could Unique the abbreviation). But this is more important to most of your tables than a series of timestamp and status fields.
Simple answer - only put it in your database what you need in your database.
For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.
DBA (with only 2 years of google for training) has created a massive data management table (108 columns and growing) containing all neccessary attribute for any data flow in the system. Well call this table BFT for short.
Of these columns:
10 are for meta-data references.
15 are for data source and temporal tracking
1 instance of new/curr columns for textual data
10 instances of new/current/delta/ratio/range columns for multi-value numeric updates
:totaling 50 columns.
Multi valued numeric updates usually only need 2-5 of the update groups.
Batches of 15K-1500K records are loaded into the BFT and processed by stored procs with logic to validate those records shuffle them off to permanent storage in about 30 other tables.
In most of the record loads, 50-70 of the columns are empty through out the entire process.
I am no database expert, but this model and process seems to smell a little, but I don't know enough to say why, and don't want to complain without being able to offer an alternative.
Given this very small insight to the data processing model, does anyone have thoughts or suggestions? Can the database (SQL Server) be trusted to handle records with mostly empty columns efficiently, or does processing in this manner wasted lots of cycles/memory,etc.
Sounds like he reinvented BizTalk.
I typically have multiple staging tables corresponding to the input loads. These may or may not correspond to the destination tables, but we don't do what you're talking about. If he doesn't like to have a lot of what are basically temporary work tables, they could be put into their own schema or even a separate database.
As far as the columns which are empty, if they aren't referenced in the particular query which is processing BFT it doesn't matter - HOWEVER, what will happen is that the indexing becomes much more crucial that the index chosen is a non-clustered covering index. When your BFT is used and a table scan or clustered index scan is chosen, the unused column have to be read and ignored or skipped, and this definitely seems to affect processing in my experience. Whereas with a non-clustered index scan or seek, less columns are read, and hopefully this doesn't include (m)any of the unused columns.
Normalization is the keyword here. If you have so many NULL values, chances are high that you're wasting a lot of space. Normalizing the table should also make data integrity in this table easier to enforce.
One thing that might make things a little more flexible (other than normalizing) could be to create one or more views or table functions to present the data. Particularly if the table is outside your control, these would enable you to filter the spurious crap out and grab only what you need from the table.
However, if you're going to be one of the people who will be working with (and frowning every time you have to crack open) that massive table, you might want to trump the DBA's "design" and normalize that beast, and maybe give the DBA the task of creating some views and/or table functions to help you out.
I currently work with a similar but not so huge table which has been around on our system for years and has had new fields and indices and constraints rather hastily tacked on Frankenstein-style. Unfortunately some other workgroups rely on the structure as gospel, so we've created such views and functions to enable us to "shape" the data the way we need it.