Store 2-dim table of attendance in database? - database

We have a web-based application, backed by a MySQL database.
One part of the system that we're coding requires us to store attendance (i.e. yes/no) to sessions for users for each day of a week. For example, we'd need to store Monday through to Friday, then for each day, morning, lunch, afternoon, evening sessions etc. So essentially it's a 2-dim array.
I was wondering what's the cleanest way of storing this in the database?
At the moment, the person working on this seems to be leaning towards storing this as one int for each day, with 1's representing attendance and 0's representing not attending. I think what the mean to do is use a bitmask (e.g. 13 for 1101, so every session except afternoon). They're just storing it as actually 0's and 1's for some strange reason.
I thought it might be easier to store it as a list of bools (bits/tinyints), e.g. monday_morning, monday_lunch, monday_afternoon etc., as it's semantically more "correct" (I think?), it'll probably be easier to extend/maintain, and I also seem to be the only one on the team with any inkling of how to do bit-operations...lol.
Another way I was thinking was just to have a 1:1 table for each user, with a list of all the times they are attending, for example. Efficiency of this approach? (Not sure what sort of read/write patterns, but I'm guessing a fairly even spread of read/modifies).
What are some recommendations on this? Or are there better ways of storing this data?
Also, as a side-note, it probably will be boolean - it'd doubtful we'll need to store more states than attending/not-attending in the table, and if we do, we are prepared to re-work the schema. Or do people suggest strongly going for ints over bits?
Cheers,
Victor

I would normalize it and have three tables: users, sessions, and sessions_attended. Users would contain information about the user, sessions would contain information about the session, and sessions_attended would be a join table indicating which sessions the user attended. Index your tables properly and the resulting joins should be pretty efficient.
select users.name, sessions.name
from users u join sessions_attended a on u.user_id = a.user_id
join sessions s on s.session_id = a.session_id
where sessions.course = ...some course id...

Your second approach (the individual columns) is "more correct" in that it doesn't violate first normal form. The bitmask approach does, since you're storing more than one value in a single column (you're storing values for multiple sessions).
And don't store a bit internally. You aren't going to see any decrease in storage over, say, a tinyint (the engine isn't going to allocate exactly one bit for you, it will just restrict the acceptable values). You may as well use a tinyint and give yourself some breathing room.
Edit
As pointed out by Mark, if you have multiple bit columns it can pack them into a single byte, but worrying about whether the data takes up one byte or four is likely a premature optimization. The most normalized solution is the one suggested where you have an individual table that indicates which sessions the participant attended. If your sessions truly are fixed, then I would likely go with having separate columns for each session over either the bitmask or the fully normalized solution.
The bitmask obfuscates the data and requires bitwise operations (obviously). These can be confusing in query syntax, since you're making multiple uses of the words or and and. This approach also can't be indexed, so finding all participants who attended, say, the morning or the morning and evening sessions will require a table scan every time.
The fully normalized solution will complicate queries of the data. While it will support indexing, it will require a full join for every session type you want to check.
The one-column-per-session approach seems like the best solution. You're still only dealing with one row of data, but you can also query with meaningful syntax and take advantages of indexes.

Related

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

Serializing data into a single text field - denormalization gone too far?

I'd love some opinions on whether this database design I'm currently pursuing is sound or not.
Lets assume I'm building a table called "Home", this table has a text field called "rooms". In this field is the serialized data for a set of rooms that this house has. My first instinct was to, of course, normalize this data into a separate "Rooms" table. However, due to some frustrating experiences with overly normalized databases in the past, I stopped to ask myself a few questions:
Will I ever need to find a specific room?
Will I ever need to update an individual room?
Will any Home records ever share Room records?
The answer to each of these questions is "no". Room records are all unique to each Home. Queries will never need to be performed to find out how many Homes in the database have bathrooms, for instance. Data will always be pulled from the perspective of the Home. The number of bedrooms and bathrooms will be explicitly stored on the Home record for searching.
So instead of having to constantly join Rooms, I wondered what would be the harm in serializing this data and just popping it into a text field.
This makes a lot of sense to me, but I'm hoping for a sanity check. Thanks for any input!
A pragmatic answer...
a) probability that you might want to decompose it in the future
b) benefit of not doing so now
c) cost of changing the schema later on.
If a * c > b then you should decompose now.
Well, you might not have a need TODAY to query to find out things like:
What is the average number of bathrooms in a home in Ohio?
Where do homes have more bedrooms? The East Coast or the West Coast?
How does house price correlate with the size of the master bedroom? What would be the average dollar value return of increasing the master bedroom size by 30%?
etc, etc.
You will be in a much better position in the future if you design your foundation correctly to begin with... no matter how enticing the short-cut may seem right now.
Plus, with a separate ROOMS table, you will be able to add additional room fields that make sense later (like width/height, color, floor level, etc.) which would all be very hard if the data were just globbed into a single field.
People will want to query in unexpected ways, like:
I have bad knees. Can you list houses with the master bedroom and master bathroom on the first floor?
In general, having a ROOMS table will just make your application more powerful, and easier to use.
Hey, I get what you're saying about "overly normalized data". We've all been there, and it DOES bite. However, having a ROOMS table in a database with housing info isn't being "overly normalized". It's just building the app the right way.
In addition to what others have said about doing the right thing, I would like to add a comment about performance.
Since you will be storing the serialized room data as a column in table Home, the row size will increase significantly. This will result in worse performance for all other queries.
Well, you say that room records are unique, but you can't enforce that. So you have no way to know this for sure in your current design: all your code should be perfect in representing this.
"constantly joining" isn't that hard to do, but if it is, you can always make a View for that, and you're done.

Database design question - which is the best solution?

I'm using Firebird 2.1 and I'm looking for the best way to solve this issue.
I'm writing a calendaring application. Different users' calendar entries are stored in a big Calendar table. Each calendar entry can have a reminder set - only one reminder/entry.
Statistically, the Calendar table could grow to hundreds of thousands of records over time, while there are going to be much less reminders.
I need to query the reminders on a constant basis.
Which is the best option?
A) Store the reminders' info in the Calendar table (in which case I'm going to query hundreds of thousands of records for IsReminder = 1)
B) Create a separate Reminders table which contains only the ID of calendar entries which have reminders set, then query the two tables with a JOIN operation (or maybe create a view on them)
C) I can store all information about reminders in the Reminders table, then query only this table. The downside is that some information needs to be duplicated in both tables, like in order to show the reminder, I'll need to know and store the event's starttime in the Reminders table - thus I'm maintaining two tables with the same values.
What do you think?
And one more question: The Calendar table will contain the calender of multiple users, separated only by a UserID field. Since there can be only 4-5 users, even if I put an index on this field, its selectivity is going to be very bad - which is not good for a table with hundreds of thousands of records. Is there a workaround here?
Thanks!
There are advantages and drawbacks to all three choices. Whis one is best depends on details you have not provided. In general, don't worry too much about selecting three or four entries out of a hundred thousand, provided the indexes you have set up allow the right retrieval strategy. If don't understand indexing, you're likely to be in trouble no matter which of the three choices you make.
If it were me, I would go with choice B. I'd also store any attributes of a reminder in the table for reminders.
Be very careful about whether you identify an event by EventId alone or by (UserId, EventId). If you choose the latter, it behooves you to use a compound primary key for the Event table. Don't worry too much about compound primary keys, especially with Firebird.
If you declare a compound primary key, be aware that declaring (UserId, EventId) will not have the same consequences as declaring (EventId, UserId). They are logically equivalent, but the structure of the automatically generated index will be different in the two cases.
This in turn will affect the speed of queries like "find all the reminders for a given user".
Again, if it were me, I'd avoid choice C. the introduction of harmful redundancy into a schema carries with it the responsibility for some very careful programming when you go to update the data. Otherwise, you can end up with a database that stores contradictory versions of the same fact in different places of the database.
And, if you really want to know the effect on perfromance, try all three ways, load with test data, and do your own benchmarks.
I think you need to create realistic, fake user data and measure the difference with some typical queries you expect to run.
Indexing, query optimization and the types of query results you need can make a big difference,
so it's not easy to say what's best without knowing more.
When choosing Option (A) you should
provide an index on "IsReminder" (or a combined index on IsReminder, UserId, whatever fits best to your intended queries)
make sure your queries use this index
Option B is preferable over A if you have more than a boolean flag for each reminder to store (for example, the number of minutes the user shall be notified before the event). You should, however, make some guessing how often in your program you will have to JOIN both tables.
If you can, avoid option C. If you don't want to benchmark all three cases, I suggest start with A or B, according to the described circumstances, and probably the solution you choose will be fast enough, so you don't have to bother with the other cases.

Database Design Questions - Need Clarifications

i m designing a database using sql server 2005
main concept of our side is to import xml feeds from suppliers
different supplier can have different representation of data
the problem is i need to design table to store imported information
some of the columns are fixed means all supplier products must have similar data coming from the feed like , name, code, price, status, etc
but some product have optional details like
one product have might color property other might dont.
what is the best way to store these kind of scenario into the database.
should i create a table for mandatory columns and other tables to hold optional column.
or i should i list down all the column first and put them into the one table. (there might a lot of null values)
there will thousands of products and database speed is very essential .
we will be doing a lot of product comparison from different supplier
our database will be something like www.pricerunner.co.uk
i hope i explain the concept well
Thousands of products (so thousands of rows.) Thats really not many at all, so you could normalize the the optional data to a few separate tables without having a dramatic effect on query time.
I would say put your indexes in the correct place, optimize your queries, make sure you have filegroups split up nicely, etc (just the usual regular old database stuff) and you should be good.
Depends on how you want to access it.
As you say, speed is important - but what are you going t do with those extra, optional, bits of information? Do you need to store them at all? Assuming you do, how often do you need to access them?
Essentially, if you will always need to at least check if they're there, probably better to put them into one table. If you need to check anyway, might as well get it over with as part of the initial query.
If, on the other hand, you can usually run without bothering to check for these extra pieces, and only need to bother when specilly requested, then it might be better to put them into a different table. The join (or subsequent lookup) will be expensive - much more expensive than pulling nulls for empty columns - but if it's very infrequent, would probably cost less in runtime execution in the long run.
Also bear in mind the tradeoff in storage and transport terms - storing lots of empty fields does take some space, and sending back lots of empty fields takes network bandwidth.
If disk space is not a concern, but bandwidth is, make the application is carfully designed to minimse unecessary lookups, and then with tight queries you can store the extra (optional) data, but not pass it back unless it's requested.
So, it really all depends on what's important to you. Once you know what your overriding design concerns are, you will know which compromises to make to address those concerns at the expense of others. A balancing act.

Strategy in storing ad-hoc numbers/constants?

I have a need to store a number of ad-hoc figures and constants for calculation.
These numbers change periodically but they are different type of values. One might be a balance, a money amount, another might be an interest rate, and yet another might be a ratio of some kind.
These numbers are then used in a calculation that involve other more structured figures.
I'm not certain what the best way to store these in a relational DB is - that's the choice of storage for the app.
One way, I've done before, is to create a very generic table that stores the values as text. I might store the data type along with it but the consumer knows what type it is so, in situations I didn't even need to store the data type. This kind of works fine but I am not very fond of the solution.
Should I break down each of the numbers into specific categories and create tables that way? For example, create Rates table, and Balances table, etc.?
Yes, you should definitely structure your database accordingly. Having a generic table holding text values is not a great solution, and it also adds overhead when using those values in programs that may pull that data for some calculations.
Keeping each of the tables and values separated allows you to do things like adding dates and statuses to your values (perhaps some are active while others aren't?) and also allows you to keep an accurate history (what if i want to see a particular rate from last year?). It also makes things easier for those who come behind you to sift through your data.
I suggest reading this article on database normalization.

Resources