I'm making an api for movie/tv/actors etc. with web api 2 and sql server. The database now has >30 tables, most of them storing data users will be able to edit.
How should I store old version of entries?
Say someone edits description, runtime and tagline for a entry(movie) in the movies table.
I'll have a table(movies_old), where I store the editable files in 'movies' pluss who/when it was edited.
All in the same database. The '???_old' tables has no relationships.
I'm very new to database design. Is there something obviously wrong with this?
To my mind, there are two issues here: what table you store the data in, and what goes in the "historical value" field.
On the first question, there are two obvious options: Store old and new records in the same table, with some sort of indication of which is "current" and which is "history", or have a separate table for history.
The main advantage of one table is that you have a simpler schema. This is especially true if the table contains many fields. If there are two tables, then all the field definitions are duplicated. When you move data from the current table to the history table, you have to copy every field, and if the list of fields changes, or their formats change, you have to remember to update the copy. Any queries that show the history have to read two tables. Etc. But with one table, all that goes away. Converting a record from current to history just means changing the setting of the "is_current" flag or however you indicate it.
The main advantages of two tables are, (a) Access is probably somewhat faster, as you don't have so many irrelevant records to skip over. (b) When reading the current table you don't have to worry about excluding the history records.
Oh, an annoying thing about SQL: In principle you could put a date on each record, and then the record with the latest date is the current one. In practice this is a pain: you usually have to have an inner query to find the latest date, and then feed this back in to an outer query that re-reads the record with that date. (Some SQL engines have ways around this. Postgres, for example.) So in practice, you need an "is_current" flag, probably 1 for current and 0 for history or some such.
The other issue is what to put in the contents. If you're dealing with short fields, customer number and amount billed and so forth, then the simple and easy thing to do is just store the complete old contents in one record and the complete new contents in the new record. But if you're dealing with a long text block, like a plot synopsis or a review, there could be many small editorial changes. If every time someone fixes a grammar or spelling error, we have a whole new record with the entire 1000 characters, of which 5 characters are different, this could really clutter up the database. If that's the case you might want to investigate ways to store changes more efficiently. May or may not be an issue to you.
Related
I'm working with an existing MS SQL database and ASP.NET web application. An update is needed in a table, but in order to add the new data and have it display correctly in the site, I need to be able to take a series of existing records, essentially "push them down", and then add the new data in the open space created.
Is there a cleaner, more efficient method than by just creating a new record that's a copy of the last related record, and then essentially doing a copy-and-paste for the remaining records until I reach the insertion point? There are quite a few records to move and I'd prefer something that isn't as mind-numbing and potentially error-prone as that.
I know the existing site and database isn't designed optimally for inserting new data into this table unless it's added to the end, but reconfiguring the database and stored procedures is not an option I presently have.
-- EDIT --
For additional requested information...
Screen shot of table definition:
Screen shot of some table data (filtered by TemplateID):
When looking at the table data, there are a couple other template ID values that bring back a bit more complex data. The issue is that this data needs to maintain this order, which happens to be the order in which it has been entered, since it gets returned and displayed in the shown order. The new data needs to be entered prior to one of these lettered subject headers. Honestly, I think this is not the best way to do this, but I had no hand in the design. It was created by a different company, and mine was hired to handle updates and maintenance after the creators became unpleasant to work with. A different template ID value brings back two levels of headers, which doesn't make my task any easier or alterations much cleaner considering the CS code that calls the stored procedures is completely separated from the code that builds the contents of the pages, and the organizational structure i tough to follow. There are some very poor naming conventions in places.
At any rate, there needs to be an insertion into this group of data under the "A" header value. The same needs to occur with another chunk associated with a different template ID, and there is another main header below the insertion point:
I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.
I was building an RSS reader, which stores the articles pulled in an database (SQLite in particular, but I don't think that matters).
Anyway, when I originally designed and coded it, the idea was to create a new table for every feed the user is subscribed to, and to have a big meta table. After reading a bit more about database management, I found another way to handle this was to have two tables, the meta table, and a table for every item in the rss feed, and in that table, have a column with the id of the feed it came from.
So, is there any major reason why I should switch the model that I'm using to be a large items table, rather than having one for each feed the user is subscribed to?
From what you wrote :
to create a new table for every feed
the user is subscribed to
In a database world, at least for me, that is insane.
Just try to picture the user wants to subscribe to 1.000 rss feeds, will you create 1.000 tables ? No way.
You can put your data in relation thanks to Primary Key and foreign keys why don't you use this strenght.
First it will be easier for you to write your query. You won't have to worry about table name. you will have a table rssfeed and a table post then everything will be link togheter.
Spend time modelling your database. In your case it won't be that hard.
You might need 3 to 4 tables in order to handle rssfeeds, post, and metadatas.
Ask another question here on : How to design a database for this need ?
People will help you with pleasure.
Ask your question you'll save time, money (even if its not about it), and best-practices(avoiding ugly design).
The typical way of storing such data (assuming that the structure of the data is the same for all feeds) is indeed to have a single table for all feeds.
Why? Because this will allow you to access all feeds in the same way. For example, lets say you want to combine all feeds in a single view, or calculate some kind of statistic on all of your feeds. By having them all located in a single table this will be extremely simple; having them all in different tables will make this much more complex, without any (as far as I can see) added value.
It's a matter of simplicity of coding versus the probably slight performance edge of having one table per RSS feed. Having one table (rather than one per feed) means your code doesn't have to do any DDL and you could more easily do cross-RSS-feed searching; but queries and updates could be a little slower. I'd probably opt for a single table with a Feed column (indexed) to make searches simpler.
I'm going to try to keep this question database agnostic, but I have an interesting problem that I need to tackle and I thought I'd open up the floor for suggestions and feedback.
I need to be able to download data from a feed source and store it in a database of some kind, the data needs to be merged into the existing data and I need to able to query for the data as of any given date. It's the part in bold that I'd like to talk about.
Essentially what this problem boils down to is that I need to persist an object graph to an OLTP database and be able to query it temporally.
In the simple case of one table this problem is very simple, you have a date range indicating the valid time span for the record and then you pass in an as of range and only select rows that are valid for this point in time. The issues rise when you have more than one table.
Let's take the case of having two tables, Order-*Item.
When we query for an order we can apply the same as of date changes to the item table. All is well, but what happens if we want to modify an order? Now we need to copy the order row, set the date ranges so the valid from on the new row and the valid to on the new row is set to now. We also have to copy the items, or if we change our model copy the references to the items.
Even in this simple case things are starting to get complicated.
My problem is exacerbated because I have a self-referential object graph, so to use the above model you'd have Order-*Item-*Order.
What would you do? How do you structure your databases when you need versioning of rows and temporal queries?
Back in the day, Developing Time-Oriented Database Applications in SQL was the best source of info for temporal databases. Published in 1999, the copyright has reverted to the author, and the link goes to his PDF version of the book. Look here for more of his publications, and for a link to the compressed content of the CDROM.
I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?
A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.
Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.