If I want to track revision history (ie. keep an audit trail) on a foo table, I might add a trigger to it, such that every UPDATE too foo.some_text_column also results in an INSERT into a foo_history table.
My question is, in that foo_history table, is it better to save the entire contents of foo.some_text_column, or to save just a "diff" of the changes?
In other words, if I update 'abc' to 'def', is it better to keep the whole original text:
abc
or is it better to keep something more like ...
- abc
+ def
Obviously, the dif takes up more space for this short string, but when you have larger strings, with edits of just a few characters, the diff will be a lot smaller.
Also, I know databases are good at optimizing redundant data, so maybe storing the whole original text really isn't that bad? I'm just honestly not sure which option make more sense, technically, and would welcome an informed opinion.
Related
I'm jumping into a project that's been running for some time. One of my first tasks is to add a few columns that will essentially replace an existing column. What should I do with the old data?
The new columns are meant to "decompose" an existing column, as a way to add more granular details to the value. The following structure is conceptually the same as what I'm dealing with:
# Current Schema
TotalPrice: BigInt
# New Schema
BasePrice: BigInt
Markup: BigInt
Tax: BigInt
Conceptually speaking, TotalPrice == (BasePrice + Markup + Tax).
As part of this migration, what's the best way to deal with all the rows that already have values for TotalPrice? I've worked out two options, and I'm looking for some authoritative guidance for which approach is "better" in terms of maintenance, reasoning, etc. I'm open to alternative approaches too.
Keep TotalPrice
Hold onto the old data as-is, make the column read-only via the ORM (I'm using Django), and introduce conditionals in the code to check for a value in this "legacy" column first. This feels more complicated on the code level, but preserves the data in its originally intended mental model, making it easier to reason about and work with in the future.
Moving TotalPrice into one of the new columns
Hold onto the data, but re-label it, so to speak. This would make the code cleaner, but set us up for potentially weird situations where only one of the new columns would have a value for a lot of records, while the expected situation is that all 3 of the new columns will have a value > 0.
To me, it seems like the first approach is better for the long term. It's more explicit (records with TotalPrice represent the information when it was created), and requires less commenting to explain "what's going on here", when dealing with columns that have an implied second meaning (e.g. BasePrice is both the base price but sometimes the TotalPrice for old records). But I'm not totally sure if holding onto this column and the associated code flows are worth an easier mental model.
Imagine a lot of code that looks like:
if obj.total_price:
return obj.total_price
else:
return obj.base_price + obj.markup + obj.tax
Where we'll always have to do a form of duck-typing, to see if it's a "legacy" record.
"It doesn't really matter" is also an acceptable answer!
I am creating a note system and want my notes to be editable, but also want them to never be deleted so I'm compromising with keeping a history of the different changes made to them. So I have come up with one idea where each note table looks like this:
Note
--------
+id
+content
+author
+timestamp
+edited
in this version if edited is anything other than null the note has been edited and points to the note id of its ancestor. It is essentially a linked list. I'm not very happy with that though as most notes won't be edited so there's just a bunch of nulls sitting around.
my other idea was to create a table like:
Note
-------
+id
+content
+author
+timestamp
and also a table like:
Edited_Notes
-----------
+id
+note_id
then whenever a note is loaded just see if it's been added to Edited_Notes. If it has been, then obviously it's been edited. I'm worried that searching through this table every time a note is opened by hundreds of users could be taxing for the database though, especially if I add an ability to see all note history for a single note at once.
I am not a db designer so this is pretty new to me. Would these kinds of transactions even scratch a databases capabilities? Is there a better way to go about it?
There is no reason to avoid empty columns - storage is cheap (too cheap to measure for most systems).
What's usually expensive is developer time. I'd optimize for the most obvious, easy-to-understand solution, that describes your business domain in the cleanest possible way. In my opinion, option 1 does that; option 2 would probably require significant additional queries on most screens.
If I understand the solution correctly deep parent child relation could be an issue for a db for even small number of records as you will need to join table by itself several times depending on the change number.
Instead I would recommend a history table separate then note table with the exact same structure with a parent id reference (or you can just use the same note id as a non primary field) to the actual note table. Whenever something changed you should move old data to history table with the parent reference id.
I'm creating a small game composed of weapons. Weapons have characteristics, like the accuracy. When a player crafts such a weapon, a value between min and max are generated for each characteristic. For example, the accuracy of a new gun is a number between 2 and 5.
My question is... should I store the minimum and maximum value in the database or should it be hard coded in the code ?
I understand that putting them in the database allows me to change these values easily, however these won't change very often and doing this mean having to make a database request when I need these values. Moreover, its means having way much more tables... however, is it a good practice to store this directly in the code ?
In conclusion, I really don't know what solution to chose as both have advantages and disadvantage.
If you have attributes of an entity, then you should store them in the database.
That is what databases are for, storing data. I can see no advantage to hardcoding such values. Worse, the values might be used in different places in your code. And, when you update them, you might end up with inconsistent values throughout the code.
EDIT:
If these are default values, then I can imagine storing them in the code along with all the other information about the weapon -- name of the weapon, category, and so on. Those values are the source information for the weapons.
I still think it would be better to have a Weapons table or WeaponDefaults table so these are in the database. Right now, you might think the defaults are only used in one place. You would be surprised how software can grow. Also, having them in the database makes the values more maintainable.
I would have to agree #Gordon_Linoff.
I Don't think you will end up with "way more tables", maybe one or two. If you had a table that had fields of ID, Weapon, Min, Max ...
Then you could do a recordset search when needed. As you said, these variables might never change but changing them in a single spot, seems much more Admin-Friendly then scouring code that you have let alone for a long time. My Two cents worth.
I'm making an api for movie/tv/actors etc. with web api 2 and sql server. The database now has >30 tables, most of them storing data users will be able to edit.
How should I store old version of entries?
Say someone edits description, runtime and tagline for a entry(movie) in the movies table.
I'll have a table(movies_old), where I store the editable files in 'movies' pluss who/when it was edited.
All in the same database. The '???_old' tables has no relationships.
I'm very new to database design. Is there something obviously wrong with this?
To my mind, there are two issues here: what table you store the data in, and what goes in the "historical value" field.
On the first question, there are two obvious options: Store old and new records in the same table, with some sort of indication of which is "current" and which is "history", or have a separate table for history.
The main advantage of one table is that you have a simpler schema. This is especially true if the table contains many fields. If there are two tables, then all the field definitions are duplicated. When you move data from the current table to the history table, you have to copy every field, and if the list of fields changes, or their formats change, you have to remember to update the copy. Any queries that show the history have to read two tables. Etc. But with one table, all that goes away. Converting a record from current to history just means changing the setting of the "is_current" flag or however you indicate it.
The main advantages of two tables are, (a) Access is probably somewhat faster, as you don't have so many irrelevant records to skip over. (b) When reading the current table you don't have to worry about excluding the history records.
Oh, an annoying thing about SQL: In principle you could put a date on each record, and then the record with the latest date is the current one. In practice this is a pain: you usually have to have an inner query to find the latest date, and then feed this back in to an outer query that re-reads the record with that date. (Some SQL engines have ways around this. Postgres, for example.) So in practice, you need an "is_current" flag, probably 1 for current and 0 for history or some such.
The other issue is what to put in the contents. If you're dealing with short fields, customer number and amount billed and so forth, then the simple and easy thing to do is just store the complete old contents in one record and the complete new contents in the new record. But if you're dealing with a long text block, like a plot synopsis or a review, there could be many small editorial changes. If every time someone fixes a grammar or spelling error, we have a whole new record with the entire 1000 characters, of which 5 characters are different, this could really clutter up the database. If that's the case you might want to investigate ways to store changes more efficiently. May or may not be an issue to you.
I have some question regarding database performance in general. I'm using Sqlite but I assume that the performance remarks are applicable to all relational databases?
I have a database that contains a table that stores data of about 200 variables. I write about 50 variables per second to the table. A writen variable contains the id of the variable, a value and a timestamp. Readig is done very rarely but needs to be as fast as possible to get the data per variable in chronological order. When I do a query I always just need to get the data of 1 variable.
How do I design the database so the reading is as fast as possible:
1. I make 1 tabel that contains all the
variables. The variable is stored as
an id. I index the table on the id
and timestamp. The bad part is that
the index makes the write slowe(r).
2. I make 200 tables for each variable
and index the timestamp.
I think the second solution is the most performant but creaying a table for each variable doesn't seem right. Someone can give me some advice?
Thanks
If you really want to use a database, use the first approach, but make sure you are inserting your data in a single transaction; benchmarks show it makes writing much faster.
Are your searches performed on variable name/id AND timestamp, or variable name only. Indexing on timestamp may not be necessary...
Are you sure you need a database? By the sounds of it, a flat-file will work well enough for you, and you don't sound like you actually need any of the trappings of a database. Just create a flat-file for each variable and keep handles to each open. Write to them through your standard buffered IO as often as you need. To read, just open one file and deserialize.
If you are using a relational database, I am guessing those variables are all related? If they are just values, for instance, settings, then maybe a file or something similar may be better.
If you only ever have to query values for ONE variable, then, if you insist on using a database (which may not be a bad thing!), then you should create one table per variable:
id (unsigned int, auto-increment, primary key)
timestamp (datetime)
variable (whatever it is supposed to be)
Do not skimp on data just because "it might take more room on the hard drive" - that only leads to trouble.