Tracking changes in data with no primary key - database

I am writing an application that uses CC-CEDICT, a CC licensed Chinese-English dictionary.
The dictionary is available only as a zipped text file (4MB) with entries in the following format:
Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/
This is sample data:
是 是 [shi4] /is/are/am/yes/to be/
昰 是 [shi4] /variant of 是[shi4]/used in given names/
時 时 [Shi2] /surname Shi/
時 时 [shi2] /o'clock/time/when/hour/season/period/
I chose those lines deliberately to illustrate my problem. The data has no descernable key by which an individual word can be identified.
The English definitions can change, and do as the dictionary is constantly updated, but suppose in one update the two definitions of 時 时 change, so the next download contains the lines:
時 时 [Shi2] /last name Shi/
時 时 [shi2] /o'clock/time period/when/hour/season/
How am I to tell which records have been updated? This is really noticeable when the translation is a single word that changes completely.
I am after a strategy as to how I can key this dictionary. So far my best idea is to take (Simplified, Traditional) as the key, and treat the duplicates as a special case - in their own table perhaps??

The issue is one of perspective.
You say that your records have no key, but in fact the whole record is the key - assuming you have no identical duplicate records.
Therefore there are no updates only inserts and deletes.
You can track which records are deleted and which are inserted in order to highlight changes in your dictionary.
If you really want to treat definition replacements as updates, then you're going to have to come up with a scheme that (a) creates a unique key for records and (b) allows you to recognize when an new definition list should be considered a modification of an existing definition list.
Part (a) is easy, add your own surrogate key. This could be unique across all definitions, or just across combinations of (Simplified, Traditional).
Part (b) is harder. At what point do you say "surname Shi" is related to "last name Shi"? I suggest coming up with some kind of text comparison function that yields a numeric score. Pick a threshold for this score at which you call it an update instead of a delete and insert. This will be arbitrary, but you might find that two people could disagree over what is an update and what isn't from one case to another anyway.

This is not a solution, but may give some ideas for you (or others).
How about modeling this as a hierarchy, Word->Meaning->Translation.
Compute a hash of the translation, sum the hash value of all translations and store that in the corresponding "meaning" record, then sum the hash values of all meanings and store that in the Word record. (Yes, this is denormalized).
You would have to recompute all the hash values for all records in the file every time. Then you could simply compare the currently stored hash value for a "word" with the hash value that you just computed. If they are different, something has changed. Either there was a new meaning, or a new translation or a translation was removed, etcetera. You could then either remove the word entirely (cascaded) and re-insert the new "sub tree". If you want to complicate things you could also descend into the hierarchy and try to detect exactly what changed.

Related

How to update translation table for translation dictionary database

I am implementing a database for a translation dictionary, and am using the design indicated here.
Is there any way to update an entry in the translation table? Or would you need to have a primary key as well in order to facilitate any updates? Ideally, there wouldn't need to be updates, but it is conceivable a translation could be incorrect and need to be changed.
It seems you could delete the incorrect translation and insert a new one. In my case, I have a server DB, and an Android app that will pull down the languages it needs, and the associated words and translations, into a local DB. In this case, while it may be simple to delete the incorrect translation on the server, how would the client know, unless it deleted and repopulated the entire translation table?
Is a primary key, then a UNIQUE constraint on the two word_id columns the best way around this?
You can update an entry in the translation table with a statement such as:
update TRANSLATION_EN_DE
set ID_DE = 3
where ID_DE = 2 and
ID_EN = 1;
I would not have one table per language though.
Add a new table for unique languages, and add its primary key to a words table that holds all languages.
Then your translation table would be "word_from" and "word_to".
It will make your design and code much more simple.
To propagate changes to the client you'd probably want to version all of the changes in a new column on all tables to take account of new words/translations, spelling corrections, possible removal of words/translations, and have the client record the version number up to which it has retrieved data.
Since you might have deletes that you want to propagate you'll need to use a "soft delete" flag in the tables, because otherwise there would be no record in the table to hold the version number.
You'd probably also want a table holding those version numbers as a unique key with a text to explain the type of changes that have taken place, and the timestamp for the change. Remove the timestamp columns from all other tables.
So when you make a new batch of changes, create a new version record, make all of the required changes, and then commit all changes in a single transaction. then the entire change set becomes visible to other database users, and they can very efficiently check whether they are up to date or not, and retrieve only the relevant changes.

Using b trees in a database

I have to implement a database using b trees for a school project. the database is for storing audio files(songs), and a number of different queries can be made like asking for all the songs of a given artist or a specific album.
The intuitive idea is to use on b tree for each field ( songs, albums, artists, ...), the problem is that one can be asked to delete any member of any field, and in-case you delete an artist you have to delete all his albums and songs from the other b trees, keeping in mind that for example all the songs of a given artist don't have to be near each other in the b tree that corresponds to songs.
My question is: is there a way to do so (delete the songs after a delete to an author has been made) without having to iterate over all elements of the other b trees? I'm not looking for code just ideas because all the ones I've come up with are brute force ones.
This is my understanding and may not be entirely right.
Typically in a database implementation B Trees are used for indexes, so unless you want to force your user to index every column, defaulting to creating a B Tree for each field is unnecessary. Although this many indexes will lead to a fast read in virtually every case (with an index on everything, you wont have to do a full table scan), it will also cause an extremely slow insert/update/delete, as the corresponding data has to be updated in each tree. As I'm sure you know, modern databases for you to have at least one index (the primary key), so you will have at least one B Tree with a key for the primary key, and a pointer to the appropriate node.
Every node in a B Tree index should have a pointer/reference to the full object it represents.
Future indexes created would include the attributes you specify in the index, such as song name, artist, etc, however will still contain the pointer/reference to the corresponding node. Thus when you modify, lets say, the song title, you will want to modify the referenced node which all the indexes reference. If you have any indexes that have the modified reference as an attribute, you will have to modify the values in that index itself.
Unfortunately I believe you are correct in your belief that you will have to brute-force your way through the other B Trees when deleting/updating, and is one of the downsides of using alot of indexes (slowed update/delete time). If you just delete the referenced nodes, you will likely end up with pointers to deleted objects, which will (depending on your language) give you some form of a NullPointerException. In order to prevent this they references will have to be removed from all the trees.
Keep in mind though that doing a full scan of your indexes will still be much better than doing full table scans.

Finding a suitable data structure for deletion from both lists

This might be deleted, since involves idea sharing which is not quite allowed in stack overflow, but still before that if I could get any ideas from solid programmers, it will be a win situation for me
Assume that you have a class Student, stored in the database, and this class has a list property called favoriteTeachers. This list constantly gets updated by the system and involves the id of teachers.
You also have a class Teacher, also stored in database and likewise has a list property favouriteStudents. It is again updated constantly and involves the id's of students.
In our system, when a student calls a function (let's say notMyFavoriteTeacher), our system has to apply the changes below;
Delete the given teacher's id from favouriteTeacher list
Delete the student's id from given teacher's favouriteStudent list
I've tried to consider the number of rows updated could exhaust the database so instead of mapping the students with their favorite teachers in a separate table as user_id, teacher_id, instead I created a column and stored a string which contains the teachers id's separated by comma. (Ex: "1,2,14,4,25"). Same applied for the teacher as well.
However when we call this function, we also face another problem. In order for this operation to be done, you need to convert the string to list, find the element by linear search and later on delete, and later on convert list to string and push back to db. And you have to do the other operation for the teacher class as well. If we did not apply the string method, deletion would be easier but since we would be handling deletion and addition operations for like 2k times a day, i did not think it would be feasible to use separate tables.
I wanted to ask in order to decrease the number of operations, could a data structure be chosen such that it would increase the efficiency?
Storing an relation as an array in a single column is a violation of first normal form, and should not be done without good reason. Although various forms of denormalization may result in increased efficiency in some cases, I don't see this case being one of those. What's worse, you'll get no help from the database in enforcing referential integrity. And some operations will result in guaranteed row scans: When deleting a teacher, you will have to examine every row of every student to remove the teacher from each student's favorite list. Same goes for deleting a student.
Relational Databases are designed and built to link rows to other rows. You need a very good reason to keep them from doing what they're design to do. You should go ahead and design a proper relational schema, and only if actual measurement shows that it is too slow should you worry about its performance.
First of all, I don't understand your choice of storing ids of favorite teachers/students as comma separated strings, because either in the case of comma separated values or in case of a table with studentId, teacherId structure, you do exactly 2 row updates/deletes (first in the favoriteTeachers table, second in the favoriteStudent table).
But one way of optimizing performance given your current data structure would be keeping the comma separated strings sorted. I mean from the very formation of rows, keep your comma separated ids like "1, 5, 7, 15". This way, if you convert it to a list, you could perform binary search and it would take Log(n) time instead of n.
You are losing all the benefits provided by any RDBMS by storing it as a list of strings. Create a separate table with Student_id and favorite teacher_id. Apply filtering conditions (either for student or for teacher) before joining it to main tables.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Style Question - Database Table with Many Fields

I'm starting a new project where I have to parse a document and store it in a database. This document contains several sections of simple key-value pairs - about 10 sections and about 100 pairs in total. I could have one table per section, and they all map one-to-one to an aggregate. Or I could have one table with about 100 fields. I'm stuck because I don't want to make a single table that big, but I also don't want to make that many one-to-one mappings either. So, do I make the big table, or do I make a bunch of smaller tables? Effectively, there wouldn't really be a difference as far as I can tell. If there are, please inform me.
EDIT
An example is desired so I will provide something that might help.
Document
- Section Title 1
- k1: val1
- k2: val2
...
- Section Title 2
- k10: val10
...
...
- Section Title n
- kn-1: valn-1
- kn: valn
And I have to use a relational database so don't bother suggesting otherwise.
If you have many, many instances of this big document to store (now and/or over time), and if each instance of this document will have values for those 100+ columns, and if you want the power and flexibility inherent in storing all that data actross rows and columns within an RDBMS, then I'd store it all as one big (albeit ugly) table.
If all the "items" in a given section are always filled, but invididual sections may or may not be filled, then there might be value in having one table per section... but it doesn't sound like this is the case.
Be wary of thise "ifs" above. If any of them are too shaky, then the big table idea may be more pain than it's worth, and alternate ideas (such as #9000's NoSQL idea) might be better.
If the data is just for read-only purpose and your xml doesn't mandate you to make DB scheme changes (alters) then I doesn't see any problem de-normalizing to a single table. The other alternative might be to look at EAV models
Table document(
PK - a surrogate key
name - the "natural" key
)
Table content(
PK - the PK of the parent document
section title
name
value
)
Yes, you have 100's of rows of name/value pairs per document. However, you can easily add names and values without having to revise the database.

Resources