database design for dictionary application - database

Currently I'd like to develop dictionary application for mobile device. The dictionary itself use offline file/database to translate the word. it just translates for two languages, for example english - spanish dictionary.
I've a simple design in my mind. it would be two tables: English Table and Spanish Table.
for each table contain of:
word_id = the id which would be a foreign key for other table
word = the word
word_description
correspond_trans_id = the id of other table which is the translation for this word to other language.
and also because of this is for mobile application, the database use SQLite.
The definition data for each table has been provided order by field 'word' on the table. However I'm still thinking the problem if there is addition for the data definition. Because the table would be order by field 'word', is there any method to put (insert) the new record still in order by word ? or any idea to make it more efficient ?

At least it for each translation there are a few translation possibilities depending on the context. if you like to do a bidirectional dictionary for two languages you need at least three tables:
ENGLISH
ID | WORD
1 | 'dictionary'
GERMAN
ID | WORD
1 | 'lexikon'
2 | 'wörterbuch'
TRANSLATION_EN_DE
ID_EN | ID_DE
1 | 1
1 | 2
The first two tables are containing all the words that are known in that language and the bidirectional mapping is done by the 3rd mapping table. this is a common n:n mapping case.
with two more tables you're always able to add a new language into you're dicitionary. If you're doing it with one table you'll have multiple definitions for a single word thus no normalized db.
you can also merge your language tables into a single table defining the words language by another column (referencing a language table). in that case you'll need a 2-column index for the language and the word itsself.

What do you intend to do when a word in language 1 can be translated by more than one word in language 2? I think you have to use something like wursT's design to handle that.
RE inserting records in alphabetical order: You do not normally worry about the physical ordering of records in a database. You use an ORDER BY clause to retrieve them in any desired order, and an index to make it efficient. There is nothing in the SQL standard to control physical ordering. Umm, I recall coming across something about forcing a physical ordering on some database I worked with, I think it was MySQL, but most will not give you any control of this. I haven't worked with SQLite so I can't say if it provides a way.

Surely the relationship between words and their possible translations is one-to-many or many-to-many. I'm not clear how you will represent this in your model. Seems like you may need at least one more table.

I agree with Matt - To make life much more easier I would stick with one table. Also if you plan to use CoreData, the index modelling of traditional database design is different to the object graph based model when working in Obj. C/IOS.
It's very easy to think along the traditional lines of Select querying and inner / outer joins but for example your column 'correspond_trans_id' would normally be handled by setting a 'relationship' when defining your data model for the two tables (if you are using CoreData of course).
In essence unless there is a good reason to have two tables I would stick with just one.
In relation to the ordering, you might not need to keep the order of words in the dataset. I'm guessing you want to keep everything Alphabetical which would involve some work if the data were to ever change, even for just one table.
Again using CoreData, NSFetchRequest and NSSortDescriptor, it is very easy to return a set of records ordered by a specified column, freeing you from having to worry about amends and additions to your database.
If you have any questions give me a shout.

Related

SQL DB schema best practice for List item table

I have a table say Table1 which has following columns
1. Id
2. Name
3. TransportModeId
4. ParkingId
5. ActivityId
Column 3,4,5 are the foreign keys and all three are simple list tables which has following columns
1. Id
2. Item
For simplicity I have shown 3 tables otherwise my actual schema contains almost 25 List table.
What should be the best Practice
Option 1.
Keep all list table separate which will create 25 tables but on the other hand i will have a clean modular schema
Option 2.
Make a table with self join and add all the items in that table in which ParentId null will represent the name of the table and it can have more than one references in other tables as described above and it has to be kept in some kind of common module
thanks
Option 1 is the way how it is normally done when designing a system that is not supposed to be very configurable by end user/implementator. It has several important advantages, two of them:
when you need to add an extra attribute to any of the enumerations (e.g. parking location to the Parking enumeration), it is quite simple and does not produce extra problems.
It is optimized for speed using relation database engine's native algorithms for linking records.
As for Option 2:
It is something called Generalization. You take more types with similar attributes (methods) and create a class/table with a structure that fits different purposes.
The self reference, as you speak about it, is not a good idea for Option 2, rather make a reference to another EnumerationType table containing type names like Parking, Activity etc. with id.
Using this approach could make sense in case you need to enable end user to configure the attributes himself within your app. But otherwise it could cause you problems when you find out, that different enumeration tables need to have different structures.

Parent child design to easily identify child type

In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.

Reverse index implementation with spring-data and postgres

I have a table, say orders, which has a column, say a alphanumeric 15 character long itemId and a bunch of other columns. The same itemId can be repeated say up to 900 times for very popular items, which means the data will be repeated about 900 times. Obviously, we need to separate it out. However, we need the lookup for the list of items to be very quick and efficient. I read up a bit and thought reverse indexing would be a good way to achieve this. However, I am a bit confused on the actual implementation. I couldn't find any examples online as well other than http://blog.frankel.ch/tag/spring-data , but it uses solr. I was thinking of creating a items-orders table, adding a repository class which will have a method to . However, since there is many-many relation between items and orders, it will require a join table. This makes me think that i am on the wrong path as i intended items-orders table itself as a kind of join table as it only as itemId and orderId in it.
I am pretty sure I am doing something wrong. Any pointers are greatly appreciated. Sorry for a basic question, but I could not find much information with samples online.
thanks,
Alice
You're on the right track with an item orders link table. You will probably find that you end up using the table for additional columns you haven't considered yet (quantity, price, etc.)
The main thing to do starting with is make sure your database design is right, look up the basic normalization rules about making sure you don't duplicate information. Also when you create your tables make sure you're explicitly telling the database of the the relationships between the tables using FOREIGN KEY and PRIMARY KEY constraints.
Once you have the correct logical structure in place you can see if you have any performance issues that require you to do anything clever.
Relational databases were designed to exactly what you're contemplating though so the performance will probably be much better than you fear. Premature optimization would be a huge mistake.
You mentioned solr, this is a generic text search engine (sort of like google). For your requirements you want to stick to a pure relational database. You want a base that delivers exact results based on exact criteria exactly what products have been included in an order etc. you don't want any fuzzy matching or artificial intelligence guessing about what has been ordered.
You might also store the product catalogue in solr so the user could pick look for products that mention pink,blue or purple in the description and comes in a size 4 etc, then once the product has been chosen use the itemId in the relational database.

Single Big SQL Server lookup table

I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.

Style Question - Database Table with Many Fields

I'm starting a new project where I have to parse a document and store it in a database. This document contains several sections of simple key-value pairs - about 10 sections and about 100 pairs in total. I could have one table per section, and they all map one-to-one to an aggregate. Or I could have one table with about 100 fields. I'm stuck because I don't want to make a single table that big, but I also don't want to make that many one-to-one mappings either. So, do I make the big table, or do I make a bunch of smaller tables? Effectively, there wouldn't really be a difference as far as I can tell. If there are, please inform me.
EDIT
An example is desired so I will provide something that might help.
Document
- Section Title 1
- k1: val1
- k2: val2
...
- Section Title 2
- k10: val10
...
...
- Section Title n
- kn-1: valn-1
- kn: valn
And I have to use a relational database so don't bother suggesting otherwise.
If you have many, many instances of this big document to store (now and/or over time), and if each instance of this document will have values for those 100+ columns, and if you want the power and flexibility inherent in storing all that data actross rows and columns within an RDBMS, then I'd store it all as one big (albeit ugly) table.
If all the "items" in a given section are always filled, but invididual sections may or may not be filled, then there might be value in having one table per section... but it doesn't sound like this is the case.
Be wary of thise "ifs" above. If any of them are too shaky, then the big table idea may be more pain than it's worth, and alternate ideas (such as #9000's NoSQL idea) might be better.
If the data is just for read-only purpose and your xml doesn't mandate you to make DB scheme changes (alters) then I doesn't see any problem de-normalizing to a single table. The other alternative might be to look at EAV models
Table document(
PK - a surrogate key
name - the "natural" key
)
Table content(
PK - the PK of the parent document
section title
name
value
)
Yes, you have 100's of rows of name/value pairs per document. However, you can easily add names and values without having to revise the database.

Resources