How i should design database ER-Diagram for this problem - database

i have some specific question to solve but i can not think.
I have 5-6 statements that I need to store in my database. This system like a news feeds.
Statement 1 : A installed this website.
Statement 2 : A added store R in provinceX
Statement 3 : B reviewed store R
Statement 4 : A edited store R
Statement 5 : A added product P in product_category1
Statement 6 : B review product P
Note that bold is dynamic data such as A, B is some preson's name, store R is store'name that person add.
In my idea, i have
person_table(id, name, age, ...)
store_table(sid, store_name, province_id, ...)
product_table(pid, product_name, ...)
and how about feed_table??
How I design database to store this data. and How i query this data to view.
thank you

There are two approaches to this kind of problem:
You design your tables in such a way that you have no repetition of information. Basically the feed you're interested in can be constructed from the existing tables in a performant manner; or
You repeat certain data to make your feed easier to implement.
Personally I would probably go for (2) and have a new table:
Feed: id, person_id, store_id, action_id, province_id, product_category_id
with the last two fields being optional, based on the action (review, edit, add, etc).
Purists will argue that repeated data is bad (certainly normal-form seeks to factor it out) but in the real world database schemas do this all the time for performance reasons.
Think about it this way: what do you spend most of your time doing in your application?
Viewing the feed (reading); or
Doing actions (writing).
If it's (1), which I suspect it is, then a feed table makes sense. You typically want to optimize for reads not writes as writes occur far less often.

Related

how to store static data that will be localized?

I am developing a health care system and I want the doctor when starting to type a diagnosis instead of typing it , he can select from a list that will be displayed for him.
the list contains diseases or symptoms that will then be inserted into database in a diagnosis table.
I did that because of two reasons:
I want all doctors to use the same list of symptoms when writing their diagnosis to work on that data later on, instead of each one typing his own way.
The data will be localized and translated to different languages when displayed to different regions.
I am facing a problem here, should i put all these in a lookup table in a database or a config file ? given that number of rows are 3000 in 7 languages ( each language will have it's own column ) and i may at anytime add new data or remove.
I would put them in a database. I find it easier to maintain, and faster to query than a config file.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

Preferred way of retrieving row with multiple relating rows

I'm currently hand-writing a DAL in C# with SqlDataReader and stored procedures. Performance is important, but it still should be maintainable...
Let's say there's a table recipes
(recipeID, author, timeNeeded, yummyFactor, ...)
and a table ingredients
(recipeID, name, amount, yummyContributionFactor, ...)
Now I'd like to query like 200 recipes with their ingredients. I see the following possibilities:
Query all recipes, then query the ingredients for each recipe.
This would of course result in maaany queries.
Query all recipes and their ingredients in a big joined list. This will cause a lot of useless traffic, because every recipe data will be transmitted multiple times.
Query all recipes, then query all the ingredients at once by passing the list of recipeIDs back to the database. Alternatively issue both queries at one and return multiple resultsets. Back in the DAL, associate the ingredients to the recipes by their recipeID.
Exotic way: Cursor though all recipes and return for each recipe two separate resultsets for recipe and ingredients. Is there a limit for resultsets?
For more variety, the recipes can be selected by a list of IDs from the DAL or by some parametrized SQL condition.
Which one you think has the best performance/mess ratio?
If you only need to join two tables and an "ingredient" isn't a huge amount of data, the best balance of performance and maintainability is likely to be a single joined query. Yes, you are repeating some data in the results, but unless you have 100,000 rows and it's overloading the database server/network, it's too soon to be optimizing.
The story is a little bit different if you have many layers of joins each with decreasing cardinality. For example, in one of my apps I have something like the following:
Event -> EventType -> EventCategory
-> EventPriority
-> EventSource -> EventSourceType -> Vendor
A query like this results in a significant amount of duplication which is unacceptable when there are 100k events to retrieve, 1000 event types, maybe 10 categories/priorities, 50 sources, and 5 vendors. So in that case, I have a stored procedure that returns multiple result sets:
All 100k Events with just EventTypeID
The 1000 EventTypes with CategoryID, PriorityID, etc. that apply to these Events
The 10 EventCategories and EventPriorities that apply to the above EventTypes
The 50 EventSources that generated the 100k events
And so on, you get the idea.
Because the cardinality goes down so drastically, it is much quicker to download only what is needed here and use a few dictionaries on the client side to piece it together (if that is even necessary). In some cases the low-cardinality data may even be cached in memory and never retrieved from the database at all (except on app start or when the data is changed).
The determining factors in using an approach such as this are a very high number of results and a steep decrease in cardinality for the joins, in other words fanning in. This is actually the reverse of most usages and probably the reverse of what you are doing here. If you are selecting "recipes" and joining to "ingredients", you are probably fanning out, which can make this approach wasteful, especially if there are only two tables to join.
So I'm just putting it out there that this is a possible alternative if performance becomes an issue down the road; at this point in your design, before you have real-world performance data, I would simply go the route of using a single joined result set.
The best performance/mess ratio is 42.
On a more serious note, go with the simplest solution: retrieve everything with a single query. Don't optimize before you encounter a performance issue. "Premature optimization is the root of all evil" :)
One stored proc that returns 2 datasets: "recipe header" and "recipe details"?
This is what I'd do if I needed the data all at once in one go. If I don't need it in one go, I'd still get 2 datasets but with less data.
We've found it slightly easier to work with this in the client rather than one big query as Andomar suggested, but his/her answer is still very valid.
I would look at the bigger picture - do you really need to retrieve ingredients for 200 recipes? What happens when you have 2,000?
For example, if this is in a web page I would have the 200 recipes listed (if not less because of paging), and when the user clicked on one to see the ingredient then I would get the ingredients from the database.
If this isn't doable, I would have 1 stored proc that returns one DataSet containing 2 tables. One with the recipes and the second with the list of ingredients.
"I'm currently hand-writing a DAL in C#..." As a side note, you might want to check out the post: Generate Data Access Layer Methods From Stored Procs. It can save you a lot of time.

Help with designing a schema for a lyrics database

I'd like to work on a project, but it's a little odd. I want to create a site that shows lyrics and their translations, but they are shown simultaneously side-by-side (so this isn't just a normal i18n of the site).
I have normalized the tables like this (formatted to show hierarchy).
artists
artistNames
albums
albumNames
tracks
trackNames
trackLyrics
user
So questions,
First, that'll be a whopping seven joins. I must have written pretty small queries in the past because I've never come across something like this. Is joining so many tables a bad thing? I'm pretty sure I'll be using SQLite for this project, but does anyone think PostgreSQL or MySQL could perform better with a pretty big join like this?
Second, my current self-built framework uses a data mapper to create domain objects. This is the first time I will be working with so many one-to-many relationships, so my mapper really only takes one row as one object. For example,
id name
------ ----------
1 Jackie Chan
2 Stephen Chow
So it's super easy to map objects. But with those one to many relationships...
id language name
------ ---------- -------
1 en Jackie Chan
1 zh 陳港生
2 en Stephen Chow
2 zh 周星馳
...I'm not sure what to do. Is looping through the result set to create a massive array and feeding it to my domain object factory the only option when dealing with a data set like this?
<?php
array(
array(
'id' => 1,
'names' => array(
'en' => 'Jackie Chan'
'zh' => '陳港生'
)
),
array(
'id' => 2,
'names' => array(
'en' => 'Stephan Chow',
'zh' => '周星馳'
)
)
);
?>
I have an itch to just denormalize these tables so I can get my one row per object application working, but I've always read this is not the way to go.
Third, does this schema sound right for the job?
Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.
You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.
Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.
Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.
As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.
Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.
General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:
Profile and optimize queries
add indices to schema
add hints to query optimizer (don't know if SQLite has any, but most of databases do)
If 1. does not gain any performance benefits, consider denormalizing database.
Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.
Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.
Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:
select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;
I've helped a friend optimize a web storefront. In your case, it's a lot the same.
First. What is your priority, webpage speed or update speed?
Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.
But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.
But at the heart of your design I have a question.
Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?
It looks like you have a 1-n relationship from artists to albums but that really should many-many.
Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?
If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

maintain history in a database

I am designing this database that must maintain a history of employee salary and the movements within the organization. Basically, my design has 3 tables (I mean, there more tables but for this question I'll mention 3, so bear with me). Employee table (containing the most current salary, position data, etc), SalaryHistory table (salary, date, reason, etc.) and MovementHistory(Title, Dept., comments). I'll be using Linq to Sql, so what I was thinking is that every time employee data is updated, the old values will be copied to their respective history tables. Is this a good approach? Should I just do it using Linq to SQL or triggers? Thanks for any help, suggestion or idea.
Have a look at http://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture .
Basically, the article suggests that you have the following columns in the tables you need to track history for -
* DateCreated – the actual date on which the given row was inserted.
* DateEffective – the date on which the given row became effective.
* DateEnd – the date on which the given row ceased to be effective.
* DateReplaced – the date on which the given row was replaced by another row.
* OperatorCode – the unique identifier of the person (or system) that created the row.
DateEffective and DateEnd together tell you the time for which the row was valid (or the time for which an employee was in a department, or the time for which he earned a particular salary).
It is a good idea to keep that logic internal to the database: that's basically why triggers exist. I say this carefully, however, as there are plenty of reasons to keep it external. Often times - especially with a technology as easy as LINQ-to-SQL - it is easier to write the code externally. In my experience, more people could write that logic in C#/LINQ than could do it correctly using a trigger.
Triggers are fast - they're compiled! However, they're very easy to misuse and make your logic overcomplex to a degree that performance can degrade rapidly. Considering how simple your use case is, I would opt to use triggers, but that's me personally.
Triggers will likely be faster, and don't require a "middle man" to get the job done, eliminating at least one chance for errors.
Depending on your database of choice, you can just use one table and enable OID's on it, and add two more columns, "flag" and "previous". Never update this table, only insert. Add a trigger so that when a row is added for employee #id, set all records with employee #id to have a flag of "old" and set the new rows "previous" value to the previous row.
I think this belongs in the database for two reasons.
First, middle tiers come and go, but databases are forever. This year Java EJBs, next year .NET, the year after that something else. The data remains, in my experience.
Second, if the database is shared at all it should not have to rely on every application that uses it to know how to maintain its data integrity. I would consider this an example of encapsulation of the database. Why force knowledge and maintenance of the history on every client?
Triggers make your front-end easier to migrate to something else and they will keep the database consistent no matter how data is inserted/updated/removed.
Besides in your case I would write the salaries straight to the salary history - from your description I wouldn't see a reason why you should go the way via an update-trigger on the employee table.

Resources