have multiple table copies in databases for easy join query or do data associate in program? [closed] - database

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
In a large system which use multiple databases.
eg:
db_trade used for trade information storage
db_fund used for user account storage
db_auth used for authentication and authorization
In this case user_info is common info.
trade system and fund system UI need display trade or account information with user info. for better performance, it need execute sql query left join user_info.
I wan't know how to design in larger system:
Perform data association in program ?
Sync user_info table in every databases ?

There are pros and cons to each approach:
The normalized approach stores each piece of data exactly once, and is better from a data integrity perspective. This is the traditional approach used in relational database design. For example, in a banking system you would probably not keep the current account balance in more than one place, right? Because then when you change it in one place, the other one becomes inconsistent, which may lead to wrong business decisions.
The denormalized approach allows to store multiple copies of the same data in different places, and is better for performance. This is the approach generally recommended for big data and NoSQL database design. An example where this makes sense: Suppose you are designing a chat system, and you need to display messages next to the name of the message author. You will probably prefer to store the display name next to the message, and not just the user ID, so that you don't need to do an expensive Join every time you display messages.
If you denormalize, you need to take care of data integrity at the application level. First, you need to make sure that you're clear about what's the source of truth. It's ok to have multiple copies of the user_info ready to be fetched with low latency, but there should be one place where the most correct and up-to-date user info can be found. This is the master table for user info. The other copies of user info should derive from it. So you must decide which one of the databases in your design is the Master of user info.
In the end, you have to make a tradeoff between consistency and performance (which is closely related to availability).
If user_info doesn't change a lot, and you have lots of queries and lots of users, and performance is your main concern - go with the denormalized approach and sync user_info table in every database. Your application will have to keep those tables as consistent as you need, by database-level replication or by some application logic.
If you must have strongly consistent views of the user_info in every query (which is not a typical situation), you may need to sacrifice performance and keep all user info in one location.
Generally, big data systems sacrifice consistency in favor of performance and availability.

Related

Database schema for Partners [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
We have an application to manage company, teams, branches,employee etc and have different tables for that. Now we have a requirement that we have to give access of same system to our technology partners so that they can also do the same thing which we are doing. But at the same time we need to supervise these partners in our system.
So in terms of DB schema what will be the best way to manage them:
1)To duplicate the entire schema for partners, and for that we have to duplicate around 50-60 tables and many more in future as system will grows.
2)To create some flag in each table which will tell it is internal or external entity.
Please suggest if anyone has any experience.
Consider the following points before finalizing any of the approaches.
Do you want a holistic view of the data
By this I mean that do you want to view the data your partner creates and which you create in a single report / form. If the answer is yes then it would make sense to store the database in the same set of tables and differentiate them based on some set of columns.
Is your application functionality going to vary significantly
If the answer to this question is NO then it would make sense to keep the data in the same set of tables. This way any changes you do to your system will automatically reflect to all the users and you won't have to replicate your code bits across schemas / databases.
Are you and your partner going to use the same master / reference data
If the answer to this question is yes then again it makes sense to use the same set of tables since you will do away with unnecessary redundant data.
Implementation
Rather than creating a flag I would recommend creating a master table known as user_master. The key of this table should be made available in every transaction table. This way if you want to include a second partner down the line you can make a new entry in your user_master table and make necessary modifications to your application code. Your application code should manage the security. Needless to say that you need to implement as much security as possible at the database level too.
Other Suggestions
To physical separate data of these entities you can either implement
partitioning or sharding depending upon the db you are using.
Perform thorough regression testing and check that your data is not
visible in partner reports or forms. Also, check that partner is not
able to update or insert your data.
Since the data in your system will increase significantly it would
make sense to performance test your reports, forms and programs.
If you are using indexes then you will need to revisit those since
your where conditions would change.
Also, revisit your keys and relationships.
None of your asked suggestion is advisable. You need to follow given guideline to secure your whole system and audit your technology partner as well.
[1]You should create a module on Admin side which will show you existing tables as well table which will be added in future.
[2]Create user for your technology partner and provide permission on those objects.
[3]Keep one audit-trail table, and insert entry of user name/IP etc.in it. So you will have complete tracking of activity carried out by your technology partner.

I want to move data from SQL server DB to Hbase/Cassandra etc.. How to decide which bigdata database to use? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to develop a plan to move data from SQL server DB to any of the bigdata databases? Some of the questions that I have thought of are :
How big is the data?
What is the expected growth rate for this data?
What kind of queries will be run frequently? eg: look-up, range-scan, full-scan etc
How frequently the data moved from source to destination?
Can anyone help add to this questionnaire?
Firstly, How big is the data doesn't matter! This point barely can be used to decide on which NoSQL DB to use as most NoSQL DBs are made for easy scalability & storage. So all that matters is the query you fire rather than how much data is there. (Unless of course you intend to use it for storage & access of very small amounts of data because they would be a little expensive in many of the NoSQL DBs) Your first question must be Why consider NoSQL? Can't RDBMS handle it?
Expected growth-rate is a considerable parameter but then again not so valid, since most of the NOSQL DBs support storage of large amounts of data (without any scalability issues).
The most important one in your list is What kind of queries will be run?
This matters most since the RDBMS stores data as tuples and its easier to select tuples & output them with smaller amounts of data. Its faster at executing * queries(as its row-wise storage). But coming to NoSQL, most DBs are columnar or Column-oriented DBMS.
Row-oriented system : As data is inserted into the table, it is assigned an internal ID, the rowid that is used internally in the system to refer to data. In this case the records have sequential rowids independent of the user-assigned empid.
Column-oriented systems : A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.
Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers.
How frequently the data will be moved/accessed? is again a good question as accesses are costly and few of the NoSQL DBs are very slow the first time a query is shot(Eg: Hive).
Other parameters you may consider are :
Are update of rows(data in the table) required? (Hive has problems with updation, you usually have to delete and insert again)
Why are you using the database? (Search, derive relationships or analytics, etc) What type of operations would you want to perform on the data?
Will it require relationship searches? Like in case of Facebook Db(Presto)
Will it require aggregations?
Will it be used to relate various columns to derive insights?(like analytics to be done)
Last but a very important one, Do you want to store that data on HDFS(Hadoop distributed File System) as files or your DB's specific storage format or anything else? This is important since your processing depends on how your data is stored, whether it can be accessed directly or needs a query call which may be time consuming , etc.
couple more pointers
Type of no-sql DB that suits your requirement. i.e. key-value, document, column family and graph databases
CAP theorem to decide which is more critical amongst Consistency, Availability and Partition tolerance

Database Design - LKP Schemas for Read-Only Tables? Best Practices? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am relatively new to coding in the Microsoft stack and some practices in my new workplace differ from things I've seen before. Namely, I have seen a practice where Read-Only tables (ones that the application is not meant to be able to insert/edit/delete in) are prefixed with "lkp.EmailType", "lkp.Gender", "lkp.Prefix" and so on.
However, when I started developing some MVC5 apps using Entity Framework and a Database-First approach - when debugging my code I noticed it attempts to both pluralize the table name and change the schema - so "lkp.Gender" queries take on a select statement on "dbo.Genders". After looking into the pluralizing functionality, it seems best practice leans toward pluralizing table names, so I went ahead and did that for this application (this is a new application but we are using a similar DB structure as prior ones but do not have to keep it the same).
The last thing I need to do - is change these table schemas to be "dbo" as opposed to "lkp". In talking with some coworkers on their other projects, they found while read only lookup tables might use the DBO schema for their project, they might name it differently such as "dbo.LkpGenders" or the like.
This takes a bit of work to remove constraints on other tables using these LKP tables and such and I wanted to ask the community before I put too much effort toward this change if it is even a good idea or not and put my time towards either making LKP tables work or doing away with them.
In short - Is usage of LKP schemas for read-only tables an old practice or is this still a good idea to do and I just have been in other workplaces and project who were doing it "wrong"? As an added bonus, reasoning why MVC5/EF may be using DBO schemas on something it created an EDMX fine out of would be good to know. Should I be using a naming convention, DB Views, or LKP schemas for this kind of read-only lookup data?
Some thoughts:
I like plural table names. A row can contain an entity; a table can contain many entities. But, naming conventions should be guidelines rather than carved-in-stone rules. It is impossible that any one rule would be the best alternative under all situations. So allow some flexibility.
My only exception to that last caveat is to name tables and views identically. That is, the database object Employees could be either a table or view. The apps that use it wouldn't know which one it is (or care) and the DB developers could quickly find out (if it was pertinent). There is absolutely no reason to differentiate between tables and views by name and many good reasons to abstract tables and views to just "data sources".
The practice of keeping significant tables in their own database/schema is a valid one. There is something about these tables (read-only) that group them together organizationally, so it can make sense to group them together physically. The problem can be when there are other such attributes: read-only employee data, read-only financial data, etc. If employee and financial data are also segregated into their own database/schema, which is the more significant attribute that would determine where they are located: read-only or employee/financial?
In your particular instance, I would not think that "read-only" is significant enough to rate segregation. Firstly, read-only is not a universal constraint -- someone must be able to maintain the data. So it is "read-only here, writable there". Secondly, just about any grouping of data can have some of that data that is generally read-only. Does it make sense to gather read-only data that is of use only to application X and read-only data that is of use only to application Y in the same place just because they are both read-only? And suppose application X now needs to see (read-only, of course) some of application Y's data to implement a new feature? Would that data be subject to relocation to the read-only database?
A better alternative would be to place X-only data in its own location, Y-only data in its own location and so forth. Company-wide data would go in dbo. Each location could have different requirements for the same common data -- read-only for some, writable for others. These differing requirements could be implemented by local views. A do nothing "instead of" trigger on the view would render it completely read only, but a view with working triggers would make it indistinguishable from the underlying table(s). Each application would have its own view in its own space with triggers as appropriate. So each sees the same data but only one can manipulate that data.
Another advantage to accessing common (dbo) data or shared data from another location through local views is that each application, even though they are looking at the same data, may want the data in different formats and/or different field names. Views allow you to provide the data to each application exactly the way that application wants to see it.
This can also greatly improve the maintainability of your physical data. If a table needs to be normalized or denormalized or a field renamed, added or dropped entirely, go ahead and do it. Just rewrite the views to minimize if not completely eliminate the differences that make it back to the apps. Application code may not have to be changed at all. How's that for cool?

Good place to look for example Database Designs - Best practices [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have been given the task to design a database to store a lot of information for our company. Because the task is rather big and contains multiple modules where users should be able to do stuff, I'm worried about designing a good data model for this. I just don't want to end up with a badly designed database.
I want to have some decent examples of database structures for contracts / billing / orders etc to combine those in one nice relational database. Are there any resources out there that can help me with some examples regarding this?
Barry Williams has published a library of about six hundred data models for all sorts of applications. Almost certainly it will give you a "starter for ten" for all your subsystems. Access to this library is free so check it out.
It sounds like this is a big "enterprise-y" application your organisation wants, and you seem to be a bit of a beginner with databases. If at all possible you should start with a single sub-system - say, Orders - and get that working. Not just the database tables build but some skeleton front-end for it. Once that is good enough add another, related sub-system such as Billing. You don't want to end up with a sprawling monster.
Also make sure you have a decent data modelling tool. SQL Power Architect is nice enough for a free tool.
Before you start read up on normalization until you have no questions about it at all. If you only did this in school, you probably don't know enough about it to design yet.
Gather your requirements for each module carefully. You need to know:
Business rules (which are specific to applications and which must be enforced in the database because they must be enforced on all records no matter the source),
Are there legal or regulatory concerns (HIPAA for instance or Sarbanes-Oxley requirements)
security (does data need to be encrypted?)
What data do you need to store and why (is this data available anywhere else)
Which pieces of data will only have one row of data and which will need to have multiple rows?
How do you intend to enforce uniqueness of the the row in each table? Do you have a natural key or do you need a surrogate key (suggest a surrogate key in almost all cases)?
Do you need replication?
Do you need auditing?
How is the data going to be entered into the database? Will it come from the application one record at a time (or even from multiple applications)or will some of it come from bulk inserts from an ETL tool or from another database.
Do you need to know who entered the record and when (highly likely this will be necessary in an enterprise system.
What kind of lookup tables will you need? Data entry is much more accurate when you can use lookup tables and restrict the users to the values.
What kind of data validation do you need?
Roughly how many records will the system have? You need to have an idea to know how big to create your test data.
How are you going to query the data? Will you be using stored procs or an ORM or dynamic queries?
Some very basic things to remember in your design. Choose the right data type for your data. Do not store dates or numbers you intend to do math on in string fields. Do store numbers that are not candidates for math (part numbers, zip codes, phone numbers, etc) as string data as you may need leading zeros. Do not store more than one piece of information in a field. So no comma-concatenated lists (these indicate the need for a related table) and while you are at it if you find yourself doing something like phone1, phone2, phone 3, stop right away and design a related table. Do use foreign keys for data integrity purposes.
All the way through your design consider data integrity. Data that has no integrity is meaningless and useless. Do design for performance, this is critical in database design and is NOT premature optimization. Database do not refactor easily, so it is important to get the most critical parts of the performance equation right the first time. In fact all databases need to be designed for data integrity, performance and security.
Do not be afraid to have multiple joins, properly indexed these will perform just fine. Do not try to put everything into an entity value type table. Use these as sparingly as possible. Try to learn to think in terms of handling sets of data, it will help your design. Databases are optimized to do things in sets.
There's more but this is enough to start digesting.
Try to keep your concerns separate here. Users being able to update the database is more of an "application design" problem. If you get your database design right then it should be a case of developing a nice front end for it.
First thing to look at is Normalization. This is the process of eliminating any redundant data from your tables. This will help keep your database neat, and only store information that is relevant to your needs.
The Data Model Resource Book.
http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237/ref=dp_cp_ob_b_title_0
HEAVY stuff, but very well through out. 3 volumes all in all...
Has a lot of very well through out generic structures - but they are NOT easy, as they cover everything ;) Always a good starting point, though.
The database should not be the model. It is used to save informations between sessions of work.
You should not build your application upon a data model, but upon a good object oriented model that follows business logic.
Once your object model is done, then think about how you can save and load it, with all the database design that goes with it.
(but apparently your company just want you to design a database ? not an application ?)

DataModel for Workflow/Business Process Application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What should be the data model for a work flow application? Currently we are using an Entity Attribute Value based model in SQL Server 2000 with the user having the ability to create dynamic forms (on asp.net), but as the data grows performance is getting down and hard to generate report and worse if too many users concurrently query the data (EAV).
As you have probably realized, the problem with an EAV model is that tables grow very large and queries grow very complex very quickly. For example, EAV-based queries typically require lots of subqueries just to get at the same data that would be trivial to select if you were using more traditionally-structured tables.
Unfortunately, it is quite difficult to move to a traditionally-structured relational model while simultaneously leaving old forms open to modification.
Thus, my suggestion: consider closing changes on well-established forms and moving their data to standard, normalized tables. For example, if you have a set of shipping forms that are not likely to change (or whose change you could manage by changing the app because it happens so rarely), then you could create a fixed table and then copy the existing data out of your EAV table(s). This would A) improve your ability to do reporting, B) reduce the amount of data in your existing EAV table(s) and C) improve your ability to support concurrent users / improve performance because you could build more appropriate indices into your data.
In short, think of the dynamic EAV-based system as a way to collect user's needs (they tell you by building their forms) and NOT as the permanent storage. As the forms evolve into their final form, you transition to fixed tables in order to gain the benefits discussed above.
One last thing. If all of this isn't possible, have you considered segmenting your EAV table into multiple, category-specific tables? For example, have all of your shipping forms in one table, personnel forms in a second, etc. It won't solve the querying structure problem (needing subqueries) but it will help shrink your tables and improve performance.
I hope this helps - I do sympathize with your plight as I've been in a similar situation myself!
Typically, when your database schema becomes very large and multiple users are trying to access the same information in many different ways, Data Warehousing, is applied in order to reduce major load on the database server. Unlike your traditional schema where you are more than likely using Normalization to keep data integrity, data warehousing is optimized for speed and multiple copies of your data are stored.
Try using the relational model of data. It works.

Resources