What is the best database design for thousand rows - sql-server

I'm about to start a Database Design that will simply manage users under companies.
Each company will have a admin area that can manage users
Each company will have around 25.000 users
Client believes to have around 50 companies to start
My main question is
Should I create tables based on Companies? like
users_company_0001 users_company_0002 users_company_0003 ...
as each company will never use "other" users and nothing will need to sum/count different tables in all user_company (a simple JOIN will do the trick, though it's more expensive (time) it will work as having the main picture, this will never be needed.
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing).
I'm thinking about the first option, though, I'm not sure how would I use Entity Framework on such layout... I would probably need to go back to the 90's and generate my Data Logic Layer by hand.
has it will be a simple call to Store Procedures containing the Company Id
What will you suggest?
The system application will be ASP.NET (probably MVC, I'm still trying to figure this out as all my knowledge is on webforms, though I saw Scott Hanselman MVC videos - seams easy - but I know it will not be that easy as problems will come and I will take more time to fix them), plus Microsoft SQL.

Even though you've described this as a 1-many relationship, I'd still design the DB as many-to-many to guard against a future change in requirements. Something like:

Having worked with a multi-terabyte SQL Server database, and having experience with hundreds of tables over the course of my career with multi-million rows, I can tell you with full assurance that SQL Server can handle a your company and users tables without partitioning. It's always there when you need it, but your worry shouldn't be about your tables - pick the simplest schema that meets your needs. If you want to do something to optimize performance, your bottleneck will almost assuredly be your disks. Don't buy large, slow disks. Get yourself a bunch of small, high RPM disks and spread your data out across them as much as possible, and don't share disks with your logs and your data. With databases, you're almost always better off achieving performance with good hardware, a good disk subsystem, and proper indexing. Don't compromise and over complicate your schema trying to anticipate performance - you'll regret it. I've seen really big databases where that sort of thing was necessary, but yours ain't it.

re: Should I create tables based on Companies?
yes
like
users_company_0001 users_company_0002 users_company_0003
no, like
companyID companyName, contactID
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing)
yes

I think you should create separate tables for Company and User. Then
a third table to connect the two: CompanyAdmin. Something like:
Company(Company_Id, Company_name, ...)
User(User_Id, User_name, ...)
CompanyAdmin(Company_id, User_id)
This way you can add users and/or companies without affecting the number
of tables you need to manage. It is generally a bad design where you need
to modify the database (ie. add tables) when new data (companies) are added to the system.
With proper indexing, the join costs in a database containing
a few million rows should not be a problem.
Finally, if you ever need to change or record additional information about
Companies, Users or the relationship between them, this setup should
have the least amount of impact on your application.

Related

Database table creation for large data

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance
With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.
The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem
This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.
depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.
You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

Data historic in a business application

I have work with a few databases up to now and the philosophys where verry different. It got me wondering,
Is it a good idea to duplicate tables for historic purpose in a business application?
By buisiness application i mean :
a software used by an enterprise to manage all of his data (eg. invoices, clients, stocks [if applicable], etc)
By 'duplicating tables' i mean :
when, lets say your invoices, goes out of date (like after one year, after being invoiced and paid, w/e), you can store them into 'historic' tables which makes them aviable for consultation but shouldent be modified. Same thing clients inactive for years.
Pros :
Using historic tables can accelerate researches trough actually used data since it make your actually used tables smaller.
Better separation of historic and actual data
Easier to remove data from the database to store it on hard media without affecting your database, (more predictable beacause the data had no chance of being used since it was in an historic table). This often happend after 10 years when you got unused data.
Cons :
Make your database have up to 2 times more tables.
Make your database more complex
Make your program more complex for reports since you sometimes have to import twice the amount of tables.
Archiving is a key aspect of enterprise applications, but in general, I'd recommend against it unless you really, really need it.
Archiving means you either accept you can't get at historical data before a specific date, or that you create some scheme for managing "current" and "historical" data; your solution (archive tables) is one solution to this problem.
Neither solution is all that nice - archive tables mean lots of duplicated code/data, complex archival procedures (esp. with foreign key relationships), lots of opportunity for errors.
I do believe the concept of "time" should be baked into the domain and data model for most business applications, along with mutability - you shouldn't be able to change an order once it's been confirmed, but you should be able to add products to a new order.
As for your pros:
In general, I don't think you'd notice the performance impact unless you're talking about very, very large scale businesses. I don't think - on modern SQL server solutions - you'd notice the speed difference between querying 10.000 customer records or 1.000.000 customer records.
The definition of "historic" is actually rather tricky - most businesses have to keep historical around for regulatory and tax purposes, often for many years; they'll probably want to be able to analyse trends over several years, etc. If the business wants to see "how many widgets did we sell per month over the last 5 years", that means you have to keep 5 years of data around somehow (either "raw" or pre-aggregated).
Yes, separating out data would be easier. Building a feature today - which you have to maintain every time you change the application - for pay-off in 10 years seems a poor investment to me...
I would only have a "duplicate" type table to store historic VERSIONS of each record, like a change log. Even a change log is not a duplicate as it would have to have info on when it was changed, etc. As a general practice,I would not recommend migrating rows from an active to a historical table. You'd have to manage different versions of queries to find the data in two places! Use a status to control if the data can be changed. I could see it may be done if there are certain circumstances for a particular application. Once you start adding foreign keys, it becomes difficult to remove data. If you had a truly enterprise business application and you attempted to remove invoices, you have all sorts of issues with FKs to other tables, accounts payable/receivable, costs of raw materials, profits from sales, shipping info, etc.

Will creating seperate databases in SQL Server give me better performance?

All, I'm a programmer by trade but for this particular project I'm finidng myself being the DBA as well. Here is the scenario I'm faced with:
Web app with anywhere from 400-1000 customers. A customer is a "physical company", each of which has n-number of uers. Each customer (company) has on average 1GB worth of data (total of about 200 million rows). Each company has probably 80% similar data in terms of the type of data stored. The other 20% is custom data that the companies can themselves define (basically custom fields).
I am trying to figure out the best way to scale this on the cheap when you conisder that the customers need pretty good reaction time. For example, customer X might want to grab all records where last name like 'smith' and phone like '555' where as customer Y might want to grab all records where account number equals '1526A'.
Bottom line, performance is key and I'm finding it hard to decide what to index and if that is even going to help me given the fact these guys can basically create their own query through the UI.
My question is, what would you do? Do you think it would be wise to break each customer out into it's own DB? Total DB size at the moment is around 400GB.
It is a complete re-write so I have the fortune of being able to start fresh if needed. Any thoughts, hints would be greatly appreciated.
Bottom line, performance is key and
I'm finding it hard to decide what to
index and if that is even going to
help me given the fact these guys can
basically create their own query
through the UI.
Bottom line, you're ceding your DB performance to the whims of your clients. If they're able to "create their own query", then they're able to "create their own REALLY BAD queries".
So, if you run this in a shared environment (i.e. the same hardware), then customer A's awful table scans can saturate the I/O for everyone else.
If they're on the same database server, then Customer A's scans get to flush all of your other customers data from the data cache.
Basically, the more you "share", the more one customer can impact the operations of other customers. If you give customers the capability to do expensive things, and share much of it, then everyone suffers.
So, the options are a) don't let the customers do silly things or b) keep the customers as separated as practical so that when one does do silly things, the phones don't light up from all of the other customers.
If you don't know "what to index" then you are not offering much control over what the customers can do, and thus the silly thing factor goes way up.
You would probably get quite far by offering several popular, pre-made SQL views that the customers can select from, and then they're limited to simply filtering and possibly ordering the results. Then you optimize around execution of those views.
It's likely that surprisingly few "general" views can cover a large amount of the use cases.
Generic, silly queries can be delegated to a batch process that runs overnight, during off hours, or to a separate machine that doesn't impact transactional performance, such as a nightly snapshot with "everything but todays data" on it. Let them run historic queries against that.
The SO question How to design a multi tenant database has a link to a decent article on the tradeoffs along the spectrum from "shared nothing" to "shared everything". Also, SO has a tag for those kinds of questions; I added it for you.
Creating separate databases on the same server won't help you get better performance. The performance optimisations available to you with multiple databases are just the same as you can achieve with one database.
Separate databases might make sense for administrative reasons - if different backup or availability requirements apply to different customers for example.
It's still probably sensible to build your application so that it can support multiple databases so that you have the option of scaling out over multiple DB servers.
If you have seperate databases the 80% that is the same beciomes almost impossible to keep the same over time. YOu will end up spending far more money for maintenance.
Luckly SQL Server has some options for you. First put the customer sspeicifc information in the same database in a separate schema and the common stuff in a differnt schema(create a common schema and a schema for each client).
Next set up data partitioning by client. This can require the proper hardware to do this effectively.
Now you have one code base for common which will promugate changes to all clients at once and clients are separated for performance using the partitions.

Few database design questions relating to user content site

Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

'Followers' and efficiency

I am designing an app that would involve users 'following' each other's activity, in the twitter sense, but I am not very experienced with database/query design/efficiency. Are there best practices for managing this, pitfalls to avoid, etc.? I gather this can create a very large load on the db if not done properly (or maybe even then?).
If it makes a difference it is likely that people will 'follow' only a relatively small number of people (but a person may have many followers). However this is not certain, and I wouldn't want to count on it.
Any advice gratefully received. Thanks.
Pretty simple and easy to do with full normalisation. If you have a table of users, each with a unique ID, you would have a TABLE_FOLLOWERS table with the columns, USERID and FOLLOWERID which would describe all the followers for each user as a one to one to many relationship.
Even with millions of assosciations on a half decent database server this will perform well and fast as long as you are using a good database (IE, not MS-Access).
The model is fairly simple. The problem is in the size of the Subscription table; if there are 1 million users, and each subscribes to 1000, then the Subscription table has 1 billion rows.
That depends on how many users you expect to need to support; how many followers you expect users to have; and what sort of funding/development-effort you expect to have access to should your answers to the previous questions prove optimistic.
For a small scale project I would likely ignore the database, design the application as a simple object model with User objects that maintain a List[followers]. Keep it all in RAM for normal operation and use an ORM to persist to a database periodically (probably postgresql or mysql).
For a larger project I would not be using a relational database at all; but exactly what I would use would depend on the specific details of the project.
If you are only trying to spike the concept, go with the ORM approach; but, keep in mind it won't scale.
You probably should read http://highscalability.com/ and it's articles on how this is managed by the big sites.

Resources