Database table creation for large data

Database table creation for large data - database

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance

With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.

The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem

This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.

depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.

You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

Related

Will creating seperate databases in SQL Server give me better performance?

All, I'm a programmer by trade but for this particular project I'm finidng myself being the DBA as well. Here is the scenario I'm faced with:
Web app with anywhere from 400-1000 customers. A customer is a "physical company", each of which has n-number of uers. Each customer (company) has on average 1GB worth of data (total of about 200 million rows). Each company has probably 80% similar data in terms of the type of data stored. The other 20% is custom data that the companies can themselves define (basically custom fields).
I am trying to figure out the best way to scale this on the cheap when you conisder that the customers need pretty good reaction time. For example, customer X might want to grab all records where last name like 'smith' and phone like '555' where as customer Y might want to grab all records where account number equals '1526A'.
Bottom line, performance is key and I'm finding it hard to decide what to index and if that is even going to help me given the fact these guys can basically create their own query through the UI.
My question is, what would you do? Do you think it would be wise to break each customer out into it's own DB? Total DB size at the moment is around 400GB.
It is a complete re-write so I have the fortune of being able to start fresh if needed. Any thoughts, hints would be greatly appreciated.

Bottom line, performance is key and
I'm finding it hard to decide what to
index and if that is even going to
help me given the fact these guys can
basically create their own query
through the UI.
Bottom line, you're ceding your DB performance to the whims of your clients. If they're able to "create their own query", then they're able to "create their own REALLY BAD queries".
So, if you run this in a shared environment (i.e. the same hardware), then customer A's awful table scans can saturate the I/O for everyone else.
If they're on the same database server, then Customer A's scans get to flush all of your other customers data from the data cache.
Basically, the more you "share", the more one customer can impact the operations of other customers. If you give customers the capability to do expensive things, and share much of it, then everyone suffers.
So, the options are a) don't let the customers do silly things or b) keep the customers as separated as practical so that when one does do silly things, the phones don't light up from all of the other customers.
If you don't know "what to index" then you are not offering much control over what the customers can do, and thus the silly thing factor goes way up.
You would probably get quite far by offering several popular, pre-made SQL views that the customers can select from, and then they're limited to simply filtering and possibly ordering the results. Then you optimize around execution of those views.
It's likely that surprisingly few "general" views can cover a large amount of the use cases.
Generic, silly queries can be delegated to a batch process that runs overnight, during off hours, or to a separate machine that doesn't impact transactional performance, such as a nightly snapshot with "everything but todays data" on it. Let them run historic queries against that.

The SO question How to design a multi tenant database has a link to a decent article on the tradeoffs along the spectrum from "shared nothing" to "shared everything". Also, SO has a tag for those kinds of questions; I added it for you.

Creating separate databases on the same server won't help you get better performance. The performance optimisations available to you with multiple databases are just the same as you can achieve with one database.
Separate databases might make sense for administrative reasons - if different backup or availability requirements apply to different customers for example.
It's still probably sensible to build your application so that it can support multiple databases so that you have the option of scaling out over multiple DB servers.

If you have seperate databases the 80% that is the same beciomes almost impossible to keep the same over time. YOu will end up spending far more money for maintenance.
Luckly SQL Server has some options for you. First put the customer sspeicifc information in the same database in a separate schema and the common stuff in a differnt schema(create a common schema and a schema for each client).
Next set up data partitioning by client. This can require the proper hardware to do this effectively.
Now you have one code base for common which will promugate changes to all clients at once and clients are separated for performance using the partitions.

What is the best database design for thousand rows

I'm about to start a Database Design that will simply manage users under companies.
Each company will have a admin area that can manage users
Each company will have around 25.000 users
Client believes to have around 50 companies to start
My main question is
Should I create tables based on Companies? like
users_company_0001 users_company_0002 users_company_0003 ...
as each company will never use "other" users and nothing will need to sum/count different tables in all user_company (a simple JOIN will do the trick, though it's more expensive (time) it will work as having the main picture, this will never be needed.
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing).
I'm thinking about the first option, though, I'm not sure how would I use Entity Framework on such layout... I would probably need to go back to the 90's and generate my Data Logic Layer by hand.
has it will be a simple call to Store Procedures containing the Company Id
What will you suggest?
The system application will be ASP.NET (probably MVC, I'm still trying to figure this out as all my knowledge is on webforms, though I saw Scott Hanselman MVC videos - seams easy - but I know it will not be that easy as problems will come and I will take more time to fix them), plus Microsoft SQL.

Even though you've described this as a 1-many relationship, I'd still design the DB as many-to-many to guard against a future change in requirements. Something like:

Having worked with a multi-terabyte SQL Server database, and having experience with hundreds of tables over the course of my career with multi-million rows, I can tell you with full assurance that SQL Server can handle a your company and users tables without partitioning. It's always there when you need it, but your worry shouldn't be about your tables - pick the simplest schema that meets your needs. If you want to do something to optimize performance, your bottleneck will almost assuredly be your disks. Don't buy large, slow disks. Get yourself a bunch of small, high RPM disks and spread your data out across them as much as possible, and don't share disks with your logs and your data. With databases, you're almost always better off achieving performance with good hardware, a good disk subsystem, and proper indexing. Don't compromise and over complicate your schema trying to anticipate performance - you'll regret it. I've seen really big databases where that sort of thing was necessary, but yours ain't it.

re: Should I create tables based on Companies?
yes
like
users_company_0001 users_company_0002 users_company_0003
no, like
companyID companyName, contactID
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing)
yes

I think you should create separate tables for Company and User. Then
a third table to connect the two: CompanyAdmin. Something like:
Company(Company_Id, Company_name, ...)
User(User_Id, User_name, ...)
CompanyAdmin(Company_id, User_id)
This way you can add users and/or companies without affecting the number
of tables you need to manage. It is generally a bad design where you need
to modify the database (ie. add tables) when new data (companies) are added to the system.
With proper indexing, the join costs in a database containing
a few million rows should not be a problem.
Finally, if you ever need to change or record additional information about
Companies, Users or the relationship between them, this setup should
have the least amount of impact on your application.

Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.

Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm

It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.

It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.

I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

SQL Server architecture guidance

We are designing a new version of our existing product on a new schema.
Its an internal web application with possibly 100 concurrent users (max)This will run on a SQL Server 2008 database.
On of the discussion items recently is whether we should have a single database of split the database for performance reasons across 2 separate databases.
The database could grow anywhere from 50-100GB over 5 years.
We are Developers and not DBAs so it would be nice to get some general guidance.
[I know the answer is not simple as it depends on the schema, archiving policy, amount of data etc. ]
Option 1 Single Main Database
[This is my preferred option].
The plan would be to have all the tables in a single database and possibly to use file groups and partitioning to separate the data if required across multiple disks. [Use schema if appropriate]. This should deal with the performance concerns
One of the comments wrt this was that the a single server instance would still be processing this data so there would still be a processing bottle neck.
For reporting we could have a separate reporting DB but this is still being discussed.
Option 2 Split the database into 2 separate databases
DB1 - Customers, Accounts, Customer resources etc
DB2 - This would contain the bulk of the data [i.e. Vehicle tracking data, financial transaction tables etc].
These tables would typically contain a lot of data. [It could reside on a separate server if required]
This plan would involve keeping the main data in a smaller database [DB1] and retaining the [mainly] read only transaction type data in a separate DB [DB2]. The UI would mainly read from DB1 and thus be more responsive.
[I'm aware that this option makes it harder for Referential Integrity to be enforced.]
Points for consideration
As we are at the design stage we can at least make proper use of indexes to deal performance issues so thats why option 1 to me is attractive and its more of a standard approach.
For both options we are considering implementing an archiving database.
Apologies for the long Question. In summary the question is 1 DB or 2?
Thanks in advance,
Liam

Option 1 in my opinion is the way to go.
CPU is very unlikely to be your bottleneck with 100 concurrent users providing your workload. You could acquire a single multi-socket server with additional CPU capacity available via hot swap technology to offer room to grow should you wish. Dependent on your availability requirements you could also consider using a Clustering solution to allow for swapping in more processing CPU resource by forced fail over to another node.
The performance of your disk subsystem is going to be your biggest concern. Your design decisions will be influenced by the storage solution you use, which I assume will be SAN technology.
As a minimum you will want to place your LOG(RAID 1) and DATA files(RAID 10 or 5 dependent on workload) on separate LUNS.
Dependent on your table access you may wish to consider placing different Filegroups on separate LUN's. Partitioning your table data could prove advantageous to you but only for large tables.

50 to 100GB and 100 users is a pretty small database by most standards today. Don't over engineer your solution by trying to solve problems that you haven't even seen yet. Splitting it into two databases, especially on two different servers will create a mountain of headaches that you're better off without. Concentrate your efforts on creating a useful product instead.

I agree to the other comments stating that between 50 and 100GB is small these days. I'd also agree that you shouldn't overengineer.
But, if there is a obvious (or not so obvious) logical separation between the entities you store (like you say, one being read-write and the other parts mainly read-only), I'd still split it in different dbs. At least I would design it in a way I could easily factor one piece out. Security would be one reason, management/backup/restore another, easier serviceability (because inherently the design will be better factored and parts better isolated from each other), and, in SQL Server, ability to scale out (or the lack thereof if it is a single database). Separating login and content databases for example often makes sense for bigger web applications.
And, if you really want a sound design, separate your entities in a single db, using different schemas, putting proper permissions on objects, you end up with almost the same effort in my eyes.
Microsoft products like SharePoint, TFS and BizTalk all use several different databases (Though I do not pretend to be aware of the reasons / probably just the outcome of the way they organize their teams).
Especially with regard to that you cannot scale out a single database instance on SQL Server (clustering needs multiple instances), I'd be tempted to split it.
#John: I would never use RAID5. Solves no purpose other than to hurt performance. I agree with the RAID10 approach.

Putting data in another database is not going to make the slightest difference to performance. Performance is a factor of other things entirely.
A reason to create a new database is for maintenance and administration reasons. For example if one set of data needs a different backup and recovery policy or has higher availability requirements.

DB Strategy for inserting into a high read table (Sql Server)

Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations.
Background:
We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc.
Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented.
ps: We do not currently have the budget or expertise for a data warehouse.
The problem:
We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data.
I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc..
The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync...
Replication? Can I use replication to replicate only data with a certain column value (FooID = x)
Any other strategies?

It sounds like your table is a perfect candidate for partitioning. Since you didn't mention it, I'll briefly describe it, and give you some links, in case you're not aware of it.
You can divide the rows of a table/index across multiple physical or logical devices, and is specifically meant to improve performance of data sets where you may only need a known subset of the data to work with at any time. Partitioning a table still allows you to interact with it as one table (you don't need to reference partitions or anything in your queries), but SQL Server is able to perform several optimizations on queries that only involve one partition of the data. In fact, in Designing Partitions to Manage Subsets of Data, the AdventureWorks examples pretty much match your exact scenario.
I would do a bit of research, starting here and working your way down: Partitioned Tables and Indexes.

Simple solution: create separate table, de-normalized, with all fields in it. Create stored procedure, that will update this table on your schedule. Create SQl Agent job to call the SP.
Index the table as you see how it's queried.
If you need to purge history, create another table to hold it and another SP to populate it and clean main report table.
You may end up with multiple report tables - it's OK - space is cheap these days.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight