DB Strategy for inserting into a high read table (Sql Server) - sql-server

Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations.
Background:
We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc.
Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented.
ps: We do not currently have the budget or expertise for a data warehouse.
The problem:
We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data.
I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc..
The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync...
Replication? Can I use replication to replicate only data with a certain column value (FooID = x)
Any other strategies?

It sounds like your table is a perfect candidate for partitioning. Since you didn't mention it, I'll briefly describe it, and give you some links, in case you're not aware of it.
You can divide the rows of a table/index across multiple physical or logical devices, and is specifically meant to improve performance of data sets where you may only need a known subset of the data to work with at any time. Partitioning a table still allows you to interact with it as one table (you don't need to reference partitions or anything in your queries), but SQL Server is able to perform several optimizations on queries that only involve one partition of the data. In fact, in Designing Partitions to Manage Subsets of Data, the AdventureWorks examples pretty much match your exact scenario.
I would do a bit of research, starting here and working your way down: Partitioned Tables and Indexes.

Simple solution: create separate table, de-normalized, with all fields in it. Create stored procedure, that will update this table on your schedule. Create SQl Agent job to call the SP.
Index the table as you see how it's queried.
If you need to purge history, create another table to hold it and another SP to populate it and clean main report table.
You may end up with multiple report tables - it's OK - space is cheap these days.

Related

Redshift - DataWarehouse Data Refresh

I have my data warehouse built on Amazon Redshift. The problem I am currently facing is, I have a huge fact table (with about 500M rows) in my schema with data for about 10 clients. I have processes that periodically (mostly daily) generates data for this fact table and requires a refresh, meaning - delete old data and insert the newly generated data.
The problem is, this bulk delete-insert operation leave holes in my fact table with a need to VACUUM which is time consuming and hence can't be done immediately. And this fact table (with huge holes due to deleted data) dramatically impacts the snapshot time which consumes data from the fact and dimension tables and refreshes it in the downstream presentation area. How can I optimize such bulk data refresh in a DWH environment?
I believe this should be a well known problem in DWH with some recommended ways to solve. Can anyone please point out the recommended solution?
P.S: One solution can be to create table per client and have a view on top of it which does a union of all the underlying tables. In this case, if I break the fact table per client it is pretty small and can be vacuumed quickly after delete-insert, but looking for solutions with better maintainability.
You might try to play with different types of Vacuum, there is "VACUUM DELETE ONLY", which will reclaim the space, but won't resort lines, not sure if its applicable for your use case.
More info here: http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html
Or I used deep copy approach when I was fighting with vacuuming tables with too many columns. Problem with this could be that you will need a lot of space for intermediate step.
More info here:
http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

Database table creation for large data

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance
With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.
The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem
This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.
depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.
You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

Parallel bulk loading using partition switching of indexed table in SQL Server 2008

This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.

Few database design questions relating to user content site

Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Database design: Running totals of row counts

I have run into the following situation several times, and was wondering what best practices say about this situation:
Rows are inserted into a table as users complete some action. For example, every time a user visits a specific portion of a website, a row is inserted indicating their IP address, username, and referring URL. Elsewhere, I want to show summary information about those actions. In our example, I'd want to allow administrators to log onto the website and see how many visits there are for a specific user.
The most natural way to do this (IMO) is to insert a row for every visit and, every time the administrator requests totals, count up the number of rows in the appropriate table for that user. However, in situations like these there can be thousands and thousands of rows per user. If administrators frequently request the totals, constantly requesting the counts could create quite a load on the database. So it seems like the right solution is to insert individual rows but simultaneous keep some kind of summary data with running totals as data is inserted (to avoid recalculating those totals over and over).
What are the best practices or most common database schema design for this situation? You can ignore the specific example I made up, my real question is how to handle cases like this dealing with high-volume data and frequently-requested totals or counts of that data.
Here are a couple of practices; the one you select will depend upon your specific situation:
Trust your database engine Many database engines will automatically cache the query plans (and results) of frequently used queries. Even if the underlying data has changed, the query plan itself will remain the same. Appropriate parts of indexes will be kept in main memory, making rerunning a given query almost free. The most you may need to do in this case is tune the database's parameters.
Denormalize your database While 3rd-Normal Form (3NF) is still considered the appropriate database design, for performance reasons it can become necessary to add additional tables that include summary values that would normally be calculated as necessary via a SELECT ... GROUP BY ... query. Frequently these other tables are kept up to date by the use of triggers, stored procedures, or background processes. See Wikipedia for more about Denormalization.
Data warehousing With a Data warehouse, the goal is to push copies of live data to secondary databases (warehouses) for query and special reporting purposes. This is usually done with background processes using whatever replication techniques your database supports. These warehouses are frequently indexed more rigorously than may otherwise be needed for your base application, with the intent to support large queries with massive amounts of historical data.

Resources