I have run into the following situation several times, and was wondering what best practices say about this situation:
Rows are inserted into a table as users complete some action. For example, every time a user visits a specific portion of a website, a row is inserted indicating their IP address, username, and referring URL. Elsewhere, I want to show summary information about those actions. In our example, I'd want to allow administrators to log onto the website and see how many visits there are for a specific user.
The most natural way to do this (IMO) is to insert a row for every visit and, every time the administrator requests totals, count up the number of rows in the appropriate table for that user. However, in situations like these there can be thousands and thousands of rows per user. If administrators frequently request the totals, constantly requesting the counts could create quite a load on the database. So it seems like the right solution is to insert individual rows but simultaneous keep some kind of summary data with running totals as data is inserted (to avoid recalculating those totals over and over).
What are the best practices or most common database schema design for this situation? You can ignore the specific example I made up, my real question is how to handle cases like this dealing with high-volume data and frequently-requested totals or counts of that data.
Here are a couple of practices; the one you select will depend upon your specific situation:
Trust your database engine Many database engines will automatically cache the query plans (and results) of frequently used queries. Even if the underlying data has changed, the query plan itself will remain the same. Appropriate parts of indexes will be kept in main memory, making rerunning a given query almost free. The most you may need to do in this case is tune the database's parameters.
Denormalize your database While 3rd-Normal Form (3NF) is still considered the appropriate database design, for performance reasons it can become necessary to add additional tables that include summary values that would normally be calculated as necessary via a SELECT ... GROUP BY ... query. Frequently these other tables are kept up to date by the use of triggers, stored procedures, or background processes. See Wikipedia for more about Denormalization.
Data warehousing With a Data warehouse, the goal is to push copies of live data to secondary databases (warehouses) for query and special reporting purposes. This is usually done with background processes using whatever replication techniques your database supports. These warehouses are frequently indexed more rigorously than may otherwise be needed for your base application, with the intent to support large queries with massive amounts of historical data.
Related
I have a primary user table and a secondary 1-M table that stores user interests.
One of the main queries users make is querying other users with similar interests (this is done quite frequently). Since I know that the user interests are updated much less frequently than the selection queries and sometimes only ran via a subset (not against the full user database), I would like to run a query with every update that populates a relationship table between other users and run the selection queries off this table.
My question is two-fold:
Is this a good practice in general or is dynamic calculation with each fetch recommended instead?
In the interest of normalization, should I simply store the number of similar interests and run queries to fetch the individual details when required?
Thanks!
I'm working on an app that has a Slack style workspace architecture where the user can access the same function of the application under multiple "instances" (workspaces).
I'm going to continue with using Slack as an example to explain my issue.
When any action is taken in my application I need to validate that the user has the rights to perform an action on the specified resource and that the resource is within the same workspace as the user.
The first tables I create such as Users have a simple database relationship to the workspace. Using a WorkspaceId field in the Users table for example.
My issue is as I create more tables which are "further" away such as UserSettings which might be a one to one relationship to the Users table I now have to do a join to the Users record to get the workspace which the UserSettings record belongs to.
So now I am thinking is it worth adding a workspaceId value on all tables since I will endup doing a lot of JOINs in my database to continue verifying that the user has permissions to that resource.
Looking for advice/architecture patterns which may help with the scenario.
I'm assuming your main concern with multiple JOIN statements is that the query performance will suffer. Multiple JOIN statements don't always mean a query will be slow. The query performance depends on many factors, how large the dataset is and how well indexed it is, what database engine and ultimately what the query plan is. You'll only end up with lots of JOIN statements if you decide to normalize the database that way. Using a full third normal form is rarely the right choice for a schema because of the potential performance impacts it can have. Some duplication of data is generally okay, the trade off you are making is storage cost vs query performance. To decide on how to normalize the database there are many questions you should be asking here's some that come to mind:
What type of queries do you expect to make?
How often will each type of query be made?
How often will the data change and can a cache be used?
Does a different storage technology better suit the use case?
Is some of the data small enough that it can be all in one table?
In my experience designing user management systems, usually ends up with a cache or similar mechanism for having fast user to a given users permissions that has an acceptable expiry window. This means you are only querying the database for a given user at the expiry window and using the cache a majority of the time. This is why many security systems and user systems don't immediately update settings. The more granular and flexible the type of permission you want to grant user the more expensive the query is going to be because of the complexity. At which point you can decide to denormalize the data or use a coaching mechanism.
My current employee has a huge table of items. Each item has user_id and obviously item_id properties. To improve performance and high availability my team decided to shard the table.
We are discussing two strategies:
Shard by item_id
In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)
Shard by user_id
If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.
So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances
P.S. We already have replicas but they are becoming useless for our write throughput
UPDATE
We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one.
We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).
I see a couple of ways to attack this:
How do you intend to shard? Separate master servers, separate schemas
serviced by the same server but by different storage backgrounds?
How do you access this data? Is it basically key/value? Do you need to query all of a user's items at once? How transactional do your CRUD operations need to be?
Do you foresee unbalanced shards being a problem, based on the data you're storing?
Do you need to do relational queries of this data against other data
in your system?
TradeOffs
If you split shards across server/database instance boundaries, sharding by item_id means you will not be able to do a single query for info about a single user_id... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.
If you can use a single database instance, sharding by creating tables/schemas that are backed by different storage subsystems would allow you to scale writes will still being able to do relational queries across them. All of your eggs are still in 1 server basket with this method, though.
If you shard by user_id, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.
If you intend to split these into separate servers such that you cannot do relational queries across schemas, it might be better to use a key/value store as DynamoDB. Then you only have to worry about one endpoint, and the sharding is done at the database layer. No middleware to determine which shard to use!
The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id), vs easier balancing and rebalancing of data across shards (sharding by item_id).
I would focus on the question of how you need to store and access your data. If you truly only need access by item_id, then shard by item_id. Avoid splitting your database in ways counterproductive to how you query it.
If you're still unsure, note that you can shard by item_id and then choose to shard by user_id later (you would do this by rebalancing based on user_id and then enforcing new rows only getting written to the shard their user_id belongs to).
Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.
I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.
If in your system separate item_id has sensible meaning and your users usually work with data from separate item_id's (like Instagram like service when item_id's are related to user photos), I would suggest you sharding by item_id because this choice has lot of advantages from the technical point of view:
ensures even load across all shards
ensures graceful degradation of your service: when shard is down users lose access to 1/N of their items, but they can work with other items
you do not have to pass user_id to access item_id
There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.
When only complete item_id series can have sensible meaning, it is more reasonable to shard by user_id
Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.
Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations.
Background:
We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc.
Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented.
ps: We do not currently have the budget or expertise for a data warehouse.
The problem:
We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data.
I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc..
The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync...
Replication? Can I use replication to replicate only data with a certain column value (FooID = x)
Any other strategies?
It sounds like your table is a perfect candidate for partitioning. Since you didn't mention it, I'll briefly describe it, and give you some links, in case you're not aware of it.
You can divide the rows of a table/index across multiple physical or logical devices, and is specifically meant to improve performance of data sets where you may only need a known subset of the data to work with at any time. Partitioning a table still allows you to interact with it as one table (you don't need to reference partitions or anything in your queries), but SQL Server is able to perform several optimizations on queries that only involve one partition of the data. In fact, in Designing Partitions to Manage Subsets of Data, the AdventureWorks examples pretty much match your exact scenario.
I would do a bit of research, starting here and working your way down: Partitioned Tables and Indexes.
Simple solution: create separate table, de-normalized, with all fields in it. Create stored procedure, that will update this table on your schedule. Create SQl Agent job to call the SP.
Index the table as you see how it's queried.
If you need to purge history, create another table to hold it and another SP to populate it and clean main report table.
You may end up with multiple report tables - it's OK - space is cheap these days.