Sharding database by user_id vs by entity_id

Sharding database by user_id vs by entity_id - database

My current employee has a huge table of items. Each item has user_id and obviously item_id properties. To improve performance and high availability my team decided to shard the table.
We are discussing two strategies:
Shard by item_id
In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)
Shard by user_id
If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.
So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances
P.S. We already have replicas but they are becoming useless for our write throughput
UPDATE
We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one.
We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).

I see a couple of ways to attack this:
How do you intend to shard? Separate master servers, separate schemas
serviced by the same server but by different storage backgrounds?
How do you access this data? Is it basically key/value? Do you need to query all of a user's items at once? How transactional do your CRUD operations need to be?
Do you foresee unbalanced shards being a problem, based on the data you're storing?
Do you need to do relational queries of this data against other data
in your system?
TradeOffs
If you split shards across server/database instance boundaries, sharding by item_id means you will not be able to do a single query for info about a single user_id... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.
If you can use a single database instance, sharding by creating tables/schemas that are backed by different storage subsystems would allow you to scale writes will still being able to do relational queries across them. All of your eggs are still in 1 server basket with this method, though.
If you shard by user_id, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.
If you intend to split these into separate servers such that you cannot do relational queries across schemas, it might be better to use a key/value store as DynamoDB. Then you only have to worry about one endpoint, and the sharding is done at the database layer. No middleware to determine which shard to use!
The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id), vs easier balancing and rebalancing of data across shards (sharding by item_id).
I would focus on the question of how you need to store and access your data. If you truly only need access by item_id, then shard by item_id. Avoid splitting your database in ways counterproductive to how you query it.
If you're still unsure, note that you can shard by item_id and then choose to shard by user_id later (you would do this by rebalancing based on user_id and then enforcing new rows only getting written to the shard their user_id belongs to).
Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.

I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.
If in your system separate item_id has sensible meaning and your users usually work with data from separate item_id's (like Instagram like service when item_id's are related to user photos), I would suggest you sharding by item_id because this choice has lot of advantages from the technical point of view:
ensures even load across all shards
ensures graceful degradation of your service: when shard is down users lose access to 1/N of their items, but they can work with other items
you do not have to pass user_id to access item_id
There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.
When only complete item_id series can have sensible meaning, it is more reasonable to shard by user_id

Related

"Grouping" Data in a MongoDB cluster

For instance, assume I have a MongoDB database that stores a number of schools, and a number of teachers, and students in those schools. Instead of having each school be its own collection in the database, I have a collection of Schools, Teachers, and Students, and obviously in the documents under Students and Teachers, I have some reference to the respective school under the Schools collection. However, is there a way to somehow logically/physically group the data such that Teacher, and Student documents, are grouped under their respective School documents.
As of now, I have three different collections, Schools, Teachers, Students, and lets say I want all students that attend StackOverflow Academy; I'd do something like:
Students.find({school: "stackOverFlowAcademy_ID"})
But as the database grows in size, I assume this way wouldn't be efficient and quick, compared to if it were a small database.
Is my current approach enough, or is there a more efficient way to do this.
EDIT:
MongoDB docs state that if you're using MongoDB Atlas (Which I am), sharding, and other effective "grouping" of data is handled automatically on their end; so no need to do any sharding, or replica sets implementation by yourself if you're using Atlas.

This is a wide topic, I'm putting few things what I'm aware of :
Replica sets : A replica set is a group of mongod instances that host the same data set, when you create mongoDB thru mongoDB Atlas what you'll get is a cluster with three nodes, which is nothing but three mongod instances, their primary purpose is high availability. As I said having replica set has much likely nothing to do with your data structure. Usually Replica sets will always have 1 Primary node and 2 Secondary(can serve read reqs) - if a Primary is down one of it's Secondary will become as Primary and serve requests until Primary is back on, Once it's back data will be synced (everything is taken care by mongoDB Atlas, usual median downtime will be 12sec).
Sharding : As far as I know when your database size is more than 2TB or 4TB(Please check on this) that's when you go to sharding which is a better option to do i.e; horizontal scaling rather than increasing RAM & size of your DB - We add more servers and in a word Sharding is nothing but a bunch of replica sets called shards plus config servers managed by mongos but in depth there is a lot to know before implementing it.
Going back, yes having a reference key between multiple collections is also an option, with introduction of aggregation particularly with $lookup & $graphLookup you can do most of your mappings. And remember to maintain good index keys for better querying. All in all it's more like you need to analyze your applications data prior to start. Try to use query analyzer (explain) in mongoDB to check stats about each query performance.
Example:-
As mongoDB is denormalized, you can definitely consider having embedded documents but you need to know when to have (Vs) when not to.
Let's say if you're dealing with a social media website have users collection where you will store a bunch of users with their related information(phone num ,height, dob, email) and can have embedded document of addresses(1 or 2) which usually won't change that often but list of friends has to be stored in different collection as it needs much maintenance plus can be accessed individually and more like you make your User JSON look better with less & important data. It's all about your data requirements(1-many or 1-n) and querying capabilities.
Check these links :
MongoDB courses are free & best to learn, which are directly offered by mongoDB University.
mongoDB Courses
In Mongo what is the difference between sharding and replication?

How does data denormalization work with the Microservice Pattern?

I just read an article on Microservices and PaaS Architecture. In that article, about a third of the way down, the author states (under Denormalize like Crazy):
Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface).
While this sounds great in theory, in practicality it has some serious hurdles to overcome. The biggest of which is that, often, databases are tightly coupled and every table has some foreign key relationship with at least one other table. Because of this it could be impossible to partition a database into n sub-databases controlled by n microservices.
So I ask: Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables) so that the fragments can be controlled by separate microservices?
For instance, given the following (rather small, but exemplar) database:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
user_id
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
product_id
order_id
quantity_ordered
Don't spend too much time critiquing my design, I did this on the fly. The point is that, to me, it makes logical sense to split this database into 3 microservices:
UserService - for CRUDding users in the system; should ultimately manage the [users] table; and
ProductService - for CRUDding products in the system; should ultimately manage the [products] table; and
OrderService - for CRUDding orders in the system; should ultimately manage the [orders] and [products_x_orders] tables
However all of these tables have foreign key relationships with each other. If we denormalize them and treat them as monoliths, they lose all their semantic meaning:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
quantity_ordered
Now there's no way to know who ordered what, in which quantity, or when.
So is this article typical academic hullabaloo, or is there a real world practicality to this denormalization approach, and if so, what does it look like (bonus points for using my example in the answer)?

This is subjective but the following solution worked for me, my team, and our DB team.
At the application layer, Microservices are decomposed to semantic function.
e.g. a Contact service might CRUD contacts (metadata about contacts: names, phone numbers, contact info, etc.)
e.g. a User service might CRUD users with login credentials, authorization roles, etc.
e.g. a Payment service might CRUD payments and work under the hood with a 3rd party PCI compliant service like Stripe, etc.
At the DB layer, the tables can be organized however the devs/DBs/devops people want the tables organized
The problem is with cascading and service boundaries: Payments might need a User to know who is making a payment. Instead of modeling your services like this:
interface PaymentService {
PaymentInfo makePayment(User user, Payment payment);
}
Model it like so:
interface PaymentService {
PaymentInfo makePayment(Long userId, Payment payment);
}
This way, entities that belong to other microservices only are referenced inside a particular service by ID, not by object reference. This allows DB tables to have foreign keys all over the place, but at the app layer "foreign" entities (that is, entities living in other services) are available via ID. This stops object cascading from growing out of control and cleanly delineates service boundaries.
The problem it does incur is that it requires more network calls. For instance, if I gave each Payment entity a User reference, I could get the user for a particular payment with a single call:
User user = paymentService.getUserForPayment(payment);
But using what I'm suggesting here, you'll need two calls:
Long userId = paymentService.getPayment(payment).getUserId();
User user = userService.getUserById(userId);
This may be a deal breaker. But if you're smart and implement caching, and implement well engineered microservices that respond in 50 - 100 ms each call, I have no doubt that these extra network calls can be crafted to not incur latency to the application.

It is indeed one of key problems in microservices which is quite conviniently omitted in most of articles. Fortunatelly there are solutions for this. As a basis for discussion let's have tables which you have provided in the question.
Image above shows how tables will look like in monolith. Just few tables with joins.
To refactor this to microservices we can use few strategies:
Api Join
In this strategy foreign keys between microservices are broken and microservice exposes an endpoint which mimics this key. For example: Product microservice will expose findProductById endpoint. Order microservice can use this endpoint instead of join.
It has an obvious downside. It is slower.
Read only views
In the second solution you can create copy of the table in the second database. Copy is read only. Each microservice can use mutable operations on its read/write tables. When it comes to read only tables which are copied from other databases they can (obviously) use only reads
High performance read
It is possible to achieve high performance read by introducing solutions such as redis/memcached on top of read only view solution. Both sides of join should be copied to flat structure optimized for reading. You can introduce completely new stateless microservice which can be used for reading from this storage. While it seems like a lot of hassle it is worth to note that it will have higher performance than monolithic solution on top of relational database.
There are few possible solutions. Ones which are simplest in implementation have lowest performance. High performance solutions will take few weeks to implement.

I realise this is possibly not a good answer but what the heck. Your question was:
Given a database that consists entirely of related tables, how does
one denormalize this into smaller fragments (groups of tables)
WRT the database design I'd say "you can't without removing foreign keys".
That is, people pushing Microservices with the strict no shared DB rule are asking database designers to give up foreign keys (and they are doing that implicitly or explicitly). When they don't explicitly state the loss of FK's it makes you wonder if they actually know and recognise the value of foreign keys (because it is frequently not mentioned at all).
I have seen big systems broken into groups of tables. In these cases there can be either A) no FK's allowed between the groups or B) one special group that holds "core" tables that can be referenced by FK's to tables in other groups.
... but in these systems "groups of tables" is often 50+ tables so not small enough for strict compliance with microservices.
To me the other related issue to consider with the Microservice approach to splitting the DB is the impact this has reporting, the question of how all the data is brought together for reporting and/or loading into a data warehouse.
Somewhat related is also the tendency to ignore built in DB replication features in favor of messaging (and how DB based replication of the core tables / DDD shared kernel) impacts the design.
EDIT: (the cost of JOIN via REST calls)
When we split up the DB as suggested by microservices and remove FK's we not only lose the enforced declarative business rule (of the FK) but we also lose the ability for the DB to perform the join(s) across those boundaries.
In OLTP FK values are generally not "UX Friendly" and we often want to join on them.
In the example if we fetch the last 100 orders we probably don't want to show the customer id values in the UX. Instead we need to make a second call to customer to get their name. However, if we also wanted the order lines we also need to make another call to the products service to show product name, sku etc rather than product id.
In general we can find that when we break up the DB design in this way we need to do a lot of "JOIN via REST" calls. So what is the relative cost of doing this?
Actual Story: Example costs for 'JOIN via REST' vs DB Joins
There are 4 microservices and they involve a lot of "JOIN via REST". A benchmark load for these 4 services comes to ~15 minutes. Those 4 microservices converted into 1 service with 4 modules against a shared DB (that allows joins) executes the same load in ~20 seconds.
This unfortunately is not a direct apples to apples comparison for DB joins vs "JOIN via REST" as in this case we also changed from a NoSQL DB to Postgres.
Is it a surprise that "JOIN via REST" performs relatively poorly when compared to a DB that has a cost based optimiser etc.
To some extent when we break up the DB like this we are also walking away from the 'cost based optimiser' and all that in does with query execution planning for us in favor of writing our own join logic (we are somewhat writing our own relatively unsophisticated query execution plan).

I would see each microservice as an Object, and as like any ORM , you use those objects to pull the data and then create joins within your code and query collections, Microservices should be handled in a similar manner. The difference only here will be each Microservice shall represent one Object at a time than a complete Object Tree. An API layer should consume these services and model the data in a way it has to be presented or stored.
Making several calls back to services for each transaction will not have an impact as each service runs in a separate container and all these calles can be executed parallely.
#ccit-spence, I liked the approach of intersection services, but how it can be designed and consumed by other services? I believe it will create a kind of dependency for other services.
Any comments please?

Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.

Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm

It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.

It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.

I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Database design: Running totals of row counts

I have run into the following situation several times, and was wondering what best practices say about this situation:
Rows are inserted into a table as users complete some action. For example, every time a user visits a specific portion of a website, a row is inserted indicating their IP address, username, and referring URL. Elsewhere, I want to show summary information about those actions. In our example, I'd want to allow administrators to log onto the website and see how many visits there are for a specific user.
The most natural way to do this (IMO) is to insert a row for every visit and, every time the administrator requests totals, count up the number of rows in the appropriate table for that user. However, in situations like these there can be thousands and thousands of rows per user. If administrators frequently request the totals, constantly requesting the counts could create quite a load on the database. So it seems like the right solution is to insert individual rows but simultaneous keep some kind of summary data with running totals as data is inserted (to avoid recalculating those totals over and over).
What are the best practices or most common database schema design for this situation? You can ignore the specific example I made up, my real question is how to handle cases like this dealing with high-volume data and frequently-requested totals or counts of that data.

Here are a couple of practices; the one you select will depend upon your specific situation:
Trust your database engine Many database engines will automatically cache the query plans (and results) of frequently used queries. Even if the underlying data has changed, the query plan itself will remain the same. Appropriate parts of indexes will be kept in main memory, making rerunning a given query almost free. The most you may need to do in this case is tune the database's parameters.
Denormalize your database While 3rd-Normal Form (3NF) is still considered the appropriate database design, for performance reasons it can become necessary to add additional tables that include summary values that would normally be calculated as necessary via a SELECT ... GROUP BY ... query. Frequently these other tables are kept up to date by the use of triggers, stored procedures, or background processes. See Wikipedia for more about Denormalization.
Data warehousing With a Data warehouse, the goal is to push copies of live data to secondary databases (warehouses) for query and special reporting purposes. This is usually done with background processes using whatever replication techniques your database supports. These warehouses are frequently indexed more rigorously than may otherwise be needed for your base application, with the intent to support large queries with massive amounts of historical data.

DB Strategy for inserting into a high read table (Sql Server)

Looking for strategies for a very large table with data maintained for reporting and historical purposes, a very small subset of that data is used in daily operations.
Background:
We have Visitor and Visits tables which are continuously updated by our consumer facing site. These tables contain information on every visit and visitor, including bots and crawlers, direct traffic that does not result in a conversion, etc.
Our back end site allows management of the visitor's (leads) from the front end site. Most of the management occurs on a small subset of our visitors (visitors that become leads). The vast majority of the data in our visitor and visit tables is maintained only for a much smaller subset of user activity (basically reporting type functionality). This is NOT an indexing problem, we have done all we can with indexing and keeping our indexes clean, small, and not fragmented.
ps: We do not currently have the budget or expertise for a data warehouse.
The problem:
We would like the system to be more responsive to our end users when they are querying, for instance, the list of their assigned leads. Currently the query is against a huge data set of mostly irrelevant data.
I am pondering a few ideas. One involves new tables and a fairly major re-architecture, I'm not asking for help on that. The other involves creating redundant data, (for instance a Visitor_Archive and a Visitor_Small table) where the larger visitor and visit tables exist for inserts and history/reporting, the smaller visitor1 table would exist for managing leads, sending lead an email, need leads phone number, need my list of leads, etc..
The reason I am reaching out is that I would love opinions on the best way to keep the Visitor_Archive and the Visitor_Small tables in sync...
Replication? Can I use replication to replicate only data with a certain column value (FooID = x)
Any other strategies?

It sounds like your table is a perfect candidate for partitioning. Since you didn't mention it, I'll briefly describe it, and give you some links, in case you're not aware of it.
You can divide the rows of a table/index across multiple physical or logical devices, and is specifically meant to improve performance of data sets where you may only need a known subset of the data to work with at any time. Partitioning a table still allows you to interact with it as one table (you don't need to reference partitions or anything in your queries), but SQL Server is able to perform several optimizations on queries that only involve one partition of the data. In fact, in Designing Partitions to Manage Subsets of Data, the AdventureWorks examples pretty much match your exact scenario.
I would do a bit of research, starting here and working your way down: Partitioned Tables and Indexes.

Simple solution: create separate table, de-normalized, with all fields in it. Create stored procedure, that will update this table on your schedule. Create SQl Agent job to call the SP.
Index the table as you see how it's queried.
If you need to purge history, create another table to hold it and another SP to populate it and clean main report table.
You may end up with multiple report tables - it's OK - space is cheap these days.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight