"Grouping" Data in a MongoDB cluster - database

For instance, assume I have a MongoDB database that stores a number of schools, and a number of teachers, and students in those schools. Instead of having each school be its own collection in the database, I have a collection of Schools, Teachers, and Students, and obviously in the documents under Students and Teachers, I have some reference to the respective school under the Schools collection. However, is there a way to somehow logically/physically group the data such that Teacher, and Student documents, are grouped under their respective School documents.
As of now, I have three different collections, Schools, Teachers, Students, and lets say I want all students that attend StackOverflow Academy; I'd do something like:
Students.find({school: "stackOverFlowAcademy_ID"})
But as the database grows in size, I assume this way wouldn't be efficient and quick, compared to if it were a small database.
Is my current approach enough, or is there a more efficient way to do this.
EDIT:
MongoDB docs state that if you're using MongoDB Atlas (Which I am), sharding, and other effective "grouping" of data is handled automatically on their end; so no need to do any sharding, or replica sets implementation by yourself if you're using Atlas.

This is a wide topic, I'm putting few things what I'm aware of :
Replica sets : A replica set is a group of mongod instances that host the same data set, when you create mongoDB thru mongoDB Atlas what you'll get is a cluster with three nodes, which is nothing but three mongod instances, their primary purpose is high availability. As I said having replica set has much likely nothing to do with your data structure. Usually Replica sets will always have 1 Primary node and 2 Secondary(can serve read reqs) - if a Primary is down one of it's Secondary will become as Primary and serve requests until Primary is back on, Once it's back data will be synced (everything is taken care by mongoDB Atlas, usual median downtime will be 12sec).
Sharding : As far as I know when your database size is more than 2TB or 4TB(Please check on this) that's when you go to sharding which is a better option to do i.e; horizontal scaling rather than increasing RAM & size of your DB - We add more servers and in a word Sharding is nothing but a bunch of replica sets called shards plus config servers managed by mongos but in depth there is a lot to know before implementing it.
Going back, yes having a reference key between multiple collections is also an option, with introduction of aggregation particularly with $lookup & $graphLookup you can do most of your mappings. And remember to maintain good index keys for better querying. All in all it's more like you need to analyze your applications data prior to start. Try to use query analyzer (explain) in mongoDB to check stats about each query performance.
Example:-
As mongoDB is denormalized, you can definitely consider having embedded documents but you need to know when to have (Vs) when not to.
Let's say if you're dealing with a social media website have users collection where you will store a bunch of users with their related information(phone num ,height, dob, email) and can have embedded document of addresses(1 or 2) which usually won't change that often but list of friends has to be stored in different collection as it needs much maintenance plus can be accessed individually and more like you make your User JSON look better with less & important data. It's all about your data requirements(1-many or 1-n) and querying capabilities.
Check these links :
MongoDB courses are free & best to learn, which are directly offered by mongoDB University.
mongoDB Courses
In Mongo what is the difference between sharding and replication?

Related

AWS DynamoDB and Storing User Data with Transactional Data in one table

I am looking at ways to store user data alongside transactional data like orders and invoices. Normally, I would use a relational database like postgresql, but I wanted to know if it would be a good idea to store the user data along with their transactional data in one noSQL table like DynamoDB?
I would assume if you did that you would structure your data to either use objects or arrays to store the orders or invoices but I'm not sure if that is the best was to go about it.
EDIT
So after doing some more research and trying understand how to fit everything into a single table design I found this article in the AWS documentation. I decided to organise my data into collections using a combinaton of the primary key and the sort key. The sort key is used to determine collections (i.e., orders, customer-data, etc). This solution is perfect for my use case because I can keep all the user data (including transactions like orders) in one dynamodb table.
In short, don't do that. DynamoDB is a great tool, but you need to understand it first. It's not just a no-sql, it's also a distributed one. It gives great performance, scalability and pricing. But modeling is trickier. You can not build requests as you please, those has to be taken into consideration when you design your model. Read about queries vs scans and global vs local indexes. When you get that you might try reading about Single Table Design. It should give you an idea about the limitations of the DynamoDB.

Realtime Scalable Chat App - which database should I choose?

I am looking to build a scalable real-time chat app (I am just doing this for fun and out of interest so please don't ask why!) and I know that I am going to be handling the realtime messaging part through redis but I am not sure of what database to use for the following information:
User Relationships (friends)
Cold Chat History - this would only be queried in limited amounts (maybe like 50 messages) ordered by timestamp and queried in reverse (just as your messages would load in imessage or whatsapp when scrolling to view older messages)
Chat user relationship
I know for cold chat history an RDBMS or Cassandra is probably my best bet but handling friend relationships, as well as user-to-chat relationships in an RDBMS or cassandra, is ugly. I'm not sure if it's necessary, worth it, or even "right" to have a graph database in my tech stack just for this relationship mapping.
I was thinking of MongoDB or some other document-based storage could be a solution but querying the data seems like it would be really taxing. My thoughts were to have a chat document that has a list of users and then I would have several other documents with a list of message id's pointing to message documents. These documents would be mapped back to the chatID. I'm sure you can see though, the time and resources to query a set of messages would be quite high. Maybe I'm just underestimating the power of MongoDB as I haven't really used it. I would also be more easily able to handle the Chat User Relationship using documents as well as friendships by just storing user-ids in a list within the document.
I understand there is no perfect tool for the job but I would like someone's thoughts and inputs on how to design the data storage.
Thank you in advance!
If transaction volume is not high then you can go with Postgresql otherwise Cassandra is a good choice for all your mentioned requirements.
In Cassandra you should have multiple tables in de-normalized for low latency and high availability.
User - Create a master table having all information of any user.
User_Friend_relation - Create another table having composite primary key as userid & freindid with clustering key is_active(0,1) desc. ((userid,freindid),is_active)
Chat_user_friend - This is your main table having all chat. Create this table with timestamp as clustering key and store data in desc order so that you can save time by ordering in real time and you have latest data first.
Cold Chat History - As Cassandra is highly scalable... no need of this table.
Data modeling is an area where a lot of discussions are required, anyways I tried to answer this as simple as possible.
It's best to keep relationship in realtional database.
I use PostgreSQL for such purposes in my chat applications.
For chat history and other events Cassandra is a good choice (I also use Cassandra). However it depends on your database size (records quantity). If you don't need to keep tens of thousands historical messages for thousands users then using Cassandra will be an overkill. In this case you can also use PostgreSQL or another relational database.
In PostgreSQL you can optimize an access to history tables using partitioning.

Sharding database by user_id vs by entity_id

My current employee has a huge table of items. Each item has user_id and obviously item_id properties. To improve performance and high availability my team decided to shard the table.
We are discussing two strategies:
Shard by item_id
In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)
Shard by user_id
If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.
So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances
P.S. We already have replicas but they are becoming useless for our write throughput
UPDATE
We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one.
We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).
I see a couple of ways to attack this:
How do you intend to shard? Separate master servers, separate schemas
serviced by the same server but by different storage backgrounds?
How do you access this data? Is it basically key/value? Do you need to query all of a user's items at once? How transactional do your CRUD operations need to be?
Do you foresee unbalanced shards being a problem, based on the data you're storing?
Do you need to do relational queries of this data against other data
in your system?
TradeOffs
If you split shards across server/database instance boundaries, sharding by item_id means you will not be able to do a single query for info about a single user_id... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.
If you can use a single database instance, sharding by creating tables/schemas that are backed by different storage subsystems would allow you to scale writes will still being able to do relational queries across them. All of your eggs are still in 1 server basket with this method, though.
If you shard by user_id, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.
If you intend to split these into separate servers such that you cannot do relational queries across schemas, it might be better to use a key/value store as DynamoDB. Then you only have to worry about one endpoint, and the sharding is done at the database layer. No middleware to determine which shard to use!
The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id), vs easier balancing and rebalancing of data across shards (sharding by item_id).
I would focus on the question of how you need to store and access your data. If you truly only need access by item_id, then shard by item_id. Avoid splitting your database in ways counterproductive to how you query it.
If you're still unsure, note that you can shard by item_id and then choose to shard by user_id later (you would do this by rebalancing based on user_id and then enforcing new rows only getting written to the shard their user_id belongs to).
Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.
I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.
If in your system separate item_id has sensible meaning and your users usually work with data from separate item_id's (like Instagram like service when item_id's are related to user photos), I would suggest you sharding by item_id because this choice has lot of advantages from the technical point of view:
ensures even load across all shards
ensures graceful degradation of your service: when shard is down users lose access to 1/N of their items, but they can work with other items
you do not have to pass user_id to access item_id
There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.
When only complete item_id series can have sensible meaning, it is more reasonable to shard by user_id

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Few database design questions relating to user content site

Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Resources