Realtime Scalable Chat App - which database should I choose? - database

I am looking to build a scalable real-time chat app (I am just doing this for fun and out of interest so please don't ask why!) and I know that I am going to be handling the realtime messaging part through redis but I am not sure of what database to use for the following information:
User Relationships (friends)
Cold Chat History - this would only be queried in limited amounts (maybe like 50 messages) ordered by timestamp and queried in reverse (just as your messages would load in imessage or whatsapp when scrolling to view older messages)
Chat user relationship
I know for cold chat history an RDBMS or Cassandra is probably my best bet but handling friend relationships, as well as user-to-chat relationships in an RDBMS or cassandra, is ugly. I'm not sure if it's necessary, worth it, or even "right" to have a graph database in my tech stack just for this relationship mapping.
I was thinking of MongoDB or some other document-based storage could be a solution but querying the data seems like it would be really taxing. My thoughts were to have a chat document that has a list of users and then I would have several other documents with a list of message id's pointing to message documents. These documents would be mapped back to the chatID. I'm sure you can see though, the time and resources to query a set of messages would be quite high. Maybe I'm just underestimating the power of MongoDB as I haven't really used it. I would also be more easily able to handle the Chat User Relationship using documents as well as friendships by just storing user-ids in a list within the document.
I understand there is no perfect tool for the job but I would like someone's thoughts and inputs on how to design the data storage.
Thank you in advance!

If transaction volume is not high then you can go with Postgresql otherwise Cassandra is a good choice for all your mentioned requirements.
In Cassandra you should have multiple tables in de-normalized for low latency and high availability.
User - Create a master table having all information of any user.
User_Friend_relation - Create another table having composite primary key as userid & freindid with clustering key is_active(0,1) desc. ((userid,freindid),is_active)
Chat_user_friend - This is your main table having all chat. Create this table with timestamp as clustering key and store data in desc order so that you can save time by ordering in real time and you have latest data first.
Cold Chat History - As Cassandra is highly scalable... no need of this table.
Data modeling is an area where a lot of discussions are required, anyways I tried to answer this as simple as possible.

It's best to keep relationship in realtional database.
I use PostgreSQL for such purposes in my chat applications.
For chat history and other events Cassandra is a good choice (I also use Cassandra). However it depends on your database size (records quantity). If you don't need to keep tens of thousands historical messages for thousands users then using Cassandra will be an overkill. In this case you can also use PostgreSQL or another relational database.
In PostgreSQL you can optimize an access to history tables using partitioning.

Related

AWS DynamoDB and Storing User Data with Transactional Data in one table

I am looking at ways to store user data alongside transactional data like orders and invoices. Normally, I would use a relational database like postgresql, but I wanted to know if it would be a good idea to store the user data along with their transactional data in one noSQL table like DynamoDB?
I would assume if you did that you would structure your data to either use objects or arrays to store the orders or invoices but I'm not sure if that is the best was to go about it.
EDIT
So after doing some more research and trying understand how to fit everything into a single table design I found this article in the AWS documentation. I decided to organise my data into collections using a combinaton of the primary key and the sort key. The sort key is used to determine collections (i.e., orders, customer-data, etc). This solution is perfect for my use case because I can keep all the user data (including transactions like orders) in one dynamodb table.
In short, don't do that. DynamoDB is a great tool, but you need to understand it first. It's not just a no-sql, it's also a distributed one. It gives great performance, scalability and pricing. But modeling is trickier. You can not build requests as you please, those has to be taken into consideration when you design your model. Read about queries vs scans and global vs local indexes. When you get that you might try reading about Single Table Design. It should give you an idea about the limitations of the DynamoDB.

"Grouping" Data in a MongoDB cluster

For instance, assume I have a MongoDB database that stores a number of schools, and a number of teachers, and students in those schools. Instead of having each school be its own collection in the database, I have a collection of Schools, Teachers, and Students, and obviously in the documents under Students and Teachers, I have some reference to the respective school under the Schools collection. However, is there a way to somehow logically/physically group the data such that Teacher, and Student documents, are grouped under their respective School documents.
As of now, I have three different collections, Schools, Teachers, Students, and lets say I want all students that attend StackOverflow Academy; I'd do something like:
Students.find({school: "stackOverFlowAcademy_ID"})
But as the database grows in size, I assume this way wouldn't be efficient and quick, compared to if it were a small database.
Is my current approach enough, or is there a more efficient way to do this.
EDIT:
MongoDB docs state that if you're using MongoDB Atlas (Which I am), sharding, and other effective "grouping" of data is handled automatically on their end; so no need to do any sharding, or replica sets implementation by yourself if you're using Atlas.
This is a wide topic, I'm putting few things what I'm aware of :
Replica sets : A replica set is a group of mongod instances that host the same data set, when you create mongoDB thru mongoDB Atlas what you'll get is a cluster with three nodes, which is nothing but three mongod instances, their primary purpose is high availability. As I said having replica set has much likely nothing to do with your data structure. Usually Replica sets will always have 1 Primary node and 2 Secondary(can serve read reqs) - if a Primary is down one of it's Secondary will become as Primary and serve requests until Primary is back on, Once it's back data will be synced (everything is taken care by mongoDB Atlas, usual median downtime will be 12sec).
Sharding : As far as I know when your database size is more than 2TB or 4TB(Please check on this) that's when you go to sharding which is a better option to do i.e; horizontal scaling rather than increasing RAM & size of your DB - We add more servers and in a word Sharding is nothing but a bunch of replica sets called shards plus config servers managed by mongos but in depth there is a lot to know before implementing it.
Going back, yes having a reference key between multiple collections is also an option, with introduction of aggregation particularly with $lookup & $graphLookup you can do most of your mappings. And remember to maintain good index keys for better querying. All in all it's more like you need to analyze your applications data prior to start. Try to use query analyzer (explain) in mongoDB to check stats about each query performance.
Example:-
As mongoDB is denormalized, you can definitely consider having embedded documents but you need to know when to have (Vs) when not to.
Let's say if you're dealing with a social media website have users collection where you will store a bunch of users with their related information(phone num ,height, dob, email) and can have embedded document of addresses(1 or 2) which usually won't change that often but list of friends has to be stored in different collection as it needs much maintenance plus can be accessed individually and more like you make your User JSON look better with less & important data. It's all about your data requirements(1-many or 1-n) and querying capabilities.
Check these links :
MongoDB courses are free & best to learn, which are directly offered by mongoDB University.
mongoDB Courses
In Mongo what is the difference between sharding and replication?

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Graph Database Design Methodologies

I want to use a graph database for a web application (involving a web of Users, Posts, Comments, Votes, Answers, Documents and Document-Merges and some other transitive relationships on Users and Documents). So I start asking myself if there is something like a design methodology for Graph Databases, i.e. a kind of analogon to the design principles recommended for Relational Databases (like those normal forms)?
Example questions (of many questions arising):
Is it a good idea, to create a Top-Node Users, having relationships ("exist") on any User-Node in the Database?
Is it a good idea to build in version management (i.e. create relationships (something like "follows")) pointing to updated versions of a Document / Post in a way that going back this relationship means watching the changes the document went through.
etc...
So, do we need a Graph Database Design Cookbook?
The Gremlin User Group (http://tinkerpop.com/) and Neo4j User Group (https://groups.google.com/forum/?fromgroups#!forum/neo4j) are good places to discuss graph-database modeling.
You can create supernodes such as "Users," but it may be better and more performant to use indexes and create an index entry for each user with a key=element_type, value="user", id=user_node_id.
A "follows" relation is often used for people/friends like on Facebook and Twitter so I wouldn't use that for versioning. You can build a versioning system into to Neo4j that timestamps each entry and use a last-write wins algorithm, and there are other database systems like Datomic that have this built in.
See Lightbulb's model (https://github.com/espeed/lightbulb/blob/master/lightbulb/model.py) for an example blog model in Bulbs/Python (http://bulbflow.com).

What is couchdb, for what and how should I use it?

I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.

Resources