The users on my website do operations like login, logout, update profile, change passwords etc. I am trying to come up with something that can store these user activities for my users and also return the matching records in case some system asks me for them based on userIds.
The problem from what I can think of looks more write intensive(since users keep logging into the website very often and I have to record that). Once in a while(say when users clicks on history or some reporting team needs it), the data records are read and returned.
So my application is write intensive
There are various approaches I can think of.
Approach 1. The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into database(which has been sharded(assume based on hash of userId)).
The problem with this approach is if my activity manager runs on multiple nodes, it has to send those records to various shards which means a lot of data movement over network.
Approach 2:The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into read though write through cache which would take care of writing into the database.
Problem with this approach is I do not know If I can control as to where those records would be written(I mean to which shard). Basically I do not know if the write through cache works(does it map to a local DB or it can manage to send data to shards).
Approach 3: The login operation is the most common user activity in my system. I can have a separate queue for login which must be periodically flushed to disk.
Approach 4: Use some cloud based storage which acts as a in memory queue where data coming from all nodes in stored. This would be reliable cache that guarantees no data loss. Periodically read from this cache and store that into the database shards.
There are many problems to solve:
1. Ensuring I do not loose the data(What kind of data replication to use i.e. any queue that ensures reliability)
2. Ensuring my frequent writes do not result in performance
3. Avoid single point of failure.
4. Achieving infinite scale
I need suggestion based on above from the existing solution available.
Related
I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.
Currently I have a monolithic process written in C performing following set of operations
a. Accept data from user (in certain payload format).
IF the user has provided data first time validate if it is standard
compliant and store it in user data-store
ELSE compare the data provided by user previously which is stored in user
data-store and check if it is duplicate or not.
IF DUPLICATE Skip it
ELSE Validate the data provided by the user and update previous information
in user data-store.
b. Whenever user data-store is updated,
initialize analytics app to process the information stored in user
date-store and generate final output
Replicate the updates into backup user data-store for future reference.
Currently I am facing scale issues especially with the rate at which user keeps inputting data. As a result current application has the limitation of accepting only certain amount of user data requests with one monolithic process.
Adding pthreads would be the next step to scale the application which sometimes adds complexity to the codebase and proper care has to be taken with respect to locking for consistency.
I want to try out microservices approach where I want to divide the current monolithic process into multiple microservices as mentioned below so that I can improve each process separately improving its performance for future updates.
Process-01: Contains logic to accept data from user and does all standard verification's and update the shared data-store.
Process-02: whenever the shared data-store is updated, contains logic to process the data in it and create final output.
Process-03: Takes care of replicating share-datastore.
process-01 ==== (shared-datastore) === process-02
||
process-03
It seems I need a shared data-store which process-01 and process-02 would request the data provided by user for processing. Would this create a bottleneck as I need to create a copy of the requested as a message and share it with processes ? For process-02 I need to create the whole datastore as copy so that it can work on the data independently.
Ideally process-01 needs both read and write permissions while process-02 needs only read permission of the data in shared data-store.
Since we are using C for our application, please suggest if the planned new design has any more caveats? Is there any better design for the same ?
Even any other design suggestions would be of great help. Pointing to similar application design in C would also be helpful.
I am developing a web-app right now, where clients will frequently (every few seconds), send read/write requests on certain data. As of right now, I have my server immediately write to the database when a user changes something, and immediately read from the database when they want to view something. This is working fine for me, but I am guessing that it would be quite slow if there were thousands of users online.
Would it be more efficient to save write requests in an object on the server side, then do a bulk update at a certain time interval? This would help in situations where the same data is edited multiple times, since it would now only require one db insert. It would also mean that I would read from the object for any data that hasn't yet been synced, which could mean increased efficiency by avoiding db reads. At the same time though, I feel like this would be a liability for two reasons: 1. A server crash would erase all data that hasn't yet been synced. 2. A bulk insert has the possibility of creating sudden spikes of lag due to mass database calls.
How should I approach this? Is my current approach ok, or should I queue inserts for a later time?
If a user makes a change to data and takes an action that (s)he expects will save the data, you should do everything you can to ensure the data is actually saved. Example: Let's say you delay the write for a while. The user is in a hurry, makes a change then closes the browser. If you don't save right when they take an action that they expect saves the data, there would be a data loss.
Web stacks generally scale horizontally. Don't start to optimize this kind of thing unless there's evidence that you really have to.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);
I'm a beginning web developer sitting on an ambitious web application project.
So after having done some research, I found out about SQL Service Broker. It seems like something I could use, but I'm not sure. Since learning it requires someone to put in lots of time, I wanted to be sure that it would fit my needs.
I need to implement a system where website users can submit text to the website. This stream of messages has to be redundant and dealt with in a FIFO way, with on the other end of the stream another group of users dealing with the messages.
Now, a message that is being read by one of this last group of users, should be locked so that no-one else can read it at the same time. The user can then decide to handle the message or not. Only if he decides to deal with the message can it be deleted from the queue. If he decides that he doesn't want to deal with the message, the message should be put back in the queue (at the end of the queue, or at least with the highest priority), so that another user can read it and decide.
Is this something I would be able to implement with SQL Service Broker? Am I on the wrong track?
Thank you!
IMO, the best use of Service Broker is for connecting to independent Application in a loosely coupled way. What I mean by that is that systems tied in this way can communicate through a set of mutually agreed message types. This in contrast to one application manipulating directly the other's database, for example.
From what you've said, I would implement it as a simple table, for example: Create a message table with an identity PK, an Allocation flag and your custom columns. Whenever an operator wants to fetch the last message, get the lowest PK value for which Allocation = 'N' and update Allocation to 'Y'. This in a single transaction.
When/if the operation decides to return the message to queue, just set its AllocationFlag to 'N' and its back.
This is just an example. The database in this case is providing you consistency, heavy load performance, etc.
Behind the screens all data you submit to SSB is stored and manipulated as tables, so there is no reason for it to be necessarily faster than a database solution .