Currently I have a monolithic process written in C performing following set of operations
a. Accept data from user (in certain payload format).
IF the user has provided data first time validate if it is standard
compliant and store it in user data-store
ELSE compare the data provided by user previously which is stored in user
data-store and check if it is duplicate or not.
IF DUPLICATE Skip it
ELSE Validate the data provided by the user and update previous information
in user data-store.
b. Whenever user data-store is updated,
initialize analytics app to process the information stored in user
date-store and generate final output
Replicate the updates into backup user data-store for future reference.
Currently I am facing scale issues especially with the rate at which user keeps inputting data. As a result current application has the limitation of accepting only certain amount of user data requests with one monolithic process.
Adding pthreads would be the next step to scale the application which sometimes adds complexity to the codebase and proper care has to be taken with respect to locking for consistency.
I want to try out microservices approach where I want to divide the current monolithic process into multiple microservices as mentioned below so that I can improve each process separately improving its performance for future updates.
Process-01: Contains logic to accept data from user and does all standard verification's and update the shared data-store.
Process-02: whenever the shared data-store is updated, contains logic to process the data in it and create final output.
Process-03: Takes care of replicating share-datastore.
process-01 ==== (shared-datastore) === process-02
||
process-03
It seems I need a shared data-store which process-01 and process-02 would request the data provided by user for processing. Would this create a bottleneck as I need to create a copy of the requested as a message and share it with processes ? For process-02 I need to create the whole datastore as copy so that it can work on the data independently.
Ideally process-01 needs both read and write permissions while process-02 needs only read permission of the data in shared data-store.
Since we are using C for our application, please suggest if the planned new design has any more caveats? Is there any better design for the same ?
Even any other design suggestions would be of great help. Pointing to similar application design in C would also be helpful.
Related
I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.
Here is the context for the problem I am trying to solve.
There are computers A and B, as well as a server S. Server S implements some backend which handles incoming requests in a RESTful manner.
The backend S has a shelf. The goal of users A and B is to make S create and place numbered boxes on that shelf. A unique constraint is that no two boxes can have the same number. Once a box is created, S should return that box (JSON, or xml...) back to A and B with its allocated number.
The problem boils down to concurrency, as A and B's POST ("create-numbered-box") transactions may arrive at the exact same time at the database - hence get cancelled (?). I remind, there is a unique constraint - no two boxes are allowed to have a same number.
What are possible ways to solve this problem? I wouldn't like to lock the database, so I am looking for alternatives of that. You are allowed to imagine that between the database and the backend layer calling the database we may have an extra layer of abstraction, e.g. a microservice, messaging queue... whatever or nothing at all - a direct backend - db exec. query call. If you think a postgres database is not a good choice to say a graph one, or document one, key-value one - feel free to substitute it.
The goal is in the end given concurrent writes users A and B to get responses to their create (POST) requests and each of them have a box on that shared shelf with a unique number, with no "Oops, something went wrong. Please retry" type of server response.
I described a simple world with users A and B but that can in theory go up to 10 000 users writing, not just 2.
As a secondary question, I'd like to ask, is there a way to test conflicting concurrent transactions in postgres?
I will go first.
My idea is, let A and B send requests and fail. Once they fail, have retries with random timeouts in some interval. Let's say up to 3 retries. This way for A and B I will try to separate the requested writes to the db and this would allow for some degree of successful resolution of the scenario. However, I don't think this is a clean solution and I am looking for alternatives you can think of. Just, please keep in mind the constraints and freedoms I mentioned above.
Databases such as Posgres include capabilities to have a unique number generated by the database (see PostgreSQL - SERIAL - Generate IDs (Identity, Auto-increment)). So the logic for your backend service S could be:
lookup if user has a record in the database already
return the id if it does
otherwise, create a record and return the newly allocated id
To avoid creating multiple boxes for the same user you need to serialize the lookup/create logic based on user id. Approaches to that vary from merely handling one request at a time in your service S to, for example, having Kafka topics that partition requests to different instances of service S based on user ids -- all depends on the scale.
The users on my website do operations like login, logout, update profile, change passwords etc. I am trying to come up with something that can store these user activities for my users and also return the matching records in case some system asks me for them based on userIds.
The problem from what I can think of looks more write intensive(since users keep logging into the website very often and I have to record that). Once in a while(say when users clicks on history or some reporting team needs it), the data records are read and returned.
So my application is write intensive
There are various approaches I can think of.
Approach 1. The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into database(which has been sharded(assume based on hash of userId)).
The problem with this approach is if my activity manager runs on multiple nodes, it has to send those records to various shards which means a lot of data movement over network.
Approach 2:The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into read though write through cache which would take care of writing into the database.
Problem with this approach is I do not know If I can control as to where those records would be written(I mean to which shard). Basically I do not know if the write through cache works(does it map to a local DB or it can manage to send data to shards).
Approach 3: The login operation is the most common user activity in my system. I can have a separate queue for login which must be periodically flushed to disk.
Approach 4: Use some cloud based storage which acts as a in memory queue where data coming from all nodes in stored. This would be reliable cache that guarantees no data loss. Periodically read from this cache and store that into the database shards.
There are many problems to solve:
1. Ensuring I do not loose the data(What kind of data replication to use i.e. any queue that ensures reliability)
2. Ensuring my frequent writes do not result in performance
3. Avoid single point of failure.
4. Achieving infinite scale
I need suggestion based on above from the existing solution available.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);
I'm a beginning web developer sitting on an ambitious web application project.
So after having done some research, I found out about SQL Service Broker. It seems like something I could use, but I'm not sure. Since learning it requires someone to put in lots of time, I wanted to be sure that it would fit my needs.
I need to implement a system where website users can submit text to the website. This stream of messages has to be redundant and dealt with in a FIFO way, with on the other end of the stream another group of users dealing with the messages.
Now, a message that is being read by one of this last group of users, should be locked so that no-one else can read it at the same time. The user can then decide to handle the message or not. Only if he decides to deal with the message can it be deleted from the queue. If he decides that he doesn't want to deal with the message, the message should be put back in the queue (at the end of the queue, or at least with the highest priority), so that another user can read it and decide.
Is this something I would be able to implement with SQL Service Broker? Am I on the wrong track?
Thank you!
IMO, the best use of Service Broker is for connecting to independent Application in a loosely coupled way. What I mean by that is that systems tied in this way can communicate through a set of mutually agreed message types. This in contrast to one application manipulating directly the other's database, for example.
From what you've said, I would implement it as a simple table, for example: Create a message table with an identity PK, an Allocation flag and your custom columns. Whenever an operator wants to fetch the last message, get the lowest PK value for which Allocation = 'N' and update Allocation to 'Y'. This in a single transaction.
When/if the operation decides to return the message to queue, just set its AllocationFlag to 'N' and its back.
This is just an example. The database in this case is providing you consistency, heavy load performance, etc.
Behind the screens all data you submit to SSB is stored and manipulated as tables, so there is no reason for it to be necessarily faster than a database solution .