Centralized data access or variables - c

I'm trying to find a way to access a centralized database for both retrieval and update.
the following is what I'm looking for,
Server 1 has this variable for example
int counter;
Server 2 will be interacting with the user, and will increase the counter whenever the user uses the service, until a certain threshold is reached. when this threshold is reached then server 2 will start rejecting the user access.
Also, the user will be able to use multiple servers (like server 2) from multiple locations and each time the user accesses the access any server the counter will be increased.
I tried google but it's hard to search for something without a name.

One approach to designing this is to do sharding by user - i.e. split the users between your servers depending on the ID of the user. That is, if you have 10 servers, then users with ID's ending with 2 would have all of their data stored on server 2, and so on. This assumes that user ID's are distributed uniformly.
One other approach is to shard the users by location - if you have servers in Asia vs Europe, for example. You'd need a property in the User record that tells you where the user is located; based on that, you'll know which server to route them to.
Ultimately, all of these design options have a concept of "where does the master record for a user reside?" Each of these approaches attempts to definitively answer this question.
A different category of approaches has to do with multi-master replication, which is supported by some database vendors; this approach does not scale as well (i.e. it's hard to get it to scale to 20 servers), but you might want to look into it, too.

Related

Concurrent writes to a shared network resource

Here is the context for the problem I am trying to solve.
There are computers A and B, as well as a server S. Server S implements some backend which handles incoming requests in a RESTful manner.
The backend S has a shelf. The goal of users A and B is to make S create and place numbered boxes on that shelf. A unique constraint is that no two boxes can have the same number. Once a box is created, S should return that box (JSON, or xml...) back to A and B with its allocated number.
The problem boils down to concurrency, as A and B's POST ("create-numbered-box") transactions may arrive at the exact same time at the database - hence get cancelled (?). I remind, there is a unique constraint - no two boxes are allowed to have a same number.
What are possible ways to solve this problem? I wouldn't like to lock the database, so I am looking for alternatives of that. You are allowed to imagine that between the database and the backend layer calling the database we may have an extra layer of abstraction, e.g. a microservice, messaging queue... whatever or nothing at all - a direct backend - db exec. query call. If you think a postgres database is not a good choice to say a graph one, or document one, key-value one - feel free to substitute it.
The goal is in the end given concurrent writes users A and B to get responses to their create (POST) requests and each of them have a box on that shared shelf with a unique number, with no "Oops, something went wrong. Please retry" type of server response.
I described a simple world with users A and B but that can in theory go up to 10 000 users writing, not just 2.
As a secondary question, I'd like to ask, is there a way to test conflicting concurrent transactions in postgres?
I will go first.
My idea is, let A and B send requests and fail. Once they fail, have retries with random timeouts in some interval. Let's say up to 3 retries. This way for A and B I will try to separate the requested writes to the db and this would allow for some degree of successful resolution of the scenario. However, I don't think this is a clean solution and I am looking for alternatives you can think of. Just, please keep in mind the constraints and freedoms I mentioned above.
Databases such as Posgres include capabilities to have a unique number generated by the database (see PostgreSQL - SERIAL - Generate IDs (Identity, Auto-increment)). So the logic for your backend service S could be:
lookup if user has a record in the database already
return the id if it does
otherwise, create a record and return the newly allocated id
To avoid creating multiple boxes for the same user you need to serialize the lookup/create logic based on user id. Approaches to that vary from merely handling one request at a time in your service S to, for example, having Kafka topics that partition requests to different instances of service S based on user ids -- all depends on the scale.

Writing to many replicas of MongoDB

Let's say I have a distributed application that writes to the database. To decrease latency one of the instances (app + database) is hosted in Australia and another one is hosted in Europe. Both instances of database need to share the same data.
So what we are after here is data locality. The reason for it is obvious: we don't want users in Australia shooting requests to our database in Europe because that would increase latency.
The natural choice would be to deploy both instances of database in a one replica set. But it seems that with MongoDB you can write to only one Mongo instance within replica set.
What are the strategies with MongoDB to have two instances of database, sharing the same data, to which you can write to? Or is the MongoDB just a wrong choice for this requirement?
Huge subject, but i'll try to give you a short and simple answer :
As your two instances must share the same sata, you can't use sharded cluster with zones . But replica set can be your solution :
Create a replica set with at least the following :
a server in a 'neutral' zone. It will be the primary server (set a priority higher). This server, as long as it still primary, will handle your write operations.
your two existing servers with lower priority.
Set in your application Read Preference to 'nearest'. This way, your read operations will be handle by the server having the mower network latency, regardless of the Master/secondary roles of server.
But i highly recommand you to check the documentation, to see how correctly deploy this architecture. Here's a good start
EDIT
Some consideration about this solution :
This use case is one of the rare use case where it's better to read from secondaries. In general, prefer reading your data from MASTER, since replica set is done for high availability, not for scalability.
If some of your data can be 'located' to be accessed faster, consider sharding collections as a better solution

NOSQL Database for Hight read concurrency

Hi I am using ldap to store user configuation, when i started i have small amount of data now it increased to more than 20 million records.
Now I face the performance issue, I preferred ldap beacuse user configuration are less updated compare to read and search operation.
I want to replace the ldap with NOSQL db which will provide me 20000/sec read operation for more than 50 millions record.
Data in ldap is User info , credentials and user specific settings, issue arise because of all-ids. I have indexed data based on First name , last name, Sun ldap did well when i have lesssa data around 500K, when My data incarsed to around 5 million then i face problem for searching that , search is not indexed , later i found issue is regarding all-ids , e.g. Chavan is very comman surname in india, when it appear in more than all-id-threshold prpperty , then my seach always failed , I increased all-id threshold many time but it has performance issue. so i want to get read ldap and use nosql db
What DB you want to use? In Mongo, for example, you can use sharding for some load balancing.
Also, switching to NoSQL should be considered, coz it's very depends on the application logic.

Delphi Solution for data replication between two remote sites loosely connected

I'm using Delphi XE4 Architect (Delphi Xe3 is ok as well)
I need to find a smart solution to the following problem
and I would like to use one of these frameworks: kbmMW or RemOjects SDK / DataAbstract or RealThinClient
Currently I have an application using a very simple MSSQL database on a site A that is used by users of a site B through the remote desktop.
The application sometimes needs to show some pictures and also view some PDF, but it is mostly text data entry.
There is no particular reason for me to use MSSQL,
but it is a database that I found already active and populated and I have not built it myself.
And now, it would be complicated to change it.
(Database is not important, not using specific features nor stored procedures nor triggers)
Users of the Site B are connected to the A site via a network connection very slow
and occasionally the connection is not available for a few hours and up to one day (this is the major problem).
The situation of the connection, unfortunately, can not be improved for various reasons.
The database is quite simple has many tables that hardly ever change,
about ten instead undergo daily updates and potentially they may be subject to competing changes.
Mainly the records of these tables contain data that are locked in update
from a single user to edit some fields and then he saves releasing the lock.
I would like to get something very different to optimize performance.
Users of the A site have higher priority, they are more important, because the A site is the headquarters.
I would like to have a copy of the database at Site A to Site B,
so that users of site B can work in local, much faster without using the remote desktop connecting to the site A.
The RDP protocol is not very optimized and in any case if the connection is absent, users could not work.
Synchronize and update databases lock records between the two databases may not be a big problem.
Basically when a user of the Site B acquires edit a record in the database B,
obviously a user of the site A should not be able to modify the same record on the database of the site A.
This should also work in the opposite direction of course.
My big problem is figuring out how handling to the best the situation that occurs
when the connection between B and A is not available for some hours. (And transaction/events is increasing on site B).
Events on Site A have generally priority (on collision) on events on Site B.
Users of the Site B must be able to continue working.
When the connection becomes active, the changes should be sent to the database at Site A.
Obviously this can result in conflicts, but the changes made on the record
possibly by users B can be discarded or it can be done under the supervision of a selective merge
and approval record by record user of the site B.
Well, I hope the scenario is almost explained clearly.
Additional infos:
DB schema is very simple, only tables, no triggers, stored procedure. So I can build one as example but imagine 10 tables that can be updated in concurrency.
DB is used by a desktop app of sales departement, so it contains most secret data.
Remote connection is typically max 512Kbit, but the main problem here is that the connection sometimes may be not active
and user on remote site must work anyway. THis is the main focus.
Total data of daily updates could be at max 10 Mb, compressed, only for DB connections. There are some other data synchronized
on the same connection but they are not part of this job.
I don't want to use specific MSSQL tools or services (replications or so on), because DB could change in future.
Thanks
We do almost exactly this using a Delphi client app, a kbmMW based Delphi server app, MSSQL database (though it used to work quite happily on on DBISAM database too).
We have some tables that only the head office site users are allowed to modify. The smaller tables are transferred in their entirety each time there is a "merge". The larger tables and the transaction type tables all have a date added and/or a date modified field and only those records that have been changed or added in the last 3 weeks or so (configurable) are transferred. This means sites can still update to the latest data even if they have been disconnected for quite some time - we used to have clients in remote places on dubious dial up lines!
We only run the merge routines once or twice a day but it would work equally well on an hourly basis or other time schedule.
At given times of day each site (including head office) "export" their changed/new records to files (eg client dataset tables or similar). These are then zipped up by the application and placed in an "outgoing" folder. The zip file is named based on the location id, date, time etc. The files are transferred by some external means eg via FTP / file share / email etc etc. Each branch office sends/transfers its data files to head office and head office transfers its data to each branch. The files are transferred by whatever means to an "incoming" folder.
On a regular basis (eg hourly) each location does a check on the incoming folder to see if there is anything new for it to import. If so it adds all the new records, branch locations overwrite the head-office data tables with the new ones and edited records are merged in "somehow". This is the tricky bit. The easiest policy is "head office wins" so all edits are accepted unless there is a conflict in which case the head office version wins. Alternatively you could use "last edited wins" - but then you need to make sure clocks are in sync across locations. The other option is to add conflicting records to some form of "suspense" status and let an end user decide at some point in the future. We do this on one data set. Whichever conflict method you choose you need to record each decision in some form of log table and prompt an administrative level user to check occasionally.
When the head office imports data or when data is added at the head office then a field is set to indicate the data is part of the master data. When branches add data this field is empty to indicate it has yet to reach the master set. This helps when branches export their data as they can include all data that doesn't have this field set.
We have found that you can't run the merge interactively as you'll end up never getting any work done and you won't be able to run the merge at night etc. It needs to be fully automated with the ability for an admin user to make adjustments at some point after the fact.
We've been running this approach for several years now on multi-site operations and once it settled down it has worked pretty much flawlessly. With 2 export/import schedules per day we have found the branch offices run perfectly well and are only ever missing less than a days worth of transactions. Works well in our scenario where we don't often have conflicts. Exported data is in the region of 5-10MB which zips up plenty small enough.
Primary keys are vital! We use a GUID and it hasn't let us down yet.
The choice of database server and n-tier framework are, actually, irrelevant. It's the process that matters here.
Basically when a user of the Site B acquires edit a record in the database B, obviously a user of the site A should not be able to modify the same record on the database of the site A. This should also work in the opposite direction of course.
I can't see how you're ever going to make this bit work reliably if both sites have their own copy of the database and you're allowing for dropped/non-existent inter-site connections on occasion.

About Youtube views count

I'm implementing an app that keeps track of how many times a post is viewed. But I'd like to keep a 'smart' way of keeping track. This means, I don't want to increase the view counter just because a user refreshes his browser.
So I decided to only increase the view counter if IP and user agent (browser) are unique. Which is working so far.
But then I thought. If Youtube, is doing it this way, and they have several videos with thousands or even millions of views. This would mean that their views table in the database would be overly populated with IP's and user agents....
Which brings me to the assumption that their video table has a counter cache for views (i.e. views_count). This means, when a user clicks on a video, the IP and user agent is stored. Plus, the counter cache column in the video table is increased.
Every time a video is clicked. Youtube would need to query the views table and count the number of entries. Won't this affect performance drastically?
Is this how they do it? Or is there a better way?
I would leverage client side browser fingerprinting to uniquely identify view counts. This library seems to be getting significant traction:
https://github.com/Valve/fingerprintJS
I would also recommend using Redis for anything to do with counts. It's atomic increment commands are easy to use and guarantee your counts never get messed up via race conditions.
This would be the command you would want to use for incrementing your counters:
http://redis.io/commands/incr
The key in this case would be the browser fingerprint hash sent to you from the client. You could then have a Redis "set" that would contain a list of all browser fingerprints known to be associated with a given user_id (the key for the set would be the user_id).
Finally, if you really need to, you run a cron job or other async process that dumps the view counts for each user into your counter cache field for your relational database.
You could also take the approach where you store user_id, browser fingerprint, and timestamps in a relational database (mysql?) and counter cache them into your user table periodically (probably via cron).
First of all, afaik, youtube uses BigTable, so do not worry about querying the count, we don't know the exact structure of the database anyway.
Assuming that you are on a relational model, create a column view_count, but do not update it on every refresh. Record the visists and periodically update the cache.
Also, you can generate hash from IP, browser, date and any other information you are using to detect if this is an unique view, and do not store the whole data.
Also, you can use session/cookie to record the view being viewed. Since it will expire, it won't be such memory problem - I don't believe anyone is viewing thousand of videos in one session
If you want to store all the IP's and browsers, then make sure you have enough DB storage space, add an index and that's it.
If not, then you can use the rails session to store the list of videos that a user has visited, and only increment the view_count attribute of a video when he's visiting a new video.

Resources