Can SQL Server database time be used for synchronizing multiple clients? - sql-server

This is an alternative approach to solving this problem, not a duplicate.
I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Looks like I should not make assumptions about global time.
But all nodes talk to the very same database. What if I use that database time - with functions like GETUTCDATE() in SQL requests I will do exactly the same thing as I planned, but not I seem to not care whether nodes time is in sync - they will all use the database time.
Will this approach work reliably if I use the database time functions?

As a general rule this approach should work. If there is only one source for the current time then time synchronisation issues are not going to affect you.
But you have to consider what happens if your database server goes down. Azure will switch you over to another copy, on another server, in another rack and this database server is not guaranteed to be synchronised at least with regard to time with the original one.
Also this approach will undoubtably get you in trouble if ever you have to scale out your database.
I think i still prefer the Queue based approach.

Related

Concurrent editing of same data

I recently came up with a case that makes me wonder if I'm a newbie or something trivial has escaped to me.
Suppose I have a software to be run by many users, that uses a table. When the user makes login in the app a series of information from the table appears and he has just to add and work or correct some information to save it. Now, if the software he uses is run by many people, how can I guarantee is he is the only one working with that particular record? I mean how can I know the record is not selected and being worked by 2 or more users at the same time? And please I wouldn't like the answer use “SELECT FOR UPDATE... “
because for what I've read it has too negative impact on the database. Thanks to all of you. Keep up the good work.
This is something that is not solved primarily by the database. The database manages isolation and locking of "concurrent transactions". But when the records are sent to the client, you usually (and hopefully) closed the transaction and start a new one when it comes back.
So you have to care yourself.
There are different approaches, the ones that come into my mind are:
optimistic locking strategies (first wins)
pessimistic locking strategies
last wins
Optimistic locking: you check whether a record had been changed in the meanwhile when storing. Usually it does this by having a version counter or timestamp. Some ORMs and frameworks may help a little to implement this.
Pessimistic locking: build a mechanism that stores the information that someone started to edit something and do not allow someone else to edit the same. Especially in web projects it needs a timeout when the lock is released anyway.
Last wins: the second person storing the record just overwrites the first changes.
... makes me wonder if I'm a newbie ...
That's what happens always when we discover that very common stuff is still not solved by the tools and frameworks we use and we have to solve it over and over again.
Now, if the software he uses is runed by many people how can I guarantee is he
is the only one working with that particular record.
Ah...
And please I wouldn't like the answer use “SELECT FOR UPDATE... “ because for
what I've read it has too negative impact on the database.
Who cares? I mean, it is the only way (keep a lock on a row) to guarantee you are the only one who can change it. Yes, this limits throughput, but then this is WHAT YOU WANT.
It is called programming - choosing the right tools for the job. IN this case impact is required because of the requirements.
The alternative - not a guarantee on the database but an application server - is an in memory or in database locking mechanism (like a table indicating what objects belong to what user).
But if you need to guarantee one record is only used by one person on db level, then you MUST keep a lock around and deal with the impact.
But seriously, most programs avoid this. They deal with it either with optimistic locking (second user submitting changes gets error) or other programmer level decisions BECAUSE the cost of such guarantees are ridiculously high.
Oracle is different from SQL server.
In Oracle, when you update a record or data set the old information is still available because your update is still on hold on the database buffer cache until commit.
Therefore who is reading the same record will be able to see the old result.
If the access to this record though is a write access, it will be a lock until commit, then you'll have access to write the same record.
Whenever the lock can't be resolved, a deadlock will pop up.
SQL server though doesn't have the ability to read a record that has been locked to write changes, therefore depending which query you're running, you might lock an entire table
First you need to separate queries and insert/updates using a data-warehouse database. Which means you could solve slow performance in update that causes locks.
The next step is to identify what is causing locks and work out each case separately.
rebuilding indexes during working hours could cause very nasty locks. Push them to after hours.

How often should I have my server sync to the database?

I am developing a web-app right now, where clients will frequently (every few seconds), send read/write requests on certain data. As of right now, I have my server immediately write to the database when a user changes something, and immediately read from the database when they want to view something. This is working fine for me, but I am guessing that it would be quite slow if there were thousands of users online.
Would it be more efficient to save write requests in an object on the server side, then do a bulk update at a certain time interval? This would help in situations where the same data is edited multiple times, since it would now only require one db insert. It would also mean that I would read from the object for any data that hasn't yet been synced, which could mean increased efficiency by avoiding db reads. At the same time though, I feel like this would be a liability for two reasons: 1. A server crash would erase all data that hasn't yet been synced. 2. A bulk insert has the possibility of creating sudden spikes of lag due to mass database calls.
How should I approach this? Is my current approach ok, or should I queue inserts for a later time?
If a user makes a change to data and takes an action that (s)he expects will save the data, you should do everything you can to ensure the data is actually saved. Example: Let's say you delay the write for a while. The user is in a hurry, makes a change then closes the browser. If you don't save right when they take an action that they expect saves the data, there would be a data loss.
Web stacks generally scale horizontally. Don't start to optimize this kind of thing unless there's evidence that you really have to.

Keeping handsets in sync with a database

On the system I am developing, we have a PostgreSQL database that contains set up data which when updated must be transfered to handsets when and while those handsets are "docked". While the handsets are docked, our "service software" can talk to them, but not while they are undocked (they are not wireless).
At the moment, the service software that the handsets talk to loads the set up data from the database on startup and caches it. Thereafter it queries the latest timestamp of the setup data every 5 seconds and reloads parts of the set up if the timestamp queried is higher than the latest cached timestamp.
However, I find this method haphazard. It may be possible to miss an update for instance if an update transaction takes longer than a second, or at least if the period between submitting the transaction and completion of the transaction takes it over a 1-second boundary (the now() function is resolved at the beginning of the transaction by PostgreSQL). The only way I can think of round that is to do a table level lock before querying the latest timestamp. I'm not a fan of table locks but it is the only way I can think of to get round the problem.
Another problem with this approach is, I have to query for new data based on the update timetamp being >= the last latest timestamp, as opposed to just > the last latest timestamp. Why? Because a record may of been committed within the same second, just after my query - so I'd miss the record.
Another approach I've thought of is, storing "last synchronised date-time" data in the database for each logical item of data that must be stored on the handsets. I would do this on a per handset basis. I can then periodically query for all data not currently synchronised on a particular handset, and then mark it as synchronised once the handset is up to date (I have worked out a mechanism for this to be failsafe which takes into account the data being updated during synchronisation).
My only problem with this approach is that it means the database is storing non-business centric data - as in it is storing data to make the system work. I'm not convinced data about what handsets are in sync is "business" data. To me it is more the responsibility of the handset service software / handset software to know how to keep itself up to date, though it is tempting as it describes perfectly what data is and is not on each handset and allows queries to only return the data needed.
The first approach however at least only uses data appropriate to the business - i.e. the timestamp of when the data was last changed.
The ideal way would be to use some kind of notification system, but unfortunately postgres only has a basic NOTIFY / EVENT system and that doesn't seem to work over ODBC (which I foolishly decided to use and do not have time to change just now). If I was using Oracle I could use Streams..
Thoughts?
Note: The database is purely relational - I am not interested in any "object oriented" approaches to this problem or any framework based solutions.
Thanks.
First of all, if you are using PostgreSQL version at least 7.2, the now function returns values with microsecond precision rather that second precision; although the value is ultimately derived from the operating system and will be accurate only up to a few hundredths of a second.
The method that you describe appears to be safe against permanently missing any updates. Just make sure that you reload data every time unless the timestamps prove that you have reloaded long enough after the last update. Alternately, you could update a timestamp upon data update in a separate transaction; in that case, ever seeing such a timestamp is a proof that all updates had finished before the timestamp value.
Another approach I've thought of is, storing "last synchronised
date-time" data in the database for each logical item of data that
must be stored on the handsets. I would do this on a per handset
basis. I can then periodically query for all data not currently
synchronised on a particular handset, and then mark it as synchronised
once the handset is up to date
I can not recommend this for the following reasons:
As synchronization is a state of a handset and not a state of the data, this information should better be stored on the handset.
The database should be scalable to many handsets and it ideally should not have to keep track of them.
If a handset can change its identity, or be wiped or restored (reimaged) to a previous state without changing its identity, the database will get out of sync with the real state of the headset and no mechanism will ensure proper synchronization.
While NOTIFY is certainly preferable to constant polling, it is a problem orthogonal to where you store the synchronization progress. You still need to have a polling capability to be able to deal with a freshly connected devide, and notifications would be just a bandwidth/latency optimization.

What is the recommended way to build functionality similar to Stackoverflow's "Inbox"?

I have an asp.net-mvc website and people manage a list of projects. Based on some algorithm, I can tell if a project is out of date. When a user logs in, i want it to show the number of stale projects (similar to when i see a number of updates in the inbox).
The algorithm to calculate stale projects is kind of slow so if everytime a user logs in, i have to:
Run a query for all project where they are the owner
Run the IsStale() algorithm
Display the count where IsStale = true
My guess is that will be real slow. Also, on everything project write, i would have to recalculate the above to see if changed.
Another idea i had was to create a table and run a job everything minutes to calculate stale projects and store the latest count in this metrics table. Then just query that when users log in. The issue there is I still have to keep that table in sync and if it only recalcs once every minute, if people update projects, it won't change the value until after a minute.
Any idea for a fast, scalable way to support this inbox concept to alert users of number of items to review ??
The first step is always proper requirement analysis. Let's assume I'm a Project Manager. I log in to the system and it displays my only project as on time. A developer comes to my office an tells me there is a delay in his activity. I select the developer's activity and change its duration. The system still displays my project as on time, so I happily leave work.
How do you think I would feel if I receive a phone call at 3:00 AM from the client asking me for an explanation of why the project is no longer on time? Obviously, quite surprised, because the system didn't warn me in any way. Why did that happen? Because I had to wait 30 seconds (why not only 1 second?) for the next run of a scheduled job to update the project status.
That just can't be a solution. A warning must be sent immediately to the user, even if it takes 30 seconds to run the IsStale() process. Show the user a loading... image or anything else, but make sure the user has accurate data.
Now, regarding the implementation, nothing can be done to run away from the previous issue: you will have to run that process when something that affects some due date changes. However, what you can do is not unnecessarily run that process. For example, you mentioned that you could run it whenever the user logs in. What if 2 or more users log in and see the same project and don't change anything? It would be unnecessary to run the process twice.
Whatsmore, if you make sure the process is run when the user updates the project, you won't need to run the process at any other time. In conclusion, this schema has the following advantages and disadvantages compared to the "polling" solution:
Advantages
No scheduled job
No unneeded process runs (this is arguable because you could set a dirty flag on the project and only run it if it is true)
No unneeded queries of the dirty value
The user will always be informed of the current and real state of the project (which is by far, the most important item to address in any solution provided)
Disadvantages
If a user updates a project and then upates it again in a matter of seconds the process would be run twice (in the polling schema the process might not even be run once in that period, depending on the frequency it has been scheduled)
The user who updates the project will have to wait for the process to finish
Changing to how you implement the notification system in a similar way to StackOverflow, that's quite a different question. I guess you have a many-to-many relationship with users and projects. The simplest solution would be adding a single attribute to the relationship between those entities (the middle table):
Cardinalities: A user has many projects. A project has many users
That way when you run the process you should update each user's Has_pending_notifications with the new result. For example, if a user updates a project and it is no longer on time then you should set to true all users Has_pending_notifications field so that they're aware of the situation. Similarly, set it to false when the project is on time (I understand you just want to make sure the notifications are displayed when the project is no longer on time).
Taking StackOverflow's example, when a user reads a notification you should set the flag to false. Make sure you don't use timestamps to guess if a user has read a notification: logging in doesn't mean reading notifications.
Finally, if the notification itself is complex enough, you can move it away from the relationship between users and projects and go for something like this:
Cardinalities: A user has many projects. A project has many users. A user has many notifications. A notifications has one user. A project has many notifications. A notification has one project.
I hope something I've said has made sense, or give you some other better idea :)
You can do as follows:
To each user record add a datetime field sayng the last time the slow computation was done. Call it LastDate.
To each project add a boolean to say if it has to be listed. Call it: Selected
When you run the Slow procedure set you update the Selected fileds
Now when the user logs if LastDate is enough close to now you use the results of the last slow computation and just take all project with Selected true. Otherwise yourun again the slow computation.
The above procedure is optimal, becuase it re-compute the slow procedure ONLY IF ACTUALLY NEEDED, while running a procedure at fixed intervals of time...has the risk of wasting time because maybe the user will neber use the result of a computation.
Make a field "stale".
Run a SQL statement that updates stale=1 with all records where stale=0 AND (that algorithm returns true).
Then run a SQL statement that selects all records where stale=1.
The reason this will work fast is because SQL parsers, like PHP, shouldn't do the second half of the AND statement if the first half returns true, making it a very fast run through the whole list, checking all the records, trying to make them stale IF NOT already stale. If it's already stale, the algorithm won't be executed, saving you time. If it's not, the algorithm will be run to see if it's become stale, and then stale will be set to 1.
The second query then just returns all the stale records where stale=1.
You can do this:
In the database change the timestamp every time a project is accessed by the user.
When the user logs in, pull all their projects. Check the timestamp and compare it with with today's date, if it's older than n-days, add it to the stale list. I don't believe that comparing dates will result in any slow logic.
I think the fundamental questions need to be resolved before you think about databases and code. The primary of these is: "Why is IsStale() slow?"
From comments elsewhere it is clear that the concept that this is slow is non-negotiable. Is this computation out of your hands? Are the results resistant to caching? What level of change triggers the re-computation.
Having written scheduling systems in the past, there are two types of changes: those that can happen within the slack and those that cause cascading schedule changes. Likewise, there are two types of rebuilds: total and local. Total rebuilds are obvious; local rebuilds try to minimize "damage" to other scheduled resources.
Here is the crux of the matter: if you have total rebuild on every update, you could be looking at 30 minute lags from the time of the change to the time that the schedule is stable. (I'm basing this on my experience with an ERP system's rebuild time with a very complex workload).
If the reality of your system is that such tasks take 30 minutes, having a design goal of instant gratification for your users is contrary to the ground truth of the matter. However, you may be able to detect schedule inconsistency far faster than the rebuild. In that case you could show the user "schedule has been overrun, recomputing new end times" or something similar... but I suspect that if you have a lot of schedule changes being entered by different users at the same time the system would degrade into one continuous display of that notice. However, you at least gain the advantage that you could batch changes happening over a period of time for the next rebuild.
It is for this reason that most of the scheduling problems I have seen don't actually do real time re-computations. In the context of the ERP situation there is a schedule master who is responsible for the scheduling of the shop floor and any changes get funneled through them. The "master" schedule was regenerated prior to each shift (shifts were 12 hours, so twice a day) and during the shift delays were worked in via "local" modifications that did not shuffle the master schedule until the next 12 hour block.
In a much simpler situation (software design) the schedule was updated once a day in response to the day's progress reporting. Bad news was delivered during the next morning's scrum, along with the updated schedule.
Making a long story short, I'm thinking that perhaps this is an "unask the question" moment, where the assumption needs to be challenged. If the re-computation is large enough that continuous updates are impractical, then aligning expectations with reality is in order. Either the algorithm needs work (optimizing for local changes), the hardware farm needs expansion or the timing of expectations of "truth" needs to be recalibrated.
A more refined answer would frankly require more details than "just assume an expensive process" because the proper points of attack on that process are impossible to know.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Resources