Consolidating changes when syncing with a server from offline localStorage

Consolidating changes when syncing with a server from offline localStorage - backbone.js

I am planning to use a localStorage adapter for backbone.us to allow it to sync/fetch to local storage instead of via jqXHR. This is so that my app can work offline.
However, once my app is back online, I'd make an ajax call to sync up the local dataset with the server, or alternatively build in some sort of a "replay" system to send only the changes over.
However, how would I handle when the data set has diverged (changed on both server and client)? Which source has the correct data set?

You should be able to create a simple two-way synchronization protocol by doing the following:
On all tables that should be synchronized, have a "last updated" timestamp field which is updated whenever you modify a row (including inserts and deletes, see further down).
For "id", avoid auto-incrementing integers (which would require talking to the server to figure out the next value), and use GUIDs instead. The localStorage adapter does this by default. Using a GUID, both the server and client can generate new rows without talking to eachother.
You need to use "soft deletes". Instead of actually deleting a row, have a flag that you mark as deleted and filter on this flag whenever you need to list your collection of objects. That way, even deletes gets propagated properly. You can do "housekeeping" (actually deleting the rows) for all rows that have remained deleted for a certain amount of time, where that time needs to be larger than the largest possible "offline" time to make sure deletes have propagated.
Store a local timestamp of the last time you sync'ed with the server on your client. Making sure you store timestamps on both the server and client with the proper timezones is important.
When you sync against the server from the client, send it your "last sync" timestamp, and the server should send you everything that changed since that time. With the new timestamp provided by the server, store that as your new "last sync time" locally on the client. Updated/insert received whatever rows you get depending on the GUID (update if you have them locally, insert if not).
If you have locally stored changes, send all locally modified rows since the original timestamp from your client to the server, preferrably before reading updates from the server (you can also do it after, but regardless, you should probably check the updated timestamp to figure out which update is more recent and handle it according to whatever conflict resolution policy you select, or none if it's not important).
Feel the pain when you realize that for practical cross platform localstorage usage, you are currently limited to about 2.5 megabytes of data. That will hopefully change soon. And before you decide to go IndexedDB instead to work around the limitations, the APIs haven' stabilized yet, and most implementations are half-assed and incompatible.

What about simply ensuring your data has some marker for when it was last edited? Then you can compare, and if the server is newer, overwrite the client, and vice versa.

Related

Browser: How to cache large data yet enable small parts to be updated?

I have a list of 20k employees to display in a React table. When the admin user changes one, I want the change reflected in the table - even if she does a reload - but I don't want to re-fetch all 20k including the unchanged 19 999.
(The table is of course paged and shows max N at once but I still need all 20k to support search and filtering, which is impractical to do server side for various reasons)
The solution I can think of is to set caching headers for /api/employees so that it is cached for e.g. one hour and have another endpoint, /api/employees?changedSince= and somehow ensure that server knows which employees have been changed. But I am sure somebody has already implemented a solution(s) for this...
Thank you!

A timestamp solution would be the best, and simplest, way to implement it. It would only require a small amount of extra data to be stored and would provide the most maintainable and expandable solution.
All you would need to do is update the timestamp when an item in the list is updated. Then, when the page loads for the first time, access /api/employees, then periodically request /api/employees?changedSince to return all of the changed rows in the table, for React to then update.
In terms of caching the main /api/employees endpoint, I’m not sure how much benefit you would gain from doing that, but it depends on how often the data is updated.

As you are saying your a in control of the frontends backend, imho this backend should cache all of the upstream data in its own (SQL or whatever) database. The backend then can expose a proper api (with pagination and search).
The backend can also implement some logic to identify which rows have changed.
If the frontend needs live updates about changes you can use some technology that allows bi-directional communication (SignalR if your backend is .NET based, or something like socket.io if you have a node backend, or even plain websockets)

Database time acces in Heroku with Play Framework

I am having a problem and I need your help.
I am working with Play Framework v1.2.4 in java, and my server is uploaded in the Heroku servers.
All works fine, I can access to my databases and all is ok, but I am experiment troubles when I do a couple of saves to the database.
I have a method who store data many times in the database and return a notification to a mobile phone. My problem is that the notification arrives before the database finish to save the data, because when it arrives I request for the update data to the server, and it returns the data without the last update. After a few seconds I have trying to update again, and the data shows correctly, therefore I think there is a time-access problem.
The idea would be that when the databases end to save the data, the server send the notification.
I dont know if this is caused because I am using the free version of the Heroku Servers, but I want to be sure before purchasing it.

In general all requests to cloud databases are always slower than the same working on your local machine. Even simply query that on your computer needs just 0.0001 sec can be as slow as 0.5 sec in the cloud. Reason is simple clouds providers uses shared databases + (geo) replications, which just... cannot be compared to the database accessed only by one program on the same machine.
Also keep in mind that free Heroku DB plans doesn't offer ANY database cache, which means that every query is fetched from the cloud directly.
As we don't know your application it's hard to say what is the bottleneck anyway almost for sure you have at least 3 ways to solve your problem. They are not an alternatives, probably you will need to use (or at least check) all of them.
You need to risk some basic plan and see how things changed with paid version, maybe it will be good enough for you, maybe not.
Redesign your application to make less queries. For an example instead sending 10 queries to select 10 different rows, you will need to send one query, which selects all 10 records at once.
Use Play's cache API to avoid repeating selecting the same set of data again and again. For an example, if you have some categories, which changes rarely, but you need category tree for each article, you don't need to fetch categories from DB every time, instead you can store a List of categories in cache, so you will need to use only one request to fetch article's content (which can be cached for some short time as well...)

How efficient can Meteor be while sharing a huge collection among many clients?

Imagine the following case:
1,000 clients are connected to a Meteor page displaying the content of the "Somestuff" collection.
"Somestuff" is a collection holding 1,000 items.
Someone inserts a new item into the "Somestuff" collection
What will happen:
All Meteor.Collections on clients will be updated i.e. the insertion forwarded to all of them (which means one insertion message sent to 1,000 clients)
What is the cost in term of CPU for the server to determine which client needs to be updated?
Is it accurate that only the inserted value will be forwarded to the clients, and not the whole list?
How does this work in real life? Are there any benchmarks or experiments of such scale available?

The short answer is that only new data gets sent down the wire. Here's
how it works.
There are three important parts of the Meteor server that manage
subscriptions: the publish function, which defines the logic for what
data the subscription provides; the Mongo driver, which watches the
database for changes; and the merge box, which combines all of a
client's active subscriptions and sends them out over the network to the
client.
Publish functions
Each time a Meteor client subscribes to a collection, the server runs a
publish function. The publish function's job is to figure out the set
of documents that its client should have and send each document property
into the merge box. It runs once for each new subscribing client. You
can put any JavaScript you want in the publish function, such as
arbitrarily complex access control using this.userId. The publish
function sends data into the merge box by calling this.added, this.changed and
this.removed. See the
full publish documentation for
more details.
Most publish functions don't have to muck around with the low-level
added, changed and removed API, though. If a publish function returns a Mongo
cursor, the Meteor server automatically connects the output of the Mongo
driver (insert, update, and removed callbacks) to the input of the
merge box (this.added, this.changed and this.removed). It's pretty neat
that you can do all the permission checks up front in a publish function and
then directly connect the database driver to the merge box without any user
code in the way. And when autopublish is turned on, even this little bit is
hidden: the server automatically sets up a query for all documents in each
collection and pushes them into the merge box.
On the other hand, you aren't limited to publishing database queries.
For example, you can write a publish function that reads a GPS position
from a device inside a Meteor.setInterval, or polls a legacy REST API
from another web service. In those cases, you'd emit changes to the
merge box by calling the low-level added, changed and removed DDP API.
The Mongo driver
The Mongo driver's job is to watch the Mongo database for changes to
live queries. These queries run continuously and return updates as the
results change by calling added, removed, and changed callbacks.
Mongo is not a real time database. So the driver polls. It keeps an
in-memory copy of the last query result for each active live query. On
each polling cycle, it compares the new result with the previous saved
result, computing the minimum set of added, removed, and changed
events that describe the difference. If multiple callers register
callbacks for the same live query, the driver only watches one copy of
the query, calling each registered callback with the same result.
Each time the server updates a collection, the driver recalculates each
live query on that collection (Future versions of Meteor will expose a
scaling API for limiting which live queries recalculate on update.) The
driver also polls each live query on a 10 second timer to catch
out-of-band database updates that bypassed the Meteor server.
The merge box
The job of the merge box is to combine the results (added, changed and removed
calls) of all of a client's active publish functions into a single data
stream. There is one merge box for each connected client. It holds a
complete copy of the client's minimongo cache.
In your example with just a single subscription, the merge box is
essentially a pass-through. But a more complex app can have multiple
subscriptions which might overlap. If two subscriptions both set the
same attribute on the same document, the merge box decides which value
takes priority and only sends that to the client. We haven't exposed
the API for setting subscription priority yet. For now, priority is
determined by the order the client subscribes to data sets. The first
subscription a client makes has the highest priority, the second
subscription is next highest, and so on.
Because the merge box holds the client's state, it can send the minimum
amount of data to keep each client up to date, no matter what a publish
function feeds it.
What happens on an update
So now we've set the stage for your scenario.
We have 1,000 connected clients. Each is subscribed to the same live
Mongo query (Somestuff.find({})). Since the query is the same for each client, the driver is
only running one live query. There are 1,000 active merge boxes. And
each client's publish function registered an added, changed, and
removed on that live query that feeds into one of the merge boxes.
Nothing else is connected to the merge boxes.
First the Mongo driver. When one of the clients inserts a new document
into Somestuff, it triggers a recomputation. The Mongo driver reruns
the query for all documents in Somestuff, compares the result to the
previous result in memory, finds that there is one new document, and
calls each of the 1,000 registered insert callbacks.
Next, the publish functions. There's very little happening here: each
of the 1,000 insert callbacks pushes data into the merge box by
calling added.
Finally, each merge box checks these new attributes against its
in-memory copy of its client's cache. In each case, it finds that the
values aren't yet on the client and don't shadow an existing value. So
the merge box emits a DDP DATA message on the SockJS connection to its
client and updates its server-side in-memory copy.
Total CPU cost is the cost to diff one Mongo query, plus the cost of
1,000 merge boxes checking their clients' state and constructing a new
DDP message payload. The only data that flows over the wire is a single
JSON object sent to each of the 1,000 clients, corresponding to the new
document in the database, plus one RPC message to the server from the
client that made the original insert.
Optimizations
Here's what we definitely have planned.
More efficient Mongo driver. We
optimized the driver
in 0.5.1 to only run a single observer per distinct query.
Not every DB change should trigger a recomputation of a query. We
can make some automated improvements, but the best approach is an API
that lets the developer specify which queries need to rerun. For
example, it's obvious to a developer that inserting a message into
one chatroom should not invalidate a live query for the messages in a
second room.
The Mongo driver, publish function, and merge box don't need to run
in the same process, or even on the same machine. Some applications
run complex live queries and need more CPU to watch the database.
Others have only a few distinct queries (imagine a blog engine), but
possibly many connected clients -- these need more CPU for merge
boxes. Separating these components will let us scale each piece
independently.
Many databases support triggers that fire when a row is updated and
provide the old and new rows. With that feature, a database driver
could register a trigger instead of polling for changes.

From my experience, using many clients with while sharing a huge collection in Meteor is essentially unworkable, as of version 0.7.0.1. I'll try to explain why.
As described in the above post and also in https://github.com/meteor/meteor/issues/1821, the meteor server has to keep a copy of the published data for each client in the merge box. This is what allows the Meteor magic to happen, but also results in any large shared databases being repeatedly kept in the memory of the node process. Even when using a possible optimization for static collections such as in (Is there a way to tell meteor a collection is static (will never change)?), we experienced a huge problem with the CPU and Memory usage of the Node process.
In our case, we were publishing a collection of 15k documents to each client that was completely static. The problem is that copying these documents to a client's merge box (in memory) upon connection basically brought the Node process to 100% CPU for almost a second, and resulted in a large additional usage of memory. This is inherently unscalable, because any connecting client will bring the server to its knees (and simultaneous connections will block each other) and memory usage will go up linearly in the number of clients. In our case, each client caused an additional ~60MB of memory usage, even though the raw data transferred was only about 5MB.
In our case, because the collection was static, we solved this problem by sending all the documents as a .json file, which was gzipped by nginx, and loading them into an anonymous collection, resulting in only a ~1MB transfer of data with no additional CPU or memory in the node process and a much faster load time. All operations over this collection were done by using _ids from much smaller publications on the server, allowing for retaining most of the benefits of Meteor. This allowed the app to scale to many more clients. In addition, because our app is mostly read-only, we further improved the scalability by running multiple Meteor instances behind nginx with load balancing (though with a single Mongo), as each Node instance is single-threaded.
However, the issue of sharing large, writeable collections among multiple clients is an engineering problem that needs to be solved by Meteor. There is probably a better way than keeping a copy of everything for each client, but that requires some serious thought as a distributed systems problem. The current issues of massive CPU and memory usage just won't scale.

The experiment that you can use to answer this question:
Install a test meteor: meteor create --example todos
Run it under Webkit inspector (WKI).
Examine the contents of the XHR messages moving across the wire.
Observe that the entire collection is not moved across the wire.
For tips on how to use WKI check out this article. It's a little out of date, but mostly still valid, especially for this question.

This is still a year old now and therefore I think pre-"Meteor 1.0" knowledge, so things may have changed again? I'm still looking into this.
http://meteorhacks.com/does-meteor-scale.html
leads to a "How to scale Meteor?" article
http://meteorhacks.com/how-to-scale-meteor.html

Keeping handsets in sync with a database

On the system I am developing, we have a PostgreSQL database that contains set up data which when updated must be transfered to handsets when and while those handsets are "docked". While the handsets are docked, our "service software" can talk to them, but not while they are undocked (they are not wireless).
At the moment, the service software that the handsets talk to loads the set up data from the database on startup and caches it. Thereafter it queries the latest timestamp of the setup data every 5 seconds and reloads parts of the set up if the timestamp queried is higher than the latest cached timestamp.
However, I find this method haphazard. It may be possible to miss an update for instance if an update transaction takes longer than a second, or at least if the period between submitting the transaction and completion of the transaction takes it over a 1-second boundary (the now() function is resolved at the beginning of the transaction by PostgreSQL). The only way I can think of round that is to do a table level lock before querying the latest timestamp. I'm not a fan of table locks but it is the only way I can think of to get round the problem.
Another problem with this approach is, I have to query for new data based on the update timetamp being >= the last latest timestamp, as opposed to just > the last latest timestamp. Why? Because a record may of been committed within the same second, just after my query - so I'd miss the record.
Another approach I've thought of is, storing "last synchronised date-time" data in the database for each logical item of data that must be stored on the handsets. I would do this on a per handset basis. I can then periodically query for all data not currently synchronised on a particular handset, and then mark it as synchronised once the handset is up to date (I have worked out a mechanism for this to be failsafe which takes into account the data being updated during synchronisation).
My only problem with this approach is that it means the database is storing non-business centric data - as in it is storing data to make the system work. I'm not convinced data about what handsets are in sync is "business" data. To me it is more the responsibility of the handset service software / handset software to know how to keep itself up to date, though it is tempting as it describes perfectly what data is and is not on each handset and allows queries to only return the data needed.
The first approach however at least only uses data appropriate to the business - i.e. the timestamp of when the data was last changed.
The ideal way would be to use some kind of notification system, but unfortunately postgres only has a basic NOTIFY / EVENT system and that doesn't seem to work over ODBC (which I foolishly decided to use and do not have time to change just now). If I was using Oracle I could use Streams..
Thoughts?
Note: The database is purely relational - I am not interested in any "object oriented" approaches to this problem or any framework based solutions.
Thanks.

First of all, if you are using PostgreSQL version at least 7.2, the now function returns values with microsecond precision rather that second precision; although the value is ultimately derived from the operating system and will be accurate only up to a few hundredths of a second.
The method that you describe appears to be safe against permanently missing any updates. Just make sure that you reload data every time unless the timestamps prove that you have reloaded long enough after the last update. Alternately, you could update a timestamp upon data update in a separate transaction; in that case, ever seeing such a timestamp is a proof that all updates had finished before the timestamp value.
Another approach I've thought of is, storing "last synchronised
date-time" data in the database for each logical item of data that
must be stored on the handsets. I would do this on a per handset
basis. I can then periodically query for all data not currently
synchronised on a particular handset, and then mark it as synchronised
once the handset is up to date
I can not recommend this for the following reasons:
As synchronization is a state of a handset and not a state of the data, this information should better be stored on the handset.
The database should be scalable to many handsets and it ideally should not have to keep track of them.
If a handset can change its identity, or be wiped or restored (reimaged) to a previous state without changing its identity, the database will get out of sync with the real state of the headset and no mechanism will ensure proper synchronization.
While NOTIFY is certainly preferable to constant polling, it is a problem orthogonal to where you store the synchronization progress. You still need to have a polling capability to be able to deal with a freshly connected devide, and notifications would be just a bandwidth/latency optimization.

One database, multiple frontends, maintaing request ordering

Assuming I have one database keeping a simple history with multiple front ends talking to it (one front end per server), I wonder what are the common solutions to deal with time. As soon as I have multiple servers, I cannot assume a global consistent clock, and I was interested in the possible solutions to maintain some kind of ordering between requests.
For a concrete example, let's say I want to record histories of customers, where history is defined as time ordered set of records. The record table would be as simple as (customer_id, time, data), and history would be all the rows where customer_id == requested id. Each request sent by the user would contain one record sent to one customer. Ideally, the time should refer to the "actual" time the request was sent to the front end by the customer (as that's the time as seen from the user POV). To be exact, I only care about preserving the ordering between records for each customer, not about the absolute time.
I am aware of solutions such as vector clocks, etc... but that seems rather complex, and I would expect this to be a rather common issue ?
Solutions which are not acceptable in my case:
Changing the requests arriving at the front end: I unfortunately have to work under the constraint that the requests are passed as is. I have complete control of whatever communication protocol is needed between front ends and database, though.
Server time clocks are synchronized
All request which require being ordered to each other are handled by the same front end server
[EDIT]: the question may sound a bit like red-herring, so here is my rationale for asking it: while this is not my issue right now, I am interested in the possibility to go to a platform like Google App Engine, which explicitly says that their servers are not guaranteed to be time synchronized. The solution to that issue for request ordering does not sound obvious to me - but maybe something like vector clock is actually the only "good" solution ?

When you perform any action that records history data to the database you could record two sets of datetime info:
the datetime as set by the DB when the record was inserted
the datetime passed through with the data as a legitimate piece of metadata.
The former would give you a central view of the world if you ever needed it, and the latter would let you reconstruct datetime from customers perspective.
If you were ultra-keen you could also pass through the datetime from the users browser by filling some sort of parameter/field using JavaScript.

As soon as I have multiple servers, I
cannot assume a global consistent
clock
Well, you can configure servers to sync their clocks to a time server. You could also configure your database server to sync to a time server, and configure the other servers to sync to your database server as often as you need to. (I'm not saying that's a great idea, just saying it's possible. If you have access to all the servers.)
Anyway . . . so the front ends are the only pieces of software you have that actually know when a request arrives. Is that right?
If that's right, then it's the front ends job to record the time of the customer's request, possibly in UTC, and then forward that timestamp to the database.
If you can't synchronize the server's clocks, then I think your only hope is to have every front ends ask just one specific server--maybe your database server, but maybe not--what time it is when a customer request arrives. A front end can do that by asking for daytime on port 13 (DAYTIME protocol, RFC-867), asking for time on port 37 (TIME protocol, RFC-868), or asking a time server on port 123 (either NTP or SNTP protocol, RFC-1305 and RFC-2030).
But after reading your edit, I think what you want is impossible. You seem to be saying that
what the front ends send doesn't
contain enough information to
reconstruct the "true" ordering
what the front ends send cannot be
changed
If the front ends can't send you any other information, vector clocks and interval tree clocks won't help.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight