Synchronizing one or more databases with a master database - Foreign keys - database

I'm using Google Gears to be able to use an application offline (I know Gears is deprecated). The problem I am facing is the synchronization with the database on the server.
The specific problem is the primary keys or more exactly, the foreign keys. When sending the information to the server, I could easily ignore the primary keys, and generate new ones. But then how would I know what the relations are.
I had one sollution in mind, bet the I would need to save all the pk for every client. What is the best way to synchronize multiple client with one server db.
Edit:
I've been thinking about it, and I guess seqential primary keys are not the best solution, but what other possibilities are there? Time based doesn't seem right because of collisions which could happen.
A GUID comes to mind, is that an option? It looks like generating a GUID in javascript is not that easy.
I can do something with natural keys or composite keys. As I'm thinking about it, that looks like the best solution. Can I expect any problems with that?

This is not quite a full answer, but might at least provide you with some ideas...
The question you're asking (and the problem you're trying to address) is not specific to Google Gears, and will remains valid with other solutions, like HTML 5 or systems based or Flash/Air.
There's been a presentation about that subject given during the last ZendCon a few month ago -- and the slides are available on slideshare : Planning for Synchronization with Browser-Local Databases
Going through thoses slides, you'll see notes about a couple of possibilities that might come to mind (some did actually come to your mind, or in other answers) :
Using GUID
Composite Keys
Primary key pool (i.e. reserve a range of keys beforehand)
Of course, for each one of those, there are advantages... and drawbacks -- I will not copy-paste them : take a look at the slides ;-)
Now, in your situation, which solution will be the best ? Hard to say, actually -- and the sooner your think about synchronisation, the better/easier it'll probably be : adding stuff into an application is so much simpler when that application is still in its design stage ^^
First, it might be interesting to determine whether :
Your application is generally connected, and being dis-connected only rarely happens
Or if your application is generally dis-connected, and only connects once in a while.
Then, what are you going to synchronise ?
Data ?
Like "This is the list of all commands made by that user"
With that data replicated on each dis-connected device, of course -- which can each modify it
In this case, if one user deletes a line, and another one adds a line, how to know which one has the "true" data ?
Or actions made on those data ?
Like "I am adding an entry in the list of commands made by that user"
In this case, if one user deletes a line, and another one adds a line, it's easy to synchronize, as you just have to synchronise those two actions to your central DB
But this is not quite easy to implements, especially for a big application / system : each time an action is made, you have to kind of log it !
There is also a specific problem to which we don't generally think -- until it happens : especially if your synchronisation process can take some time (if you have a lot of data, if you don't synchronise often, ...), what if the synchronisation is stopped when it's not finished yet ?
For instance, what if :
A user, in a train, has access to the network, with some 3G card
The synchronisation starts
there is a tunnel -- and the connection is lost.
Having half-synchronised data might not be that good, in most situations...
So, you have to find a solution to that problem, too : in most cases, the synchronisation has to be atomic !

I've came up with the following solution:
Every client gets a unique id from the server. Everywhere a primary key is referenced, I use a composite key with the client id and an auto increment field.
This way, the combination is unique, and it's easy to implement. The only thing left is making sure every client does get a unique id.
I just found out one drawback: SQLite doesn't support autoincrement on composite primary keys, so I would have to handle the id's myself.

I would use a similar setup to your latest answer. However, to get around your auto-increment issue, I would use a single auto-increment surrogate key in your master database and then store the client primary key and your client id as well. That way you are not losing or changing any data in the process and you are also tracking which client the data was originally sourced from.
Be sure to also set up a unique index on your Client Pk, Client Id to enable referential integrity from any child tables.

Is there a reasonable limit to how many objects the client can create while disconnected?
One possibilty I can see is to create a sort of "local sequence".
When your client connects to the central server, it gets a numeric ID, say a 7 digit number (the server generates it as a sequence).
The actual PKs are created as strings like this: 895051|000094 or 895051|005694 where the first part is the 7 digit number sent from the server, and the second part is a "local" sequence managed by the client.
As soon as you synch with the central, you can get a new 7 digit number and restart your local sequence. This is not too different from what you were proposing, all in all. It just makes the actual PK completely independant from the client identity.
Another bonus is that if you have a scenario where the client has never connected to the server, it can use 000000|000094 locally, require a new number from the server and update the keys on its side before sending back to the server for synch (this is tricky if you have lots of FK constraints though, and could not be feasible).

Related

Couchbase, two user registering with same username but different datacenters?

Let's say I have two users, Alice in North America and Bob in Europe. Both want to register a new account with the same username, at the same time, on different datacenters. The datacenters are configured to replicate between each other using eventual consistency.
How can I make sure only one of them succeeds at registering the username? Keep in mind that the connection between the datacenters might even be offline at the time (worst case, but daily occurance on spotify's cassandra setup).
EDIT:
I do realize the key uniqueness is the big problem here. The thing is that I need all usernames to be unique. Imagine using twitter if you couldn't tag a specific person, but had to tag everyone with the same username.
With any eventual consistency system, and particularly in the presence of a network partition, you essentially have two choices:
Accept collisions, and pick a winner later.
Ensure you never have a collision.
In the case of Couchbase:
For (1) that means letting two users register with the same address in both NA and EU, and then later picking one as the "winner" (when the network link is present - not a very desirable outcome for something like a user account. A slight variation on this would be something like #Robert's suggestion and putting them in a staging area (which means the account cannot be made "active" until the partition is resolved), and then telling the "winning" user they have successfully registered, and the "loser" that the name is taken and to try again.
For (2) this means making the users unique, even though they pick the same username - for example adding a NA:: / EU:: prefix to their username document. When they login the application would need some logic to try looking up both document variations - likely trying the prefix for the local region first. (This is essentially the same idea as "realms" or "servers" that many MMO games use).
There are variations of both of these, but ultimately given an AP-type system (which Couchbase across XDCR is) you've essentially chosen Availability & Partition-Tolerance over Consistancy, and hence need to reconcile that at the application layer.
Put the user name registrations into a staging table until you can perform a replication to determine if the name already exists in one of the other data centers.
You tagged Couchbase, so I will answer about that.
As long as the key for each object is different, you should be fine with Couchbase. It is the keys that would be unique and work great with XDCR. Another solution would be to have a concatenated key made up of the username and other values (company name, etc) if that suits your use case, again giving you a unique key for the object. Yet another would be to have a key/value in a JSON document that is the username.
It's not clear to me whether you're using Cassandra or Couchbase.
As far as Cassandra is concerned, since version 2.0, you can use Lightweight Transactions which are created for the goal. A Serial Consistency has been created just to achieve what you need. In the above link you can read what follows:
For example, suppose that I have an application that allows users to
register new accounts. Without linearizable consistency, I have no way
to make sure I allow exactly one user to claim a given account — I
have a race condition analogous to two threads attempting to insert
into a [non-concurrent] Map: even if I check for existence before
performing the insert in one thread, I can’t guarantee that no other
thread inserts it after the check but before I do.
As far as the missing connection between two or more cluster its your choice how to handle it. If you can't guarantee the uniqueness at insert-time you can both refuse the registration or dealing with it, accepting and apologize later.
HTH, Carlo

When is it appropriate to use UUIDs for a web project?

I'm busy with the database design of a new project, and I'm not sure whether to use UUIDs or normal table-unique auto-increment ids.
Up to now, the sites I've built have all run on a single server, and very heavy traffic has never been too much of a concern. However, this web application will eventually run concurrently on multiple servers, serve an API, and need to process thousands of requests per second, and I want to make sure that the design I choose now doesn't cripple any of those possibilities later.
I have my suspicions, of course, and they should be clear through the way I phrased my question, but I would like to hear from those with more experience what trouble I can run into later if I do or don't have UUIDs, and what I should really be basing my decision on.
So, in short: What are the considerations I should give into deciding whether or not to use UUIDs for all database models, so that any one object can be identified uniquely by one string, and when is it appropriate to use this as the primary key, instead of table-by-table auto-increment?
Note: I've seen this question (When are you truly forced to use UUID as part of the design?), and read all the answers, but they mostly answer "How rarely do UUIDs collide", instead of "When is it appropriate to use them".
One consideration that I've used when deciding on UUIDs vs. auto-increment ids is whether they're going to be user-visible, and if so, whether I want users to know how many I have of that table. For example, if I didn't want to make public the number of registered users my site has, I wouldn't assign auto-increment user ids.
And to address one other specific point you raised, it's still possible to use auto-incrementing ids with multiple servers (though not with the built-in MySQL). You just need to start all the ids at different offsets, and increment accordingly. That is, if you had 3 servers, you could start server A at 1, server B at 2, and server C at 3, and then increment the ids by 10 each time instead of 1. That way, you could guarantee no collisions.
And finally, the last thing I consider is how important performance is to my application. Integers are much more easily indexed than UUIDs that are string-based, so indexes are smaller, more quickly searched, etc.
UUID's or GUID's can be very useful especially for the web. If you use auto-increment values to store UserId anyone can view the source of your web pages and see the simplicity of it's use. They could then try any integer value to get data they are not supposed to see.
GUID's are not created in any sequential format, therefore if you create them one right after the other, there sequence can not easily be guessed.
I don't think it's necessary to use GUID's for simple lookup type data such as ColorId 1=Blue, 2=Red, 3=Green.
GUID's are also very useful for session and state management.
That's my $0.02

Enforcing Unique Constraint in GAE

I am trying out Google App Engine Java, however the absence of a unique constraint is making things difficult.
I have been through this post and this blog suggests a method to implement something similar. My background is in MySQL.Moving to datastore without a unique constraint makes me jittery because I never had to worry about duplicate values before and checking each value before inserting a new value still has room for error.
"No, you still cannot specify unique
during schema creation."
-- David Underhill talks about GAE and the unique constraint (post link)
What are you guys using to implement something similar to a unique or primary key?
I heard about a abstract datastore layer created using the low level api which worked like a regular RDB, which however was not free(however I do not remember the name of the software)
Schematic view of my problem
sNo = biggest serial_number in the db
sNo++
Insert new entry with sNo as serial_number value //checkpoint
User adds data pertaining to current serial_number
Update entry with data where serial_number is sNo
However at line number 3(checkpoint), I feel two users might add the same sNo. And that is what is preventing me from working with appengine.
This and other similar questions come up often when talking about transitioning from a traditional RDB to a BigTable-like datastore like App Engine's.
It's often useful to discuss why the datastore doesn't support unique keys, since it informs the mindset you should be in when thinking about your data storage schemes. The reason unique constraints are not available is because it greatly limits scalability. Like you've said, enforcing the constraint means checking all other entities for that property. Whether you do it manually in your code or the datastore does it automatically behind the scenes, it still needs to happen, and that means lower performance. Some optimizations can be made, but it still needs to happen in one way or another.
The answer to your question is, really think about why you need that unique constraint.
Secondly, remember that keys do exist in the datastore, and are a great way of enforcing a simple unique constraint.
my_user = MyUser(key_name=users.get_current_user().email())
my_user.put()
This will guarantee that no MyUser will ever be created with that email ever again, and you can also quickly retrieve the MyUser with that email:
my_user = MyUser.get(users.get_current_user().email())
In the python runtime you can also do:
my_user = MyUser.get_or_create(key_name=users.get_current_user().email())
Which will insert or retrieve the user with that email.
Anything more complex than that will not be scalable though. So really think about whether you need that property to be globally unique, or if there are ways you can remove the need for that unique constraint. Often times you'll find with some small workarounds you didn't need that property to be unique after all.
You can generate unique serial numbers for your products without needing to enforce unique IDs or querying the entire set of entities to find out what the largest serial number currently is. You can use transactions and a singleton entity to generate the 'next' serial number. Because the operation occurs inside a transaction, you can be sure that no two products will ever get the same serial number.
This approach will, however, be a potential performance chokepoint and limit your application's scalability. If it is the case that the creation of new serial numbers does not happen so often that you get contention, it may work for you.
EDIT:
To clarify, the singleton that holds the current -- or next -- serial number that is to be assigned is completely independent of any entities that actually have serial numbers assigned to them. They do not need to be all be a part of an entity group. You could have entities from multiple models using the same mechanism to get a new, unique serial number.
I don't remember Java well enough to provide sample code, and my Python example might be meaningless to you, but here's pseudo-code to illustrate the idea:
Receive request to create a new inventory item.
Enter transaction.
Retrieve current value of the single entity of the SerialNumber model.
Increment value and write it to the database
Return value as you exit transaction.
Now, the code that does all the work of actually creating the inventory item and storing it along with its new serial number DOES NOT need to run in a transaction.
Caveat: as I stated above, this could be a major performance bottleneck, as only one serial number can be created at any one time. However, it does provide you with the certainty that the serial number that you just generated is unique and not in-use.
I encountered this same issue in an application where users needed to reserve a timeslot. I needed to "insert" exactly one unique timeslot entity while expecting users to simultaneously request the same timeslot.
I have isolated an example of how to do this on app engine, and I blogged about it. The blog posting has canonical code examples using Datastore, and also Objectify. (BTW, I would advise to avoid JDO.)
I have also deployed a live demonstration where you can advance two users toward reserving the same resource. In this demo you can experience the exact behavior of app engine datastore click by click.
If you are looking for the behavior of a unique constraint, these should prove useful.
-broc
I first thought an alternative to the transaction technique in broc's blog, could be to make a singleton class which contains a synchronized method (say addUserName(String name)) responsible adding a new entry only if it is unique or throwing an exception. Then make a contextlistener which instantiates a single instance of this singleton, adding it as an attribute to the servletContext. Servlets then can call the addUserName() method on the singleton instance which they obtain through getServletContext.
However this is NOT a good idea because GAE is likely to split the app across multiple JVMs so multiple singleton class instances could still occur, one in each JVM. see this thread
A more GAE like alternative would be to write a GAE module responsible for checking uniqueness and adding new enteries; then use manual or basic scaling with...
<max-instances>1</max-instances>
Then you have a single instance running on GAE which acts as a single point of authority, adding users one at a time to the datastore. If you are concerned about this instance being a bottleneck you could improve the module, adding queuing or an internal master/slave architecture.
This module based solution would allow many unique usernames to be added to the datastore in a short space of time, without risking entitygroup contention issues.

How do people handle foreign keys on clients when synchronizing to master db

I'm writing an application with offline support. i.e. browser/mobile clients sync commands to the master db every so often.
I'm using uuid's on both client and server-side. When synching up to the server, the servre will return a map of local uuids (luid) to server uuids (suid). Upon receiving this map, clients updated their records suid attributes with the appropriate values.
However, say a client record, e.g. a todo, has an attribute 'list_id' which holds the foreign key to the todos' list record. I use luids in foreign_keys on clients. However, when that attribute is sent over to the server, it would dirty the server db with luids rather than the suid the server is using.
My current solution, is for the master server to keep a record of the mappings of luids to suids (per client id) and for each foreign key in a command, look up the suid for that particular client and use the suid instead.
I'm wondering wether others have come across thus problem and if so how they have solved it? Is there a more efficient, simpler way?
I took a look at this question "Synchronizing one or more databases with a master database - Foreign keys (5)" and someone seemed to suggest my current solution as one option, composite keys using suids and autoincrementing sequences and another option using -ve ids for client ids and then updating all negative ids with the suids. Both of these other options seem like a lot more work.
Thanks,
Saimon
From my experience it's easiest taking the composition approach, particularly when it comes to debugging issues and potential needs for rolling back, i.e. it's really helpful to know what requests came from what machine and resulted in what changes. Whenever you're effectively dealing with many to one you have to have a way to effectively isolate all of the many, it also allows you to do more intelligent conflict management when you have two of the 'many' sending updates that are non-complementary (if you want to do that sort of thing).
I've just thought of another possibility:
When assigning luids on the client-side, keep a map of all assignments of that luid e.g.
something like (json):
{
'luid123': [{model: list, attribute: 'id'},
{model: todo, attribute: 'list_id'}]
}
When we get the global luid2suid map from the server (after sync up), for each luid we look up the luid in the luid map and for each entry update the appropriate attribute in the appropriate model with the guid accordingly and then remove the entry from the luid mapping.
What do you think?
This way I avoid having to do expensive look ups in the global luid2suid map for all foreign keys of command synched. Another benefit is that foreign keys are all suids on the client too and I'd only have to look up suids from luids on the server side for cases of offline record creation and modification before synching back to server.
It's just an idea that just popped into my head. I'm still hoping for more feedback on the subject

Adding relations to an Access Database

I have an MS Access database with plenty of data. It's used by an application me and my team are developing. However, we've never added any foreign keys to this database because we could control relations from the code itself. Never had any problems with this, probably never will either.
However, as development has developed further, I fear there's a risk of losing sight over all the relationships between the 30+ tables, even though we use well-normalized data. So it would be a good idea go get at least the relations between the tables documented.
Altova has created DatabaseSpy which can show the structure of a database but without the relations, there isn't much to display. I could still use to add relations to it all but I don't want to modify the database itself.
Is there any software that can analyse a database by it's structures and data and then do a best-guess about its relations? (Just as documentation, not to modify the database.)
This application was created more than 10 years ago and has over 3000 paying customers who all use it. It's actually document-based, using an XML document for it's internal storage. The database is just used as storage and a single import/export routine converts it back and to XML. Unfortunately, the XML structure isn't very practical to use for documentation and there's a second layer around this XML document to expose it as an object model. This object model is far from perfect too, but that's what 10 years of development can do to an application. We do want to improve it but this takes time and we can't disappoint the current users by delaying new updates.Basically, we're stuck with its current design and to improve it, we need to make sure things are well-documented. That's what I'm working on now.
Only 30+ tables? Shouldn't take but a half hour or an hour to create all the relationships required. Which I'd urge you to do. Yes, I know that you state your code checks for those. But what if you've missed some? What if there are indeed orphaned records? How are you going to know? Or do you have bullet proof routines which go through all your tables looking for all these problems?
Use a largish 23" LCD monitor and have at it.
If your database does not have relationships defined somewhere other than code, there is no real way to guess how tables relate to each other.
Worse, you can't know the type of relationship and whether cascading of update and deletion should occur or not.
Having said that, if you followed some strict rules for naming your foreign key fields, then it could be possible to reconstruct the structure of the relationships.
For instance, I use a scheme like this one:
Table Product
- Field ID /* The Unique ID for a Product */
- Field Designation
- Field Cost
Table Order
- Field ID /* the unique ID for an Order */
- Field ProductID
- Field Quantity
The relationship is easy to detect when looking at the Order: Order.ProductID is related to Product.ID and this can easily be ascertain from code, going through each field.
If you have a similar scheme, then how much you can get out of it depends on how well you follow your own convention, but it could go to 100% accuracy although you're probably have some exceptions (that you can build-in your code or, better, look-up somewhere).
The other solution is if each of your table's unique ID is following a different numbering scheme.
Say your Order.ID is in fact following a scheme like OR001, OR002, etc and Product.ID follows PD001, PD002, etc.
In that case, going through all fields in all tables, you can search for FK records that match each PK.
If you're following a sane convention for naming your fields and tables, then you can probably automate the discovery of the relations between them, store that in a table and manually go through to make corrections.
Once you're done, use that result table to actually build the relationships from code using the Database.CreateRelation() method (look up the Access documentation, there is sample code for it).
You can build a small piece of VBA code, divided in 2 parts:
Step 1 implements the database relations with the database.createrelation method
Step 2 deleted all created relations with the database.delete command
As Tony said, 30 tables are not that much, and the script should be easy to set. Once this set, stop the process after step 1, run the access documenter (tools\analyse\documenter) to get your documentation ready, launch step 2. Your database will then be unchanged and your documentation ready.
I advise you to keep this code and run it regularly against your database to check that your relational model sticks to the data.
There might be a tool out there that might be able to "guess" the relations but I doubt it. Frankly I am scared of databases without proper foreign keys in particular and multi user apps that uses Access as a DBMS as well.
I guess that the app must be some sort of internal tool, otherwise I would suggest that you move to a proper DBMS ( SQL Express is for free) and adds the foreign keys.

Resources