Autoincrement with Grails/Hibernate for different DBs - database

it's not a realy problem, but it surprises me:
when I use Grails with diffrent DBs, I get different counter increments...:
with the ootb hsqldb, every table gets its own counter which is always increased by 1
with an oracle db, it seems that all tables use the same global counter
now I am using javadb/derby and the generated id are huge!
where can I find some more information about this behaviour and which one is the best?
hsql seems to keep the counters small
with oracle, I get a global unique id - also a nice feature
but what about the derby behaviour?

It really depends on the default id generation strategy in the specific dialect. Grails allows you to customize the generation strategy with mapping closure.
The most 'safe' (i.e. being supported by every RDBMS) generation strategy is TABLE, and this is preferred choice of many JPA implementations. This is probably what you get in HSQLDB. However, Oracle support sequences and these objects are generally better optimized for handling key generation -- hence, the dialect for Oracle seems to use one global sequence. I'm not familiar with Derby, but probably there is identity column support there and what you get is some sort of UUID.

Related

Why does a database have a weird alphanumeric key

I am trying to connect data in two databases, both created automatically by a different UI application. In one, all the keys are in this format "D8FC23D7-97D6-42F5-A52F-1CE93087B3A4".
Is there any reason this would be done? I also saw keys that look similar in a GIS database. I can't tell if these are supposed to be some computed key, maybe to detect what I am trying to do, or just random with some other intent.
PS I am using SQL Server. From what I can gather, this is not something that would be auto generated by SQL Server.
This is a GUID, also called UUID, a universally unique identifier (confer, for example, wikipedia or rfc4122). The idea behind a guid is that applications can generate identifiers, that are globally unique, without the need of a central unit doing any choreography (see motivation from rfc4122 below).
Various systems, databases, and programming languages offer functionality for generating UUIDs (e.g. SELECT NEWID() in sql server); the benefit is that with UUID generators, application can generate globally identified units in autarkical manner.
UUIDs can serve as database keys, though in most cases you will find much more lightweight and much more proper keys.
One of the main reasons for using UUIDs is that no centralized
authority is required to administer them (although one format uses
IEEE 802 node identifiers, others do not). As a result, generation
on demand can be completely automated, and used for a variety of
purposes. The UUID generation algorithm described here supports very
high allocation rates of up to 10 million per second per machine if
necessary, so that they could even be used as transaction IDs.
UUIDs are of a fixed size (128 bits) which is reasonably small
compared to other alternatives. This lends itself well to sorting,
ordering, and hashing of all sorts, storing in databases, simple
allocation, and ease of programming in general.
Since UUIDs are unique and persistent, they make excellent Uniform
Resource Names. The unique ability to generate a new UUID without a
registration process allows for UUIDs to be one of the URNs with the
lowest minting cost.

Is using a (sequential) GUID the only viable alternative to a database generated ID?

We are migrating our MS-Access database to SQL Server Compact 4.0 using the Entity Framework 5 Code First approach. We found that using Database generated Integer ID's is very slow and to make it worse, the delay increases exponentially with the size of the database. This makes using an Identity column impossible and it seems there is a bad implementation of this feature in SQL Server Compact 4.0 paired with the Entity Framework.
So, we ran some tests and found that using a client side generated key speeds op insertion by at least 20 times, the exponential increase in insertion disappears.
Now we are looking at a the best way to generate client side ID's. Using GUID's seems the most secure option, but I read that this negatively impacts read actions. Is there a strategy in using auto-incremented Integers that are client side generated?
EDIT:
I will investigate the underlying problem that lead to the question further. In the mean time can my real question be answered please? :-)
EDIT2:
It is pretty exasperating that nobody seems to believe the assertion that using auto-id's with EF and SQL Server compact 4.0 is so slow. I posted a separate question about this with a proof of concept that should be easily reproducible.
If you are moving large amounts of data with EF, you are doing it wrong. Use ADO.NET, and for example a BULK COPY approach instead (with SQL CE use SqlCeUpdateableRecord). You could use my SqlCeBulkCopy library to save some coding effort.
I dont think the way of identity generation is the source of performance problem.
I think if you want to get a better performance during migration process,
before conversion process, you can disable Primary keys and foreign keys and other constraint
on your main tables. (this could be done by scripting or manualy)
However data integrity will be your new concern and you conversion code must be strong so after conversion process, enabling the constraints could be done.
hope this helps.
Solution as I see it.
a)Try Fix Performance problem. My suggestions (dont use large numbers of entities inside context.) try as few as the business problem will allow. Dont use merge tracking etc... See EF performance tips.
http://blogs.msdn.com/b/wriju/archive/2011/03/15/ado-net-entity-framework-performance-tips.aspx
and http://msdn.microsoft.com/en-au/library/cc853327.aspx
b)Use a Guid. Allocate externally (its not sequential, but fast)
c)Use a customer Integer generator, that runs in memory, can allocate many keys at once and can persist the current state. This technique is used by SAP. They call it "number range".
Can be very fast but not as fast as b).
btw I use GUIDs and not DB generated IDs to make partial DB copies and migrations Easy/easier. :-)

PIG Latin script for Database access

I am trying to implement a surrogate key generator using PIG.
I need to persist the last generated key in a Database and query the Database for the next available key.
Is there any support in PIG to query the Database using ODBC?
If yes, please provide guidance or some samples.
Sorry for not answering your question directly, but this is not something you want to be doing. For a few reasons:
Your MapReduce job is going to hammer your database as a single performance chokepoint (you are basically defeating the purpose of Hadoop).
With speculative execution, you'll have the same data get loaded up twice so some unique identifiers won't exist when one of the tasks gets killed.
I think if you can conceivably hit the database once per record, you can just do this surrogate key enrichment without MapReduce in a single thread.
Either way, building surrogate keys or automatic counters is not easy in Hadoop because of the shared-nothing nature of the thing.

What orm.xml features should be avoided to stay database agnostic?

I'm embarking on an adventure in JPA and want to, inasmuch as possible, stay database agnostic. What orm.xml features should I avoid to stay database agnostic?
For example, if I use strategy="AUTO" in orm.xml as follows:
<id name="id">
<generated-value strategy="AUTO" />
</id>
...then MySQL shows it as an AUTO_INCREMENT column which may (I'm not yet sure) cause problems if I needed to deploy to Oracle.
JPA features
You can use all JPA features. At worse you will need to change the annotations or orm.xml (e.g. if you want to use a special sequence in oracle), but all features are supported one way or another without impacting the code. That's what's nice with an ORM -- you have an extra-layer of abstraction.
Reserved keywords
Don't use reserved word to name your tables and columns
Keep close to SQL-92 standard
The way the queries are translated (especially the native ones) is loose. This is great in some case, but can lead to some problems sometimes:
Don't use AS in native queries
Never use SELECT * in native queries
User = for equality and not ==
Use only the SQL-92 standard functions
I am not familiar with JPA, but in general a reasonable ORM should be database agnostic (for the the major databases) for all of its mappings.
Especially an "AUTO" Increment strategy should work out of the box..
When switching the database, you have to deal with migration issues for the existing data.
In general MySQL "AUTO_INCREMENT" should be used when selecting value generation of "IDENTITY", and on Sybase SERIAL, and on DB2 ... etc. Some RDBMS don't have something equivalent.
Value generation of "AUTO" is for the implementation to choose what is best for that datastore. Yes, on MySQL they may choose AUTO_INCREMENT, and on Sybase SERIAL, and on Oracle SEQUENCE, etc etc, but from the user code point of view that one will (should) work on any spec-compliant implementation. Obviously you cannot then switch JPA implementations and expect it to use the exact same mechanism, since JPA impl #1 may choose AUTO_INCREMENT on MySQL, and JPA impl #2 may choose some internal mechanism etc etc.

GUIDs as Primary Keys - Offline OLTP

We are working on designing an application that is typically OLTP (think: purchasing system). However, this one in particular has the need that some users will be offline, so they need to be able to download the DB to their machine, work on it, and then sync back once they're on the LAN.
I would like to note that I know this has been done before, I just don't have experience with this particular model.
One idea I thought about was using GUIDs as table keys. So for example, a Purchase Order would not have a number (auto-numeric) but a GUID instead, so that every offline client can generate those, and I don't have clashes when I connect back to the DB.
Is this a bad idea for some reason?
Will access to these tables through the GUID key be slow?
Have you had experience with these type of systems? How have you solved this problem?
Thanks!
Daniel
Using Guids as primary keys is acceptable and is considered a fairly standard practice for the same reasons that you are considering them. They can be overused which can make things a bit tedious to debug and manage, so try to keep them out of code tables and other reference data if at all possible.
The thing that you have to concern yourself with is the human readable identifier. Guids cannot be exchanged by people - can you imagine trying to confirm your order number over the phone if it is a guid? So in an offline scenario you may still have to generate something - like a publisher (workstation/user) id and some sequence number, so the order number may be 123-5678 -.
However this may not satisfy business requirements of having a sequential number. In fact regulatory requirements can be and influence - some regulations (SOX maybe) require that invoice numbers are sequential. In such cases it may be neccessary to generate a sort of proforma number which is fixed up later when the systems synchronise. You may land up with tables having OrderId (Guid), OrderNo (int), ProformaOrderNo (varchar) - some complexity may creep in.
At least having guids as primary keys means that you don't have to do a whole lot of cascading updates when the sync does eventually happen - you simply update the human readable number.
#SqlMenace
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
Not true. Primary key != clustered index.
If the clustered index is another column ("inserted_on" springs to mind) then the inserts will be sequential and no page splits or excessive fragmentation will occur.
This is a perfectly good use of GUIDs. The only draw backs would be a slight complexity in working with GUIDs over INTs and the slight size difference (16 bytes vs 4 bytes).
I don't think either of those are a big deal.
Will access to these tables through
the GUID key be slow?
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
In SQL Server 2005 MS introduced NEWSEQUENTIALID() to fix this, the only problem for you might be that you can only use NEWSEQUENTIALID as a default value in a table
You're correct that this is an old problem, and it has two canonical solutions:
Use unique identifiers as the primary key. Note that if you're concerned about readability you can roll your own unique identifier instead of using a GUID. A unique identifier will use information about the date and the machine to generate a unique value.
Use a composite key of 'Actor' + identifier. Every user gets a numeric actor ID, and the keys of newly inserted rows use the actor ID as well as the next available identifier. So if two actors both insert a new row with ID "100", the primary key constraint will not be violated.
Personally, I prefer the first approach, as I think composite keys are really tedious as foreign keys. I think the human readability complaint is overstated -- end-users shouldn't have to know anything about your keys, anyways!
Make sure to utilize guid.comb - takes care of the indexing stuff. If you are dealing with performance issues after that then you will be, in short order, an expert on scaling.
Another reason to use GUIDs is to enable database refactoring. Say you decide to apply polymorphism or inheritance or whatever to your Customers entity. You now want Customers and Employees to derive from Person and have them share a table. Having really unique identifiers makes data migration simple. There are no sequences or integer identity fields to fight with.
I'm just going to point you to What are the performance improvement of Sequential Guid over standard Guid?, which covers the GUID talk.
For human readability, consider assigning machine IDs and then using sequential numbers from those machines as a possibility. This will require managing the assignment of machine IDs, though. Could be done in one or two columns.
I'm personally fond of the SGUID answer, though.
Guids will certainly be slower (and use more memory) than standard integer keys, but whether or not that is an issue will depend on the type of load your system will see. Depending on your backend DB there may be issues with indexing guid fields.
Using guids simplifies a whole class of problems, but you pay for it part will performance and also debuggability - typing guids into those test queries will get old real fast!
The backend will be SQL Server 2005
Frontend / Application Logic will be .Net
Besides GUIDs, can you think of other ways to resolve the "merge" that happens when the offline computer syncs the new data back into the central database?
I mean, if the keys are INTs, i'll have to renumber everything when importing basically. GUIDs will spare me of that.
Using GUIDs saved us a lot of work when we had to merge two databases into one.
If your database is small enough to download to a laptop and work with it offline, you probably don't need to worry too much about the performance differences between ints and Guids. But do not underestimate how useful ints are when developing and troubleshooting a system! You will probably need to come up with some fairly complex import/synch logic regardless of whether or not you are using Guids, so they might not help as much as you think.
#Simon,
You raise very good points. I was already thinking about the "temporary" "human-readable" numbers i'd generate while offline, that i'd recreate on sync. But i wanted to avoid doing with with foreign keys, etc.
i would start to look at SQL Server Compact Edition for this! It helps with all of your issues.
Data Storage Architecture with SQL Server 2005 Compact Edition
It specifically designed for
Field force applications (FFAs). FFAs
usually share one or more of the
following attributes
They allow the user to perform their
job functions while disconnected from
the back-end network—on-site at a
client location, on the road, in an
airport, or from home.
FFAs are usually designed for
occasional connectivity, meaning that
when users are running the client
application, they do not need to have
a network connection of any kind. FFAs
often involve multiple clients that
can concurrently access and use data
from the back-end database, both in a
connected and disconnected mode.
FFAs must be able to replicate data
from the back-end database to the
client databases for offline support.
They also need to be able to replicate
modified, added, or deleted data
records from the client to the server
when the application is able to
connect to the network
First thought that comes to mind: Hasn't MS designed the DataSet and DataAdapter model to support scenarios like this?
I believe I read that MS changed their ADO recordset model to the current DataSet model so it works great offline too. And there's also this Sync Services for ADO.NET
I believe I have seen code that utilizes the DataSet model which also uses foreign keys and they still sync perfectly when using the DataAdapter. Havn't try out the Sync Services though but I think you might be able to benefit from that too.
Hope this helps.
#Portman By default PK == Clustered Index, creating a primary key constraint will automatically create a clustered index, you need to specify non clustered if you don't want it clustered.

Resources