Lets say we generate our order numbers in SQL.Normally i get the next number with a
SELECT COUNT(numbers)+1 FROM X
etc.
The problem is, I want to give this number to the user first,then wait for the user to input the contents, then do the insert to the table.But since there are multiple users i also want them to get the number but not the same number as the first user,is there a way to do this more elegantly?
Shortly i want the number to be reserved to the specific user and insert it if he does,if not, just release the number.
Create a table of Numbers. Pre-populate the table with values and use this table as a queue. A transaction can reserve a Number by dequeing a row. On rollback, the number will become again available. Other transactions can concurrently dequeue other numbers due to the readpast semantics of using-a-table-as-queue. Add more numbers (insert more rows) as needed.
If this seems overkill, rest assured: it is not. Naive solutions may not account for concurrency or rollbacks, which are not trivial to solve.
You could insert into the table, and then update it when the user commits their data.
To get around your "releasing" of the numbers you could:
Have a flag on the table to say if the row is "free" or not. When you first insert the flag is "not free". If the user commits their data, keep it as "not free".
If they release their number, mark it as "free".
When assigning a number to a user find the first "free" row, if they aren't any insert a new one.
If you need not to have all numbers sequentially without gaps in your system, you can make simple table containing a column of Identity type. So, insert fake record into it and use ##IDENTITY as generated number. This solution of course have some drawbacks as Remus Rusanu mentioned for 'naive solutions'.
If possible, avoid display this number for user before really store it into database. You can generate some abstract number for temporary reference, e.g. from date and time, and after inserting data into database you can display your real number. Almost nothing to code but 100% reliable.
Related
I must / have to create unique ID for invoices. I have a table id and another column for this unique number. I use serialization isolation level. Using
var seq = #"SELECT invoice_serial + 1 FROM invoice WHERE ""type""=#type ORDER BY invoice_serial DESC LIMIT 1";
Doesn't help because even using FOR UPDATE it wont read correct value as in serialization level.
Only solution seems to put some retry code.
Sequences do not generate gap-free sets of numbers, and there's really no way of making them do that because a rollback or error will "use" the sequence number.
I wrote up an article on this a while ago. It's directed at Oracle but is really about the fundamental principles of gap-free numbers, and I think the same applies here.
Well, it’s happened again. Someone has asked how to implement a requirement to generate a gap-free series of numbers and a swarm of nay-sayers have descended on them to say (and here I paraphrase slightly) that this will kill system performance, that’s it’s rarely a valid requirement, that whoever wrote the requirement is an idiot blah blah blah.
As I point out on the thread, it is sometimes a genuine legal requirement to generate gap-free series of numbers. Invoice numbers for the 2,000,000+ organisations in the UK that are VAT (sales tax) registered have such a requirement, and the reason for this is rather obvious: that it makes it more difficult to hide the generation of revenue from tax authorities. I’ve seen comments that it is a requirement in Spain and Portugal, and I’d not be surprised if it was not a requirement in many other countries.
So, if we accept that it is a valid requirement, under what circumstances are gap-free series* of numbers a problem? Group-think would often have you believe that it always is, but in fact it is only a potential problem under very particular circumstances.
The series of numbers must have no gaps.
Multiple processes create the entities to which the number is associated (eg. invoices).
The numbers must be generated at the time that the entity is created.
If all of these requirements must be met then you have a point of serialisation in your application, and we’ll discuss that in a moment.
First let’s talk about methods of implementing a series-of-numbers requirement if you can drop any one of those requirements.
If your series of numbers can have gaps (and you have multiple processes requiring instant generation of the number) then use an Oracle Sequence object. They are very high performance and the situations in which gaps can be expected have been very well discussed. It is not too challenging to minimise the amount of numbers skipped by making design efforts to minimise the chance of a process failure between generation of the number and commiting the transaction, if that is important.
If you do not have multiple processes creating the entities (and you need a gap-free series of numbers that must be instantly generated), as might be the case with the batch generation of invoices, then you already have a point of serialisation. That in itself may not be a problem, and may be an efficient way of performing the required operation. Generating the gap-free numbers is rather trivial in this case. You can read the current maximum value and apply an incrementing value to every entity with a number of techniques. For example if you are inserting a new batch of invoices into your invoice table from a temporary working table you might:
insert into
invoices
(
invoice#,
...)
with curr as (
select Coalesce(Max(invoice#)) max_invoice#
from invoices)
select
curr.max_invoice#+rownum,
...
from
tmp_invoice
...
Of course you would protect your process so that only one instance can run at a time (probably with DBMS_Lock if you're using Oracle), and protect the invoice# with a unique key contrainst, and probably check for missing values with separate code if you really, really care.
If you do not need instant generation of the numbers (but you need them gap-free and multiple processes generate the entities) then you can allow the entities to be generated and the transaction commited, and then leave generation of the number to a single batch job. An update on the entity table, or an insert into a separate table.
So if we need the trifecta of instant generation of a gap-free series of numbers by multiple processes? All we can do is to try to minimise the period of serialisation in the process, and I offer the following advice, and welcome any additional advice (or counter-advice of course).
Store your current values in a dedicated table. DO NOT use a sequence.
Ensure that all processes use the same code to generate new numbers by encapsulating it in a function or procedure.
Serialise access to the number generator with DBMS_Lock, making sure that each series has it’s own dedicated lock.
Hold the lock in the series generator until your entity creation transaction is complete by releasing the lock on commit
Delay the generation of the number until the last possible moment.
Consider the impact of an unexpected error after generating the number and before the commit is completed — will the application rollback gracefully and release the lock, or will it hold the lock on the series generator until the session disconnects later? Whatever method is used, if the transaction fails then the series number(s) must be “returned to the pool”.
Can you encapsulate the whole thing in a trigger on the entity’s table? Can you encapsulate it in a table or other API call that inserts the row and commits the insert automatically?
Original article
You could create a sequence with no cache , then get the next value from the sequence and use that as your counter.
CREATE SEQUENCE invoice_serial_seq START 101 CACHE 1;
SELECT nextval('invoice_serial_seq');
More info here
You either lock the table to inserts, and/or need to have retry code. There's no other option available. If you stop to think about what can happen with:
parallel processes rolling back
locks timing out
you'll see why.
In 2006, someone posted a gapless-sequence solution to the PostgreSQL mailing list: http://www.postgresql.org/message-id/44E376F6.7010802#seaworthysys.com
I am trying to create a cache for a table in Oracle DB. I monitor the changes in the DB using DBMS_CHANGE_NOTIFICATION to automatically update the cache.
This is however only working in a satisfactory manner as long as the updates I do are rather small -- if I delete large portion of rows, the ALL_ROWS flag of the notification structure is set to true and the array of ROWIDs is NULL.
By trial and error I found out that the threshold for number of updated rows is about 100 rows which is really too little. If a table contains several million rows and I delete a thousand I do not get information on what was updated and I have to refresh the cache for the whole table which is unacceptable.
Can I somehow change this threshold? I could not find a specific answer in documentation:
If the ALL_ROWS (0x1) bit is set it means that either the entire table
is modified (for example, DELETE * FROM t) or row level granularity of
information is not requested or not available in the notification and
the receiver has to conservatively assume that the entire table has
been invalidated.
This only gives me vague information.
From the docs I found this:
If the ALL_ROWS bit is set in the table operation flag, then it means
that all rows within the table may have been potentially modified. In
addition to operations like TRUNCATE that affect all rows in the
tables, this bit may also be set if individual rowids have been rolled
up into a FULL table invalidation.
This can occur if too many rows were modified on a given table in a
single transaction (more than 80) or the total shared memory
consumption due to rowids on the RDBMS is determined too large
(exceeds 1 % of the dynamic shared pool size). In this case, the
recipient must conservatively assume that the entire table has been
invalidated and the callback/application must be able to handle this
condition.
I rolled by own solution years ago, which gives me control/flexibility, but perhaps someone has a workaround for you (commit in small chunks of 50? but what if your app isn't the only one changing the table?). I think the whole point is to only cache tables that are slowly changing, but this restriction does seem silly to me.
Currently there is a procedure where you can specify the value:
SET_ROWID_THRESHOLD
It would be nice if I could look up what the current value is with a getter, I haven't found it.
I have a table called employees with 3 columns: FirstName, LastName, and SSN.
Data is fed into this table nightly by a .Net service, something I'm not comfortable updating.
I'd like to have a trigger that says:
Hey, I see you're trying to insert something in the SSN column... let's HASH that before it goes in.
One way is to use an INSTEAD OF TRIGGER:
CREATE TRIGGER dbo.HashSSN
ON dbo.tablename
INSTEAD OF INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT dbo.tablename(FirstName, LastName, SSN)
SELECT FirstName, LastName, HASHBYTES('SHA1', SSN)
FROM inserted;
END
GO
Business Rule Compliance and Staging Tables
Another way is to not insert to the final table but to use a staging table. The staging table is a sort of permanent temporary table that has no constraints, allows NULLs, is in a schema such as import and is simply a container for an external data source to drop data into. The concept is then that a business process with proper business logic can be set up to operate on the data in the container.
This is a kind of "data scrubbing" layer where the SSN hashing could be done, as well as other business processes operating or business rules being enforced such as nullability or allowed omissions, capitalization, lengths, naming, duplicate elimination, key lookup, change notification, etc, and then finally performing the insert. The benefit is that a set of bad data, instead of having been attempted to insert, being forced to roll back, and then blowing up the original process, can be detected, preserved intact without loss and ultimately be properly handled (such as being moved to an error queue, notifications sent, and so on).
Many people would use SSIS for tasks like this, though I personally find SSIS very hard to work with, since it has problems ranging from brittleness, difficulty using SPs containing temp tables, deployment challenges, not being part of database backups, and others.
If such a scheme seems like overkill to you so that you wouldn't even consider it, step back for a second and think about it: you have an external process that is supposed to be inserting proper, exact, scrubbed, and certainly-known data into a table. But, it's not doing that. Instead, it's inserting data that does not conform to business rules. I think that slapping on a trigger could be a way to handle it, but this is also an opportunity for you to think more about the architecture of the system and explore why you're having this problem in the first place.
How do you think untrusted or non-business-rule-compliant data should be become trusted and business-rule-compliant? Where do transformation tasks such as hashing an SSN column belong?
Should the inserting process be aware of such business rules? If so, is this consistent across the organization, the architecture, the type of process that inserter is? If not, how will you address this so going forward you're not putting patches on fixes on kluges?
The Insecurity of an SSN Hash
Furthermore, I would like to point something else out. There are only about 889 million SSNs possible (888,931,098) if there are no TINs. How long do you think it would take to run through all of them and compare the hash to those in your table? Hashing certainly reduces quick exposure--you can't just read the SSN out extremely easily. But given it only takes a billion tries, it's a matter of days or even hours to pop all of them, depending on resources and planning.
A rainbow table with all SSNs and their SHA1 hashes would only take on the order of 25-30 GB -- quite achievable even on a relatively inexpensive home computer, where once created it would allow popping any SSN in a split second. Even using a longer or more computationally expensive hash isn't going to help much. In a matter of days or weeks a rainbow table can be built. A few hundred bucks can buy multiple terabytes of storage nowadays.
You could salt the SSN hash, which will mean that if someone runs a brute force crack against your table they will have to do it once for each row rather than be able to get all the rows at once. This is certainly better, but it only delays the inevitable. A serious hacker probably has a bot army backing him up that can crack a simple SSN + salt in a matter of seconds.
Further Thoughts
I would be interested in the business rules that are on the one hand requiring you to be able to verify SSNs and use them as a type of password, but on the other hand not allowing you to store the full values. Do you have security concerns about your database? Now that you've updated your question to say that these are employees, my questions about why the exclusion of non-SSN-holders is moot. However, I'm still curious why you need to hash the values and can't just store them. It's not just fine but required for an employer to have its employees' SSNs so it can report earnings and deductions to the government.
If on the other hand, your concern isn't really about security but more about deniability ("your SSN is never stored on our servers!") then that isn't really true, now, is it? All you've done is transform it in a way that can be reversed through brute-force, and the search space is small enough that brute force is quite reasonable. If someone gives you the number 42, and you multiply it by 2 and save 84, then tell the person that his number was not stored, but you can simply divide 84 by 2 to get the original number, then you're not really being completely straightforward.
Certainly, "one-way" hashing is much harder to reverse than multiplying, but we're not dealing with a problem such as "find the original 200 thousand-character document (or whatever) from its hash" but "find a 9 digit number from its hash". Sure, many different inputs will hash to the same values as one particular SSN, but I doubt that there are very many collisions of exactly 9-character strings consisting exclusively of numeric digits.
Actual SHA-1 SSN Hash Reversal Testing
I just did some testing. I have a table with about 3200 real SSNs in it. I hashed them using SHA1 and put those hashes into a temp table containing just the one column. I was able to pop 1% of the SSNs in about 8 minutes searching upward from 001-01-0001. Based on the speed of processing and the total search space it will be done in less than 3 hours (it's taking ~2 minutes per 10 million SSNs, so 88.89 * 2 minutes). And this is from inside SQL Server, not running a compiled program that could be much, much faster. That's not very secure!
Is there any performance impact or any kind of issues?
The reason I am doing this is that we are doing some synchronization between two set of DBs with similar tables and we want to avoid duplicate PK errors when synchronizing data.
Yes, it's okay.
Note: If you have perfomance concerns you could use the "CACHE" option on "CREATE SEQUENCE":
"Specify how many values of the sequence the database preallocates and keeps in memory for faster access. This integer value can have 28 or fewer digits. The minimum value for this parameter is 2. For sequences that cycle, this value must be less than the number of values in the cycle. You cannot cache more values than will fit in a given cycle of sequence numbers. Therefore, the maximum value allowed for CACHE must be less than the value determined by the following formula:"
(CEIL (MAXVALUE - MINVALUE)) / ABS (INCREMENT)
"If a system failure occurs, all cached sequence values that have not been used in committed DML statements are lost. The potential number of lost values is equal to the value of the CACHE parameter."
Sure. What you plan on doing is actually a rather common practice. Just make sure the variables in your client code which you use to hold IDs are big enough (i.e., use longs instead of ints)
The only problem we recently had with creating tables with really large seeds was when we tried to interface with a system we did not control. That system was apparently reading our IDs as a char(6) field, so when we sent row 10000000 it would fail to write.
Performance-wise we have seen no issues on our side with using large ID numbers.
No performance impact that we've seen. I routinely bump sequences up by a large amount. The gaps come in handy if you need to "backfill" data into the table.
The only time we had a problem was when a really large sequence exceeded MAXINT on a particular client program. The sequence was fine, but the conversion to an integer in the client app started failing! In our case it was easy to refactor the ID column in the table and get things running again, but in retrospect this could have been a messy situation if the tables had been arranged differently!
If you are synching two tables why not change the PK seed/increment amount so that everything takes care of itself when a new PK is added?
Let's say you had to synch the data from 10 patient tables in 10 different databases.
Let's also say that eventually all databases had to be synched into a Patient table at headquarters.
Increment the PK by ten for each row but ensure the last digit was different for each database.
DB0 10,20,30..
DB1 11,21,31..
.....
DB9 19,29,39..
When everything is merged there is guaranteed to be no conflicts.
This is easily scaled to n database tables. Just make sure your PK key type will not overflow. I think BigInt could be big enough for you...
I'm trying to find if there is a reliable way (using SQLite) to find the ID of the next row to be inserted, before it gets inserted. I need to use the id for another insert statement, but don't have the option of instantly inserting and getting the next row.
Is predicting the next id as simple as getting the last id and adding one? Is that a guarantee?
Edit: A little more reasoning...
I can't insert immediately because the insert may end up being canceled by the user. User will make some changes, SQL statements will be stored, and from there the user can either save (inserting all the rows at once), or cancel (not changing anything). In the case of a program crash, the desired functionality is that nothing gets changed.
Try SELECT * FROM SQLITE_SEQUENCE WHERE name='TABLE';. This will contain a field called seq which is the largest number for the selected table. Add 1 to this value to get the next ID.
Also see the SQLite Autoincrement article, which is where the above info came from.
Cheers!
Either scrapping or committing a series of database operations all at once is exactly what transactions are for. Query BEGIN; before the user starts fiddling and COMMIT; once he/she's done. You're guaranteed that either all the changes are applied (if you commit) or everything is scrapped (if you query ROLLBACK;, if the program crashes, power goes out, etc). Once you read from the db, you're also guaranteed that the data is good until the end of the transaction, so you can grab MAX(id) or whatever you want without worrying about race conditions.
http://www.sqlite.org/lang_transaction.html
You can probably get away with adding 1 to the value returned by sqlite3_last_insert_rowid under certain conditions, for example, using the same database connection and there are no other concurrent writers. Of course, you may refer to the sqlite source code to back up these assumptions.
However, you might also seriously consider using a different approach that doesn't require predicting the next ID. Even if you get it right for the version of sqlite you're using, things could change in the future and it will certainly make moving to a different database more difficult.
Insert the row with an INVALID flag of some kind, Get the ID, edit it, as needed, delete if necessary or mark as valid. That and don't worry about gaps in the sequence
BTW, you will need to figure out how to do the invalid part yourself. Marking something as NULL might work depending on the specifics.
Edit: If you can, use Eevee's suggestion of using proper transactions. It's a lot less work.
I realize your application using SQLite is small and SQLite has its own semantics. Other solutions posted here may well have the effect that you want in this specific setting, but in my view every single one of them I have read so far is fundamentally incorrect and should be avoided.
In a normal environment holding a transaction for user input should be avoided at all costs. The way to handle this, if you need to store intermediate data, is to write the information to a scratch table for this purpose and then attempt to write all of the information in an atomic transaction. Holding transactions invites deadlocks and concurrency nightmares in a multi-user environment.
In most environments you cannot assume data retrieved via SELECT within a transaction is repeatable. For example
SELECT Balance FROM Bank ...
UPDATE Bank SET Balance = valuefromselect + 1.00 WHERE ...
Subsequent to UPDATE the value of balance may well be changed. Sometimes you can get around this by updating the row(s) your interested in Bank first within a transaction as this is guaranteed to lock the row preventing further updates from changing its value until your transaction has completed.
However, sometimes a better way to ensure consistency in this case is to check your assumptions about the contents of the data in the WHERE clause of the update and check row count in the application. In the example above when you "UPDATE Bank" the WHERE clause should provide the expected current value of balance:
WHERE Balance = valuefromselect
If the expected balance no longer matches neither does the WHERE condition -- UPDATE does nothing and rowcount returns 0. This tells you there was a concurrency issue and you need to rerun the operation again when something else isn't trying to change your data at the same time.
select max(id) from particular_table is unreliable for the reason below..
http://www.sqlite.org/autoinc.html
"The normal ROWID selection algorithm described above will generate monotonically increasing unique ROWIDs as long as you never use the maximum ROWID value and you never delete the entry in the table with the largest ROWID. If you ever delete rows or if you ever create a row with the maximum possible ROWID, then ROWIDs from previously deleted rows might be reused when creating new rows and newly created ROWIDs might not be in strictly ascending order."
I think this can't be done because there is no way to be sure that nothing will get inserted between you asking and you inserting. (you might be able to lock the table to inserts but Yuck)
BTW I've only used MySQL but I don't think that will make any difference)
Most likely you should be able to +1 the most recent id. I would look at all (going back a while) of the existing id's in the ordered table .. Are they consistent and is each row's ID is one more than the last? If so, you'll probably be fine. I'd leave a comments in the code explaining the assumption however. Doing a Lock will help guarantee that you're not getting additional rows while you do this as well.
Select the last_insert_rowid() value.
Most of everything that needs to be said in this topic already has... However, be very careful of race conditions when doing this. If two people both open your application/webpage/whatever, and one of them adds a row, the other user will try to insert a row with the same ID and you will have lots of issues.
select max(id) from particular_table;
The next id will be +1 from the maximum id.