How to handle concurrency-related issues on a DB table if multiple applications are reading and writing on it? This case may not be specific to microservices.
OPERATION
STATUS
GET_ORDER
COMPLETE
CALCULATE_PRICE
RUNNING
A very basic use-case: multiple applications are writing in the above table. Before writing, they check if same operation is already present in RUNNING status. If not present, they insert the entry. Otherwise they just skip. Both read and write operations are simple SQL queries.
Problem is - 2 different applications can read at the same time and find that there is no 'CREATE_INVOICE' operation RUNNING, so they both will insert it in the table which will now look like:
OPERATION
STATUS
GET_ORDER
COMPLETE
CALCULATE_PRICE
RUNNING
CREATE_INVOICE
RUNNING
CREATE_INVOICE
RUNNING
As a result the table has two duplicate CREATE_INVOICE records. Besides applying unique constraint on the table, what are the ways to resolve this?
By "2 different applications" do you mean that there are two completely separate applications which create invoices, or just 2 instances of the same application?
If the former, I'd be curious why there are two applications doing the same thing writing to the same DB.
If the latter, those instances will need to coordinate in some way (a uniqueness constraint on the table is an example of such coordination), and it's important to note that this coordination makes the application a little more stateful.
My preferred way of dealing with this would be to be event driven (e.g. by tapping into database change data capture) and sharding: for instance, when a GET_ORDER record is marked COMPLETE in the DB (resulting in a CDC record being published), based on the order ID, that CDC record is always routed to the same shard in the invoice creation application (or the price calculation application for that matter; your second table seems to imply that invoice creation can be simultaneous with price calculation), thus avoiding the conflict.
Related
I must / have to create unique ID for invoices. I have a table id and another column for this unique number. I use serialization isolation level. Using
var seq = #"SELECT invoice_serial + 1 FROM invoice WHERE ""type""=#type ORDER BY invoice_serial DESC LIMIT 1";
Doesn't help because even using FOR UPDATE it wont read correct value as in serialization level.
Only solution seems to put some retry code.
Sequences do not generate gap-free sets of numbers, and there's really no way of making them do that because a rollback or error will "use" the sequence number.
I wrote up an article on this a while ago. It's directed at Oracle but is really about the fundamental principles of gap-free numbers, and I think the same applies here.
Well, it’s happened again. Someone has asked how to implement a requirement to generate a gap-free series of numbers and a swarm of nay-sayers have descended on them to say (and here I paraphrase slightly) that this will kill system performance, that’s it’s rarely a valid requirement, that whoever wrote the requirement is an idiot blah blah blah.
As I point out on the thread, it is sometimes a genuine legal requirement to generate gap-free series of numbers. Invoice numbers for the 2,000,000+ organisations in the UK that are VAT (sales tax) registered have such a requirement, and the reason for this is rather obvious: that it makes it more difficult to hide the generation of revenue from tax authorities. I’ve seen comments that it is a requirement in Spain and Portugal, and I’d not be surprised if it was not a requirement in many other countries.
So, if we accept that it is a valid requirement, under what circumstances are gap-free series* of numbers a problem? Group-think would often have you believe that it always is, but in fact it is only a potential problem under very particular circumstances.
The series of numbers must have no gaps.
Multiple processes create the entities to which the number is associated (eg. invoices).
The numbers must be generated at the time that the entity is created.
If all of these requirements must be met then you have a point of serialisation in your application, and we’ll discuss that in a moment.
First let’s talk about methods of implementing a series-of-numbers requirement if you can drop any one of those requirements.
If your series of numbers can have gaps (and you have multiple processes requiring instant generation of the number) then use an Oracle Sequence object. They are very high performance and the situations in which gaps can be expected have been very well discussed. It is not too challenging to minimise the amount of numbers skipped by making design efforts to minimise the chance of a process failure between generation of the number and commiting the transaction, if that is important.
If you do not have multiple processes creating the entities (and you need a gap-free series of numbers that must be instantly generated), as might be the case with the batch generation of invoices, then you already have a point of serialisation. That in itself may not be a problem, and may be an efficient way of performing the required operation. Generating the gap-free numbers is rather trivial in this case. You can read the current maximum value and apply an incrementing value to every entity with a number of techniques. For example if you are inserting a new batch of invoices into your invoice table from a temporary working table you might:
insert into
invoices
(
invoice#,
...)
with curr as (
select Coalesce(Max(invoice#)) max_invoice#
from invoices)
select
curr.max_invoice#+rownum,
...
from
tmp_invoice
...
Of course you would protect your process so that only one instance can run at a time (probably with DBMS_Lock if you're using Oracle), and protect the invoice# with a unique key contrainst, and probably check for missing values with separate code if you really, really care.
If you do not need instant generation of the numbers (but you need them gap-free and multiple processes generate the entities) then you can allow the entities to be generated and the transaction commited, and then leave generation of the number to a single batch job. An update on the entity table, or an insert into a separate table.
So if we need the trifecta of instant generation of a gap-free series of numbers by multiple processes? All we can do is to try to minimise the period of serialisation in the process, and I offer the following advice, and welcome any additional advice (or counter-advice of course).
Store your current values in a dedicated table. DO NOT use a sequence.
Ensure that all processes use the same code to generate new numbers by encapsulating it in a function or procedure.
Serialise access to the number generator with DBMS_Lock, making sure that each series has it’s own dedicated lock.
Hold the lock in the series generator until your entity creation transaction is complete by releasing the lock on commit
Delay the generation of the number until the last possible moment.
Consider the impact of an unexpected error after generating the number and before the commit is completed — will the application rollback gracefully and release the lock, or will it hold the lock on the series generator until the session disconnects later? Whatever method is used, if the transaction fails then the series number(s) must be “returned to the pool”.
Can you encapsulate the whole thing in a trigger on the entity’s table? Can you encapsulate it in a table or other API call that inserts the row and commits the insert automatically?
Original article
You could create a sequence with no cache , then get the next value from the sequence and use that as your counter.
CREATE SEQUENCE invoice_serial_seq START 101 CACHE 1;
SELECT nextval('invoice_serial_seq');
More info here
You either lock the table to inserts, and/or need to have retry code. There's no other option available. If you stop to think about what can happen with:
parallel processes rolling back
locks timing out
you'll see why.
In 2006, someone posted a gapless-sequence solution to the PostgreSQL mailing list: http://www.postgresql.org/message-id/44E376F6.7010802#seaworthysys.com
Background
I am designing a Data Warehouse with SQL Server 2012 and SSIS. The source system handles hotel reservations. The reservations are split between two tables, header and header line item. The Fact table would be at the line item level with some data from the header.
The issue
The challenge I have is that the reservation (and its line items) can change over time.
An example would be:
The booking is created.
A room is added to the booking (as a header line item).
The customer arrives and adds food/drinks to their reservation (more line items).
A payment is added to the reservation (as a line item).
A room could be subsequently cancelled and removed from the booking (a line item is deleted).
The number of people in a room can change, affecting that line item.
The booking status changes from "Provisional" to "Confirmed" at a point in its life cycle.
Those last three points are key, I'm not sure how I would keep that line updated without looking up the record and updating it. The business would like to keep track of the updates and deletions.
I'm resisting updating because:
From what I've read about Fact tables, its not good practice to revisit rows once they've been written into the table.
I could do this with a look-up component but with upward of 45 million rows, is that the best approach?
The questions
What type of Fact table or loading solution should I go for?
Should I be updating the records, if so how can I best do that?
I'm open to any suggestions!
Additional Questions (following answer from ElectricLlama):
The fact does have a 1:1 relationship with the source. You talk about possible constraints on future development. Would you be able to elaborate on the type of constraints I would face?
Each line item will have a modified (and created date). Are you saying that I should delete all records from the fact table which have been modified since the last import and add them again (sounds logical)?
If the answer to 2 is "yes" then for auditing purposes would I write the current fact records to a separate table before deleting them?
In point one, you mention deleting/inserting the last x days bookings based on reservation date. I can understand inserting new bookings. I'm just trying to understand why I would delete?
If you effectively have a 1:1 between the source line and the fact, and you store a source system booking code in the fact (no dimensional modelling rules against that) then I suggest you have a two step load process.
delete/insert the last x days bookings based on reservation date (or whatever you consider to be the primary fact date),
delete/insert based on all source booking codes that have changed (you will of course have to know this beforehand)
You just need to consider what constraints this puts on future development, i.e. when you get additional source systems to add, you'll need to maintain the 1:1 fact to source line relationship to keep your load process consistent.
I've never updated a fact record in a dataload process, but always delete/insert a certain data domain (i.e. that domain might be trailing 20 days or source system booking code). This is effectively the same as an update but also takes cares of deletes.
With regards to auditing changes in the source, I suggest you write that to a different table altogether, not the main fact, as it's purpose will be audit, not analysis.
The requirement to identify changed records in the source (for data loads and auditing) implies you will need to create triggers and log tables in the source or enable native SQL Server CDC if possible.
At all costs avoid using the SSIS lookup component as it is totally ineffective and would certainly be unable to operate on 45 million rows.
Stick with the 'insert/delete a data portion' approach as it lends itself to SSIS ability to insert/delete (and its inability to efficiently update or lookup)
In answer to the follow up questions:
1:1 relationship in fact
What I'm getting at is you have no visibility on any future systems that need to be integrated, or any visibility on what future upgrades to your existing source system might do. This 1:1 mapping introduces a design constraint (its not really a constraint, more a framework). Thinking about it, any new system does not need to follow this particular load design, as long as it's data arrive in the fact consistently. I think implementing this 1:1 design is a good idea, just trying to consider any downside.
If your source has a reliable modified date then you're in luck as you can do a differential load - only load changed records. I suggest you:
Load all recently modified records (last 5 days?) into a staging table
Do a DELETE/INSERT based on the record key. Do the delete inside SSIS in an execute SQL task, don't mess about with feeding data flows into row-by-row delete statements.
Audit table:
The simplest and most accurate way to do this is simply implement triggers and logs in the source system and keep it totally separate to your star schema.
If you do want this captured as part of your load process, I suggest you do a comparison between your staging table and the existing audit table and only write new audit rows, i.e. reservation X last modified date in the audit table is 2 Apr, but reservation X last modified date in the staging table is 4 Apr, so write this change as a new record to the audit table. Note that if you do a daily load, any changes in between won't be recorded, that's why I suggest triggers and logs in the source.
DELETE/INSERT records in Fact
This is more about ensuring you have an overlapping window in your load process, so that if the process fails for a couple of days (as they always do), you have some contingency there, and it will seamlessly pick the process back up once it's working again. This is not so important in your case as you have a modified date to identify differential changes, but normally for example I would pick a transaction date and delete, say 7 trailing days. This means that my load process can be borken for 6 days, and if I fix it by the seventh day everything will reload properly without needing extra intervention to load the intermediate days.
I would suggest having a deleted flag and update that instead of deleting. Your performance will also be better.
This will enable you to perform an analysis on how the reservations are changing over a period of time. You will need to ensure that this flag is used in all the analysis to ensure that there is no confusion.
I have an app (vb.net) which collects data from users and stores the data locally on their laptop until they sync it up with a central SQLServer 2008 database. The sync needs to be in both directions. So right now, I have a timestamp on each record that gets set when that record gets updated. Then I compare times on the records to see which is more recent. If a record on the laptop is more recent than the one on the central DB, the laptop record gets sent up. And if the record on the central DB is more recent than the laptop, that record gets sent down to the laptop.
I have several hundred thousand records spread over about 15 tables. It is taking 3 to 4 minutes to run through all of them if you are local on the network. The problem really gets worse for remote users. It takes them 20 to 30 minutes to sync. via VPN.
I have about 5 users doing this and they all need to maintain the same information with each other by way of the central database. They all sync to the central DB, not with each other.
Is there a better way to check every record other than comparing timestamps?
Note that only a handful of records (5%) change each time they sync, but I don't know which ones it may be. It could be any of them. So I have to check all of them.
Thanks.
In my opinion timestamps are not the way to go for determining which records to send to the other party.
Although they might be "ok" for conflict resolution, time differences on synchronization parties (computers), might cause records to be skipped from sending out, causing real problems.
Myself I use an identity column (on the server side) on one specific table to generate sequence nr's, and in every transaction, I get a new sequence number, and assign this to all updated/inserted rows of the other tables that need synchronization.
Now when a client requests synchronization, it provides the server with the latest 'sequence' it received during last synchronization or 0 if it is the first time.
The server would send only those records that have a greater sequence number, and then determines what the highest sequence number was on those records it actually sent to the client, and give this number to the client for next synchronization requests.
In my scenario, conflict resolution is done on the client, because all business logic is their anyway, and this means, that the client always receives updates first, before it start to send theirs.
Because you use one newly generated sequence number for every transaction, you maintain referential integrity during each synchronization, but to make sure that's actually true,
you need to determine the currently highest sequence number before you start to send synchronization data, and never retrieve any records higher then this number, because otherwise you could break referential integrity.
This because, some other thread might have committed inserts of Orders and OrderItems after you already looked up the Orders but not the OrderItems, by which you have OrderItems in your outwards synchronization package without the Order.
For deletions, I use a IsDeleted column, and the server holds records for some period before they really get deleted.
When clients insert data, I give them feedback of what (primary) keys that records where given, etc.. etc..
Well, there is so much more to this then I can mention here, but here are some key thoughts for you that you should watch carefully:
How to prevent:
Missing records
Missing deletes
Double inserts
Unnecessary sending of records (I use a nullable field LastModifierId)
Input validation
Referential integrity
Conflict resolution
Performance costs (choose the right indexes, filtered unique indexes are great for keeping track of temporary client insert identities of records, so they might also be null, you need these to prevent double inserts)
Well good luck, hope this gives food for thoughts..
I was going to include 'status', 'date_created', 'date_updated' to every table in database.
'status' is for soft deletion of rows.
Then, I've seen few people also add 'user_created', 'user_updated' columns to each table.
If I add those columns too, then I will have at least 5 columns for every table.
Will this be too much overhead?
Do you think it's a good idea to have those five columns?
Also, does the 'user' in 'user_created' mean database user? or application user?
As per comments above, would advise adding auditing only to those tables actually requiring it.
You generally want to audit the application user - in many instances, applications (such as Web or SOA) may be connecting all users with the same credential, so storing the DB login is pointless.
IMHO, the date created / last date updated / lastupdateby patterns never give the full picture, as you will only be able to see who made the last change and not see what was changed. If you are doing auditing, I would suggest that instead you do a full change audit using patterns such as an audit trigger. You can also avoid using triggers if your inserts / updates / deletes to your tables are encapsulated e.g. via Stored Procedures. True, the audit tables will grow very large, but they will generally not be queried much (generally just in witch-hunts), and can be archived, easily partitioned by date (and can be made readonly). With a separated audit table, you won't need a DateCreated or LastDateUpdated column, as this can be derived. You will generally still need the last change user however, as SQL will not be able to derive the application user.
If you do decide on logical deletes, I would avoid using 'status' as an field indicating logical deletes, as it is likely you have tables which do model a process state (e.g. Payment Status etc.) Using a bit or char field such as ActiveYN or IsActive are common for logical deletes.
Logical deletes can be cumbersome, as all your queries will need to filter out Active=N records, and by keeping deleted records in your transaction tables can make these tables larger than necessary, especially on Many : Many / junction tables. Performance can also be impacted, as a 2-state field is unlikely to be selective enough to be useful in indexes. In this case, physical deletes with the full audit might make better sense.
I've used all five before, sure. When I want to track who, through a web app, is creating and (last) editing records, and when that happens, I include timestamps and the logged-in user (but not the DB user, that's not how my system is setup; we use one account for all DB interaction).
Likewise, status can also be useful if users are changing a record's, well, status. If it goes from being "Online" to "Offline" to "Archive", that record can reflect that.
However, I don't use these for every table, nor should you. Sometimes I have tables that are meant only to store parts of a record (normalized), or just don't have a value as far as needing a status or time created by who.
What you should be considering for every table is a Primary Key field. Unless you are more sophisticated in your approach than you sound, you will almost always want one. Some things don't necessarily need one (a states list, for instance, could Unique the abbreviation). But this is more important to most of your tables than a series of timestamp and status fields.
Simple answer - only put it in your database what you need in your database.
I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB