Understanding Transactions in databases - database

I'm new to database applications and I'm trying to use Datamapper to make a ruby web application.
I stumbled across this piece of code which I don't understand:
transaction do |txn|
link = Link.new(:identifier => custom)
link.url = Url.create(:original => original)
link.save
end
I have a few questions: What exactly are transactions? And why was this preferred instead of just doing:
link = Link.new(:identifier => custom)
link.url = Url.create(:original => original)
link.save
When should I consider using transactions? What are the best use-cases? Is there any resource available online where I can read more about such concepts.
Thanks!

Transaction is an indivisible unit of work. The idea comes from the database world and is connected with the problems of data selection/update. Consider the following situation:
user A asks for object O in order to change it.
While A was doing his/her stuff, user B asked for the same object. Object O currently is equal for both users.
Then A puts the update to the database, with changed property O1 to the object O. User B hasn't got this change - his object O is still the same as it was before.
B puts the update to the database with changed property O2 to the object O. The change to O1 is effectively lost.
Basically, it has to do with multi-user access and changes - there are several kinds of problems that arise.
Transaction are also used to couple different operations together into one logical processing statement. For example, you need to delete User with all his/her associated Photos.
The topic is really vast to cover in one post, so I'd recommend reading following articles: wiki#1 and wiki#2.

A transaction is a series of instructions which, upon execution, are seen as one atomic instruction.
This means that all of the instructions must succeed in order for the transaction to succeed. If only one of them fails, you return at the state you were before the beginning of the transaction. This is good for fault-tolerance, for example.
One other field in which transactions are useful is in concurrent applications. Using a transaction avoids interference by other processes.
Hope this helps.

Related

Need to establish Database connectivity in DRL file

Need to establish Oracle database connectivity in drools to get some data as and when required while executing the rules. How do I go about that?
You shouldn't do this. Instead, you should query your data out of the database first, then pass it into the rules as facts in working memory.
I tried to write a detailed answer about all the reasons you shouldn't do this, but it turns out that StackOverflow has a character limit. So I'm going to give you the high level reasons.
Latency
Data consistency
Lack of DB access hardening
Extreme design constraints for rules
High maintenance burden
Potential security issues
Going in order ...
Latency. Database queries aren't free. Regardless of how good your connection management is, you will incur overhead every time you make a database call. If you have a solid understanding of the Drools execution lifecycle and how it executes rules, and you design your rules to explicitly only query the database in ways that will minimize the number and quantity of calls, you could consider this an OK risk. A good caching layer wouldn't be amiss. Note that having to properly design your rules this way is not trivial, and you'll incur perpetual overhead in having to make sure all of your rules remain compliant.
(Hint: this means you must never ever call the database from the 'when' clause.)
Data consistency. A database is a shared resource. If you make the same query in two different 'when' clauses, there is no guarantee that you'll get back the same result. Again, you could potentially work around this with a deep understanding of how Drools evaluates and executes rules, and designing your rules appropriately. But the same issues from 'latency' will affect you here -- namely the burden of perpetual maintenance. Further the rule design restrictions -- which are quite strict -- will likely make your other rules and use cases less efficient as well because of the contortions you need to pull to keep your database-dependent rules compatible.
Lack of hardening. The Java code you can write in a DRL function is not the same as the Java code you can write in a Java class. DRL files are parsed as strings and then interpreted and then compiled; many language features are simply not available. (Some examples: try-with-resources, annotations, etc.) This makes properly hardening your database access extremely complicated and in some cases impossible. Libraries which rely on annotations like Spring Data are not available to you for use in your DRL functions. You will need to manage your connection pooling, transaction management, connection management (close everything!), error handling, and so on manually using a subset of the Java language that is roughly equivalent to Java 5.
This is, of course, specific to writing your code to access the database as a function in your DRL. If you instead implement your database access in a service which acts like a database access layer, you can leverage the full JDK and its features and functionality in that external service which you then pass into the rules as an input. But in terms of DRL functions, this point remains a major concern.
Rule design constraints. As I mentioned previously, you need to have an in-depth understanding of how Drools evaluates and executes rules in order to write effective rules that interact with the database. If you're not aware that all left hand sides ("when" clauses) are executed first, then the "matches" ordered by salience, and then the right hand sides ("then" clauses) executed in order sequentially .... well you absolutely should not be trying to do this from the rules. Not only do you as the initial implementor need to understand the rules execution lifecycle, but everyone who comes after you who is going to be maintaining your rules needs to also understand this and continue implementing the rules based on these restrictions. This is your high maintenance burden.
As an example, here are two rules. Let's assume that "DataService" is a properly implemented data access layer with all the necessary connection and transaction management, and it is passed into working memory as a fact.
rule "Record Student Tardiness"
when
$svc: DataService() // data access layer
Tardy( $id: studentId )
$student: Student($tardy: tardyCount) from $svc.getStudentById($id)
then
$student.setTardyCount($tardy + 1)
$svc.save($student)
end
rule "Issue Demerit for Excessive Tardiness"
when
$svc: DataService() // data access layer
Tardy( $id: studentId )
$student: Student(tardyCount > 3) from $svc.getStudentById($id)
then
AdminUtils.issueDemerit($student, "excessive tardiness")
end
If you understand how Drools executes rules, you'll quickly realize the problems with these rules. Namely:
we call getStudentById twice (latency, consistency)
the changes to the student's tardy count are not visible to the second rule
So if our student, Alice, has 3 tardies recorded in the database, and we pass in a new Tardy instance for her, the first rule will hit and her tardy count will increment and be saved (Alice will have 4 tardies in the database.) But the second rule will not hit! Because at the time the matches are calculated, Alice only had 3 tardies, and the "issue demerit" rule only triggers for more than 3. So while she has 4 tardies now, she didn't then.
The solution to the second problem is, of course, to call update to let Drools know to reevaluate all matches with the new data in working memory. This of course exacerbates the first issue -- now we'll be calling getStudentById four times!
Finally the last problem are potential security issues. This really depends on how you implement your queries, but you'll need to be doubly sure you're not accidentally exposing any connection configuration (URL, credentials) in your DRLs, and that you've properly sanitized all query inputs to protect yourself against SQL injection.
The right way to do this, of course, is not to do it at all. Call the database first, then pass it to your rules.
As an example, let's say we have a set of rules which is designed to determine if a customer purchase is "suspicious" by comparing it to trends from the previous 3 months' worth of purchases.
// Assume this class serves as our data access layer and does proper connection,
// transaction management. It might be something like a Spring Data JPA repository,
// or something from another library; the specifics are not relevant.
private PurchaseService purchaseService;
public boolean isSuspiciousPurchase(Purchase purchase) {
List<Purchase> previous = purchaseService.getPurchasesForCustomerAfterDate(
purchase.getCustomerId(),
LocalDate.now().minusMonths(3));
KieBase kBase = ...;
KieSession session = kBase.newKieSession();
session.insert(purchase);
session.insert(previous);
// insert other facts as needed
session.fireAllRules();
// ...
}
As you can see, we call the database and pass the result into working memory. Then we can write the rules such that they do work against that existing list, without needing to interact with the database at all.
If our use case requires modifying the database -- eg saving updates -- we can pass those commands back to the caller and they can be invoked after the fireAllRules is completed. Not only will that keep us from having to interact with the database in the rules, but it'll give us better control over our transaction management (you can probably group the updates into a single transaction, even if the originally came from multiple rules). And since we don't need to understand anything about how Drools evaluates and executes rules, it'll be a little more robust in case a rule with a database "update" is triggered twice.
You can use function like below to get details from DB. Here I have written function in DRL file but its suggested to add such code in java file and call specific method from DRL file.
function String ConnectDB(String ConnectionClass,String url,String user, String password) {
Class.forName(ConnectionClass);
java.sql.Connection con = DriverManager.getConnection(url, user, password);
Statement st = con.createStatement();
ResultSet rs = st.executeQuery("select * from Employee where employee_id=199");
rs.first();
return rs.getString("employee_name");
}
rule "DBConnection"
when
person:PersonPojo(name == ConnectDB("com.mysql.jdbc.Driver","jdbc:mysql://localhost:3306/root","root","redhat1!"))
.. ..
then
. . ..
end

Postgres: How to clear transaction ID for anonymity and data reduction

We're running an evaluation platform where users can comment on certain things. A key feature is that people can comment only once, and every comment is made in anonymity.
We're using Postgres for all our data. We want to save a flag in the database that a user created a comment (so they cannot comment again). In a separate table but within the same transaction, we want to save the comment itself without any link to the user.
However, postgres saves the transaction ID of every tuple inserted into the database (xmin of the system columns). So now there's a link between the user and their comment which we have to avoid!
Possible (Non)Solutions
Vacuuming alone does not help as it does not clear the transaction ID. See the "Note" box in the "24.1.5. Preventing Transaction ID Wraparound Failures" section in the postgres docs.
Putting those inserts in different transactions, doesn't really solve anything since transaction IDs are consecutive.
We could aggregate comments from multiple users to one large text in the database with some separators, but since old versions of this large text would be kept by postgres at least until the next vacuum, that doesn't seem like a full solution. Also, we'd still have the order of when the user added their comment, which would be nice to not save as well.
Re-writing all the tuples in those tables periodically (by a dummy UPDATE to all of them), followed by a vacuum would probably erase the "insert history" sufficiently, but that too seems like a crude hack.
Is there any other way within postgres to make it impossible to reconstruct the insertion history of a table?
Perhaps you could use something like dblink or postgres_fdw to write to tables using a remote connection (either to the current database or another database), and thereby separate xmin values, even though you as a user think you are doing it all in the "same transaction."
Regarding the concerns about tracking via reverse-engineering sequential xmin values, since dblink is asynchronous, this issue may become moot at scale, when many users are simultaneously adding comments to the system. This might not work if you need to be able to rollback after encountering an error—it really depends on how important it is for you to confine the operations into one transaction.
I don't think there is a problem.
In your comment you write that you keep a flag with the user (however exactly you store it) that keeps track of which posting the user commented on. To keep that information private, you have to keep that flag private so that nobody except the user itself could read it.
If no other user can see that information, then no other user can see the xmin on the corresponding table entries. Then nobody could make a correlation with the xmin on the comment, so the problem is not there.
The difficult part is how you want to keep the information private which postings a user commented on. I see two ways:
Don't use database techniques to do it, but write the application so that it hides that information from the users.
Use PostgreSQL Row Level Security to do it.
There is no way you can keep the information from a superuser. Don't even try.
You could store the users with their flags and the comments on different database clusters (and use distributed transactions), then the xmins would be unrelated.
Make sure to disable track_commit_timestamp.
To make it impossible to correlate the transactions in the databases, you could issue random
SELECT txid_current();
which do nothing but increment the transaction counter.

Concurrency with Objectify in GAE

I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
db.put(ent)
ent = db.get(ent.key()) # get, modify and save the entity
ent.property = 'foo'
db.put(ent)
ent.get(ent.key()) # get and delete the entity
db.delete(my_ent)
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.

Is this a functional syncing algorithm?

I'm working on a basic syncing algorithm for a user's notes. I've got most of it figured out, but before I start programming it, I want to run it by here to see if it makes sense. Usually I end up not realizing one huge important thing that someone else easily saw that I couldn't. Here's how it works:
I have a table in my database where I insert objects called SyncOperation. A SyncOperation is a sort of metadata on the nature of what every device needs to perform to be up to date. Say a user has 2 registered devices, firstDevice and secondDevice. firstDevice creates a new note and pushes it to the server. Now, a SyncOperation is created with the note's Id, operation type, and processedDeviceList. I create a SyncOperation with type "NewNote", and I add the originating device ID to that SyncOperation's processedDeviceList. So now secondDevice checks in to the server to see if it needs to make any updates. It makes a query to get all SyncOperations where secondDeviceId is not in the processedDeviceList. It finds out its type is NewNote, so it gets the new note and adds itself to the processedDeviceList. Now this device is in sync.
When I delete a note, I find the already created SyncOperation in the table with type "NewNote". I change the type to Delete, remove all devices from processedDevicesList except for the device that deleted the note. So now when new devices call in to see what they need to update, since their deviceId is not in the processedList, they'll have to process that SyncOperation, which tells their device to delete that respective note.
And that's generally how it'd work. Is my solution too complicated? Can it be simplified? Can anyone think of a situation where this wouldn't work? Will this be inefficient on a large scale?
Sounds very complicated - the central database shouldn't be responsible for determining which devices have recieved which updates. Here's how I'd do it:
The database keeps a table of SyncOperations for each change. Each SyncOperation is has a change_id numbered in ascending order (that is, change_id INTEGER PRIMARY KEY AUTOINCREMENT.)
Each device keeps a current_change_id number representing what change it last saw.
When a device wants to update, it does SELECT * FROM SyncOperations WHERE change_id > current_change_id. This gets it the list of all changes it needs to be up-to-date. Apply each of them in chronological order.
This has the charming feature that, if you wanted to, you could initialise a new device simply by creating a new client with current_change_id = 0. Then it would pull in all updates.
Note that this won't really work if two users can be doing concurrent edits (which edit "wins"?). You can try and merge edits automatically, or you can raise a notification to the user. If you want some inspiration, look at the operation of the git version control system (or Mercurial, or CVS...) for conflicting edits.
You may want to take a look at SyncML for ideas on how to handle sync operations (http://www.openmobilealliance.org/tech/affiliates/syncml/syncml_sync_protocol_v11_20020215.pdf). SyncML has been around for a while, and as a public standard, has had a fair amount of scrutiny and review. There are also open source implementations (Funambol comes to mind) that can also provide some coding clues. You don't have to use the whole spec, but reading it may give you a few "ahah" moments about syncing data - I know it helped to think through what needs to be done.
Mark
P.S. A later version of the protocol - http://www.openmobilealliance.org/technical/release_program/docs/DS/V1_2_1-20070810-A/OMA-TS-DS_Protocol-V1_2_1-20070810-A.pdf
I have seen the basic idea of keeping track of operations in a database elsewhere, so I dare say it can be made to work. You may wish to think about what should happen if different devices are in use at much the same time, and end up submitting conflicting changes - e.g. two different attempts to edit the same note. This may surface as a change to the user interface, to allow them to intervene to resolve such conflicts manually.

Unit of Work - What is the best approach to temporary object storage on a web farm?

I need to design and implement something similar to what Martin Fowler calls the "Unit of Work" pattern. I have heard others refer to it as a "Shopping Cart" pattern, but I'm not convinced the needs are the same.
The specific problem is that users (and our UI team) want to be able to create and assign child objects (with referential integrity constraints in the database) before the parent object is created. I met with another of our designers today and we came up with two alternative approaches.
a) First, create a dummy parent object in the database, and then create dummy children and dummy assignments. We could use negative keys (our normal keys are all positive) to distinguish between the sheep and the goats in the database. Then when the user submits the entire transaction we have to update data and get the real keys added and aligned.
I see several drawbacks to this one.
It causes perturbations to the indexes.
We still need to come up with something to satisfy unique constraints on columns that have them.
We have to modify a lot of existing SQL and code that generates SQL to add yet another predicate to a lot of WHERE clauses.
Altering the primary keys in Oracle can be done, but its a challenge.
b) Create Transient tables for objects and assignments that need to be able to participate in these reverse transactions. When the user hits Submit, we generate the real entries and purge the old.
I think this is cleaner than the first alternative, but still involves increased levels of database activity.
Both methods require that I have some way to expire transient data if the session is lost before the user executes submit or cancel requests.
Has anyone solved this problem in a different way?
Thanks in advance for your help.
I don't understand why these objects need to be created in the database before the transaction is committed, so you might want to clarify with your UI team before proceeding with a solution. You may find that all they want to do is read information previously saved by the user on another page.
So, assuming that the objects don't need to be stored in the database before the commit, I give you plan C:
Store initialized business objects in the session. You can then create all the children you want, and only touch the database (and set up references) when the transaction needs to be committed. If the session data is going to be large (either individually or collectively), store the session information in the database (you may already be doing this).

Resources