Datastore - Resource contention in a single entity group - - google-app-engine

I am getting lost on the following regarding the Datastore :
It is recommended to denormalize data as the Datastore does not support join queries. This means that the same information is copied in several entities
Denormalization means that whenever you have to update
data, it must be updated in different entities
But there is a limit of 1 write / second in a single entity group.
The problem I have is therefore the following :
In order to update records, I open a transaction then
Update all the required entities. The entities to be updated are within the same entity group but relate to different kinds
I am getting a "resource contention" exception
==> It seems therefore that the only way to update denormalized data is outside of a transaction. But doing this is really bad as some entities could be updated whereas other entities wouldn't.
Am I the only one having this problem ? How did you solve it ?
Thanks,
Hugues
The (simplified version of the ) code is as follows :
Objectify ofy=ObjectifyService.beginTransaction();
try {
Key<Party> partyKey=new Key<Party>(realEstateKey, Party.class, partyDTO.getId());
//--------------------------------------------------------------------------
//-- 1 - We update the party
//--------------------------------------------------------------------------
Party party=ofy.get(partyKey);
party.update(partyDTO);
//---------------------------------------------------------------------------------------------
//-- 2 - We update the kinds which have Party as embedded field, all in the same entity group
//---------------------------------------------------------------------------------------------
//2.1 Invoices
Query<Invoice> q1=ofy.query(Invoice.class).ancestor(realEstateKey).filter("partyKey", partyKey);
for (Invoice invoice: q1) {
invoice.setParty(party);
ofy.put(invoice);
}
//2.2Payments
Query<Payment> q2=ofy.query(Payment.class).ancestor(realEstateKey).filter("partyKey", partyKey);
for (Payment payment: q2) {
payment.setParty(payment);
ofy.put(payment);
}
}
ofy.getTxn().commit();
return (RPCResults.SUCCESS);
}
catch (Exception e) {
final Logger log = Logger.getLogger(InternalServiceImpl.class.getName());
log.severe("Problem while updating party : " + e.getLocalizedMessage());
return (RPCResults.FAILURE) ;
}
finally {
if (ofy.getTxn().isActive()) {
ofy.getTxn().rollback();
partyDTO.setCreationResult(RPCResults.FAILURE);
return (RPCResults.FAILURE) ;
}
}

This is happening because multiple requests to update the same entity group are occurring in a short period of time, not because you are updating many entities in the same entity group at once.
Since you have not shown your code, I can assume one of two things are happening:
The method you describe above is not actually using a transaction and you are running put_multi() with many entities of the same entity group. (If I had to guess, it'd be this.)
You have a high-traffic site and many other updates are simultaneously occurring at the same time.

Just in case someones gets in the same issue.
The problem was in the party.update(partyDTO) where under some specific conditions, I was initiating another transaction.
What I learned today is that :
--> Inside a transaction, you are allowed to include multiple puts even getting over the 1 entity / second
--> However, you should take care not initiating another transaction within your transaction

Related

Common strategy in handling concurrent global 'inventory' updates

To give a simplified example:
I have a database with one table: names, which has 1 million records each containing a common boy or girl's name, and more added every day.
I have an application server that takes as input an http request from parents using my website 'Name Chooser' . With each request, I need to pick up a name from the db and return it, and then NOT give that name to another parent. The server is concurrent so can handle a high volume of requests, and yet have to respect "unique name per request" and still be high available.
What are the major components and strategies for an architecture of this use case?
From what I understand, you have two operations: Adding a name and Choosing a name.
I have couple of questions:
Qustion 1: Do parents choose names only or do they also add names?
Question 2 If they add names, doest that mean that when a name is added it should also be marked as already chosen?
Assuming that you don't want to make all name selection requests to wait for one another (by locking of queueing them):
One solution to resolve concurrency in case of choosing a name only is to use Optimistic offline lock.
The most common implementation to this is to add a version field to your table and increment this version when you mark a name as chosen. You will need DB support for this, but most databases offer a mechanism for this. MongoDB adds a version field to the documents by default. For a RDBMS (like SQL) you have to add this field yourself.
You havent specified what technology you are using, so I will give an example using pseudo code for an SQL DB. For MongoDB you can check how the DB makes these checks for you.
NameRecord {
id,
name,
parentID,
version,
isChosen,
function chooseForParent(parentID) {
if(this.isChosen){
throw Error/Exception;
}
this.parentID = parentID
this.isChosen = true;
this.version++;
}
}
NameRecordRepository {
function getByName(name) { ... }
function save(record) {
var oldVersion = record.version - 1;
var query = "UPDATE records SET .....
WHERE id = {record.id} AND version = {oldVersion}";
var rowsCount = db.execute(query);
if(rowsCount == 0) {
throw ConcurrencyViolation
}
}
}
// somewhere else in an object or module or whatever...
function chooseName(parentID, name) {
var record = NameRecordRepository.getByName(name);
record.chooseForParent(parentID);
NameRecordRepository.save(record);
}
Before whis object is saved to the DB a version comparison must be performed. SQL provides a way to execute a query based on some condition and return the row count of affected rows. In our case we check if the version in the Database is the same as the old one before update. If it's not, that means that someone else has updated the record.
In this simple case you can even remove the version field and use the isChosen flag in your SQL query like this:
var query = "UPDATE records SET .....
WHERE id = {record.id} AND isChosend = false";
When adding a new name to the database you will need a Unique constrant that will solve concurrenty issues.

Why are my datastore Writes per Entity Group per Second far over the stated limit?

I'm updating an entity with objectify inside a transaction. My guess was that I can only write to the same entity group around 1 to 5 times per second. This would conform to the documentation and advices around writing into the datastore. But after running some simple load tests on the following code, I saw
around 90 writes per second on a single entity
around 50 writes per second on random entities in the same entity group.
Why is this possible? Where is my mistake?
// text => a random text, different for each request
public void update(final Key<SomeEntity> toLoad, String text) {
final AtomicInteger attempts = new AtomicInteger(0);
SomeEntity modified = ofy().transact(new Work<SomeEntity>() {
public SomeEntity run() {
// count every attempt
attempts.incrementAndGet();
SomeEntity toModify = ofy().load().key(toLoad).now();
if (toModify != null) {
// modifies the entity
toModify.setText(text);
ofy().save().entity(toModify).now();
}
return toModify;
}
});
if (attempts.get() > 1) {
logger.warning(attempts.get() + " attempts for update on " + modified);
}
}
In the Cloud Console Log Viewer a lot of retries are reported, most warnings had ~ 2 attempts, some transactions had 5 attempts, but were executed and updated the entity. Are there any special strategies for load tests on GAE? Or any general advice on this topic?
Update:
A short description of the entity group structure and test setup. To make it easy to select an entity the key name reflects the entity's position in it's entity group. "001-001-100" is a 2nd level entity in the entity group with the root entity "100" and has the parent "001-100". So an entity group looks like this:
- 100
- 001-100
- 001-001-100
- 002-001-100
- 003-001-100
- ...
- 002-100
- 003-100
- 004-100
- 005-100
- ...
- 101
- ...
I tried three different version. Each one is using another value for the update request in JMeter. All update exactly the same entity "001-001-100".
// Version A: text does not change during load test
vars.put("text", "Foo Bar");
// Version B: text changes every second during load test
var d = new Date();
vars.put("text", [d.getHours(), d.getMinutes(), d.getSeconds()].join("-")));
// Version B: text changes every request
vars.put("text", Math.random());
Version A: ~ 110 requests / second
Version B: ~ 70 requests / second
Version C: ~ 24 requests / second
But still: 24 writes on one entity per second is really high. So I slightly redesigned the test.
Then I modified the test slightly. Instead of firing requests on only one entity, I now distribute them over the 2nd level of an entity group. So JMeter uses randomly "001-001-100", "002-001-100", "003-001-100", "004-001-100", or "005-001-100". More or less the same result as if I choose only one entity.
Version A: ~ 110 requests / second
Version B: ~ 100 requests / second
Version C: ~ 20 requests / second
Update 2:
If you execute the load test with just one single thread, the throughput is around 2.5 updates per second. This is closer to the proposed limit. If I run the test with 80 threads, the throughput goes up to the numbers I posted before. The response times for the samples are not the best, but the throughput keeps high: avg = 2100ms, median = 1350ms, 90% = 5400ms, max = 18000ms. Maybe the throughput might not be a gut measure for the datastore limits?
You get the benefits of entity caching (versions A and B). It may be at Objectify's level, or within the Datastore's infrastructure.
5 requests per second is not a hard limit. It's a warning:
Writes to a single entity group are serialized by the App Engine
datastore, and thus there's a limit on how quickly you can update one
entity group. In general, this works out to somewhere between 1 and 5
updates per second; a good guideline is that you should consider
rearchitecting if you expect an entity group to have to sustain more
than one update per second for an extended period.
Note that:
(a) A simple text string has almost no serialization overhead. It will not be the case with a complex entity.
(b) The warning includes the words "extended period".

How to handle unique constraint exception to update row after failing to insert?

I am trying to handle near-simultaneous input to my Entity Framework application. Members (users) can rate things, so I have a table for their ratings, where one column is the member's ID, one is the ID of the thing they're rating, one is the rating, and another is the time they rated it. The most recent rating is supposed to override the earlier ratings. When I receive input, I check to see if the member has already rated a thing or not, and if they have, I just update the rating using the existing row, or if they haven't, I add a new row. I noticed that when input comes in from the same user for the same item at nearly the same time, that I end up with two ratings for that user for the same thing.
Earlier I asked this question: How can I avoid duplicate rows from near-simultaneous SQL adds? and I followed the suggestion to add a SQL constraint requiring unique combinations of MemberID and ThingID, which makes sense, but I am having trouble getting this technique to work, probably because I don't know the syntax for doing what I want to do when an exception occurs. The exception comes up saying the constraint was violated, and what I would like to do then is forget the attemptd illegal addition of a row with the same MemberID and ThingID, and instead fetch the existing one and simply set the values to this slightly more recent data. However I have not been able to come up with a syntax that will do that. I have tried a few things and always I get an exception when I try to SaveChanges after getting the exception - either the unique constraint is still coming up, or I get a deadlock exception.
The latest version I tried was like this:
// Get the member's rating for the thing, or create it.
Member_Thing_Rating memPref = (from mip in _myEntities.Member_Thing_Rating
where mip.thingID == thingId
where mip.MemberID == memberId
select mip).FirstOrDefault();
bool RetryGet = false;
if (memPref == null)
{
using (TransactionScope txScope = new TransactionScope())
{
try
{
memPref = new Member_Thing_Rating();
memPref.MemberID = memberId;
memPref.thingID = thingId;
memPref.EffectiveDate = DateTime.Now;
_myEntities.Member_Thing_Rating.AddObject(memPref);
_myEntities.SaveChanges();
}
catch (Exception ex)
{
Thread.Sleep(750);
RetryGet = true;
}
}
if (RetryGet == true)
{
Member_Thing_Rating memPref = (from mip in _myEntities.Member_Thing_Rating
where mip.thingID == thingId
where mip.MemberID == memberId
select mip).FirstOrDefault();
}
}
After writing the above, I also tried wrapping the logic in a function call, because it seems like Entity Framework cleans up database transactions when leaving scope from where changes were submitted. So instead of using TransactionScope and managing the exception at the same level as above, I wrapped the whole thing inside a managing function, like this:
bool Succeeded = false;
while (Succeeded == false)
{
Thread.Sleep(750);
Exception Problem = AttemptToSaveMemberIngredientPreference(memberId, ingredientId, rating);
if (Problem == null)
Succeeded = true;
else
{
Exception BaseEx = Problem.GetBaseException();
}
}
But this only results in an unending string of exceptions on the unique constraint, being handled forever at the higher-level function. I have a 3/4 second delay between attempts, so I am surprised that there can be a reported conflict yet still there is nothing found when I query for a row. I suppose that indicates that all of the threads are failing because they are running at the same time and Entity Framework notices them all and fails them all before any succeed. So I suppose there should be a way to respond to the exception by looking at all the submissions and adjusting them? I don't know or see the syntax for that. So again, what is the way to handle this?
Update:
Paddy makes three good suggestions below. I expect his Stored Procedure technique would work around the problem, but I am still interested in the answer to the question. That is, surely one should be able to respond to this exception by manipulating the submission, but I haven't yet found the syntax to get it to insert one row and use the latest value.
To quote Eric Lippert, "if it hurts, stop doing it". If you are anticipating getting very high volumnes and you want to do an 'insert or update', then you may want to consider handling this within a stored procedure instead of using the methods outlined above.
Your problem is coming because there is a small gap between your call to the DB to check for existence and your insert/update.
The sproc could use a MERGE to do the insert or update in a single pass on the table, guaranteeing that you will only see a single row for a rating and that it will be the most recent update you receive.
Note - you can include the sproc in your EF model and call it using similar EF syntax.
Note 2 - Looking at your code, you don't rollback the transaction scope prior to sleeping your thread in the case of exception. This is a relatively long time to be holding a transaction open, particularly when you are expecting very high volumes. You may want to update your code something like this:
try
{
memPref = new Member_Thing_Rating();
memPref.MemberID = memberId;
memPref.thingID = thingId;
memPref.EffectiveDate = DateTime.Now;
_myEntities.Member_Thing_Rating.AddObject(memPref);
_myEntities.SaveChanges();
txScope.Complete();
}
catch (Exception ex)
{
txScope.Dispose();
Thread.Sleep(750);
RetryGet = true;
}
This may be why you seem to be suffering from deadlocks when you retry, particularly if you are getting rapid concurrent requests.

How many objects is "too many" for in a single transaction to Google's DataStore (High Replication)?

I have following entity (non-relevant fields/methods are removed).
public class HitsStatsTotalDO
{
#Id
transient private Long targetId;
public Key<HitsStatsTotalDO> createKey()
{
return new Key<HitsStatsTotalDO>(HitsStatsTotalDO.class, targetId);
}
}
So... I'm trying to do batch get for 10 objects for which I construct keys using HitsStatsTotalDO.createKey(). I'm attempting to fetch them in transaction like this:
final List<Key<HitsStatsTotalDO>> keys = ....
// This is being called in transaction..
Map<Key<HitsStatsTotalDO>, HitsStatsTotalDO> result = DAOBase.ofy().get(keys);
which throws following exception:
java.lang.IllegalArgumentException: operating on too many entity groups in a single transaction.
Could you please elaborate how many is too many and how to fix it ? I couldn't find exact number in the documentation.
Thanks!
The issue is not the number of entities you're retrieving, it's the fact that they're in multiple entity groups. Either do the fetch outside a transaction, or use an XG (Cross Group) transaction.
In a single transaction you can operate entities in the same entity group.
What Can Be Done In a Transaction

Creating logger in CouchDB?

I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?
I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.
Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.
Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username
A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers

Resources