Our GAE app makes a local copy of another website's relational database in the NDB. There are 4 entity types - User, Table, Row, Field. Each user has a bunch of tables, each table has a bunch of rows, each row has a bunch of fields.
SomeUser > SomeTable > ARow > AField
Thus, each User becomes one entity group. I need a feature where I can clear out all the tables (and their rows) for a certain user. What is the right way to delete all the tables and all the rows, while avoiding the contention limit of ~5 operations/second.
The current code is getting TransactionFailedErrors because of contention on the Entity Group.
(detail that I overlooked is that we only want to delete tables with the attribute 'service' set to a certain value)
def delete_tables_for_service(user, service):
tables = Tables.query(Tables.service == service, ancestor=user.key).fetch(keys_only=True)
for table in tables:
keys = []
keys += Fields.query(ancestor=table).fetch(keys_only=True)
keys += TableRows.query(ancestor=table).fetch(keys_only=True)
keys.append(table)
ndb.delete_multi(keys)
If all of the entities you're deleting are in one entity group, try deleting them all in one transaction. Without an explicit transaction, each delete is occurring in its own transaction, and all of the transactions have to line up (via contention and retries) to change the entity group.
Are you sure it's contention-based, or perhaps because the code above is executed within a transaction? A quick fix might be to increase the number of retries and turn on cross-group transactions for this method:
#ndb.transactional(retries=5, xg=True)
You can read more about that here: https://developers.google.com/appengine/docs/python/ndb/transactions. If that's not the culprit, maybe consider deferring or running the deletes asynchronously so they execute over time and in smaller batches. The trick with NDB is to do small bursts of work regularly, versus a large chunk of work infrequently. Here is one way to turn that code into an asynchronous unit of work:
def delete_tables_for_service(user, service):
tables = Tables.query(Tables.service == service, ancestor=user.key).fetch(keys_only=True)
for table in tables:
# Delete fields
fields_keys = Fields.query(ancestor=table).fetch(keys_only=True)
ndb.delete_multi_async(fields_keys)
# Delete table rows
table_rows_keys = TableRows.query(ancestor=table).fetch(keys_only=True)
ndb.delete_multi_async(table_rows_keys)
# Finally delete table itself
ndb.delete_async(table.key)
If you want more control over the deletes, retries, failures, you can either use Task Queues, or simply use the defer library (https://developers.google.com/appengine/articles/deferred):
Turn deferred on in your app.yaml
Change the calls to ndb.delete_multi to deferred:
def delete_tables_for_service(user, service):
tables = Tables.query(Tables.service == service, ancestor=user.key).fetch(keys_only=True)
for table in tables:
keys = []
keys += Fields.query(ancestor=table).fetch(keys_only=True)
keys += TableRows.query(ancestor=table).fetch(keys_only=True)
keys.append(table)
deferred.defer(_deferred_delete_tables_for_keys, keys)
def _deferred_delete_tables_for_keys(keys):
ndb.delete_multi(keys)
Related
From learn.microsoft.com "Populating a DataSet from a DataAdapter"
Pulling all of the table to the client also locks all of the rows on the server.
I didn't find any information (in Namespace: System.Data) regarding possibility to put lock on records (or group records) in DB, that was read to a DataSet (DataTable), which can affect all users of the DB, but not only those who will work with the database through my application.
Also From learn.microsoft.com "Using UpdatedRowSource to Map Values to a DataSet"
The Update method resolves your changes back to the data source; however other clients may have modified data at the data source since the last time you filled the DataSet. To refresh your DataSet with current data, use the DataAdapter and Fill method. New rows will be added to the table, and updated information will be incorporated into existing rows. The Fill method determines whether a new row will be added or an existing row will be updated by examining the primary key values of the rows in the DataSet and the rows returned by the SelectCommand. If the Fill method encounters a primary key value for a row in the DataSet that matches a primary key value from a row in the results returned by the SelectCommand, it updates the existing row with the information from the row returned by the SelectCommand and sets the RowState of the existing row to Unchanged. If a row returned by the SelectCommand has a primary key value that does not match any of the primary key values of the rows in the DataSet, the Fill method adds a new row with a RowState of Unchanged.
If you we have modified copy records from DB (locally in-memory) in DataSet, and want to propogand these chanes to server, why we must refresh our local record? It can rejects all our changes.
I doesn't understand, in common, the strategy for organization modification a records in DB throught DataSet:
make local copy records (by "Addapter.Fill(Dataset)")
change record (or records) locally (some contunies time) and wait when user click "Update", when:
save all modification in temp table?
reread records from DB (again by "Addapter.Fill(Dataset)")?
compare records from temp table with updated Dataset?
And if any nothing is changed, quickly to update records in DB (by
"Addapter.Update(Dataset)")?
But also in that case, It's a possibly that someone would be more quickly than I (and can update "My records" between the my reread and my update?
Rereaded all articles about ADO.NET (from learn.microsoft.com), again, and found some additive information and can answer on own questions:
1) The reading records from DB to DataSet (by Addapter.Fill(Dataset) doesn’t put any lock on records in DB (the foregoing phrase from learn.microsoft.com “Pulling all of the table to …” isn’t correct).
2) “In a multiuser environment, there are two models for updating data in a database: optimistic concurrency and pessimistic concurrency. The DataSet object is designed to encourage the use of optimistic concurrency (ONLY) for long-running activities, such as remoting data and interacting with data.”
A transaction (DbTransaction Class-derived types) which consists of a single command or a group of commands and which execute as a package, can put different types of locks on rows in tables in DB (DbTransaction.IsolationLevel Property).
Not correct use of transaction can very badly influence on work DB in multi-user env (the transactions must keep as short as possible).
3) When we talk regarding a locking records in DB, we must clear understood, what the aim of locking records and who it will be corresponds with logic work of our Application.
By example, we want made system for selling tickets for cinema (to simplify, for one movie and for one session only).
Application (which will run on multi-devices) must connect to DB and
fill local DataSet, and show for users all available seats (all rows
which have in special DB field SeatStatus = “FREE”).
After “selecting seat” user must press “Make Reservation” (MakeReservation()). Method MakeReservation() must use the ”Testing for Optimistic Concurrency Violations” (see example from Microsoft it more elegant than my from first post) trying to change value of the field SeatStatus to “RESERVED”, in this time will use the “optimistic concurrency” – who will press first, that “receive seat”.
Who will “second” receive message “Sorry, place is reserved” and Update for current list of available seats (UpdateSeats()).
Also, UpdateSeats() must periodically run on all active devices (one
per sec).
User who first pressed button, on next screen must entered
credit card information and press “Pay” (PayTicket()).
Method PayTicket() must connect to Bank, check payment and change
status SeatStatus to “OCCUPIED”, and this case (IMHO) more correct will use a some transaction with “pessimistic
concurrency”.
If User isn’t pressed “Pay” in some demanded time (5 min.) will
return to previous
screen and the field SeatStatus changed to “FREE”.
P.S>
Wellcome to all who knows more correct way to realize this task more correctly.
My source tables called Event sitting in a different database and it has millions of rows. Each event can have an action of DELETE, UPDATE or NEW.
We have a Java process that goes through these events in the order they were created and do all sort of rules and then insert the results into multiple tables for look up, analyse etc..
I am using JdbcTemplate and using batchUpdate to delete and upsert to Postgres DB in a sequential order right now, but I'd like to be able to parallel too. Each batch is 1,000 entities to be insert/upserted or deleted.
However, currently even doing in a sequential manner, Postgres locks queries somehow which I don't know much about and why.
Here are some of the codes
entityService.deleteBatch(deletedEntities);
indexingService.deleteBatch(deletedEntities);
...
entityService.updateBatch(allActiveEntities);
indexingService.updateBatch(....);
Each of these services are doing insert/delete into different tables. They are in one transaction though.
The following query
SELECT
activity.pid,
activity.usename,
activity.query,
blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));
returns
Query being blocked: "insert INTO ENTITY (reference, seq, data) VALUES($1, $2, $3) ON CONFLICT ON CONSTRAINT ENTITY_c DO UPDATE SET data = $4",
Blockking query: delete from ENTITY_INDEX where reference = $1
There are no foreign constraints between these tables. And we do have indexes so that we can run queries for our processing as part of the process.
Why would one completely different table can block the other tables? And how can we go about resolving this?
Your query is misleading.
What it shows as “blocking query” is really the last statement that ran in the blocking transaction.
It was probably a previous statement in the same transaction that caused entity (or rather a row in it) to be locked.
I tried to delete all datastore entities in two different ways but I get an error:
Try 1:
results = myDS().query().fetch()
for res in results:
res.delete()
Try 2:
results = myDS().query().fetch()
ndb.delete_multi(results)
In both cases it fails and I get the error:
The server encountered an error and could not complete your request.
Any idea why?
In the results obtained from your queries you have actual entities.
In the first try, to delete an entity, you need to call .delete() on the entity's key, not on the entity itself, see also Deleting entities:
res.key.delete()
Similarly, in the 2nd try, you need to pass entity keys, not entities, to ndb.delete_multi(), see also Using batch operations:
ndb.delete_multi([r.key for r in results])
But in both cases it's more efficient to directly obtain just the entity keys from the queries (you don't actually need the entities themselves to delete them). It's also cheaper as you'd be skipping datastore read ops. Your tries would look like this:
keys = myDS().query().fetch(keys_only=True)
for key in keys:
key.delete()
keys = myDS().query().fetch(keys_only=True, limit=500) # up to 500 keys at a time
ndb.delete_multi(keys)
I am not sure this is the preferred method, but I want to present my solution and see if you symfony2 wizards out there have enlightening comments on this.
I am registering financial transactions in a table, and each user has their own series of serial numbers (i.e. each user's transaction table will start with 1).
I understand that this must be handled by code, and then I run the risk of having duplicate entries for a user if let's say two people would be logged on to the same user account registering transactions, or the user triggers multiple transaction writes at the same time and Doctrine were to do the SELECTs in both operations before the firs write fires...
$em->getConnection()->exec('LOCK TABLES transaction WRITE;'); //lock for write access
$results = $em->createQuery("SELECT MAX(t.serial) FROM ekonomiKassabokBundle:Transaction t WHERE t.user = $userId")->getResult();
$temp = $results[0];
$max_serial = $temp[1];
$new_serial = $max_serial + 1;
$entity->setSerial($new_serial);
$em->persist($entity);
$em->flush();
$em->getConnection()->exec('UNLOCK TABLES;');
The above code gives me...
SQLSTATE[HY000]: General error: 1100 Table 't0_' was not locked with LOCK TABLES
Or is this perhaps even overkill, should I just skip the table lock?
I eventually managed to find the solution, well... a solution.
Actually from what I understand it is pretty stupid : when you lock tables MySQL expects ALL the tables you will use until the unlock to be locked and this must happen in one LOCK TABLES statement.
Now Doctrine will systematically use table aliases for whatever reason and MySQL apparently can't figure out that the aliases refer to locked tables... so you have to specifically lock all aliases that will be used yourself!
Try:
$em->getConnection()->exec('LOCK TABLES transaction as t0_ WRITE;');
And if you have another error after this (it will happen if you do several queries while the table is locked), just keep adding locks to the additional aliases, for instance:
$em->getConnection()->exec('LOCK TABLES transaction as t0_ WRITE, transaction as t0 WRITE, transaction as t1 WRITE;');
Fortunately it seems doctrine always uses the same table aliases, so once you have got it down it should keep working!
I am looking for a way to quickly compare the state of a database table with the results of a Web service call.
I need to make sure that all records returned by the Web service call exist in the database, and any records in the database that are no longer in the Web service response are removed from the table.
I have to problems to solve:
How do I quickly compare a data
structure with the results of a
database table?
When I find a
difference, how do I quickly add
what's new and remove what's gone?
For number 1, I was thinking of doing an MD5 of a data structure and storing it in the database. If the MD5 is different, then I'd move to step 2. Are there better ways of comparing response data with the state of a database?
I need more guidance on number 2. I can easily retrieve all records from a table (SELECT * FROM users WHERE user_id = 1) and then loop through an array adding what's not in the DB and creating another array of items to be removed in a subsequent call, but I'm hoping for a better (faster) was of doing this. What is the best way to compare and sync a data structure with a subset of a database table?
Thanks for any insight into these issues!
I've recently been caught up in a similar problem. Our--very simple--solution was to load the web service data into a table with the same structure as the DB table. The DB table keeps a hash of its most important columns, and the same hash function is applied to the corresponding columns in the web service table.
The "sync" logic then goes like this:
Delete any rows from the web service table with hashes that do exist in the DB table. This is duplicate data that doesn't need synchronizing.
DELETE FROM ws_table WHERE hash IN (SELECT hash from db_table);
Delete any rows from the DB table with hashes not found in the web service table.
DELETE FROM db_table WHERE hash NOT IN (SELECT hash FROM ws_table);
Anything left over in the web service table is new data, and should now be inserted into the DB table.
INSERT INTO db_table SELECT ... FROM ws_table;
It's a pretty brute-force approach, and if done transactionally (even just steps 2 and 3) locks up the DB table for the duration, but it's very simple.
One refinement would be to deal with changed records using UPDATE statements, but that adds a good deal of complexity, and may not be any faster than a DELETE followed by an INSERT.
Another possible optimization would be to set a flag instead of deleting rows. The rows could then be deleted later on. However, any logic using the DB table would have to ignore rows with a set flag.
Don't kill yourself doing premature optimization. Go with the simple approach of inserting each row one at a time. If you find your having transactional issues like locking of the table is to long while looping you could insert the rows first into a temporary table then do a single insert into the real destination table.
If you were using SQL Server you could do bulk inserts, or package the data into XML, But I'd still highly recommend implement it the easy way first, then test it and if you can test with production data (or the same quantity of data), then look to optimize only if you need to.