As django model save() methods are not lazy, and as keeping transactions short is a general good practice, should saves be preferably deferred to the end of transaction blocks?
As an example, would the code sample B hold a transaction open for less time than code sample A below?
Code sample A:
from django.db import transaction
from my_app.models import MyModel
#transaction.commit_on_success
def model_altering_method():
for inst in MyModel.objects.all()[0:5000]:
inst.name = 'Joel Spolsky'
# Some models independent time consuming operations...
inst.save()
Code sample B:
from django.db import transaction
from my_app.models import MyModel
#transaction.commit_on_success
def model_altering_method():
instances_to_save = []
for inst in MyModel.objects.all()[0:5000]:
inst.name = 'Joel Spolsky'
# Some models independent time consuming operations...
instances_to_save.append(inst)
for inst in instances_to_save:
inst.save()
I'm not sure, but here is my theory - I would think that your commit_manually decorator will begin a new transaction rather than having a new transaction spring into existence when you do your first save. So my theory is that code sample B would keep the transaction open longer, since it has to loop through the list of models twice.
Again, that's just a theory - and it could also depend on which DBMS you're using as to when the actual transaction starts (another theory).
Django’s default behavior is to run with an open transaction which it commits automatically when any built-in, data-altering model function is called. In case of commit_on_success or commit_manually decorators, django does not commit upon save(), but rather on function execution successful completion or on transaction.commit() command respectively.
Therefore, the elegant approach would be to separate the transaction handling code and other time consuming code if possible:
from django.db import transaction
from my_app.models import MyModel
#transaction.commit_on_success
def do_transaction(instances_to_save):
for inst in instances_to_save:
inst.save()
def model_altering_method():
instances_to_save = []
for inst in MyModel.objects.all()[0:5000]:
inst.name = 'Joel Spolsky'
# Some models independent time consuming operations...
instances_to_save.append(inst)
do_transaction(instances_to_save)
If this is impossible design wise, e.g. you need instance.id information which for new instances you can only get only after the first save(), try breaking up your flow to reasonably sized workunits, as not to keep the transaction open for long minutes.
Also notice that having long transactions is not always a bad thing. If your application is the only entity modifying the db, it could actually be ok. You should however check the specific configuration of your db to see the time limit for transactions (or idle transaction).
Related
While integration testing, I am attempting to test the execution of a stored procedure. To do so, I need to perform the following steps:
Insert some setup data
Execute the stored procedure
Find the data in the downstream repo that is written to by the stored proc
I am able to complete all of this successfully, however, after completion of the test only the rows written by the stored procedure are rolled back. Those rows inserted via the JdbcAggregateTemplate are not rolled back. Obviously I can delete them manually at the end of the test declaration, but I feel like I must be missing something here with my configuration (perhaps in the #Transactional or #Rollback annotations.
#SpringBootTest
#Transactional
#Rollback
#TestInstance(TestInstance.Lifecycle.PER_CLASS)
class JobServiceIntegrationTest #Autowired constructor(
private val repo: JobExecutorService,
private val template: JdbcAggregateTemplate,
private val generatedDataRepo: GeneratedDataRepo,
) {
#Nested
inner class ExecuteMyStoredProc {
#Test
fun `job is executed`() {
// arrange
val supportingData = supportingData()
// act
// this data does not get rolled back but I would like it to
val expected = template.insert(supportingData)
// this data does get rolled back
repo.doExecuteMyStoredProc()
val actual = generatedDataRepo.findAll().first()
assertEquals(expected.supportingDataId, actual.supportingDataId)
}
}
fun supportingData() : SupportingData {
...
}
}
If this was all done as part of a physical database transaction, I would anticipate the inner transactions are all rolled back when the outer transaction rolls back. Obviously this is not that, but that's the behavior I'm hoping to emulate.
I've made plenty of integration tests and all of them roll back as I expect, but typically I'm just applying some business logic and writing to a database, nothing as involved as this. The only unique situations about this test from my other tests is that I'm executing a stored proc (and the stored proc contains transactions).
I'm writing this data to a SQL Server DB, and I'm using Spring JDBC with Kotlin.
Making my comment into an answer since it seemed to have solved the problem:
I suspect the transaction in the SP commits the earlier changes. Could you post the code for a simple SP that causes the problems you describe, so I can play around with it?
My Google App Engine application (Python3, standard environment) serves requests from users: if there is no wanted record in the database, then create it.
Here is the problem about database overwriting:
When one user (via browser) sends a request to database, the running GAE instance may temporarily fail to respond to the request and then it creates a new process to respond this request. It results that two instances respond to the same request. Both instances make a query to database almost in the same time, and each of them finds there is no wanted record and thus creates a new record. It results as two repeated records.
Another scenery is that for certain reason, the user's browser sends twice requests with time difference less than 0.01 second, which are processed by two instances at the server side and thus repeated records are created.
I am wondering how to temporarily lock the database by one instance to prevent the database overwriting from another instance.
I have considered the following schemes but have no idea whether it is efficient or not.
For python 2, Google App Engine provides "memcache", which can be used to mark the status of query for the purpose of database locking. But for python3, it seems that one has to setup a Redis server to rapidly exchange database status among different instances. So, how about the efficiency of database locking by using Redis?
The usage of session module of Flask. The session module can be used to share data (in most cases, the login status of users) among different requests and thus different instances. I am wondering the speed to exchange the data between different instances.
Appended information (1)
I followed the advice to use transaction, but it did not work.
Below is the code I used to verify the transaction.
The reason of failure may be that the transaction only works for CURRENT client. For multiple requests at the same time, the server side of GAE will create different processes or instances to respond to the requests, and each process or instance will have its own independent client.
#staticmethod
def get_test(test_key_id, unique_user_id, course_key_id, make_new=False):
client = ndb.Client()
with client.context():
from google.cloud import datastore
from datetime import datetime
client2 = datastore.Client()
print("transaction started at: ", datetime.utcnow())
with client2.transaction():
print("query started at: ", datetime.utcnow())
my_test = MyTest.query(MyTest.test_key_id==test_key_id, MyTest.unique_user_id==unique_user_id).get()
import time
time.sleep(5)
if make_new and not my_test:
print("data to create started at: ", datetime.utcnow())
my_test = MyTest(test_key_id=test_key_id, unique_user_id=unique_user_id, course_key_id=course_key_id, status="")
my_test.put()
print("data to created at: ", datetime.utcnow())
print("transaction ended at: ", datetime.utcnow())
return my_test
Appended information (2)
Here is new information about usage of memcache (Python 3)
I have tried the follow code to lock the database by using memcache, but it still failed to avoid overwriting.
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.gets(cache_key_id)
if result is None or result == "":
client.cas(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
print("failed after 500 tries.")
break
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.cas(cache_key_id, "")
memcache.delete(cache_key_id)
If the problem is duplication but not overwriting, maybe you should specify data id when creating new entries, but not let GAE generate a random one for you. Then the application will write to the same entry twice, instead of creating two entries. The data id can be anything unique, such as a session id, a timestamp, etc.
The problem of transaction is, it prevents you modifying the same entry in parallel, but it does not stop you creating two new entries in parallel.
I used memcache in the following way (using get/set ) and succeeded in locking the database writing.
It seems that gets/cas does not work well. In a test, I set the valve by cas() but then it failed to read value by gets() later.
Memcache API: https://cloud.google.com/appengine/docs/standard/python3/reference/services/bundled/google/appengine/api/memcache
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.get(cache_key_id)
if result is None or result == "":
client.set(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
return "failed after 500 tries of memcache checking."
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.delete(cache_key_id)
...
Transactions:
https://developers.google.com/appengine/docs/python/datastore/transactions
When two or more transactions simultaneously attempt to modify entities in one or more common entity groups, only the first transaction to commit its changes can succeed; all the others will fail on commit.
You should be updating your values inside a transaction. App Engine's transactions will prevent two updates from overwriting each other as long as your read and write are within a single transaction. Be sure to pay attention to the discussion about entity groups.
You have two options:
Implement your own logic for transaction failures (how many times to
retry, etc.)
Instead of writing to the datastore directly, create a task to modify
an entity. Run a transaction inside a task. If it fails, the App
Engine will retry this task until it succeeds.
I had tried to make my entire POST method transactional, but was unable to because it called other methods and was therefore nested. So what I've done is created a transactional method just to complete the db.put() of my entity.
def post(self):
myobj = db.get(key)
myobj.property = x + 1
second_method()
my_txn(my_obj)
#db.transactional
def my_txn(obj):
db.put(obj)
Is this a valid way to create a transaction?
No I don't think it has any use regarding the concept of TA.
Something like this:
def post(self):
second_method()
my_txn(key)
#db.transactional
def my_txn(key):
myobj = db.get(key)
myobj.property = x + 1
db.put(obj)
You should use transactions in order to ensure that the read(get) and write(put) are consistent.
In this way you know that when you are getting the entity and until you write the entity if anything has changed in the meanwhile will abort this TA and have a retry (default 3).
In your way of using it the get is outside the transaction thus making the transaction useless
I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break
NB: I am using db (not ndb) here. I know ndb has a count_async() but I am hoping for a solution that does not involve migrating over to ndb.
Occasionally I need an accurate count of the number of entities that match a query. With db this is simply:
q = some Query with filters
num_entities = q.count(limit=None)
It costs a small db operation per entity but it gets me the info I need. The problem is that I often need to do a few of these in the same request and it would be nice to do them asynchronously but I don't see support for that in the db library.
I was thinking I could use run(keys_only=True, batch_size=1000) as it runs the query asynchronously and returns an iterator. I could first call run() on each query and then later count the results from each iterator. It costs the same as count() however run() has proven to be slower in testing (perhaps because it actually returns results) and in fact it seems that batch_size is limited at 300 regardless of how high I set it which requires more RPCs to do a count of thousands of entities than the count() method does.
My test code for run() looks like this:
queries = list of Queries with filters
iters = []
for q in queries:
iters.append( q.run(keys_only=True, batch_size=1000) )
for iter in iters:
count_entities_from(iter)
No, there's no equivalent in db. The whole point of ndb is that it adds these sort of capabilities which were missing in db.