Why is my django bulk database population so slow and frequently failing? - database

I decided I'd like to use django's model system rather than coding raw SQL to interface with my database, but I am having a problem that surely is avoidable.
My models.py contains:
class Student(models.Model):
student_id = models.IntegerField(unique = True)
form = models.CharField(max_length = 10)
preferred = models.CharField(max_length = 70)
surname = models.CharField(max_length = 70)
and I'm populating it by looping through a list as follows:
from models import Student
for id, frm, pref, sname in large_list_of_data:
s = Student(student_id = id, form = frm, preferred = pref, surname = sname)
s.save()
I don't really want to be saving this to the database each time but I don't know another way to get django to not forget about it (I'd rather add all the rows and then do a single commit).
There are two problems with the code as it stands.
It's slow -- about 20 students get updated each second.
It doesn't even make it through large_list_of_data, instead throwing a DatabaseError saying "unable to open database file". (Possibly because I'm using sqlite3.)
My question is: How can I stop these two things from happening? I'm guessing that the root of both problems is that I've got the s.save() but I don't see a way of easily batching the students up and then saving them in one commit to the database.

So it seems I should have looked harder before posing the question.
Some solutions are described in this stackoverflow question (the winning answer is to use django.db.transaction.commit_manually) and also in this one on aggregating saves.
Other ideas for speeding up this type of operation are listed in this stackoverflow question.

Related

Django Model: Best Approach to update?

I am trying to update 100's of objects in my Job which is scheduled every two hours.
I have articles table in my Model. All articles are parsed and then different attributes are saved for each article.
First i query to get all unparsed articles and then parse each URL which is saved against article and save the received attributes.
Below is my code
articles = Articles.objects.filter(status = 0) #100's of articles
for art in articles:
try:
url = art.link
result = ArticleParser(URL) #Custom function which will do all the parsing
art.author = result.articleauthor
art.description = result.articlecontent[:5000]
art.imageurl = result.articleImage
art.status = 1
art.save()
except Exception as e:
art.author = ""
art.description = ""
art.imageurl = ""
art.status = 2
art.save()
The thing is when this job is running CPU utilization is very high also DB process utilization is very high. I am trying to pin point when and where it spikes.
Question: Is this the right way to update multiple objects or is there any better way to do it? Any suggestions.
Appreciate your help.
Regards
Edit 1: Sorry for the confusion. There is some explanation to do. The fields like author, desc etc they will be different for every article they will be returned after i parse the URL. The reason i am updating in loop is because these fields will be different for every iteration according to the URL. I have updated the code i hope it helps clearing the confusion.
You are doing 100s of DB operations in a relatively tight loop, so it is expected that there is some load on the DB.
If you have a lot of articles, make sure you have an index on the status column to avoid a table scan.
You can try disabling autocommit and wrapping the whole update in one transaction instead.
From my understanding, you do NOT want to set the fields author, description and imageurl to same value on all articles, so QuerySet.update won't work for you.
Django recommends this way when you want to update or delete multi-objects: https://docs.djangoproject.com/en/1.6/topics/db/optimization/#use-queryset-update-and-delete
1.Better not to use 'Exception', need to specify concretely: KeyError, IndexError etc.
2.Data can be created once. Something like this:
data = dict(
author=articleauthor,
description=articlecontent[:5000],
imageurl=articleImage,
status=1
)
Articles.objects.filter(status=0).update(**data)
To Edit 1: Probably want to set up a periodic tasks celery. That is, for each query to a separate task. For help see this documentation.

App Engine Datastore - consistency and 1 write per sec limitation - who will it work in the following scenarious

I'm trying to wrap my head around eventuality consistency and 1 write per sec principles in GAE datastore. I have a scenario and two questions:
#python like pseudo-code
class User:
user_id = StringProperty
last_update_time = DateTimeProperty
class Comment:
user_id = StringProperty
comment = StringProperty
...
def AddCommentAndReturnAllComments(user_id):
user = db.GqlQuery("SELECT * FROM User where user_id = :1", user_id)
user.last_update_time = datetime.now()
user.put()
comment = Comment(parent=User(user_id))
comment.put()
comments = db.GqlQuery("SELECT * FROM Comment where user_id = :1", user_id)
return comments
Questions:
Will I get an exception here because I make two writes into the same EntityGroup within one second (user.put and comment.put)? Is there a simple way around it?
If I remove the parent=user(user_id), the two entities will no longer belong to the same EntityGroup. Does it mean that the list of comments returned from the function might not contain the last added comment?
Am I doing something inherently wrong?
I know that I got the entity referencing part wrong. It doesn't matter for the question (or does it?)
This seems to be a soft limit. In practice I see up to 5 writes/s allowed.
Yes and it also happens now, because you are not using ancestor query.
Nothing, except as mentioned in point 2.

Trigger Duplicate CSV

am trying to upload a CSV file / insert a bulk of records using the import wizard. In short I would like to keep the latest record, in case if duplicates are found. Duplicates record are a combination of First name, Last name and title
For example if my CSV file looks like the following:
James,Wistler,34,New York,Married
James,Wistler,34,London,Married
....
....
James,Wistler,34,New York,Divorced
This should only keep in my org: James,Wistler,34,New York,Divorced
I have been trying to write a trigger before an update / insert but so far no success Here is my trigger code: (The code is not yet finished (only filering with Firstname), I am having a problem deleting found duplicate in my CSV ) Any hints. Thanks for reading!
trigger CheckDuplicateInsert on Customer__c(before insert,before update){
Map <String, Customer__c> customerFirstName = new Map<String,Customer__c>();
list <Customer__c> CustomerList = Trigger.new;
for (Customer__c newCustomer : CustomerList)
{
if ((newCustomer.First_Name__c != null) && System.Trigger.isInsert )
{
if (customerFirstName.containsKey(newCustomer.First_Name__c) )
//remove the duplicate from the map
customerFirstName.remove(newCustomer.First_Name__c);
//end of the if clause
// add this stage we dont have any duplicate, so lets add a new customer
customerFirstName.put(newCustomer.First_Name__c , newCustomer);
}
else if ((System.Trigger.oldMap.get(newCustomer.id)!= null)&&newCustomer.First_Name__c !=System.Trigger.oldMap.get(newCustomer.id).First_Name__c )
{//field is being updated, lets mark it with UPDATED for tracking
newCustomer.First_Name__c=newCustomer.First_Name__c+'UPDATED';
customerFirstName.put(newCustomer.First_Name__c , newCustomer);
}
}
for (Customer__c customer : [SELECT First_Name__c FROM Customer__c WHERE First_Name__c IN :customerFirstName.KeySet()])
{
if (customer.First_Name__c!=null)
{
Customer__c newCustomer=customerFirstName.get(customer.First_Name__c);
newCustomer.First_Name__c=Customer.First_Name__c+'EXIST_DB';
}
}
}
Purely non-SF solution would be to sort them & deduplicate in Excel for example ;)
Good news - you don't need a trigger. Bad news - you might have to ditch the import wizard and start using Data Loader. The solution is pretty long and looks scary but once you get the hang of it it should start to make more sense and be easier to maintain in future than writing code.
You can download the Data Loader in setup area of your Production org and here's some basic info about the tool.
Anyway.
I'd make a new text field on your Contact, call it "unique key" or something and mark it as External Id. If you have never used ext. ids - Jeff Douglas has a good post about them.
You might have to populate the field on your existing data before proceeding. Easiest would be to export all Contacts where it's blank (from a report for example), fill it in with some Excel formulas and import back.
If you want, you can even write a workflow rule to handle the generation of the unique key. This might help you when Mrs. Jane Doe gets married and becomes Jane Bloggs and also will make previous point easier (you'd just import Contacts without changes, just "touching" them and the workflow will fire). Something like
condition: ISBLANK(Unique_key__c) || ISCHANGED(FirstName) || ISCHANGED(LastName) || ISCHANGED(Title)
new value: Title + FirstName + ' ' + LastName
Almost there. Fire Data Loader and prepare an upsert job (because we want to insert some records and when duplicate is found - update them instead).
My only concern is what happens when what's effectively same row will appear more than once in 1 "batch" of records sent to SF like in your example. Upsert will not know which value is valid (it's like setting x = 7; and x = 5; in same save to DB) and will decide to fail these rows. So you might have to tweak the amount of records in a batch in Data Loader's settings.

Is there any way to do a Insert or Update / Merge / Upsert in LLBLGen

I'd like to do an upmerge using LLBLGen without first fetching then saving the entity.
I already found the possibility to update without fetching the entity first, but then I have to know it is already there.
Updating entries would be about as often as inserting a new entry.
Is there a possibility to do this in one step?
Would it make sense to do it in one step?
Facts:
LLBLgen Pro 2.6
SQL Server 2008 R2
.NET 3.5 SP1
I know I'm a little late for this, but As I remember working with LLBLGenPro, it is totally possible and one of its beauties is everithing is possible!
I don't have my samples, but I'm pretty sure you there is a method named UpdateEntitiesDirectly that can be used like this:
// suppose we have Product and Order Entities
using (var daa = new DataAccessAdapter())
{
int numberOfUpdatedEntities =
daa.UpdateEntitiesDirectly(OrderFields.ProductId == 23 && OrderFields.Date > DateTime.Now.AddDays(-2));
}
When using LLBLGenPro we were able to do pretty everything that is possible with an ORM framework, it's just great!
It also has a method to do a batch delete called DeleteEntitiesDirectly that may be usefull in scenarios that you need to delete an etity and replace it with another one.
Hope this is helpful.
I think you can achieve what you're looking for by using EntityCollection. First fetch the entities you want to update by FetchEntityCollection method of DataAccessAdapter then, change anything you want in that collection, insert new entities to it and save it using DataAccessAdapter, SaveCollection method. this way existing entities would be updated and new ones would be inserted to the Database. For example in a product order senario in which you want to manipulate orders of a specified product then you can use something like this:
int productId = 23;
var orders = new EntityCollection<OrderEntity>();
using (DataAccessAdapter daa = new DataAccessAdapter())
{
daa.FetchEntityCollection(orders, new RelationPredicateBucket(OrderFields.ProductId == productId))
foreach(var order in orders)
{
order.State = 1;
}
OrderEntity newOrder = new OrderEntity();
newOrder.ProductId == productId;
newOrder.State = 0;
orders.Add(newOrder);
daa.SaveEntityCollection(orders);
}
As far as I know, this is not possible, and could not be possible.
If you were to just call adapter.Save(entity) on an entity that was not fetched, the framework would assume it was new. If you think about it, how could the framework know whether to emit an UPDATE or an INSERT statement? No matter what, something somewhere would have to query the database to see if the row exists.
It would not be too difficult to create something that did this more or less automatically for single entity (non-recursive) saves. The steps would be something like:
Create a new entity and set it's fields.
Attempt to fetch an entity of the same type using the PK or a unique constraint (there are other options as well, but none as uniform)
If the fetch fails, just save the new entity (INSERT)
If the fetch succeeds, map the fields of the created entity to the fields of the fetched entity.
Save the fetched entity (UPDATE).

Django "bulk_save" and "bulk_update"

UPDATE: ADDED A BOUNTY. PLEASE PROVIDE AN EXAMPLE AND I WILL ACCEPT THE BEST ANSWER
UPDATE 2: Explicit example now included
Carrying on from the same project, where I asked about bulk_create in a separate thread.
I was wondering if there is a way to essentially "bulk_save" - insert if non-existent or simply update if it already exists.
For example:
class Person(models.Model):
first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)
height = models.DecimalField(blank=True, null=True)
weight = models.DecimalField(blank=True, null=True)
I have a list of dictionaries with key-value pairs for these fields. I would like to filter by name, and then update the height and/or weight as these my players are still growing and conditioning. If there is no easy way to "bulk_save", a bulk update would also be helpful.
Reference: June 8, 2012 - "get_or_create()" patch at django project
Bulk_update reference
I just did a variation of the update_many function listed below I seem to have improved speeds tremendously already.
http://people.iola.dk/olau/python/bulkops.py
UPDATE - apparently DSE2 is also an option.
https://bitbucket.org/weholt/dse2
Will update with speed tests tomorrow.

Resources