Is there a way to insert foreign keys when using pandas to_sql function?
I am processing uploaded Consultations (n=40k) with pandas in django, before adding them to the database (postgres). I got this working row by row, but that takes 15 to 20 minutes. This is longer than I want my users to wait, so I am looking for a more efficient solution.
I tried pandas to_sql, but I cannot figure out how to add the two foreign key relations as columns to my consultations dataframe before calling the to_sql function. Is there a way to add the Patient and Praktijk foreign keys as a column in the consultations dataframe?
More specifically, when inserting row by row, I use objects of type Patient or Praktijk when creating new consultations in the database. In a dataframe however, I cannot use these types, and therefore don't know how I could add the foreign keys correctly. Is there possibly a value of type object or int (a patient's id?) which can substitute a value of type Patient, and thereby set the foreign key?
The Consultation model:
class Consultation(models.Model):
# the foreign keys
patient = models.ForeignKey(Patient, on_delete=models.CASCADE, null=True, blank=True)
praktijk = models.ForeignKey(Praktijk, on_delete=models.CASCADE, default='')
# other fields which do not give trouble with to_sql
patient_nr = models.IntegerField(blank=True, null=True)
# etc
The to_sql call:
consultations.to_sql(Consult._meta.db_table, engine, if_exists='append', index=False, chunksize=10000)
If above is not possible, any hints towards another more efficient solution?
I had same problem and this is how I solved it. My answer isn't as straight forward but I trust it helps.
Inspect your django project to be sure of two things:
Target table name
Table column names
In My case, I use class Meta when defining django models to use explicit name (django has a way of automatically naming tables). I will use django tutorial project to illustrate.
class Question(models.Model):
question_text = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
class Meta:
db_table = "poll_questions"
class Choice(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
choice_text = models.CharField(max_length=200)
votes = models.IntegerField(default=0)
class Meta:
db_table = "question_choices"
Note: Django references Question foreign key in the database using pk of the Question object.
Assume I have a Question pk 1, and a dataframe df that I wish to update Question choices with. My df must look like one below if using pandas to batch insert into database!
import pandas as pd
df = pd.DataFrame(
{
"question": [1, 1, 1, 1, 1],
"choice_text": [
"First Question",
"Second Question",
"Third Question",
"Fourth Question",
"Fifth Question"
],
"votes":[5,3,10,1,13]
}
)
I wish I could write the df as a table. Too bad that SO doesn't support usual markdown for tables
Nonetheless, we have our df next step is to create database connection for inserting the records.
from django.conf import settings
from sqlalchemy import create_engine
# load database settings from django
user = settings.DATABASES['default']['USER']
passwd = settings.DATABASES['default']['PASSWORD']
dbname = settings.DATABASES['default']['NAME']
# create database connection string
conn = 'postgresql://{user}:{passwd}#localhost:5432/{dbname}'.format(
user=user,
passwd=passwd,
dbname=dbname
)
# actual database connection object.
conn = create_engine(conn, echo=False)
# write df into db
df.to_sql("question_choices", con=conn, if_exists="append", index=False, chunksize=500, method="multi")
Voila!
We are done!
Note:
django supports bulk-create which, however, isn't what you asked for.
I ran into a similar problem using SQLalchemy but I found a simple workaround.
What I did is defined the database schema the way I wanted with SQLalchemy (with all the datatypes and foreign keys I needed) and then created an empty table, then I simply changed the if_exists parameter to append.
This will append all the data to an empty database.
Related
I'm using flask-sqlalchemy orm in my flask app which is about smarthome sensors and actors (for the sake of simplicity let's call them Nodes.
Now I want to store an Event which is bound to Nodes in order to check their state and other or same Nodes which should be set with a given value if the state of the first ones have reached a threshold.
Additionally the states could be checked or set from/for Groups or Scenes. So I have three diffrent foreignkeys to check and another three to set. All of them could be more than one per type and multiple types per Event.
Here is an example code with the db.Models and pseudocode what I expect to get stored in an Event:
db = SQLAlchemy()
class Node(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Group(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Scene(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Event(db.Model):
id = db.Column(db.Integer, primary_key=True)
# The following columns may be in a intermediate table
# but I have no clue how to design that under these conditions
constraints = # list of foreignkeys from diffrent tables (Node/Group/Scene)
# with threshold per key
target = # list of foreignkeys from diffrent tables (Node/Group/Scene)
# with target values per key
In the end I want to be able to check if any of my Events are true to set the bound Node/Group/Scene accordingly.
It may be a database design problem (and not sqlalchemy) but I want to make use of the advantages of sqla orm here.
Inspired by this and that answer I tried to dig deeper, but other questions on SO were about more specific problems or one-to-many relationships.
Any hints or design tips are much appreciated. Thanks!
I ended up with a trade-off between usage and lines of code. My first thought here was to save as much code as I can (DRY) and defining as less tables as possible.
As SQLAlchemy itself points out in one of their examples the "generic foreign key" is just supported because it was often requested, not because it is a good solution. With that less db functionallaty is used and instead the application has to take care about key constraints.
On the other hand they said, having more tables in your database does not affected db performance.
So I tried some approaches and find a good one that fits to my usecase. Instead of a "normal" intermediate table for many-to-many relationships I use another SQLAlchemy class which has two one-to-many relations on both sides to connect two tables.
class Event(db.Model):
id = db.Column(db.Integer, primary_key=True)
noodles = db.relationship('NoodleEvent', back_populates='events')
# columns snipped out
def get_as_dict(self):
return {
"id": self.id,
"nodes": [n.get_as_dict() for n in self.nodes]
}
class Node(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
events = db.relationship('NodeEvent', back_populates='node')
# columns snipped out
class NodeEvent(db.Model):
ev_id = db.Column('ev_id', db.Integer, db.ForeignKey('event.id'), primary_key=True)
n_id = db.Column('n_id', db.Integer, db.ForeignKey('node.id'), primary_key=True)
value = db.Column('value', db.String(200), nullable=False)
compare = db.Column('compare', db.String(20), nullable=True)
node = db.relationship('Node', back_populates="events")
events = db.relationship('Event', back_populates="nodes")
def get_as_dict(self):
return {
"trigger_value": self.value,
"actual_value": self.node.status,
"compare": self.compare
}
The trade-off is that I have to define a new class everytime I bind a new table on that relationship. But with the "generic foreign key" approach I also would have to check from where the ForeignKey is comming from. Same work in the end of the day.
With my get_as_dict() function I have a very handy access to the related data.
I have a model B with a Many2many field referencing model A.
Now given an id of model A, I try to get the records of B that reference it.
Is this possible with Odoo search domains? Is it possible doing some SQL query?
Example
class A(models.Model):
_name='module.a'
class B(models.Model):
_name='module.b'
a_ids = fields.Many2many('m.a')
I try to do something like
a_id = 5
filtered_b_ids = self.env['module.b'].search([(a_id,'in','a_ids')])
However, this is not a valid search in Odoo. Is there a way to let the database do the search?
So far I fetch all records of B from the database and filter them afterward:
all_b_ids = self.env['module.b'].search([])
filtered_b_ids = [b_id for b_id in b_ids if a_id in b_id.a_ids]
However, I want to avoid fetching not needed records and would like to let the database do the filtering.
You should create the equivalent Many2many field in A.
class A(models.Model):
_name='module.a'
b_ids = fields.Many2many('module.b', 'rel_a_b', 'a_id', 'b_id')
class B(models.Model):
_name='module.b'
a_ids = fields.Many2many('module.a', 'rel_a_b', 'b_id', 'a_id')
In the field definition, the second argument is the name of the association table, and the two next ones are the name of the columns referencing the records of the two models. It's explained in the official ORM documentation.
Then you just have to do my_a_record.b_ids.
If you prefer doing an SQL request because you don't want to add a python field to A, you can do so by calling self.env.cr.execute("select id from module_b b, ...").fetchall(). In your request you have to join the association table (so you need to specify a name for it and its columns, as described in my code extract, otherwise they are automatically named by Odoo and I don't know the rule).
I think it's still possible to use search domains without the field in A but it's tricky. You can try search([('a_ids','in', [a_id])]) but I'm really not sure.
class A(models.Model):
_name='module.a'
class B(models.Model):
_name='module.b'
a_ids = fields.Many2many('module.a')
Now you want to search a_id = 5
To do so simply use browse or search ORM methods i.e,
a_id = 5
filtered_b_ids = self.env['module.b'].search([(a_id,'in',self.a_ids.ids)])
or
a_id = 5
filtered_b_ids = self.env['module.a'].search([(a_id)])
I have to work with a database containing columns with a dash in their name, as for example a-name. When converting the table with peewee, it converts it to an illegal character, with python complaining about a misplaced operator.
For a table with 2 columns, id and a-name, the result would be
from peewee import *
database = MySQLDatabase('databasename', **{'password': 'pwd', 'host': 'ip', 'user': 'username'})
class BaseModel(Model):
class Meta:
database = database
class ATable(BaseModel):
id = PrimaryKeyField()
a-name = CharField()
class Meta:
db_table = 'aTable'
I found a temporary workaround by changing the dash to an underscore and using the optional parameter db_column, like
a_name = CharField(db_column='a-name')
Is there another possibility for this issue as I do not want to do manual changes everytime I download the models from the database server?
I should add that I have no control over the database server, I have merely an account with read-only permissions.
Greetings,
Luc
a_name = CharField(db_column='a-name')
This is the correct way to solve the problem. Python does not allow dashes in identifiers, so if your column uses them then specify the column name explicitly and use a nice name for the column.
I suppose you could look into modifying the playhouse.reflection.Introspector.make_column_name method, as well.
I followed a bit the steps on Django User Profiles - Simple yet powerful.
Not quite the same because I am in the middle of developing the idea.
From this site I used in particular, also this line:
User.profile = property(lambda u:
UserProfile.objects.get_or_create(user=u)[0])
I was getting always an error message on creating the object, typically
"XX" may not be null. I solved part of the problems by playing with models
and (in my present case) sqliteman. Till I got the same
message on the id: "xxx.id may not be null".
On the net I found a description of a possible solution which involved doing a reset
of the database, which I was not that happy to do. In particular because for the
different solutions, it might have involved the reset of the application db.
But because the UserProfile model was kinda new and till now empty,
I played with it on the DB directly and made an hand made drop of the table and
ask syncdb to rebuilt it. (kinda risky thought).
Now this is the diff of the sqlite dump:
294,298c290,294
< CREATE TABLE "myt_userdata" (
< "id" integer NOT NULL PRIMARY KEY,
< "user_id" integer NOT NULL UNIQUE REFERENCES "auth_user" ("id"),
< "url" varchar(200),
< "birthday" datetime
---
> CREATE TABLE myt_userdata (
> "id" INTEGER NOT NULL,
> "user_id" INTEGER NOT NULL,
> "url" VARCHAR(200),
> "birthday" DATETIME
Please note that both versions are generated by django. The ">" version was generated with a simple model definition which had indeed the connection with the user table via:
user = models.ForeignKey(User, unique=True)
The new "<" version has much more information and it is working.
My question:
Why Django complains about an myt_userdata.id may not be null?
The subsidiary question:
Does Django tries to relate to the underline db structure, how?
(for example the not NULL message comes from the model or from the DB?)
The additional question:
I have been a bit reluctant to the use south: Too complicated, additional modules
which I might have to care between devel and production and maybe not that easy
if I want to switch DB engine (I am using sqlite only at devel stage, I plan to move to
mysql).
Probably south might have worked in this case. Would it work? would you suggest its use
anyway?
Edited FIY:
This is my last model (the working one):
class UserData(models.Model):
user = models.ForeignKey(User, unique=True)
url = models.URLField("Website", blank=True, null=True)
birthday = models.DateTimeField('Birthday', blank=True, null=True)
def __unicode__(self):
return self.user.username
User.profile = property(lambda u: UserData.objects.get_or_create(user=u,defaults={'birthday': '1970-01-01 00:00:00'})[0])
Why Django complains about an myt_userdata.id may not be null?
Because id is not a primary key and is not populated automatically though. Also, you don't provide it on model creation, so DB does not know what to do.
Does Django tries to relate to the underline db structure, how? (for example the not NULL message comes from the model or from the DB?)
It's an error from DB, not from Django.
You can use sql command to understan what exactly is executed on syncdb. Variant above seems to be correct table definition made from correct Django model, and I have no ide how have you got a variant below. Write a correct and clear model, and you'll get correct and working table scheme after syncdb
I'm working with a legacy DB in MSSQL. We have a table that has two columns that are causing me problems:
class Emp(models.Model):
empid = models.IntegerField(_("Unique ID"), unique=True, db_column=u'EMPID')
ssn = models.CharField(_("Social security number"), max_length=10, primary_key=True, db_column=u'SSN') # Field name made lowercase.
So the table has the ssn column as primary key and the relevant part of the SQL-update code generated by django is this:
UPDATE [EMP] SET [EMPID] = 399,
.........
WHERE [EMP].[SSN] = 2509882579
The problem is that EMP.EMPID is an identity field in MSSQL and thus pyodbc throws this error whenever I try to save changes to an existing employee:
ProgrammingError: ('42000', "[42000] [Microsoft][SQL Native Client][SQL Server]C
annot update identity column 'EMPID'. (8102) (SQLExecDirectW); [42000] [Microsof
t][SQL Native Client][SQL Server]Statement(s) could not be prepared. (8180)")
Having the EMP.EMPID as identity is not crucial to anything the program, so dropping it by creating a temporary column and copying, deleting, renaming seems like the logical thing to do. This creates one extra step in transferring old customers into Django, so my question is, is there any way to prevent Django from generating the '[EMPID] = XXX' snippet whenever I'm doing an update on this table?
EDIT
I've patched my model up like this:
def save(self, *args, **kwargs):
if self.empid:
self._meta.local_fields = [f for f in self._meta.local_fields if f.name != 'empid']
super().save(*args, **kwargs)
This works, taking advantage of the way Django populates it's sql-sentence in django/db/models/base.py (525). If anyone has a better way or can explain why this is bad practice I'd be happy to hear it!
This question is old and Sindri found a workable solution, but I wanted to provide a solution that I've been using in production for a few years that doesn't require mucking around in _meta.
I had to write a web application that integrated with an existing business database containing many computed fields. These fields, usually computing the status of the record, are used with almost every object access across the entire application and Django had to be able to work with them.
These types of fields are workable with a model manager that adds the required fields on to the query with an extra(select=...).
ComputedFieldsManager code snippet: https://gist.github.com/manfre/8284698
class Emp(models.Model):
ssn = models.CharField(_("Social security number"), max_length=10, primary_key=True, db_column=u'SSN') # Field name made lowercase.
objects = ComputedFieldsManager(computed_fields=['empid'])
# the empid is added on to the model instance
Emp.objects.all()[0].empid
# you can also search on the computed field
Emp.objects.all().computed_field_in('empid', [1234])