NDB Datastore: Data Modeling for a Job website - google-app-engine

What is the best way to model data for a job website which has the following elements:
Two types of user accounts: JobSeekers and Employers
Employers can create JobPost entities.
Each JobSeeker can create a Resume entity, and many JobApplication entities.
JobSeekers can create a JobApplication entity which is related to a JobPost entity.
A JobPost entity may receive many JobApplication entities.
A JobSeeker may only create one JobApplication entity per JobPost entity.
A Resume contains one or more instances of Education, Experience, using ndb.StructuredProperty(repeated = True).
Each Education contains the following ndb.StringProperty fields: institution, certification, area_of_study
While each Experience contains: workplace, job_title.

Here is a skeleton model that meets your requirements:
class Employer(ndb.Model):
user = ndb.UserProperty()
class JobPost(ndb.Model):
employer = ndb.KeyProperty(kind=Employer)
class JobSeeker(ndb.Model):
user = ndb.UserProperty()
def apply(self, job_post):
if JobApplication.query(JobApplication.job_seeker == self.key,
JobApplication.job_post == job_post).count(1) == 1:
raise Exception("Already applied for this position")
...
class Resume(ndb.Model):
job_seeker = ndb.KeyProperty(JobSeeker)
education = ndb.JsonProperty()
experience = ndb.JsonProperty()
class JobApplication(ndb.Model):
job_seeker = ndb.KeyProperty(JobSeeker)
job_post = ndb.KeyProperty(JobPost)
Notes:
Employer and JobSeeker have the built-in UserProperty to identify and allow them to login.
Resume uses JsonProperty for education and experience to allow for more fields in the future. You can assign a Python dictionary to this field, for example
resume.education = {'institution': 'name', 'certification': 'certificate', 'area_of_study': 'major', 'year_graduated': 2013, ...}
(I have personally found StructuredProperty to be more pain than gain, and I avoid it now.)
Limiting a JobSeeker to only one JobApplication can be done with the method apply() which checks the JobApplication table for existing applications.

Related

SQLAlchemy: foreignKeys from multiple Tables (Many-to-Many)

I'm using flask-sqlalchemy orm in my flask app which is about smarthome sensors and actors (for the sake of simplicity let's call them Nodes.
Now I want to store an Event which is bound to Nodes in order to check their state and other or same Nodes which should be set with a given value if the state of the first ones have reached a threshold.
Additionally the states could be checked or set from/for Groups or Scenes. So I have three diffrent foreignkeys to check and another three to set. All of them could be more than one per type and multiple types per Event.
Here is an example code with the db.Models and pseudocode what I expect to get stored in an Event:
db = SQLAlchemy()
class Node(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Group(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Scene(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
# columns snipped out
class Event(db.Model):
id = db.Column(db.Integer, primary_key=True)
# The following columns may be in a intermediate table
# but I have no clue how to design that under these conditions
constraints = # list of foreignkeys from diffrent tables (Node/Group/Scene)
# with threshold per key
target = # list of foreignkeys from diffrent tables (Node/Group/Scene)
# with target values per key
In the end I want to be able to check if any of my Events are true to set the bound Node/Group/Scene accordingly.
It may be a database design problem (and not sqlalchemy) but I want to make use of the advantages of sqla orm here.
Inspired by this and that answer I tried to dig deeper, but other questions on SO were about more specific problems or one-to-many relationships.
Any hints or design tips are much appreciated. Thanks!
I ended up with a trade-off between usage and lines of code. My first thought here was to save as much code as I can (DRY) and defining as less tables as possible.
As SQLAlchemy itself points out in one of their examples the "generic foreign key" is just supported because it was often requested, not because it is a good solution. With that less db functionallaty is used and instead the application has to take care about key constraints.
On the other hand they said, having more tables in your database does not affected db performance.
So I tried some approaches and find a good one that fits to my usecase. Instead of a "normal" intermediate table for many-to-many relationships I use another SQLAlchemy class which has two one-to-many relations on both sides to connect two tables.
class Event(db.Model):
id = db.Column(db.Integer, primary_key=True)
noodles = db.relationship('NoodleEvent', back_populates='events')
# columns snipped out
def get_as_dict(self):
return {
"id": self.id,
"nodes": [n.get_as_dict() for n in self.nodes]
}
class Node(db.Model):
id = db.Column(db.Integer, primary_key=True)
value = db.Column(db.String(20))
events = db.relationship('NodeEvent', back_populates='node')
# columns snipped out
class NodeEvent(db.Model):
ev_id = db.Column('ev_id', db.Integer, db.ForeignKey('event.id'), primary_key=True)
n_id = db.Column('n_id', db.Integer, db.ForeignKey('node.id'), primary_key=True)
value = db.Column('value', db.String(200), nullable=False)
compare = db.Column('compare', db.String(20), nullable=True)
node = db.relationship('Node', back_populates="events")
events = db.relationship('Event', back_populates="nodes")
def get_as_dict(self):
return {
"trigger_value": self.value,
"actual_value": self.node.status,
"compare": self.compare
}
The trade-off is that I have to define a new class everytime I bind a new table on that relationship. But with the "generic foreign key" approach I also would have to check from where the ForeignKey is comming from. Same work in the end of the day.
With my get_as_dict() function I have a very handy access to the related data.

Tracking item order for storage to and retrieval from a DB

I'm trying to figure out how I'm going to 'CRUD' the order of items I have in a group that I'm storing in a database. (Pseudo statement of: select * items from app where group_id = 1;)
My guess is that I just use an numeric field and just increase/decrease the number as more items are added to/removed from the group. I can then just update the items number in this field as they are moved around. However, I've seen this go really badly wrong in an old legacy app where items would get out of sync and you'd have a group where the order ended up something like this:
0,1,1,3,4,5
0,1,1,1,4,5
This wasn't handled very gracefully by the application either, and broke the application necessitating manual intervention to reorder the items in the DB.
Is there a way to avoid this pitfall?
EDIT: I would also maybe want the items available in multiple groups with multiple orders.
I think in that case I would need a many to many relationship for both the group to item relationship and the item to order relationship. /EDIT
I'll be doing this in the Django framework.
I'm not really sure what you are asking; because ordering is one thing, and grouping of related objects is something else entirely.
Databases don't store the order of things, but rather the relationships (grouping) of things. The order of things is a user interface detail and not something that a database should be used for.
In django, you can create a ManyToMany relationship. This essentially creates a "box" where you can add and remove items that are related to a particular model. Here is the example from the documentation:
from django.db import models
class Publication(models.Model):
title = models.CharField(max_length=30)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.title
class Meta:
ordering = ('title',)
class Article(models.Model):
headline = models.CharField(max_length=100)
publications = models.ManyToManyField(Publication)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.headline
class Meta:
ordering = ('headline',)
Here an Article can belong to many Publications, and Publications have one or more Articles associated with them:
a = Article.create(headline='Hello')
b = Article.create(headline='World')
p = Publication.create(title='My Publication')
p.article_set.add(a)
p.article_set.add(b)
p.save()
# You can also add an article to a publication from the article object:
c = Article.create(headline='The Answer is 42')
c.publications.add(p)
To know how many articles belong to a publication:
Publication.objects.get(title='My Publication').article_set.count()

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

How to flatten a 'friendship' model within User model in GAE?

I recently came across a number of articles pointing out to flatten the data for NoSQL databases. Coming from traditional SQL databases I realized I am replicating a SQL db bahaviour in GAE. So I started to refactor code where possible.
We have e.g. a social media site where users can become friends with each other.
class Friendship(ndb.Model):
from_friend = ndb.KeyProperty(kind=User)
to_friend = ndb.KeyProperty(kind=User)
Effectively the app creates a friendship instance between both users.
friendshipA = Friendship(from_friend = UserA, to_friend = userB)
friendshipB = Friendship(from_friend = UserB, to_friend = userA)
How could I now move this to the actual user model to flatten it. I thought maybe I could use a StructuredProperty. I know it is limited to 5000 entries, but that should be enough for friends.
class User(UserMixin, ndb.Model):
name = ndb.StringProperty()
friends = ndb.StructuredProperty(User, repeated=True)
So I came up with this, however User can't point to itself, so it seems. Because I get a NameError: name 'User' is not defined
Any idea how I could flatten it so that a single User instance would contain all its friends, with all their properties?
You can't create a StructuredProperty that references itself. Also, use of StructuredProperty to store a copy of User has additional problem of needing to perform a manual cascade update if a user ever modifies a property that is stored.
However, as KeyProperty accept String as kind, you can easily store the list of Users using KeyProperty as suggested by #dragonx. You can further optimise read by using ndb.get_multi to avoid multiple round-trip RPC calls when retrieving friends.
Here is a sample code:
class User(ndb.Model):
name = ndb.StringProperty()
friends = ndb.KeyProperty(kind="User", repeated=True)
userB = User(name="User B")
userB_key = userB.put()
userC = User(name="User C")
userC_key = userC.put()
userA = User(name="User A", friends=[userB_key, userC_key])
userA_key = userA.put()
# To retrieve all friends
for user in ndb.get_multi(userA.friends):
print "user: %s" % user.name
Use a KeyProperty that stores the key for the User instance.

Google App Engine Query (not filter) for children of an entity

Are the children of an entity available in a Query?
Given:
class Factory(db.Model):
""" Parent-kind """
name = db.StringProperty()
class Product(db.Model):
""" Child kind, use Product(parent=factory) to make """
#property
def factory(self):
return self.parent()
serial = db.IntegerProperty()
Assume 500 factories have made 500 products for a total of 250,000 products. Is there a way to form a resource-efficient query that will return just the 500 products made by one particular factory? The ancestor method is a filter, so using e.g. Product.all().ancestor(factory_1) would require repeated calls to the datastore.
Although ancestor is described as a "filter", it actually just updates the query to add the ancestor condition. You don't send a request to the datastore until you iterate over the query, so what you have will work fine.
One minor point though: 500 entities with the same parent can hurt scalability, since writes are serialized to members of an entity group. If you just want to track the factory that made a product, use a ReferenceProperty:
class Product(db.Model):
factory = db.ReferenceProperty(Factory, collection_name="products")
You can then get all the products by using:
myFactory.products

Resources