I have a collection called products which contains products like so
{
"name" :
"price" :
"category" :
"quality" :
}
and I have a user object that contains an array of products like so
{
"products" : []
}
My question - what is the best way to go about storing "product objects" in the products array.
Do I just put the corresponding ObjectID's inside the "products" array?
Do I literally put each "dictionary" as an entry in the products array?
The first approach would actually create a pointer so it seems the most logical, however that means that after I fetch the products array I would have to fetch each objectId from the products array and parse that.
The second option means that I wouldn't have to parse the each ObjectID, but there wouldn't be any link between the products in the product array and the product in the products collection
Any other way to go about this?
Thanks
In MongoDB we avoid joins and we reduce the number of collections for perfomance
and simplicity reasons.Document model allows us to do it.
But there are other things to consider
Are those data accessed together? (if yes + to embed)
for example when we need the users, we need the products also?
How big will be the products array? (if not too big + to embed)
How often those products will be updated? (if reraly + to embed)
in case of embed this means that we will have to update in many locations
How much duplication we will have? (if small + to embed)
for example a product can be in 5000 customers? this means that if a product changes we need to do all those extra updates
Is update even needed? (if no + to embed/duplicate)
for example if a customer gets on its products array a product that its price changes, it might be correct to pay the price it had when he added it to his product array, so update with refence would cause wrong results
Do we need both collections? (if no + to embed)
for example if products are always accessed from the customers, maybe we can avoid having a second collection
Data staleness is acceptable? (if its ok + to embed)
for example if product price changes and and the user has in its product array
a product with the old price is it ok?
(this might last some milliseconds but still can be problem)
Is data staleness problems, fixable? (if its ok + to embed)
for example lets say that the price changed and person go to pay,
and we are so unlucky that thise milliseconds/seconds caused the problem,
if it can detected at pay time(like find by productID and price),
then its fixable
if data staleness is not acceptable and not fixable then you can consider
transactions, but in this case i think reference is better
I guess there are more creteria to think about, but i think the above are
some of the basic things.
Related
I am trying to figure out the best way to store trip itinerary data into DynamoDB. Just for your info, my code is written in Python3 and I am using Boto3 to interact with DynamoDB.
After researching on this resource - https://schema.org/Trip, this is what I think would be the data classes of the objects.
from marshmallow_dataclass import dataclass
from typing import List, Optional
#dataclass(frozen=True)
class Itinerary:
id: str
startTime: int
endTime: int
dayTripId: str
placeName: str
placeCategory: str
estimatedCost: float
#dataclass(frozen=True)
class DayTrip:
id: str
day: str
parentTripId: str
date: Optional[str]
itinerary: List[Itinerary]
#dataclass(frozen=True)
class UserTrip:
tripId: str
userId: str
tripName: str
subTrip: List[DayTrip]
Essentially, the structure is as follows:
A person can have many UserTrips
A UserTrip can consist of one day or multiple day of DayTrip, e.g. Day 1, Day 2, Day 3
A DayTrip can have one or multiple places to visit (Itinerary)
An Itinerary is the lowest level that describes the place to visit
It wouldn't be good to store the UserTrip as is, with nested JSON structure consisting of DayTrip, then Itinerary, right? It would mean that the subTrip attribute of a particular UserTrip will be a huge chuck of JSON. So I think everyone here would agree this is a no, no. Is that correct?
Another alternative that I could think of was to store only the id of each entity. What I mean by this is, for example, a UserTrip will have its subTrip attribute containing a list of the DayTrip id. This means there will be another table to store DayTrip items and we can connect it to the corresponding UserTrip via the parentTripId attribute. And so on for the list of Itinerary.
Using this approach, I will have 3 x tables as follows:
user-trip-table to store UserTrip where subTrip will contain the list of DayTrip.ids
user-day-trip-table to store DayTrip where itinerary will contain the list of Itinerary.ids. The parentTripId will enable the mapping back to the original UserTrip
user-itinerary-table to store Itinerary where it can be mapped back to the original DayTrip via dayTripId attribute.
I am not sure if this is a good practice as there will be a lot of lookups happening and asynchronous operations are not possible here. This is because, to fetch the Itinerary, I need to wait for the completion of GetItem operation to get UserTrip, then, I will have the ids of the DayTrip and then, I will do another GetItem to fetch the DayTrip, then, finally, another GetItem to fetch the Itinerary.
Could the community here suggest a better, simpler solution?
Thanks!
Regarding the data structure, I don't see an absolute need for DayTrip, as you can get all that data from Itinerary. So in UserTrip I would keep a list of Itineraries instead of a list of DayTrips.
It wouldn't be good to store the UserTrip as is, with nested JSON
structure consisting of DayTrip, then Itinerary, right? It would mean
that the subTrip attribute of a particular UserTrip will be a huge
chuck of JSON. So I think everyone here would agree this is a no, no.
Is that correct?
Actually this is recommended in NoSQL databases, to have all data denormalised/embedded in the object. You use more storage, but avoid joins/processing. But keep in mind DynamoDB's item size limitation (currently 400KB).
In general, in NoSQL, you need to create your schema based on the queries you will need. For example in your case, you want to fetch all Itineraries of a UserTrip. Simply add userTripId to the Itinerary table. Create a GSI on Itinerary
with userTripId as hash key so you can query it efficiently. This way you will get all itinerary objects of a user trip.
I have a Kind Students which stores the details of favorite colors of all students. They are allowed to pick their favorite color from a set of three colors : {Red,Blue,Green}
Let us assume there are 100 students, my code is like this for every student :
Entity arya = new Entity("Student","Arya");
arya.setProperty("Color","Red");
Entity robb = new Entity("Student","Robb");
robb.setProperty("Color","Green");
..
..
Entity jon = new Entity("Student","Jon");
jon.setProperty("Color","Blue");
How to find out how many students liked a particular color(say Red) in this Student Kind ? What Query I should write to fetch the count ?
Thanks in advance
The number you seek would be the number of items in the result of a query with an equality filter on the Color property.
You could use a keys-only query (a special kind of projection query) for this purpose, faster and less expensive:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
...
A keys-only query is a small operation and counts as only a single
entity read for the query itself.
Something along these lines (but note that I'm not a java user, the snippet is based only on the documentation examples)
Query<Key> query = Query.newKeyQueryBuilder()
.setKind("Student")
.setFilter(PropertyFilter.eq("Color", "Red")
.build();
I agree with the Dan Cornilescu's answer. Here is a direct Datastore API usage. I have prepared the request body for your use-case. You can run it by just adding your Project Id. This will return the entities that matches with the filter then you can count the number of them.
I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.
I have a Google App Engine datastore that could have several million records in it and I'm trying to figure out the best way to do a query where I need get records back that match several Strings.
For example, say I have the following Model:
String name
String level
Int score
I need to return all the records for a given "level" that also match a list of "names". There might be only 1 or 2 names in the name list, but there could be 100.
It's basically a list of high scores ("score") for players ("name") for a given level ("level"). I want to find all the scores for a given "level" for a list of players by "name" to build a high score list that include just your friends.
I could just loop over the list of "names" and do a query for each their high scores for that level, but I don't know if this is the best way. In SQL I could construct a single (complex) query to do this.
Given the size of the datastore, I want to make sure I'm not wasting time running python code that should be done by the query or vise-versa.
The "level" needs to be a String, not an Int since they are not numbered levesl but rather level names, but I don't know if that matters.
You can use IN filter operator to match property against a list of values (user names):
scores = Scores.all().filter('level ==', level).filter('user IN', user_list)
Note that under the hood this performs as much queries as there are users in user_list.
players = Player.all().filter('level =', level).order('score')
names = [name1, name2, name3, ...]
players = [p for p in players if p.name in names]
for player in players:
print name, print score
is this what you want?
...or am i simplifying too much?
No you can not do that in one pass.
You will have to either query the friends for the level one by one
or
make a friends score entity for each level. Each time the score changes check which friends list he belongs to and update all their lists. Then its just a matter or retrieving that list.
the first one will be slow and the second costly unless optimized.
For example you make a search for a hotel in London and get 250 hotels out of which 25 hotels are shown on first page. On each page user has an option to sort the hotels based on price, name, user-reviews etc. Now the intelligent thing to do will be to only get the first 25 hotels on the first page from the database. When user moves to page 2, make another database query for next 25 hotels and keep the previous results in cache.
Now consider this, user is on page 1 and sees 25 hotels sorted by price and now he sorts them based on user-ratings, in this case, we should keep the hotels we already got in cache and only request for additional hotels. How is that implemented? Is there something built in any language (preferably php) or we have to implement it from scratch using multiple queries?
This is usually done as follows:
The query is executed with order by the required field, and with a top (in some databases limit) set to (page_index + 1) * entries_per_page results. The query returns a random-access rowset (you might also hear of this referred to as a resultset or a recordset depending on the database library you are using) which supports methods such as MoveTo( row_index ) and MoveNext(). So, we execute MoveTo( page_index * entries_per_page ) and then we read and display entries_per_page results. The rowset generally also offers a Count property which we invoke to get the total number of rows that would be fetched by the query if we ever let it run to the end (which of course we don't) so that we can compute and show the user how many pages exist.