I'm trying to create a dating app with firebase as my NoSQL backend.
However, as I'm new to the concept of NoSQL, I've encountered several problems that include optimization regarding the size of my documents and querying.
users:
user_id
name
age
gender
location
matches: [user1_id, user2_id, ...]
likes_from_other_users: [user1_id, user2_id, ...]
likes_other_users: [user1_id, user2_id, ...]
dislikes_from_other_users: [user1_id, user2_id, ...]
dislikes_other_users: [user1_id, user2_id, ...]
Suppose a user named UserA has lots of likes from other users therefore the size of the array likes_from_other_users will get to a point where it exceeds the maximum size of a document (1 MB). This example can apply to the other arrays.
How should I proceed?
I was thinking about extracting these large arrays to new collections.
In the case of all the likes (suppose userA has N likes from other users) to userA, there will be n documents where each document has the field 'to' set to userA id and the 'from' field set to each particular user that liked userA.
This would solve the problem of immense array sizes but may increase the complexity of querying for a particular user.
Is this the right approach? (I would have to make 5 additional collections: likes_from_other_users, likes_other_users, dislikes_from_other_users, dislikes_other_users, and matches) and may be slow.
Additionally, the matches collection would contain a pair (array) of users, so to query the matches for a particular user, I would need to check all the documents in this collection and find those that contain the particular user that I'm querying for
Is this the right approach?
I was thinking about extracting these large arrays to new collections.
This is the only scalable solution. Arrays are not scalable at all; individual documents within a collection is infinitely scalable in Firestore. It's not even a question as to whether this is the "right" solution, it's the only scalable solution when using Firestore.
Note that Firestore is not always the best database for a given application. If you're concerned about the cost or efficiency of a solution for a given problem, maybe Firestore isn't really the best solution. Each database has its strengths, and NoSQL database are typically not good at expressing complex relationships like SQL databases are.
Related
I have two collections: movies collection and comments collection, I want users to be able to post comments about a movie.
I can either have any movie contain an array which contains the id's of each comment or I can have any comment contain the id of the movie to whom it belongs. What are the downsides and advantages of each method?
This is more of a theoretical question. so lets assume that comments are too large and cannot be embedded into the movies collection.
This question is difficult to answer. In NoSQL DB (like your "mongodb" used tag indicates your are using it), the choice of using two collections, OR a collection with embedded comment's _id in an array, OR one single collection with embedded comments information really depends on your use cases.
With SQL database you can create a movie table and a comment table, with movie's id in comment element.
With nosql, you have to choose regarding your use cases : is your page displaying a movies list first with associated comments ? do you have a page which is listing last comments whatever movie ? You have also to integrate technical requirements/restrictions in your reflexion. Example, with mongodb you have a main restriction :
BSON Document Size - The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot
use excessive amount of RAM or, during transmission, excessive amount
of bandwidth. To store documents larger than the maximum size, MongoDB
provides the GridFS API. See mongofiles and the documentation for your
driver for more information about GridFS.
Check https://docs.mongodb.com/manual/reference/limits/ for more precisions.
My first reflexion regarding your needs and my global representation of what you want to do with your app is regarding the following use case :
A page is listing all movies (you can eventualy filter on different movie's flags). So, your entry point is a movie, not a comment. A comment is related to only one movie, a comment is not for more than one movie.
For each movie, an user can display associated comments and add a new comment.
For this use case, a performant db organisation is : One single collection for movies. A movie embed a list of comments, directly embedded in an array of JSON objects, like :
{
"_id":"m001",
"title":"Movie1",
"synopsis":"A young girl want to learn chess and becomes the best player in the world, his name: Beth harmone",
"comments":[
{
"_id":"c001",
"title":"Good movie",
"commentText":"This is a very good movie"
},
{
"_id":"c002",
"title":"Annoying movie",
"commentText":"This is a very annying movie"
}
]
}
You don't need to create another collection to store comments, you will loose reactivity, because of joining from movie another collection comment. BUT, this is a good choice only if you think each of your whole movies element will not be bigger than 16MB (you can also integrate GridFS API as indicated by MongoDB doc, but not the subject here...).
Alternatively, IF you think millions and millions of comments, with lot of information, can be added to a single movie, you will be blocked by technical limitation. In this case, it is better to split into two collections, with it, the technical limitation will not hurt you : each comment will be an element on "comment" collection and will certainly not reach 16MB.
Ffinally, noSQL DB performances can be really really better than SQL DB but you have to design your DB model regarding your use case.
I hope to be clear.
Useful links :
https://www.mongodb.com/basics/embedded-mongodb
https://fosterelli.co/collections-and-embedded-documents-in-mongodb (particularly "Example: comments on a blog" which seems to be your use case)
I have 3 types of entity:
Subjects
Topics
Tasks
In each subjects there are topics and tasks. The topics can depend on each other. (Of course, a topic that belongs to sj1 subject, can only be depended on an another topic that also belongs to sj1 subject.)
Between tasks and topics there are connections (also must belong to same subject) that symbolise the fact that to solve a certain task we need to be aware of certain topics.
So a task can require more topics. Also a topic can be required by more tasks. ( N<--->M connection.)
What would be the best solution to store?
solution
Have 3 collections for each type of entity
In tasks and topics have an index for a subject identifier attribute.
and an edge collection for storing connections between topics [N]<-->[M] tasks
solution
Have 1 collection for the subjects
For each subject, have 1 topics, and 1 tasks collections. The connection between subjects and tasks/topics can be based on prefix of collection names. (I.e. for chemistry subject we have chemistry_tasks and chemistry_topics collections)
For each subject, have an edge collection for connections between the tasks and topics and an another edge collection for connections among topics (I.e. chemistry_topics_tasks_connections and chemistry_topics_connections)
This way if I want to search among topics or tasks of a subject, I don't need to pre-filter them based on the subject identifier index. I'll immediately get the desired collection that contains all of my data. Moreover I don't have overhead of index for each document in tasks and topics.
On the other hand, this will result in a mess of collections.
Sidenote: There will be maximum 50 subjects, but the number of tasks and topics are unlimited.
In your terms, "awareness" is generated through the "graph", which requires no extra indexing to work at it's best. ArangoDB automatically creates special "_key" and "_from/_to" indexes, which it uses for graph traversal.
But as for indexing, that about all search performance - indexes are added based on the data you want to find. It really comes down to how you want to search:
one collection with multiple entity types or
multiple collections segregated by entity type.
There is not a penalty for having large collections, and a graph can link documents within a single collection - it doesn't need them to be segregated. Also, you can have multiple edge collections and/or multiple document collections. These are some of the concepts that challenge those of us who, like me, come from a traditional RDBMS - "schemaless" or "multi-model" databases kinda turn normalization on its ear.
Personally, I choose to build fairly large collections based on the data source (I import a data from external sources). Each collection contains documents of multiple object/data schema identified by an objType attribute. The benefit here is that you can search all documents in the collection on a single field (or even an index with multiple fields, like title + objType), very quickly reducing the set of documents to iterate/traverse - this is usually where real performance gains are made.
So... I guess I recommend solution #3?
I am assessing backend for location base dating app similar to Tinder.
App feature is showing nearby online users (with sex, and age filter)
Some database engines in mind are Redis, Cassandra, MySQL Cluster
The app should scale horizontally by adding node at high traffic time
After researching, I am very confused whether there is a common "best practice" data model, algorithm for this.
My approach is using Redis Cluster:
// Store all online users in same location (city) to a Set. In this case, store user:1 to New York set
SADD location:NewYork 1
// Store all users age to Sorted Set. In this case, user:1 has age 30
ZADD age 30 "1"
// Retrieve users in NewYork age from 20 to 40
ZINTERSTORE tmpkey 2 location:NewYork age AGGREGATE MAX
ZRANGEBYSCORE tmpkey 20 40
I am inexperienced and can not foresee potential problem if scaling happen for million of concurrent users.
Hope any veteran could shed some light.
For your use case, mongodb would be a good choice.
You can store each user in single document, along with their current location.
Create indexes on fields you want to do queries on, e.g. age, gender, location
Mongodb has inbuilt support for geospatial queries, hence it is easy to find users within 1 km radius of another user.
Most noSQL Geo/proximity index features rely on the GeoHash Algorithm
http://www.bigfastblog.com/geohash-intro
It's a good thing to understand how it works, and it's really quite fascinating. This technique can also be used to create highly efficient indexes on a relational database.
Redis does have native support for this, but if you're using ElastiCache, that version of Redis does not, and you'll need to mange this in your API.
Any Relational Database will give you the most flexibility and simplest solution. The problem you may face is query times. If you're optimizing for searches on your DB instance (possibly have a 'search db' separate to profile/content data), then it's possible to have the entire index in memory for fast results.
I can also talk a bit about Redis: The sorted set operations are blazingly fast, but you need to filter. Either you have to scan through your nearby result and lookup meta information to filter, or maintain separate sets for every combination of filter you may need. The first will have more performance overhead. The second requires you to mange the indexes yourself. EG: What if someone removes one of their 'likes'? What if they move around?
It's not flash or fancy, but in most cases where you need to search a range of data, relational databases win due to their simplicity and support. Think of your search as a replica of your master source, and you can always migrate to another solution, or re-shard/scale if you need to in the future.
You may be interested in the Redis Geo API.
The Geo API consists of a set of new commands that add support for storing and querying pairs of longitude/latitude coordinates into Redis keys. GeoSet is the name of the data structure holding a set of (x,y) coordinates. Actually, there isn’t any new data structure under the hood: a GeoSet is simply a Redis SortedSet.
Redis Geo Tutorial
I will also support MongoDB on the basis of requirements with the development of MongoDB compass you can also visualize your geospatial data.The link of mongodb compass documentation is "https://docs.mongodb.com/compass/getting-started/".
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
As an example, Google App Engine uses Google Datastore, not a standard database, to store data. Does anybody have any tips for using Google Datastore instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of Google Datastore (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).
Does anybody who has worked with Google Datastore or BigTable have any good advice to working with them?
There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:
The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
Querying, as you alluded to, is much more limited. No joins, for a start.
The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.
Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.
The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.
In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of this sort of tips and tricks here.
The way I have been going about the mind switch is to forget about the database altogether.
In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. You're already 2/3 there.
If you forget the notion that database size matters and data shouldn't be duplicated then you're 3/4 there and you didn't even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.
Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.
I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.
Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.
My two pence.
class League(BaseModel):
name = db.StringProperty()
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league
def get_managers(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.managers)
def get_coaches(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.coaches)
def __str__(self):
return self.name
# Need to delete all the associated games, teams and players
def delete(self):
for player in self.leagues_players:
player.delete()
for game in self.leagues_games:
game.delete()
for team in self.leagues_teams:
team.delete()
super(League, self).delete()
class UserPrefs(db.Model):
user = db.UserProperty()
league_ref = db.ReferenceProperty(reference_class=League,
collection_name='users') #league the users are managing
def __str__(self):
return self.user.nickname
# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
#property
def managing(self):
return League.gql('WHERE managers = :1', self.key())
#property
def coaching(self):
return League.gql('WHERE coaches = :1', self.key())
# remove all references to me when I'm deleted
def delete(self):
for manager in self.managing:
manager.managers.remove(self.key())
manager.put()
for coach in self.managing:
coach.coaches.remove(self.key())
coaches.put()
super(UserPrefs, self).delete()
I came from Relational Database world then i found this Datastore thing. it took several days to get hang of it. well there are some of my findings.
You must have already know that Datastore is build to scale and that is the thing that separates it from RDMBS. to scale better with large dataset, App Engine has done some changes(some means lot of changes).
RDBMS VS DataStore
Structure
In database, we usually structure our data in Tables, Rows which is in Datastore it becomes Kinds and Entities.
Relations
In RDBMS, Most of the people folllows the One-to-One, Many-to-One, Many-to-Many relationship, In Datastore, As it has "No Joins" thing but still we can achieve our normalization using "ReferenceProperty" e.g. One-to-One Relationship Example .
Indexes
Usually in RDMBS we make indexes like Primary Key, Foreign Key, Unique Key and Index key to speed up the search and boost our database performance. In datastore, you have to make atleast one index per kind(it will automatically generate whether you like it or not) because datastore search your entity on the basis of these indexes and believe me that is the best part, In RDBMS you can search using non-index field though it will take some time but it will. In Datastore you can not search using non-index property.
Count
In RDMBS, it is much easier to count(*) but in datastore, Please dont even think it in normal way(Yeah there is a count function) as it has 1000 Limit and it will cost as much small opertion as the entity which is not good but we always have good choices, we can use Shard Counters.
Unique Constraints
In RDMBS, We love this feature right? but Datastore has its own way. you cannot define a property as unique :(.
Query
GAE Datatore provides a better feature much LIKE(Oh no! datastore does not have LIKE Keyword) SQL which is GQL.
Data Insert/Update/Delete/Select
This where we all are interested in, as in RDMBS we require one query for Insert, Update, Delete and Select just like RDBMS, Datastore has put, delete, get(dont get too excited) because Datastore put or get in terms of Write, Read, Small Operations(Read Costs for Datastore Calls) and thats where Data Modeling comes into action. you have to minimize these operations and keep your app running. For Reducing Read operation you can use Memcache.
Take a look at the Objectify documentation. The first comment at the bottom of the page says:
"Nice, although you wrote this to describe Objectify, it is also one of the most concise explanation of appengine datastore itself I've ever read. Thank you."
https://github.com/objectify/objectify/wiki/Concepts
If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.
The way I look at datastore is, kind identifies table, per se, and entity is individual row within table. If google were to take out kind than its just one big table with no structure and you can dump whatever you want in an entity. In other words if entities are not tied to a kind you pretty much can have any structure to an entity and store in one location (kind of a big file with no structure to it, each line has structure of its own).
Now back to original comment, google datastore and bigtable are two different things so do not confuse google datastore to datastore data storage sense. Bigtable is more expensive than bigquery (Primary reason we didn't go with it). Bigquery does have proper joins and RDBMS like sql language and its cheaper, why not use bigquery. That being said, bigquery does have some limitations, depending on size of your data you might or might not encounter them.
Also, in terms of thinking in terms of datastore, i think proper statement would have been "thinking in terms of NoSQL databases". There are too many of them available out there these days but when it comes to google products except google cloud SQL (which is mySQL) everything else is NoSQL.
Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.
(bigtable reference: http://en.wikipedia.org/wiki/BigTable)