This question might be relevant for any document based NoSQL database.
I'm making some interest specific social network and decided to go with DynamoDB because of scalability and no-pain-administration factors. There are only two main entities in database: users and posts.
Requirement for common queries are very simple:
Home feed (feed of people I'm following)
My/User feed (feed of mine, or specific user feed)
List of user I/user followed
List of followers
Here is a database scheme I come up with so far (legend: __thisIsHashKey and _thisIsRangeKey):
timeline = { // post
__usarname:"totocaster",
_date:"1245678901345",
record_type:"collection",
items: ["2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594"],
number_of_likes:123,
description:"Hello, this is cool"
}
timeline = { // new follower
__usarname:"totocaster",
_date:"1245678901345",
type:"follow",
follower:"tamuna123"
}
timeline = { // new like
__usarname:"totocaster",
_date:"1245678901345",
record_type:"like",
liker:"tamuna123",
like_date:"123255634567456"
}
users = {
__username:"totocaster",
avatar_url:"2d931510-d99f-494a-8c67-87feb05e1594",
followers:["don_gio","tamuna123","barbie","mikecsharp","bassman"],
following:["tamuna123","barbie","mikecsharp"],
likes:[
{
username:'barbie',
date:"123255634567456"
},
{
username:"mikecsharp",
date:"123255634567456"
}],
full_name:"Toto Tvalavadze",
password:"Hashed Key",
email:"totocaster#myemailprovider.com"
}
As you can see I came-up storing all my post directly in timeline collection. This way I can query for posts using date and username (hash and range keys). Everything seems fine, but here is the problem:
I can not query for User-Timeline in one go. This will be one of the most demanded queries by system and I can not provide efficient way to do this. Please help. Thanks.
I happen to work with news feeds daily. (Author of Stream-Framework and founded getstream.io)
The most common solutions I see are:
Cassandra (Instagram)
Redis (expensive, but easy)
MongoDB
DynamoDB
RocksDB (Linkedin)
Most people use either fanout on write or fanout on read. This makes it easier to build a working solution, but it can get expensive quickly. Your best bet is to use a combination of those 2 approaches. So do a fanout on write in most cases, but for very popular feeds keep them in memory.
Stream-Framework is open source and supports Cassandra/Redis & Python
getstream.io is a hosted solution build on top of Go & Rocksdb.
If you do end up using DynamoDB be sure to setup the right partition key:
https://shinesolutions.com/2016/06/27/a-deep-dive-into-dynamodb-partitions/
Also note that a Redis or DynamoDB based solution will get expensive pretty quickly. You'll get the lowest cost per user by leveraging Cassandra or RocksDB.
I would check out the Titan graph database (http://thinkaurelius.github.com/titan/) and Neo4j (http://www.neo4j.org/).
I know Titan claims to scale pretty well with large data sets.
Ultimately I think your model maps well to a graph. Users and posts would be nodes, and then you can connect them arbitrarily via edges. A user (node) is a friend (edge) of another user (node).
A user (node) has many posts (nodes) in their timeline. Then you can run interesting traversals via the graph.
You can also use Amazon Neptune (https://aws.amazon.com/neptune/) (Graph DB) which is well suited for social network. I don't think DynomoDB would be a good choice for yours use cases.
Related
I am assessing backend for location base dating app similar to Tinder.
App feature is showing nearby online users (with sex, and age filter)
Some database engines in mind are Redis, Cassandra, MySQL Cluster
The app should scale horizontally by adding node at high traffic time
After researching, I am very confused whether there is a common "best practice" data model, algorithm for this.
My approach is using Redis Cluster:
// Store all online users in same location (city) to a Set. In this case, store user:1 to New York set
SADD location:NewYork 1
// Store all users age to Sorted Set. In this case, user:1 has age 30
ZADD age 30 "1"
// Retrieve users in NewYork age from 20 to 40
ZINTERSTORE tmpkey 2 location:NewYork age AGGREGATE MAX
ZRANGEBYSCORE tmpkey 20 40
I am inexperienced and can not foresee potential problem if scaling happen for million of concurrent users.
Hope any veteran could shed some light.
For your use case, mongodb would be a good choice.
You can store each user in single document, along with their current location.
Create indexes on fields you want to do queries on, e.g. age, gender, location
Mongodb has inbuilt support for geospatial queries, hence it is easy to find users within 1 km radius of another user.
Most noSQL Geo/proximity index features rely on the GeoHash Algorithm
http://www.bigfastblog.com/geohash-intro
It's a good thing to understand how it works, and it's really quite fascinating. This technique can also be used to create highly efficient indexes on a relational database.
Redis does have native support for this, but if you're using ElastiCache, that version of Redis does not, and you'll need to mange this in your API.
Any Relational Database will give you the most flexibility and simplest solution. The problem you may face is query times. If you're optimizing for searches on your DB instance (possibly have a 'search db' separate to profile/content data), then it's possible to have the entire index in memory for fast results.
I can also talk a bit about Redis: The sorted set operations are blazingly fast, but you need to filter. Either you have to scan through your nearby result and lookup meta information to filter, or maintain separate sets for every combination of filter you may need. The first will have more performance overhead. The second requires you to mange the indexes yourself. EG: What if someone removes one of their 'likes'? What if they move around?
It's not flash or fancy, but in most cases where you need to search a range of data, relational databases win due to their simplicity and support. Think of your search as a replica of your master source, and you can always migrate to another solution, or re-shard/scale if you need to in the future.
You may be interested in the Redis Geo API.
The Geo API consists of a set of new commands that add support for storing and querying pairs of longitude/latitude coordinates into Redis keys. GeoSet is the name of the data structure holding a set of (x,y) coordinates. Actually, there isn’t any new data structure under the hood: a GeoSet is simply a Redis SortedSet.
Redis Geo Tutorial
I will also support MongoDB on the basis of requirements with the development of MongoDB compass you can also visualize your geospatial data.The link of mongodb compass documentation is "https://docs.mongodb.com/compass/getting-started/".
I'm using ServiceStack.Redis to implement a demo project. It contains two POCOs i.e Albums and its Songs.
Below is the search results measured using a stopwatch instance:
Time elapsed searching 5804 items is 00:00:00.1243984 <-- Albums
Time elapsed searching 138731 items is 00:00:02.0592068 <-- Songs
As you can see the search for the songs is taking too much time. I'm displaying the results in a WPF application wherein the search term is also entered. The lag is a no-go for redis.
Below is the code used for searching:
IEnumerable<int> songsFromRedis =
songRedis.GetAll()
.Where(song => song.Title != null
&& song.Title.ToLowerInvariant().Contains(searchText))
.OrderBy(song => song.Title)
.Select(x => x.AlbumId);
If we cannot make it any faster, would ElasticSearch help ?
The issue is how you're using Redis, i.e. songRedis.GetAll() downloads the entire dataset, deserializes all entities into C# objects and performs the search on the client.
You should never download and query an entire dataset across the network on the client (i.e with any datastore), even a full server-side table-scan query would perform much better since only the filtered results are returned to the client and not the entire dataset. Ideally even full server-side table-scans should be avoided and any queries should be made via an index.
Redis doesn't have support for indexes built-in, but when needed you can use a SET to manually create indexes between entities in Redis.
Any search oriented database would help. Even mysql fulltext search, which is notoriusly slow, would be a lot better here.
Elasticsearch is one good alternative, Sphinx another good one. ES has the easy scalability and ease of use, sphinx has the performance win and otherwise most of the common features but is a bit more work to learn and to scale.
I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks
Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip
In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.
As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.
I'm developing an application that allows users tag product purchases (via a Web App).
I intend to use the tags to automatically query DBPedia (Possible other Open Data Sources such as FreeBase).
The top N results returned from DBPEdia will be displayed to users and they will select the one that most closely resembles the tag they entered. (I will only extract specific data).
For example:
User enters tag 'iPhone' and SparSQL query sent to DBPedia. Results are parsed and some data on each result shown to user who then selects the one that most closely resembles what they bought.
I want to extract some of the data from the users selected DBpedia result and store it for marketing purposes at a later stage. (Ideally via some call to an API)
I was thinking either Bigdata or Protege OWL but have no experience of using either
Can anybody suggest the best tool for this task and advantages/disadvantages/learning curve/etc...?
Thanks
It all depends on what you want to do with the data that you've extracted. The simplest option is just to store the reconciled entity URI along with your other data in a relational database or even a NoSQL database. This lets you easily query Freebase and DBpedia for that entity later on.
If you want to pull in "everything there is to know" about an entity from Freebase and DBpedia, then you're probably better off with a triple store. With this approach, you can query all the data locally; but now you have to worry about keeping it updated.
For the kind of thing you have in mind, I don't think you necessarily need a highly scalable triplestore solution. More important seems to me that you have a toolkit for easy execution of SPARQL queries, result processing, and quick local caching of RDF data.
With those things in mind, I'd recommend having a look at OpenRDF Sesame. It's a Java toolkit and API for working with RDF and SPARQL with support for multiple storage backends. It has a few built-in stores that perform well for what you need (scaling up to about 100 million facts in a single store), and if you do find you need a bigger/better storage solution, stores like BigData or OWLIM are pretty much just drop-in replacements for Sesame's own storage backends, so you get to switch without having to make large changes to your code.
Just to give you an idea: the following lines of code use Sesame to fire a SPARQL query against DBPedia and process the result:
SPARQLRepository dbpediaEndpoint = new SPARQLRepository("http://dbpedia.org/sparql");
dbpediaEndpoint.initialize();
RepositoryConnection conn = dbpediaEndpoint.getConnection();
try {
String queryString = " SELECT ?x WHERE { ?x a foaf:Person } LIMIT 10";
TupleQuery query = conn.prepareTupleQuery(Querylanguage.SPARQL, queryString);
TupleQueryResult result = query.evaluate();
while(result.hasNext()) {
// and so on and so forth, see sesame manual/javadocs
// for details and examples
}
}
finally {
conn.close();
}
(disclosure: I work on Sesame)
As an example, Google App Engine uses Google Datastore, not a standard database, to store data. Does anybody have any tips for using Google Datastore instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of Google Datastore (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).
Does anybody who has worked with Google Datastore or BigTable have any good advice to working with them?
There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:
The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
Querying, as you alluded to, is much more limited. No joins, for a start.
The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.
Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.
The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.
In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of this sort of tips and tricks here.
The way I have been going about the mind switch is to forget about the database altogether.
In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. You're already 2/3 there.
If you forget the notion that database size matters and data shouldn't be duplicated then you're 3/4 there and you didn't even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.
Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.
I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.
Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.
My two pence.
class League(BaseModel):
name = db.StringProperty()
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league
def get_managers(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.managers)
def get_coaches(self):
# This returns the models themselves, not just the keys that are stored in teams
return UserPrefs.get(self.coaches)
def __str__(self):
return self.name
# Need to delete all the associated games, teams and players
def delete(self):
for player in self.leagues_players:
player.delete()
for game in self.leagues_games:
game.delete()
for team in self.leagues_teams:
team.delete()
super(League, self).delete()
class UserPrefs(db.Model):
user = db.UserProperty()
league_ref = db.ReferenceProperty(reference_class=League,
collection_name='users') #league the users are managing
def __str__(self):
return self.user.nickname
# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
#property
def managing(self):
return League.gql('WHERE managers = :1', self.key())
#property
def coaching(self):
return League.gql('WHERE coaches = :1', self.key())
# remove all references to me when I'm deleted
def delete(self):
for manager in self.managing:
manager.managers.remove(self.key())
manager.put()
for coach in self.managing:
coach.coaches.remove(self.key())
coaches.put()
super(UserPrefs, self).delete()
I came from Relational Database world then i found this Datastore thing. it took several days to get hang of it. well there are some of my findings.
You must have already know that Datastore is build to scale and that is the thing that separates it from RDMBS. to scale better with large dataset, App Engine has done some changes(some means lot of changes).
RDBMS VS DataStore
Structure
In database, we usually structure our data in Tables, Rows which is in Datastore it becomes Kinds and Entities.
Relations
In RDBMS, Most of the people folllows the One-to-One, Many-to-One, Many-to-Many relationship, In Datastore, As it has "No Joins" thing but still we can achieve our normalization using "ReferenceProperty" e.g. One-to-One Relationship Example .
Indexes
Usually in RDMBS we make indexes like Primary Key, Foreign Key, Unique Key and Index key to speed up the search and boost our database performance. In datastore, you have to make atleast one index per kind(it will automatically generate whether you like it or not) because datastore search your entity on the basis of these indexes and believe me that is the best part, In RDBMS you can search using non-index field though it will take some time but it will. In Datastore you can not search using non-index property.
Count
In RDMBS, it is much easier to count(*) but in datastore, Please dont even think it in normal way(Yeah there is a count function) as it has 1000 Limit and it will cost as much small opertion as the entity which is not good but we always have good choices, we can use Shard Counters.
Unique Constraints
In RDMBS, We love this feature right? but Datastore has its own way. you cannot define a property as unique :(.
Query
GAE Datatore provides a better feature much LIKE(Oh no! datastore does not have LIKE Keyword) SQL which is GQL.
Data Insert/Update/Delete/Select
This where we all are interested in, as in RDMBS we require one query for Insert, Update, Delete and Select just like RDBMS, Datastore has put, delete, get(dont get too excited) because Datastore put or get in terms of Write, Read, Small Operations(Read Costs for Datastore Calls) and thats where Data Modeling comes into action. you have to minimize these operations and keep your app running. For Reducing Read operation you can use Memcache.
Take a look at the Objectify documentation. The first comment at the bottom of the page says:
"Nice, although you wrote this to describe Objectify, it is also one of the most concise explanation of appengine datastore itself I've ever read. Thank you."
https://github.com/objectify/objectify/wiki/Concepts
If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.
The way I look at datastore is, kind identifies table, per se, and entity is individual row within table. If google were to take out kind than its just one big table with no structure and you can dump whatever you want in an entity. In other words if entities are not tied to a kind you pretty much can have any structure to an entity and store in one location (kind of a big file with no structure to it, each line has structure of its own).
Now back to original comment, google datastore and bigtable are two different things so do not confuse google datastore to datastore data storage sense. Bigtable is more expensive than bigquery (Primary reason we didn't go with it). Bigquery does have proper joins and RDBMS like sql language and its cheaper, why not use bigquery. That being said, bigquery does have some limitations, depending on size of your data you might or might not encounter them.
Also, in terms of thinking in terms of datastore, i think proper statement would have been "thinking in terms of NoSQL databases". There are too many of them available out there these days but when it comes to google products except google cloud SQL (which is mySQL) everything else is NoSQL.
Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.
(bigtable reference: http://en.wikipedia.org/wiki/BigTable)