Consider I have some users U1, U2, U3 each with property 'age' such that;
U1.age = 10
U2.age = 30
U3.age = 70
I also have some lists which are dynamic collections of users based on some criteria, say L1, L2, L3, such that;
L1: where age < 60
L2: where age < 30
L3: where age > 20
Since the lists are dynamic, the relationship between lists and users is established only through the user properties and list criteria. There is no hard mapping to indicate which users belong to which list. When the age of any user changes or when the criteria of any list changes, the users associated with a list may also change.
In this scenario, at any point of time it is very easy to get the users associated with a list by querying users matching the list criteria.
But to get the lists associated with a user, is an expensive operation which involves first determining users associated with each list and then picking those lists where the result has the user in question.
Could this be a candidate for using Graph Database? And why? (I'm considering Neo4j) If yes, how to model the nodes and the relationships so that I can easily get the lists given a user.
Since 2.3 Neo4j does allow index range queries. Assume you have an index:
CREATE INDEX on :User(age)
Then this query gives you the list of people younger 60 years and is performed via the index
MATCH (u:User) WHERE u.age < 60 RETURN u
However I would not store the age, instead I'd store the date of birth as a long property. Otherwise you have can the age over and over again.
Update based on comment below
Assume you have a node for each list:
CREATE (:List{name:'l1', min:20, max:999})
CREATE (:List{name:'l2', min:0, max:30})
CREATE (:List{name:'l3', min:0, max:60})
Let's find all the lists a user U1 belongs to:
MATCH (me:User{name:'U1'})
WITH me.age as age
MATCH (l:List) WHERE age >= l.min AND age <= l.max // find lists
WITH l
MATCH (u:User) WHERE u.age >= l.min AND age <= l.max
RETURN l.name, collect(u)
Update 2
A complete different idea would be to use a timetree. Both, all users and your list definitions are connected to the timetree
Related
I have a graph database with 3 type of nodes and two relationships:
(p:PERSON)-[:manages]->(c:COMPANY)-[:seeks]->(s:SKILLS)
I want to create a new relationship between the nodes labeled (:PERSON) such as: (p1:PERSON)-[:competes_with]->(p2:PERSON) and
(p2:PERSON)-[:competes_with]->(p1:PERSON) subject to p1.name <> p2.name.
so that I can represent competition for scarce labor in a variety of markets represented by (s:SKILLS).
The condition to establish the new relationship [:competes_with] is that 2 distinct persons nodes (:PERSON) manage companies that seek at least 3 (:SKILLS) profiles that coincide between the 2 companies.
Orders of magnitude are:
|(:PERSON)| = 6000
|(:COMPANY)| = 15000
|(:SKILLS)| = 95000
In my plodding way, what I did was:
MATCH (p1:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1, collect(DISTINCT s.skill_names) AS p1_skills
MATCH (p2:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1,p1_skills, p2, collect(DISTINCT s.skill_names) AS p2_skills
WHERE p1 <> p2
UNWIND p1_skills AS sought_skills
WITH p1,p2, sought_skills, reduce(com_skills=[], sought_skills IN p2_skills | com_skills + sought_skills) AS NCS
WHERE size(NCS) >= 3
MERGE(p1)-[competes_with]->(p2)
MERGE(p2)-[competes_with]->(p1)
Given the size of the problem, this causes a 14GB RAM box to crash after a while with an "out-of-memory" exception.
So, besides the fact that I don't know whether my query actually does what I want (it crashes before completing), the question is: Can I streamline this to make it work with smaller memory requirements ? What would the improved query be like ? Tx.
The standard neo4j naming convention is to have camel-case label names, and all-upper-case relationship names (and properties should start with a lower-case character). In this answer, I will follow the standard and use names like Person and MANAGES.
You don't need 2 COMPETES_WITH relationships between the same 2 Person nodes if the relationship is inherently bidirectional. Neo4j can navigate incoming and outgoing relationships equally easily, and the MATCH clause allows a relationship pattern to not specify a direction (e.g., MATCH (a)-[:FOO]-(b)). Also, the MERGE clause (but not CREATE) allows you to specify an undirected relationship -- which ensures that only one relationship exists between the 2 endpoints.
It seems that the COMPETES_WITH relationship really belongs between Company nodes, since that is really the source of the competition. Also, if a Person left a company, you should not have to remove any COMPETES_WITH relationships from that node (and you should also not have to add a COMPETES_WITH relationship to the replacement Person).
In addition, you should consider whether the COMPETES_WITH relationship is really needed in the first place. Every time the skills sought by a Company changes, you'd have to recalculate its COMPETES_WITH relationships. You should determine whether doing that is worth it, or whether your queries should just dynamically determine a company's competitors as needed.
Here is a simplified version of your original query:
MATCH (p1:Person)-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(p1)-[:COMPETES_WITH]-(p2);
To find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:COMPETES_WITH]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you changed the data model to have the COMPETES_WITH relationship between Company nodes:
MATCH (c1:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(c2:Company)
WITH c1, c2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(c1)-[:COMPETES_WITH]-(c2);
With this model, to find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:COMPETES_WITH]-(:Company)<-[:MANAGES]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you did not have COMPETES_WITH relationships at all, to find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
RETURN p1, COLLECT(p2) AS competing_people;
Need help figuring out a good way to store data effectively and efficiently
I'm using Parse (JavaScript SDK), here's an example of what I'm trying to store
Predictions of football (soccer) matches so an example of one match would be;
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 predicts the score will be Team A 2-0 Team B -> so 2-0
User456 predicts the score will be Team A 1-3 Team B -> so 1-3
Each event has information attached to it like an eventId, several categories, start time, end time, a result and more
I need to record a score prediction per user for each event (usually 10 events at a time so a lot of predictions will be coming in)
I need to store these so I can cross reference the correct result against the user's prediction and award points based on their prediction, the teams in the match and the categories of the event but instead of adding to a total I need all the awarded points stored separately per category and per user so I can then filter based on predictions between set dates and certain categories e.g.
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 prediction = 2-0
Actual result = 2-0
So now I need to award X points to User123 for Team A, Team B, "League-1", and "Sunday-League" and record it to the event date too.
I would suggest you create a table for games and a table for users and then an associative table to handle the many to many relationship. This is a pretty standard many to many relationship.
Suppose, In my app, I ask users to input some string. A user can input string multiple times. Whenever any user inputs a string, I log it in the database along with the day. Many strings can be same, even though inputted by different users. In the home page, I need to give the interface such that any user can query for top n (say 50) strings in any time period (say between last 45 days, or 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written query like:
select string, count(*)
from userStrings where day >= d1 and day <= d2
group by string
order by count(*) desc
limit n
For each user query, I can't process the record at query time - there can be millions of records. If the time period constraint was not there, I could have done something like this - Create a class for UserString and maintain unique object of it for each distinct user string, retrieve corresponding object for user inputted string, and increment it's count [Even with approach, I assume the datastore will have to process all UserStrings objects (~ 100000) and return me the top n - so it can itself be very heavy query].
I am using JDO. My obvious goal is to minimize the app engine cost : CPU + data.
Thanks,
You can use the App Engine Task Queue to process the strings offline. If you need real-time answers you could use memcache to store a temporary record of how many times that word has been entered during the day and background the processing.
I want to use redis to store a large set of user_ids and with each of these
ids, a "group id" to which that user was previously assigned:
User_ID | Group_ID
1043 | 2
2403 | 1
The number of user_ids is fairly large (~ 10 million); the number of unique
group ids is about 3 - 5.
My purpose for this LuT is routine:
find the group id for a given user; and
return a list of other users (of specified length) with the same
group id as that given user
There might be an idiomatic way to do this in redis or at least a way that's most efficient. If so i would like to know what it is. Here's a simplified version of my working implementation (using the python client):
# assume a redis server is already running
# create some model data:
import numpy as NP
NUM_REG_USERS = 100
user_id = NP.random.randint(1000, 9999, NUM_REG_USERS)
cluster_id = NP.random.randint(1, 4, NUM_REG_USERS)
D = zip(cluster_id, user_id)
from redis import Redis
# r = Redis()
# populate the redis LuT:
for t in D :
r.sadd( t[0], t[1] )
# the queries:
# is user_id 1034 in Group 1?
r.sismember("1", 1034)
# return 10 users in the same Group 1 as user_id 1034:
r.smembers("1")[:10] # assume user_id 1034 is in group 1
So i have implemented this LuT using ordinary redis sets; each set is keyed to a Group ID (1, 2, or 3), so there are three sets in total.
Is this the most efficient way store this data given the type of queries i want to run against it?
Using sets is a good basic approach, though there are a couple of things in there you may want to change:
Unless you store the group ID for each a user somewhere you will need 5 round trips to get the group for a particular user - the operation itself is O(1), but you still need to consider latency. Usually it is fairly easy to do this without too much effort - you have lots of other properties stored for each user, so it is trivial to add one for group id.
You probably want SRANDMEMBER rather than SMEMBERS - I think SMEMBERS will return the same 10 items from your million item set every time.
I have many products (product_id). Users (user_id) view the products.
I want to query which users viewed whatever product in the last 24 hours. (In other words, I want to keep a list of user_ids attached to that product_id...and when 24 hours is up for a user, that user pops off that list and the record disappears)
How do I store this in Redis? Can someone give me a high-level schema because I'm new in Redis.
For something similar I use a sorted set with values being user ids and score being the current time. When updating the set, remove older items with ZREMRANGEBYSCORE as well as updating the time score for the current user.
Update with code:
Whenever a new item is added:
ZREMRANGEBYSCORE recentitems 0 [DateTime.Now.AddMinutes(-10).Ticks]
ZADD recentitems [DateTime.Now.Ticks] [item.id]
To get the ids of items added in the last 10 minutes:
ZREVRANGEBYSCORE recentitems [DateTime.Now.Ticks] [DateTime.Now.AddMinutes(-10).Ticks]
Note that you could also use
ZREVRANGE recentitems 0 -1
if you don't mind that the set could include older items if nothing has been added recently.
That gets you a list of item ids. You then use GET/MGET/HGET/HMGET as appropriate to retrieve the actual items for display.
If you want redis keys to drop off automatically then you'll probably want to use a redis key for every user_id-to-product_id map. So, you would write by doing something like redis.set "user-to-products:user_id:product_id", timestamp followed by redis.expire "user-to-products:user_id:product_id" 86400 (24hrs, in seconds).
To retrieve the current list you should be able to do redis.keys "user-to-products:user_id:*"