Great Grandpa AppEngine ancestor query - google-app-engine

Scenario 1:
If I have a Great Grandpa ancestor (let's name him A); And my key is from "child1"; is there a way to check that my Great Grandpa is A? (hope I can do that without needing to loop)
Or can I check, if child1's key is of the path "A->B->C".
A -> B -> C -> (child1, child2...)
Scenario 2:
From the above. Great Grandpa has another descendants from "G", and would like to retrieve "H"s children:
A-> B -> C -> (children of C)
...-> G -> H -> (children of H)
I like to retrieve "H"s children, thinking that Grandpa knows the path from A, G, to H... can I do that? (hope I can do this in a query, without looping)
If you have a Go1 example: that would be awesome...

Scenario 1:
If you want to check that Child 1's great-grandpa is A, you will have to invoke key.getParent() twice (checking for null parents). There is no API to check this for you.
If you want to check that entity X has has A as an ancestor then you will have to call key.getParent() N times.
Note however the overhead is minimal. Calling key.getParent() does not result in any calls to the actual datastore.
You can of course ensure with an ancestor query that C / entity X is a descendant of A (as your scenario 2 implies). Thus avoiding checking the query result. The datastore on query execution will check this for you.
https://developers.google.com/appengine/docs/java/datastore/queries
=> search for Ancestor Queries
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/Query#setAncestor(com.google.appengine.api.datastore.Key)
childCQuery.setAncestor(entityA.getKey());
Scenario 2:
Grandpa 'A' can't know the path to 'H' since children can be added and removed at any point in time. There is no limitation on what entities can be descendants of 'A'. So only with a datastore query can you determine the descendants of 'A'.
But as stated in scenario 1 you can specify 'A' as the ancestor in your query so that you filter any results where 'A' is not the ancestor.
Hope this answers your questions.
Note: My responses to your question refer to the java API. I am not yet familiar with the Go API.
Thanks.
Amir

Related

Neo4j query without specifying relation type

I have a graph representation of data with multiple nodes with random relationships
I'm using Neo4j to represent this and using query
Using query
MATCH (a)-[r]->(b)-[r2]->(c)
I'm getting an output where ever A is related with B and B is related with C
But I need to query nodes A and B with any relationship (call as RelationA) and all C following B (with RelationA) with the Same relation as A and B like the following image
if A is connected with more than one B then the expected graph will be like this
You can check the equality of the relationships in the WHERE clause.
MATCH path=(a)-[r]->(b)-[r2]->(c)
WHERE type(r)=type(r2)
RETURN path
P.S.: If you are viewing this result in the Neo4j browser, then it "connects the result nodes" which means it adds extra relationships between nodes that don’t match the criteria. Don’t forget to uncheck the "connect result nodes" option in the from the settings.

Methods to avoiding cross-product APOC query (using hashmap?)?

I currently have a Neo4J database with a simple data structure comprised of about 400 million (:Node {id:String, refs:List[String]}), with two properties: An id, which is a string, and refs, which is a list of strings.
I need to search all of these nodes to identify relationships between them. These directed relationships exist if a node's id is in the ref list of another nose. A simple query that accomplishes what I want (but is too slow):
MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b) AND a.id IN b.refs
CREATE (b)-[:CITES]->(a)
I can use apoc.periodic.iterate, but the query is still much too slow:
CALL apoc.periodic.iterate(
"MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b)
AND a.id IN b.refs RETURN a, b",
"CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:false,iterateList:true})
Any suggestions as to how I can build this database and relationships efficiently? I've vague thoughts about creating a hash table as I first add the Nodes to the database, but am not sure how to implement this, especially in Neo4j.
Thank you.
If you first create an index on :Node(id), like this:
CREATE INDEX ON :Node(id);
then this query should be able to take advantage of the index to quickly find each a node:
MATCH (b:Node)
UNWIND b.refs AS ref
MATCH (a:Node)
WHERE a.id = ref
CREATE (b)-[:CITES]->(a);
Currently, the Cypher execution planner does not support using the index when directly comparing the values of 2 properties. In the above query, the WHERE clause is comparing a property with a variable, so the index can be used.
The ID(a) < ID(b) test was omitted, since your question did not state that ordering the native node IDs in such a way was required.
[UPDATE 1]
If you want to run the creation step in parallel, try this usage of the APOC procedure apoc.periodic.iterate:
CALL apoc.periodic.iterate(
"MATCH (b:Node) UNWIND b.refs AS ref RETURN b, ref",
"MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
The first Cypher statement passed to the procedure just returns each b/ref pair. The second statement (which is run in parallel) uses the index to find the a node and creates the relationship. This division of effort puts the more expensive processing in the statement running in a parallel thread. The iterateList: true option is omitted, since we (probably) want the second statement to run in parallel for each b/ref pair.
[UPDATE 2]
You can encounter deadlock errors if parallel executions try to add relationships to the same nodes (since each parallel transaction will attempt to write-lock every new relationship's end nodes). To avoid deadlocks involving just the b nodes, you can do something like this to ensure that a b node is not processed in parallel:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
However, this approach is still vulnerable to deadlocks if parallel executions can try to write-lock the same a nodes (or if any b nodes can also be used as a nodes). But at least hopefully this addendum will help you to understand the problem.
[UPDATE 3]
Since these deadlocks are race conditions that depend on multiple parallel executions trying to lock the same nodes at the same time, you might be able to work around this issue by retrying the "inner statement" whenever it fails. And you could also try making the batch size smaller, to reduce the probability that multiple parallel retries will overlap in time. Something like this:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize: 1000, parallel: true, retries: 100})

What Erlang data structure to use for ordered set with the possibility to do lookups?

I am working on a problem where I need to remember the order of events I receive but also I need to lookup the event based on it's id. How can I do this efficiently in Erlang if possible without a third party library? Note that I have many potentially ephemeral actors with each their own events (already considered mnesia but it requires atoms for the tables and the tables would stick around if my actor died).
-record(event, {id, timestamp, type, data}).
Based on the details included in the discussion in comments on Michael's answer, a very simple, workable approach would be to create a tuple in your process state variable that stores the order of events separately from the K-V store of events.
Consider:
%%% Some type definitions so we know exactly what we're dealing with.
-type id() :: term().
-type type() :: atom().
-type data() :: term().
-type ts() :: calendar:datetime().
-type event() :: {id(), ts(), type(), data()}.
-type events() :: dict:dict(id(), {type(), data(), ts()}).
% State record for the process.
% Should include whatever else the process deals with.
-record(s,
{log :: [id()],
events :: event_store()}).
%%% Interface functions we will expose over this module.
-spec lookup(pid(), id()) -> {ok, event()} | error.
lookup(Pid, ID) ->
gen_server:call(Pid, {lookup, ID}).
-spec latest(pid()) -> {ok, event()} | error.
latest(Pid) ->
gen_server:call(Pid, get_latest).
-spec notify(pid(), event()) -> ok.
notify(Pid, Event) ->
gen_server:cast(Pid, {new, Event}).
%%% gen_server handlers
handle_call({lookup, ID}, State#s{events = Events}) ->
Result = find(ID, Events),
{reply, Result, State};
handle_call(get_latest, State#s{log = [Last | _], events = Events}) ->
Result = find(Last, Events),
{reply, Result, State};
% ... and so on...
handle_cast({new, Event}, State) ->
{ok, NewState} = catalog(Event, State),
{noreply, NewState};
% ...
%%% Implementation functions
find(ID, Events) ->
case dict:find(ID, Events) of
{Type, Data, Timestamp} -> {ok, {ID, Timestamp, Type, Data}};
Error -> Error
end.
catalog({ID, Timestamp, Type, Data},
State#s{log = Log, events = Events}) ->
NewEvents = dict:store(ID, {Type, Data, Timestamp}, Events),
NewLog = [ID | Log],
{ok, State#s{log = NewLog, events = NewEvents}}.
This is a completely straightforward implementation and hides the details of the data structure behind the interface of the process. Why did I pick a dict? Just because (its easy). Without knowing your requirements better I really have no reason to pick a dict over a map over a gb_tree, etc. If you have relatively small data (hundreds or thousands of things to store) the performance isn't usually noticeably different among these structures.
The important thing is that you clearly identify what messages this process should respond to and then force yourself to stick to it elsewhere in your project code by creating an interface of exposed functions over this module. Behind that you can swap out the dict for something else. If you really only need the latest event ID and won't ever need to pull the Nth event from the sequence log then you could ditch the log and just keep the last event's ID in the record instead of a list.
So get something very simple like this working first, then determine if it actually suits your need. If it doesn't then tweak it. If this works for now, just run with it -- don't obsess over performance or storage (until you are really forced to).
If you find later on that you have a performance problem switch out the dict and list for something else -- maybe gb_tree or orddict or ETS or whatever. The point is to get something working right now so you have a base from which to evaluate the functionality and run benchmarks if necessary. (The vast majority of the time, though, I find that whatever I start out with as a specced prototype turns out to be very close to whatever the final solution will be.)
Your question makes it clear you want to lookup by ID, but it's not entirely clear if you want to lookup or traverse your data by or based on time, and what operations you might want to perform in that regard; you say "remember the order of events" but storing your records with an index of the ID field will accomplish that.
If you only have to lookup by ID then any of the usual suspects will work as a suitable storage engines, so ets, gb_trees and dict for example would be good. Don't use mnesia unless you need the transactions and safety and all those good features; mnesia is good, but there is a high performance price to be paid for all that stuff, and it's not clear you need it, from your question anyway.
If you do want to lookup or traverse your data by or based on time, then consider an ets table of ordered_set. If that can do what you need then it's probably a good choice. In that case you would employ two tables, one set to provide a hash lookup by ID and another ordered_set to lookup or traverse by timestamp.
If you have two different lookup methods like this there's no getting around the fact you need two indexes. You could store the whole record in both, or, assuming your IDs are unique, you could store the ID as the data in the ordered_set. Which you choose is really a matter of trade off of storage utilisation and read and wrote performance.

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

What is an appropriate data structure and database schema to store logic rules?

Preface: I don't have experience with rules engines, building rules, modeling rules, implementing data structures for rules, or whatnot. Therefore, I don't know what I'm doing or if what I attempted below is way off base.
I'm trying to figure out how to store and process the following hypothetical scenario. To simplify my problem, say that I have a type of game where a user purchases an object, where there could be 1000's of possible objects, and the objects must be purchased in a specified sequence and only in certain groups. For example, say I'm the user and I want to purchase object F. Before I can purchase object F, I must have previously purchased object A OR (B AND C). I cannot buy F and A at the same time, nor F and B,C. They must be in the sequence the rule specifies. A first, then F later. Or, B,C first, then F later. I'm not concerned right now with the span of time between purchases, or any other characteristics of the user, just that they are the correct sequence for now.
What is the best way to store this information for potentially thousands of objects that allows me to read in the rules for the object being purchased, and then check it against the user's previous purchase history?
I've attempted this, but I'm stuck at trying to implement the groupings such as A OR (B AND C). I would like to store the rules in a database where I have these tables:
Objects
(ID(int),Description(char))
ObjectPurchRules
(ObjectID(int),ReqirementObjectID(int),OperatorRule(char),Sequence(int))
But obviously as you process through the results, without the grouping, you get the wrong answer. I would like to avoid excessive string parsing if possible :). One object could have an unknown number of previous required purchases. SQL or psuedocode snippets for processing the rules would be appreciated. :)
It seems like your problem breaks down to testing whether a particular condition has been satisfied.
You will have compound conditions.
So given a table of items:
ID_Item Description
----------------------
1 A
2 B
3 C
4 F
and given a table of possible actions:
ID_Action VerbID ItemID ConditionID
----------------------------------------
1 BUY 4 1
We construct a table of conditions:
ID_Condition VerbA ObjectA_ID Boolean VerbB ObjectB_ID
---------------------------------------------------------------------
1 OWNS 1 OR MEETS_CONDITION 2
2 OWNS 2 AND OWNS 3
So OWNS means the id is a key to the Items table, and MEETS_CONDITION means that the id is a key to the Conditions table.
This isn't meant to restrict you. You can add other tables with quests or whatever, and add extra verbs to tell you where to look. Or, just put quests into your Items table when you complete them, and then interpret a completed quest as owning a particular badge. Then you can handle both items and quests with the same code.
This is a very complex problem that I'm not qualified to answer, but I've seen lots of references to. The fundamental problem is that for games, quests and items and "stats" for various objects can have non-relational dependencies. This thread may help you a lot.
You might want to pick up a couple books on the topic, and look into using LUA as a rules processor.
Personally I would do this in code, not in SQL. Each item should be its own class implementing an interface (i.e. IItem). IItem would have a method called OkToPurchase that would determine if it is OK to purchase that item. To do that, it would use one or more of a collection of rules (i.e. HasPreviouslyPurchased(x), CurrentlyOwns(x), etc.) that you can build.
The nice thing is that it is easy to extend this approach with new rules without breaking all the existing logic.
Here's some pseudocode:
bool OkToPurchase()
{
if( HasPreviouslyPurchased('x') && !CurrentlyOwns('y') )
return true;
else
return false;
}
bool HasPreviouslyPurchased( item )
{
return purchases.contains( item )
}
bool CurrentlyOwns( item )
{
return user.Items.contains( item )
}

Resources