gremlin query involving edge property quantity checks - graph-databases

I have this schema for the graph database.
I need to find users who have picked calls from all calls labels that it is connected to, for program X.
I am quite new to gremlin and my vocabulary is limited.
I want to find a way to check if the number of incoming edges to users with picked == yes is the same as the total number of incoming edges to the user for a given program X. (that is, if every edge from program X to user via calls contains picked==yes)
I am unable to count twice in the same query and compare those values to decide if user satisfies criteria. My approach is as follows:
g.V().hasLabel("User").filter(inE().count().is(neq(0))).filter(inE().outV().in().has("program","name","X")).filter(inE().has("connected","yes").has("picked","no").count().is(eq(0))).values("name")
Essentially, I am removing all cases that I know do not lead to my expeced result (null nodes in User, Users who connected but did not pick up). This does not generate the right output when I tested it against manual examination result.
Thanks!

From the description alone I would say that this is your query:
g.V().hasLabel('User').
filter(project('a','b').
by(inE().has('picked','yes').count()).
by(out().has('program','X').count()).
where('a',eq('b')))
If that doesn't return the expected result, then please provide a snippet that creates a small sample graph along with the expected result.

Related

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

Predict probabilities of a recommender system?

I have a dataset where I have a sparse utility matrix (user-product) with binary input: 1 if the user 𝑖 bought the product 𝑗, and 0 if it hasn't.
However it has a different meaning on the test set, 0 means that we don't know if the user bought this product, and 1 means that we're sure that the user bought that given product.
I need to get for each user in each product the probability that a user 𝑖 bought a product 𝑗 in the test set. For this I wish to use different matrix factorization techniques like FunkSVD, NMF or SVD++ , but I'm quite confused:
These techniques would only allow me to get the label (1 or 0) on test set, but I need to compute a probability of getting 1 and not the label in it self.
How can I approach this problem ? Or do I treat it as a classification problem and then use all common classification techniques ?
Okay, One approach that might help based on Recurrent-Knowledge-Graph-Embedding.
Try converting user-item interactions into a knowledge graph and mine the top-n paths in the graph between user_i and item_j. Build an RNN model as described in the paper and its code, to get probabilities.

Is there a way to optimize this Gremlin query?

I have a graph database which looks like this (simplified) diagram:
Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.
I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:
g.V().match(
__.as("id").hasId("some_id"),
__.as("id")
.out("has_firstName")
.in("has_firstName")
.as("firstName"),
__.as("id")
.out("has_lastName")
.in("has_lastName")
.as("lastName"),
__.as("id")
.out("has_email")
.in("has_email")
.as("email"),
where("firstName", eq("lastName")),
where("firstName", eq("email")),
where("firstName", neq("id"))
).select("firstName")
The query returns a list of IDs which match the input some_id.
When this query tries to match an ID with a particularly common first name, it becomes very, very slow. I suspect that the match step is the problem, but I've struggled to find an alternative with no luck so far.
The performance of this query will depend on the edge degrees in your graph. Since many people share the same first name, you will most likely have a huge amount of edge going into a specific firstName vertex.
You can make assumptions, like: there are fewer people with the same last name than people with the same first name. And of course, there should be even fewer people who share the same email address. With that knowledge you just can start to traverse to the vertices with the lowest degree first and then filter from there:
g.V().hasId("some_id").as("id").
out("has_email").in("has_email").where(neq("id")).
filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
filter(out("has_firstName").where(__.in("has_firstName").as("id")))
With that, the performance will mostly depend on the vertex with the lowest edge degree.

How can I find the closest document using Google App Engine Search API?

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.
I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:
from google.appengine.api import search
# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
doc_id='meaningful-unique-id',
fields=[search.GeoField(name='location'
value=search.GeoPoint(coords[0], coords[1]))])
# find document function radius is in metres
def find_document(coords, radius=1000000):
sort_expr = search.SortExpression(
expression='distance(location, geopoint(%.3f, %.3f))' % coords,
direction=search.SortExpression.ASCENDING,
default_value=0)
search_query = search.Query(
query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
% (coords[0], coords[1], radius),
options=search.QueryOptions(
limit=1,
ids_only=True,
sort_options=search.SortOptions(expressions=[sort_expr])))
index = search.Index(name='document-index')
return index.search(search_query)
With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.
I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)
The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?
I believe the issue is the following.
Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.)
So I suspect that some of the geographically closer points are not included in this 10k selection.
That's why things work better when you narrow your search radius, as you have fewer total points in that radius.
Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on.
You can address this in at least a couple of ways, which you can combine:
Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.
(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).
I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.
I have seen the same clusters as you and its part of the search API.
One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.
Wild idea, why not keep/record the distance from 3 points then calculate from that.

Need help solving a problem using graphs in C

i'm coding a c project for an algorithm class and i really need some help!
Here's the problem:
I have a set of names like this one N = (James,John,Robert,Mary,Patricia,Linda Barbara) wich are stored in an RB tree.
Starting from this set of names a series of couple like those ones are formed:
(James,Mary)
(James,Patricia)
(John,Linda)
(John,Barbara)
(Robert,Linda)
(Robert,Barbara)
Now i need to merge the elements in a way that i can form n subgroups with the constraint that each pairing is respected and the group has the smallest possible cardinality.
With the couples in the example they will form two groups
(James,Mary,Patricia) and (John,Robert,Barbara,Linda).
The task is to return the maximum number of groups formed and the number of males and females in the group with the maximum cardinality.
In this case it would be 2 2 2
I was thinking about building a graph where every name is represented by a vertex and two vertex are in an edge only if they are paired.
I can then use an algorithm (like Kruskal) to find the Minimum spanning tree.Is that right?
The problem is that the graph would not be completely connected.
I also need to find a way to map the names to the edges of the Graph and vice-versa.
Can the edges be indexed by a string?
Every help is really appreciated :)
Thanks in advice!
You don't need to find the minimum spanning tree. That is really for finding the "best" edges in a graph that will still keep the graph connected. In other words, you don't care how John and Robert are connected, just that they are.
You say that the problem is that the graph would not be completely connected, but I think that is actually the point. If you represent graph edges by using the couples as you suggest, then the vertices that are connected form the groups that you are looking for.
In your example, James is connected to Mary and also James is connected to Patricia. No other person connects to any of those three vertices (if they did, you would have another couple that included them), which is why they form a single group of (James, Mary, Patricia). Similarly all of John, Robert, Barbara, and Linda are connected to each other.
Your task is really to form the graph and find all of the connected subgraphs that are disjoint from each other.
While not a full algorithm, I hope that helps get you started.
I think that you can easily solve this with a dfs and connected components. Because every person(node) has a relation with an other one (edge). So you have an outer loop and run an explore function for every node which is unvisited and add the same number for every node explored by the explore function.
e.g
dfs() {
int group 0;
for(int i=0;i<num_nodes;i++) {
if(nodes[i].visited==false){
explore(nodes[i],group);
group++;
}
}
then you simple have to sort the node by the group and then you are ready. if you want to track the path you can use a pre number which indicates which node was explored first, second..etc
(sorry for my bad english)!
The sets of names and pairs of names already form a graph. A data structure with nodes and pointers to other nodes is just another representation, one that you don't necessarily need. Disjoint sets are easier to implement IMO, and their purpose in life is exactly to keep track of sameness as pairs of things are joined together.

Resources