I have a graph database which looks like this (simplified) diagram:
Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.
I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:
g.V().match(
__.as("id").hasId("some_id"),
__.as("id")
.out("has_firstName")
.in("has_firstName")
.as("firstName"),
__.as("id")
.out("has_lastName")
.in("has_lastName")
.as("lastName"),
__.as("id")
.out("has_email")
.in("has_email")
.as("email"),
where("firstName", eq("lastName")),
where("firstName", eq("email")),
where("firstName", neq("id"))
).select("firstName")
The query returns a list of IDs which match the input some_id.
When this query tries to match an ID with a particularly common first name, it becomes very, very slow. I suspect that the match step is the problem, but I've struggled to find an alternative with no luck so far.
The performance of this query will depend on the edge degrees in your graph. Since many people share the same first name, you will most likely have a huge amount of edge going into a specific firstName vertex.
You can make assumptions, like: there are fewer people with the same last name than people with the same first name. And of course, there should be even fewer people who share the same email address. With that knowledge you just can start to traverse to the vertices with the lowest degree first and then filter from there:
g.V().hasId("some_id").as("id").
out("has_email").in("has_email").where(neq("id")).
filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
filter(out("has_firstName").where(__.in("has_firstName").as("id")))
With that, the performance will mostly depend on the vertex with the lowest edge degree.
Related
Situation
I have a Rails application using Postgresql.
Texts are added to the application (ranging in size from a few words to, say, 5,000 words).
The texts get parsed, first automatically, and then with some manual revision, to associate each word/position in the text with specific information (verb/noun/etc, base word (running ==> run), definition_id, grammar tags)
Given a lemma (base word, ex. "run"), or a part of speech (verb/noun), or grammar tags, or a definition_id (or a combination), I need to be able to find all the other text positions in the database that contain the same information.
Conflict
I can't do a full-text search because, for example, if I click "left" on "I left Nashville", I don't want "turn left at the light" to appear. the traffic light. I just want "Leave" as a verb, as well as other forms of "Leave" as a verb.
Also, I might want just "left" with a specific definition_id (eg "Left" used as "The political party", not used as "the opposite of the right").
In short, I am looking for some advice on which of the following 3 routes I should take (or if there's a 4th or 5th route that I haven't considered).
Solutions
There are three options I can think of:
Option 1: TextPosition
A TextPosition table to store each word position, with columns for each of the above attributes.
This would make searching very easy, but there would be MANY records (1 for each position), but maybe that's not a problem? Is storing this amount of tickets a bad idea for some specific reason?
Option 2: JSON on the Text object
A JSON column on the Text object, to store all word positions in a large array of hashes, or a hash of hashes.
This would add zero records, but, a) Building a query to search all texts with certain information would probably be difficult, b) That query would probably be slow, and c) It could take up more storage space than a separate table (TextPosition).
Option 3: TWO JSON columns: one on the Text object, and one on each dictionary object
A JSON in each text object, as in option 2, but only to render the text (not to search), containing all the information about each position in that same text.
Another JSON in each "dictionary object" (definition, base word, grammar concept, grammar tag), just for searching (not to render the text). This column would track the matches of this particular object across ALL texts. It would be an array of hashes, where each hash would be {text_id: x, text_index: y}.
With this option, the search would be "easier", but it would still not be ideal: to find all the text positions that contain a certain attribute, I would have to do the following:
Find the record for that attribute
Extract the text_ids / indexes from the record
Find the texts with those IDs
Extract the matching line from each text, using the index that comes with each text_id within the JSON.
If it was a combination of attributes that I were looking for, I would have to do those 4 steps for each attribute, and then find the intersection between the sets of matches for each attribute (to end up only having the positions that contain both).
Furthermore, when updating a position (for example, if a person indicates that an attribute is wrongly associated and that it should actually be another), I would have to update both JSONs.
Also, will storing 2 JSON columns actually bring any tangible benefit over a TextPosition table? It would probably take up MORE storage space than using a TextPosition table, and for what benefit?
conclusion
In sum, I am looking for some advice on which of those 3 routes I should follow. I hope the answer is "option 1", but if so, I would love to know what drawbacks/obstacles could come up later when there are a ton of entries.
Thanks, Michael King
Text parsing and searching make my brain hurt. But anytime I have something with the complexity of what you are talking about, ElasticSearch is my tool of choice. You can do some amazingly complex indexing and searching with it.
So my answer is 4) ElasticSearch.
How do you decide the verb-direction of a relation ?
E.g I have a Country falling under a Sub REgion which in turn is under a Region.
Which one would be better and are there any thumb rules on deciding the direction.
(Region)-[HAS]->(Sub Region)-[HAS]->(Country)
or
(Region)<-[BELONGS_TO]-(Sub Region)<-[BELONGS_TO]-(Country)
Regards
San
I agree with #InverFalcon that directionality is mostly a subjective decision. However, there may be (at least) one situation in which you might want to use a specific direction, especially if that will make an important use case faster.
This is related to the fact that often if you can make a Cypher pattern less specific (without affecting the output), then neo4j would have to do less work and your query would be faster.
For example, suppose your entire data model consists of 2 node labels and 2 relationship types, like below. (I use my own data model, since I don't know what your uses cases are.)
(:Person)-[:ACTED_IN]->(:Movie)
(:Person)-[:DIRECTED]->(:Movie)
In order to find the movies that an actor acted in, your query would have to look something like the following. (Notice that we have to specify the ACTED_IN type, because an outgoing relationship could also be of type DIRECTED. This means that neo4j has to explicitly test every outgoing relationship for its type):
MATCH (:Person {id: 123})-[:ACTED_IN]->(m:Movie)
RETURN m;
However, if your data model replaced the DIRECTED type with a DIRECTED_BY type that had opposite directionality, then it would look like this instead:
(:Person)-[:ACTED_IN]->(:Movie)
(:Person)<-[:DIRECTED_BY]-(:Movie)
With that tweak, your query could be simpler and faster (since neo4j would not have to test relationship types):
MATCH (:Person {id: 123})-->(m:Movie)
RETURN m;
And to be complete, notice that in the above 2 MATCH patterns we could actually remove the :Movie label, since in both data models the ACTED_IN end node will always have the Movie label.
how can i do a search based on combinations of like 50 parameters like filters.
These filters can be price color size brand etc.
So we can get different pages based on these params.
So one link can have price brand size, another one size brand color, and so on.
My question is what will be the best practice to query the database based on these params.
I have one ideea to encrypt them into 101101101 sequence of 1 and 0 and search by that.
So i have like more than 2 milions possible combinations, and i want to reduce the query time.
I heard about btree but i don't know how to use it, i have given my table columns the proper indexes but from this point i don't know in wich direction should i go. How my query is going to look like.
I think that it is a good idea to "encrypt" the params, but don't do it like "10100010", because then you'll have to be storing these values as string.
Rather encode it as base10 number. It means that 100101 = 1*32+0*16+0*8+1*4+0*2+1*1 = 37.
Ofcourse, with 50 flags you'd get a number too big to store as bigint (which is 32 bytes), so try to logically group the parameters and use 2-3 fields for them.
The problem with this aproach would be with querying the data - you would have to write a function extracting a flag from the number, to be able to query the data by only one parameter and not all of them.
i'm coding a c project for an algorithm class and i really need some help!
Here's the problem:
I have a set of names like this one N = (James,John,Robert,Mary,Patricia,Linda Barbara) wich are stored in an RB tree.
Starting from this set of names a series of couple like those ones are formed:
(James,Mary)
(James,Patricia)
(John,Linda)
(John,Barbara)
(Robert,Linda)
(Robert,Barbara)
Now i need to merge the elements in a way that i can form n subgroups with the constraint that each pairing is respected and the group has the smallest possible cardinality.
With the couples in the example they will form two groups
(James,Mary,Patricia) and (John,Robert,Barbara,Linda).
The task is to return the maximum number of groups formed and the number of males and females in the group with the maximum cardinality.
In this case it would be 2 2 2
I was thinking about building a graph where every name is represented by a vertex and two vertex are in an edge only if they are paired.
I can then use an algorithm (like Kruskal) to find the Minimum spanning tree.Is that right?
The problem is that the graph would not be completely connected.
I also need to find a way to map the names to the edges of the Graph and vice-versa.
Can the edges be indexed by a string?
Every help is really appreciated :)
Thanks in advice!
You don't need to find the minimum spanning tree. That is really for finding the "best" edges in a graph that will still keep the graph connected. In other words, you don't care how John and Robert are connected, just that they are.
You say that the problem is that the graph would not be completely connected, but I think that is actually the point. If you represent graph edges by using the couples as you suggest, then the vertices that are connected form the groups that you are looking for.
In your example, James is connected to Mary and also James is connected to Patricia. No other person connects to any of those three vertices (if they did, you would have another couple that included them), which is why they form a single group of (James, Mary, Patricia). Similarly all of John, Robert, Barbara, and Linda are connected to each other.
Your task is really to form the graph and find all of the connected subgraphs that are disjoint from each other.
While not a full algorithm, I hope that helps get you started.
I think that you can easily solve this with a dfs and connected components. Because every person(node) has a relation with an other one (edge). So you have an outer loop and run an explore function for every node which is unvisited and add the same number for every node explored by the explore function.
e.g
dfs() {
int group 0;
for(int i=0;i<num_nodes;i++) {
if(nodes[i].visited==false){
explore(nodes[i],group);
group++;
}
}
then you simple have to sort the node by the group and then you are ready. if you want to track the path you can use a pre number which indicates which node was explored first, second..etc
(sorry for my bad english)!
The sets of names and pairs of names already form a graph. A data structure with nodes and pointers to other nodes is just another representation, one that you don't necessarily need. Disjoint sets are easier to implement IMO, and their purpose in life is exactly to keep track of sameness as pairs of things are joined together.
I have a real question.
I have a database with the schema as follows:
item
id
description
other junk
tag
id
name
item2tag
item_id
tag_id
count
Basically, each item is tagged as up to 10 things, with varying counts. There are 50,000 items and 50,000 tags, and about 500,000 entries in items2tag. I'd like to find, given one item, the "most similar" item.
By "most similar" I mean the item that has the most similar combination of tags... if something is "cool" twice as much as it is "funny," I want to find all other things that are almost "cool" twice as much as they are "funny." Of course, this should apply to 10 tags, not just 2.
Any ideas?
Well, you can look at linear algebra to give a n dimensional vector to each item, and then compute the distance between items to find the closest items, but that's pretty complex with even small data sets.
Which is why Google came up with Map Reduce. This will probably be your best bet, but even then it's non-trivial.
-Adam
Given your representation of item-tag relationship as vectors,
What you have is an instance of nearest-neighbor search.
You may find pointers in the field of Collaborative Filtering.