Optimizing in-count searches with neo4j - database

I am building a social network like graph where we periodically would like to find all the nodes with an incoming count of greater than 2. E.g.
A->B
C->B
C->A
D->A
E->A
Should return node A.
I am new to neo4j and am still a bit confused about whether they will search through all the available rows. Is there an index I can use to optimize these queries? My search query is like this
EXPLAIN MATCH ()-[r:F]->(b:Person)
WITH b, count(r) AS count
WHERE count > 1
RETURN collect(b.username)
Thank you

Neo4j has a special relationship count store that holds the number of relationships a node has. It allows to get the count of relationships without having to expand them.
You can access relationship count store values using the size() function.
MATCH (b:Person)
WITH b, size((b)<--()) as size
WHERE size > 2
RETURN collect(b.username)

Related

How to calulcate index pages scanned and relation pages scanned

I have some homework problems that require that I calculate the total cost of queries. Some of the assumptions made for the queries are:
All costs are in terms of disk pages, cost of scanning the index, and cost of reading matching
tuples from the relation if needed.
If reading tuples from a relation, use the worst case scenario: all tuples are in different disk
pages.
Assume all indices have 3 levels (root, internal node, and leaf level), and the root of all indices
are in memory so scanning the root incurs no cost. Most searches will have one node from
the internal level and a number of leaf nodes.
Postgresql returns the size of an index in terms of the number of pages using the query below.
For the B-tree index, assume the numbers provided are the number of leaf nodes.
My professor also gives us a sample query to get the cost for, but I don't know exactly how he got to these numbers. The database in question looks like:
create table series (
seriesid int primary key
, title varchar(400)
, yearreleased int
, contentrating varchar(40) -- age group the movie is intended for
, imdbrating float -- imdb rating
, rottentomatoes int -- rotten tomatoes rating
, description text
, seasons int -- how many seasons are available
, date_added date -- date series is added to Netflix
) ;
and the query the calculate the cost for looks like so:
qa: select director from seriesdirectors where seriesid <= 100;
The index referenced is seriesdirectors_pkey which has 2 relation pages, and the query returns 5 tuples in total.
Using these numbers my professor somehow got to the conclusion that the number of index pages scanned is 3 and the number of relation pages scanned is 0.
The reasoning being: "because there are 5 tuples matching this query (1 internal node, and 2 leaf nodes in the worst case), and the index contains all the necessary information (so zero tuples need to be read)." I have been trying to understand the concepts of index pages and how exactly my professor got these numbers so I can work on the rest of the problems. If any more context is needed and I will let you know.
Edit: The definition of the index page in this case is a way to store tuples from a database with a unique identifier (i.e. a standard index for a database, more here: https://en.wikipedia.org/wiki/Database_index).
The big question I have is how can I calculate the total number of index pages queried and the number of relation pages queried. The example query above (qa) was using the index seriesdirectors_pkey which has 2 relation pages and returns a result of 5 rows. I don't know how we get from that to knowing that 3 index pages were queried and 0 relation pages were queried.

Alternative 1 of index data entry

One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?
This is an extreme case, because it does not really correspond to
having data entries separated by data records (the hashed file is an
example of this case).
(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)
Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)
Let's do a concrete example.
We have a relation R(a,b,c) and we have a clustered B+⁠-⁠tree using alternative 2 on search key a. Since the tree is clustered, the relation R must be sorted by a.
Now, let's suppose that a common query for the relation is:
SELECT *
FROM R
WHERE b > 25
so we want to build another index to efficiently support this kind of query.
Case 1: clustered tree with alt. 2
We know that clustered B+⁠-⁠trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Use the found pointer to go to a specific page. Cost: 1
Scan the page and eventual other pages. Cost: num. of relevant pages
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + 1 + #relevant-pages
where ƒ is the fan-out and ℓ the number of leaves.
Unfortunately, in our case a tree on search key b must be unclustered, because the relation is already sorted by a
Case 2: unclustered tree with alt. 2 (or 3)
We also know that B+⁠-⁠trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves) and do a different page access for each tuple that falls in the range. Cost: num. of other relevant leaves + num. of relevant tuples
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves + #relevant-tuples
notice that the number of tuples is pretty bigger respect to the number of pages!
Case 3: unclustered tree with alt. 1
Using alternative 1, we have all the data in the tree, so for executing the query we:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves). Cost: num. of other relevant leaves
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves
that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.
I hope I was clear enough.
N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second⁠-⁠storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).

What is a good algorithm to check whether or not a number exist in multiple sets without searching them all?

Scenario
Let's say you have multiple databases in 3 zones. Zone A, B, and C. Each zone in different geographical location. At the same time, you have an application that will route username and password based on the geographical location of the user. For example, user A will be redirected to database in Zone A. User B Zone B and so on.
Now, let's say user A moves to a zone B. The application query zone B and won't find anything. Querying zone A and zone C might take some time due to zones are far away, and will have to query all the databases in all zones.
My Question
How can you verify if a string/number exists in multiple sets?
or
How can you verify a row exist in the database before even sending a query?
My Algorithm
This is not perfect, but will give you some idea what I'm trying to do
If we have the database with the following 3 users
foo
bar
foobar
We take the hash of all 3 users, and look for the next prime number if the hash is not prime.
sum = hash(foo).nextPrime() * hash(bar).nextPrime() * hash(foobar).nextPrime()
That sum is shared between all zones. If I want to check foo, I can just take the hash of foo, and look for the next prime, then take the gcd(foo,sum). If it's not equal to one. It means foo exist in some database. If it equal to one, it means foo doesn't exist at all. If I want to add a new username. I can simply do sum = sum * hash(newUserName).nextPrime().
Sum will grow to a point that will be just faster to query all databases.
Do you know a similar algorithm to solve this problem?
One data structure suitable for this application is a Bloom filter.
A Bloom filter is a probabilistic data structure which allows you test whether an item is already in a set. If the test returning false then the item is definitely not in the set (0% false negatives), if true then it may be in the set, but is not guaranteed to be (false positives are possible).
The filter is implemented as a bit array with m bits and a set of k hash functions. To add an item to the array (e.g. a username), hash the item using each of the hash functions and then take the modulo m of each hash value to compute the indexes to set in the bit array. To test if an item is in the set, compute all the hashes and indexes and check that all of the corresponding bits in the array are set to 1. If any of them are zero them the item is definitely not in the set, if all are 1 then the item is most likely in the set, but there is a small chance it may not be, the percentage of false positives can be reduced by using a larger m.
To implement the k hash functions, it is possible to just use the same hashing algorithm (e.g. CRC32, MD5 etc) but append different salts to the username string for each before passing to the hash function, effectively creating "new" hash functions for each salt. For a given m and n (number of elements being added), the optimal number of hash functions is k = (m / n) ln 2
For your application, the Bloom filter bit array would be shared across all zones A B C etc. When a user attempts to login, you could first check in the database of the local zone, and if present then log them in as normal. If not present in the local database, check the Bloom filter and if the result is negative then you know for sure that they don't exist in another zone. If positive, then you still need to check the databases in the other zones (because of the possibility of a false positive), but presumably this isn't a big issue because you would be contacting the other zones in any case to transfer the user's data in the case that it was a true positive.
One down-side of using a Bloom filter is that it is difficult (though not impossible) to remove elements from the set once they have been added.

Need help solving a problem using graphs in C

i'm coding a c project for an algorithm class and i really need some help!
Here's the problem:
I have a set of names like this one N = (James,John,Robert,Mary,Patricia,Linda Barbara) wich are stored in an RB tree.
Starting from this set of names a series of couple like those ones are formed:
(James,Mary)
(James,Patricia)
(John,Linda)
(John,Barbara)
(Robert,Linda)
(Robert,Barbara)
Now i need to merge the elements in a way that i can form n subgroups with the constraint that each pairing is respected and the group has the smallest possible cardinality.
With the couples in the example they will form two groups
(James,Mary,Patricia) and (John,Robert,Barbara,Linda).
The task is to return the maximum number of groups formed and the number of males and females in the group with the maximum cardinality.
In this case it would be 2 2 2
I was thinking about building a graph where every name is represented by a vertex and two vertex are in an edge only if they are paired.
I can then use an algorithm (like Kruskal) to find the Minimum spanning tree.Is that right?
The problem is that the graph would not be completely connected.
I also need to find a way to map the names to the edges of the Graph and vice-versa.
Can the edges be indexed by a string?
Every help is really appreciated :)
Thanks in advice!
You don't need to find the minimum spanning tree. That is really for finding the "best" edges in a graph that will still keep the graph connected. In other words, you don't care how John and Robert are connected, just that they are.
You say that the problem is that the graph would not be completely connected, but I think that is actually the point. If you represent graph edges by using the couples as you suggest, then the vertices that are connected form the groups that you are looking for.
In your example, James is connected to Mary and also James is connected to Patricia. No other person connects to any of those three vertices (if they did, you would have another couple that included them), which is why they form a single group of (James, Mary, Patricia). Similarly all of John, Robert, Barbara, and Linda are connected to each other.
Your task is really to form the graph and find all of the connected subgraphs that are disjoint from each other.
While not a full algorithm, I hope that helps get you started.
I think that you can easily solve this with a dfs and connected components. Because every person(node) has a relation with an other one (edge). So you have an outer loop and run an explore function for every node which is unvisited and add the same number for every node explored by the explore function.
e.g
dfs() {
int group 0;
for(int i=0;i<num_nodes;i++) {
if(nodes[i].visited==false){
explore(nodes[i],group);
group++;
}
}
then you simple have to sort the node by the group and then you are ready. if you want to track the path you can use a pre number which indicates which node was explored first, second..etc
(sorry for my bad english)!
The sets of names and pairs of names already form a graph. A data structure with nodes and pointers to other nodes is just another representation, one that you don't necessarily need. Disjoint sets are easier to implement IMO, and their purpose in life is exactly to keep track of sameness as pairs of things are joined together.

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Resources