How to Query a Titan index(standard) directly to retrieve vertices in sorted order - graph-databases

I am using Rexster/TITAN 0.4 over Cassandra.
The vertex keys are indexed using standard index as below.
g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make();
I am not using Uniqueness for performance and scalability.
There are around ~10M vertices in graph.
My requirement is to iterate over each vertices and identify if any duplicates and then remove it.
Is there a way to get the sorted list of vertices, directly from the index which is already present.
A direct query on index (standard TITAN index) similar to "Direct Index Query" .
So that I can partition the entire vertices into smaller batches and process individually.
If not possible , what is the best way to achieve this.
I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.
I want to run the below query to get 1000 vertices in the sorted order.
gremlin> g.V.has('domain').domain.order[0..1000]
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes
But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.
How can I force gremlin to use index in this particular case?

The answer is the same as the one I provided in the comments of your previous question:
Throw more memory at the problem (i.e. increase -Xmx to the console or whatever application is running your query) - which would be a short-term solution.
Use titan-hadoop.
Restructure your graph or queries in some way to allow a use of an index. This could mean giving up some performance on insert and using a uniqueness lock. Maybe you don't have to remove duplicates in your source data - perhaps you can dedup them in your Gremlin queries at the time of traversal. The point is that you'll need to be creative.
Despite your reluctance to use titan-hadoop and not wanting to use it to "just for finding/removing duplicates in graph", that's the exact use case it will be good at. You have a batch process that must iterate all vertices and it can't fit in the memory you've allotted and you don't want to use titan-hadoop. That's a bit like saying: "I have a nail and a hammer, but I don't want to use the hammer to bang in the nail." :)
How can I force gremlin to use index in this particular case?
There is no way in gremlin to do this. In theory, there might be a way to try to read from Cassandra directly (bypassing Titan), decode the binary result and somehow iterate and delete, but it's not known to me. Even if you figured it out, which would mean lots of hours trying to dig into the depths of Titan to see how to read the index data, it would be a hack that is likely to break at any time you upgrade Titan, as the core developers might close that avenue to you at any point as you are circumventing Titan in an unexpected way.
The best option is to simply use titan-hadoop to solve your problem. Unless your graph is completely static and no longer growing, you will reach a point where titan-hadoop is inevitable. How will you be sure that your graph is growing correctly when you have 100M+ edges? How will you gather global statistics about your data? How will you repair bad data that got into the database from a bug in your code? All of those things become issues when your graph reaches a certain scale and titan-hadoop is your only friend there at this time.

Related

Deleting vertex with a degree of millions scale from JanusGraph

I am running Janusgraph with Scylla as a storage engine.
The graph has a vertex with a degree of 5M (in + out), i.e. around 5M vertices are connected to it,
I am trying to drop this vertex by gremlin query graph.traversal().V(vertexId).drop().iterate() but it's taking a lot of time (unable to delete in 20 minutes).
I understand that above query iterates all edges and does the actual deletion
I wanted to know if anyone has faced a similar issue and figured out any workaround on it. Any lead would be really helpful.
My information may be dated and perhaps there are revised ways to do this, but since there have been no responses on this question I figured I'd offer the advice as I know it. In the days before JanusGraph when this graph was called Titan and I had situations like the one you describe, I found similar results that you are finding when doing a direct g.V(id).drop() and that to fully get rid of the vertex of that size meant having some patience. The strategy I used to get rid of it involved pruning the vertex of its edges so that a delete of the vertex itself became possible.
How you go about pruning the edges is dependent on your data and how those 5M edges are composed. It could be as simple as doing it by label or by blocks of 10000 within each label at a time or something else that makes sense to break the process down into chunks.
while(g.V(vertexId).outE('knows').limit(1).hasNext()) {
g.V(vertexId).outE('knows').limit(10000).drop().iterate();
}
I think I recall that I was able to run these types of operates in parallel which sped the process a bit. In any case, when you get the vertex bare of all edges (or down to a significantly smaller degree size at least) you can then g.V(vertexId).drop() and say good-bye to it.
I didn't use ScyllaDB but I think I remember that this many deletes can create tombstone types of issues for Cassandra, so that's worth looking out for potentially. You might also look at increasing the various timeouts that might come into play during this process.
For me, the lesson I learned over the years with respect to this issue was to build OLAP based monitors that keep track of graph statistics to ensure that you have proper and expected growth within your graph (i.e. degree distribution, label distributions, etc). This is especially important with graphs being fed from high-volume streams like Kafka where you can turn your head for a few hours and come back and find your graph in an ugly unexpected state. I think it's also important to model in ways that work around the possibility of getting to these supernode states. Edge TTLs and unidirectional edges can help with that in many cases.
I would love to hear that this answer is no longer relevant and that there are neat new ways to do these sorts of drops or that there is some ScyllaDB specific way to handle this problem, but, if not, perhaps this will be useful to you and get you past your problem.

Global Indexing in Graph Databases

I am very new to graph databases and am trying to work on a survey of different graph databases. I am not able to understand what exactly the global indexing in graph databases are.
Can someone please help me to understand what is Global indexing in Graph Databases.
I am not sure whether all graph databases agree on the notion of what a global index is, but generally it means an index that applies to the whole graph. Such an index allows to efficiently retrieve vertices based on some indexed property, e.g.: find all person vertices with the name Manoj. Most graph queries use a global index to find one or a small number of vertices as an entry point into the graph and then traverse the graph from there.
Opposed to global indexes are vertex-centric indexes. They only apply to a specific vertex and can be used to make queries with so-called supernodes more efficient. The idea here is to index a property of incident edges of the vertex that can reduce the number of neighboring vertices returned to those that are really interesting for the query. Such a vertex-centric index could for example for twitter be used to index the followedSince property on follower edges. This would allow to efficiently query for all followers of Katy Perry that began following her on her birthday. Without an index you would have to check the property for all of her (currently over 95 Mio.) followers for this query.
(Your question didn't mention vertex-centric indexes but I think it helps to understand why global indexes are called that way when you know about vertex-centric indexes, as they are basically local indexes.)
For more information about indexing in graph databases see the respective sections in the documentation of graph databases like Titan or DSE Graph.

Open Street Map enclosing polygons

I am working on an Android application that uses the Overpass API at [1]. My goal is to get all circular ways that enclose a certain lat-long point.
In order to do so I build a request for a rectangle that contains my location, then parse the response XML and run a ray-casting algorithm to filter the ways that enclose the given lat-long position. This is too slow for the purpose of my application because sometimes the response has tens or hundreds of MB.
Is there any OSM API that I can call to get all ways that enclose a certain location? Otherwise, how could I optimize the process?
Thanks!
[1] http://overpass-api.de/
To my knowledge, there is no standard API in OSM to do this (it is indeed a very uncommon usecase).
I assume you define enclose as the point representing the current location is inside the inner area of the polygon. Furthermore I assume optimizing the process might including changing the entire concept of the algorithm.
First of all, you need to define the rectangle to fetch data. For that, you need to consider that querying a too large rectangle would yield too much data. As far as I know there is no specific API to query circular ways only, and even if there is, querying a too large rectangle would probably denied by the server, because the server load would be enormous.
Server-side precomputation / prefiltering
Therefore I suggest the first optimization: Instead of querying an API that is not specifically suited for your purpose, use an offline database saved on the Android device. OsmAnd and others save the whole database for a country offline, but in your specific usecase you only need to save a pre-filtered database of circular ways.
As far as I know, only a small fraction of the ways in OSM is circular. Therefore I suggest writing a script that regularly downloads OSM dumps e.g. from Geofabrik, remove non-circular ways (e.g. you could check if the last node ID in a way is equal to the first node ID, but you'd need to check if that captures any way you would define as circular). How often you would run it depends on your usecase.
This optimization solves:
The issue of downloading a large amount of data
The issue of overloading the API with large request
The issue of not being able to request large chunks of data
If that is not suitable for your usecase, I suggest to build a simple API for that on your server.
Re-chunking the data into appriopriate grids
However, you still would need to filter a large amount of data. In order to partially solve this, I suggest the second optimization: Re-chunk your data. For example, if your current location is in Virginia, you would not need to filter circular ways that have an area not beyond Texas. Because filtering by state etc. would by highly country-dependent and difficult (CPU-intensive), I suggest to choose a grid, say e.g. 0.05 lat/lon degree (I'd choose a equirectangular projection because it's easy to calculate if you already have lat/lon coordinates).
The script that preprocessed that data shall then create one chunk of data (that could be a file, but we don't know enough about your usecase to talk about specific data strucutres) for any rectangle in the area you want to use. A circular way is included in this chunk if and only if it has at least one node that is inside the chunk area.
You would then only request / filter the specific chunk your position is currently in. Choose the chunk size appropriately for your application (preferably rather small, but that depends on numerous factors!).
This optimization solves:
Assuming most of the circular ways are quite small in terms of their bounding rectangles, you only need to filter a tiny fraction of the overall ways
IO is minimized, especially if you
Hysteretic heuristics
If the aforementioned optimizations do not sufficiently reduce your computation time, I'd suggest the third optimization that depends on how many circular ways you want to find (if you really need to find all, it won't help at all): Use hysteresis. Save the circular ways you were inside of during the last computation (assuming the new current location is near to the last location) and check them first. If your location didn't change too much, you have a high chance of hitting a way you're inside of during the first few raycasts.
Leveraging relations between different circular ways
Also, a fourth optimization is possible: There will be some circular ways that are fully enclosed in another circular way. You could code your program so that it knows about that relation and checks the inner circular way first. If this check succeeds, you automatically now that the current position is also contained in the outer circular way. I think computing the information (server-side) could be incredibly CPU-intensive and implementing it might also be a hard task, so I'd suggest to use this optimization only if not avoidable.
Tuning the parameters of these optimizations should be sufficient to decrease the CPU time needed for your computation significantly. Please feel free to comment/ask if you have further questions regarding these suggestions.

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Resources