I am running Janusgraph with Scylla as a storage engine.
The graph has a vertex with a degree of 5M (in + out), i.e. around 5M vertices are connected to it,
I am trying to drop this vertex by gremlin query graph.traversal().V(vertexId).drop().iterate() but it's taking a lot of time (unable to delete in 20 minutes).
I understand that above query iterates all edges and does the actual deletion
I wanted to know if anyone has faced a similar issue and figured out any workaround on it. Any lead would be really helpful.
My information may be dated and perhaps there are revised ways to do this, but since there have been no responses on this question I figured I'd offer the advice as I know it. In the days before JanusGraph when this graph was called Titan and I had situations like the one you describe, I found similar results that you are finding when doing a direct g.V(id).drop() and that to fully get rid of the vertex of that size meant having some patience. The strategy I used to get rid of it involved pruning the vertex of its edges so that a delete of the vertex itself became possible.
How you go about pruning the edges is dependent on your data and how those 5M edges are composed. It could be as simple as doing it by label or by blocks of 10000 within each label at a time or something else that makes sense to break the process down into chunks.
while(g.V(vertexId).outE('knows').limit(1).hasNext()) {
g.V(vertexId).outE('knows').limit(10000).drop().iterate();
}
I think I recall that I was able to run these types of operates in parallel which sped the process a bit. In any case, when you get the vertex bare of all edges (or down to a significantly smaller degree size at least) you can then g.V(vertexId).drop() and say good-bye to it.
I didn't use ScyllaDB but I think I remember that this many deletes can create tombstone types of issues for Cassandra, so that's worth looking out for potentially. You might also look at increasing the various timeouts that might come into play during this process.
For me, the lesson I learned over the years with respect to this issue was to build OLAP based monitors that keep track of graph statistics to ensure that you have proper and expected growth within your graph (i.e. degree distribution, label distributions, etc). This is especially important with graphs being fed from high-volume streams like Kafka where you can turn your head for a few hours and come back and find your graph in an ugly unexpected state. I think it's also important to model in ways that work around the possibility of getting to these supernode states. Edge TTLs and unidirectional edges can help with that in many cases.
I would love to hear that this answer is no longer relevant and that there are neat new ways to do these sorts of drops or that there is some ScyllaDB specific way to handle this problem, but, if not, perhaps this will be useful to you and get you past your problem.
Related
I am using Rexster/TITAN 0.4 over Cassandra.
The vertex keys are indexed using standard index as below.
g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make();
I am not using Uniqueness for performance and scalability.
There are around ~10M vertices in graph.
My requirement is to iterate over each vertices and identify if any duplicates and then remove it.
Is there a way to get the sorted list of vertices, directly from the index which is already present.
A direct query on index (standard TITAN index) similar to "Direct Index Query" .
So that I can partition the entire vertices into smaller batches and process individually.
If not possible , what is the best way to achieve this.
I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.
I want to run the below query to get 1000 vertices in the sorted order.
gremlin> g.V.has('domain').domain.order[0..1000]
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes
But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.
How can I force gremlin to use index in this particular case?
The answer is the same as the one I provided in the comments of your previous question:
Throw more memory at the problem (i.e. increase -Xmx to the console or whatever application is running your query) - which would be a short-term solution.
Use titan-hadoop.
Restructure your graph or queries in some way to allow a use of an index. This could mean giving up some performance on insert and using a uniqueness lock. Maybe you don't have to remove duplicates in your source data - perhaps you can dedup them in your Gremlin queries at the time of traversal. The point is that you'll need to be creative.
Despite your reluctance to use titan-hadoop and not wanting to use it to "just for finding/removing duplicates in graph", that's the exact use case it will be good at. You have a batch process that must iterate all vertices and it can't fit in the memory you've allotted and you don't want to use titan-hadoop. That's a bit like saying: "I have a nail and a hammer, but I don't want to use the hammer to bang in the nail." :)
How can I force gremlin to use index in this particular case?
There is no way in gremlin to do this. In theory, there might be a way to try to read from Cassandra directly (bypassing Titan), decode the binary result and somehow iterate and delete, but it's not known to me. Even if you figured it out, which would mean lots of hours trying to dig into the depths of Titan to see how to read the index data, it would be a hack that is likely to break at any time you upgrade Titan, as the core developers might close that avenue to you at any point as you are circumventing Titan in an unexpected way.
The best option is to simply use titan-hadoop to solve your problem. Unless your graph is completely static and no longer growing, you will reach a point where titan-hadoop is inevitable. How will you be sure that your graph is growing correctly when you have 100M+ edges? How will you gather global statistics about your data? How will you repair bad data that got into the database from a bug in your code? All of those things become issues when your graph reaches a certain scale and titan-hadoop is your only friend there at this time.
I am working on an Android application that uses the Overpass API at [1]. My goal is to get all circular ways that enclose a certain lat-long point.
In order to do so I build a request for a rectangle that contains my location, then parse the response XML and run a ray-casting algorithm to filter the ways that enclose the given lat-long position. This is too slow for the purpose of my application because sometimes the response has tens or hundreds of MB.
Is there any OSM API that I can call to get all ways that enclose a certain location? Otherwise, how could I optimize the process?
Thanks!
[1] http://overpass-api.de/
To my knowledge, there is no standard API in OSM to do this (it is indeed a very uncommon usecase).
I assume you define enclose as the point representing the current location is inside the inner area of the polygon. Furthermore I assume optimizing the process might including changing the entire concept of the algorithm.
First of all, you need to define the rectangle to fetch data. For that, you need to consider that querying a too large rectangle would yield too much data. As far as I know there is no specific API to query circular ways only, and even if there is, querying a too large rectangle would probably denied by the server, because the server load would be enormous.
Server-side precomputation / prefiltering
Therefore I suggest the first optimization: Instead of querying an API that is not specifically suited for your purpose, use an offline database saved on the Android device. OsmAnd and others save the whole database for a country offline, but in your specific usecase you only need to save a pre-filtered database of circular ways.
As far as I know, only a small fraction of the ways in OSM is circular. Therefore I suggest writing a script that regularly downloads OSM dumps e.g. from Geofabrik, remove non-circular ways (e.g. you could check if the last node ID in a way is equal to the first node ID, but you'd need to check if that captures any way you would define as circular). How often you would run it depends on your usecase.
This optimization solves:
The issue of downloading a large amount of data
The issue of overloading the API with large request
The issue of not being able to request large chunks of data
If that is not suitable for your usecase, I suggest to build a simple API for that on your server.
Re-chunking the data into appriopriate grids
However, you still would need to filter a large amount of data. In order to partially solve this, I suggest the second optimization: Re-chunk your data. For example, if your current location is in Virginia, you would not need to filter circular ways that have an area not beyond Texas. Because filtering by state etc. would by highly country-dependent and difficult (CPU-intensive), I suggest to choose a grid, say e.g. 0.05 lat/lon degree (I'd choose a equirectangular projection because it's easy to calculate if you already have lat/lon coordinates).
The script that preprocessed that data shall then create one chunk of data (that could be a file, but we don't know enough about your usecase to talk about specific data strucutres) for any rectangle in the area you want to use. A circular way is included in this chunk if and only if it has at least one node that is inside the chunk area.
You would then only request / filter the specific chunk your position is currently in. Choose the chunk size appropriately for your application (preferably rather small, but that depends on numerous factors!).
This optimization solves:
Assuming most of the circular ways are quite small in terms of their bounding rectangles, you only need to filter a tiny fraction of the overall ways
IO is minimized, especially if you
Hysteretic heuristics
If the aforementioned optimizations do not sufficiently reduce your computation time, I'd suggest the third optimization that depends on how many circular ways you want to find (if you really need to find all, it won't help at all): Use hysteresis. Save the circular ways you were inside of during the last computation (assuming the new current location is near to the last location) and check them first. If your location didn't change too much, you have a high chance of hitting a way you're inside of during the first few raycasts.
Leveraging relations between different circular ways
Also, a fourth optimization is possible: There will be some circular ways that are fully enclosed in another circular way. You could code your program so that it knows about that relation and checks the inner circular way first. If this check succeeds, you automatically now that the current position is also contained in the outer circular way. I think computing the information (server-side) could be incredibly CPU-intensive and implementing it might also be a hard task, so I'd suggest to use this optimization only if not avoidable.
Tuning the parameters of these optimizations should be sufficient to decrease the CPU time needed for your computation significantly. Please feel free to comment/ask if you have further questions regarding these suggestions.
I want to do pre-clustering for a set of approx. 500,000 points.
I haven't started yet but this is what I had thought I would do:
store all points in a localSOLR index
determine "natural cluster positions" according to some administrative information (big cities for example)
and then calculate a cluster for each city:
for each city
for each zoom level
query the index to get the points contained in a radius around the city (the length of the radius depends on the zoom level)
This should be quite efficient because there are only 100 major cities and SOLR queries are very fast. But a little more thinking revealed it was wrong:
there may be clusters of points that are more "near" one another than near a city: they should get their own cluster
at some zoom levels, some points will not be within the acceptable distance of any city, and so they will not be counted
some cities are near one another and therefore, some points will be counted twice (added to both clusters)
There are other approaches:
examine each point and determine to which cluster it belongs; this eliminates problems 2 and 3 above, but not 1, and is also extremely inefficient
make a (rectangular) grid (for each zoom level); this works but will result in crazy / arbitrary clusters that don't "mean" anything
I guess I'm looking for a general purpose geo-clustering algorithm (or idea) and can't seem to find any.
Edit to answer comment from Geert-Jan
I'd like to build "natural" clusters, yes, and yes I'm afraid that if I use an arbitrary grid, it will not reflect the reality of the data. For example if there are many events that occur around a point that is at or near the intersection of two rectangles, I should get just one cluster but will in fact build two (one in each rectangle).
Originally I wanted to use localSOLR for performance reasons (and because I know it, and have better experience indexing a lot of data into SOLR than loading it in a conventional database); but since we're talking of pre-clustering, maybe performance is not that important (although it should not take days to visualize a result of a new clustering experiment). My first approach of querying lots of points according to a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned being the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will clearly overlap, sure, but data should come first).
There is a great clusterer for live clustering, that has been added to the core Google Maps API: Marker Clusterer. I wonder if anyone has tried to run it "offline": run it for any amount of time it needs, and then store the results?
Or is there a clusterer that examines each point, point after point, and outputs clusters with their coordinates and number of points included, and which does this in a reasonable amount of time?
You may want to look into advanced clustering algorithms such as OPTICS.
With a good database index, it should be fairly fast.
I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).
The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.
Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.
I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?
If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.
edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.
edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).
The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.
MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
EDIT
Rationale for my lack of explanation / etc.:
OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."
What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.
OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"
Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.
OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."
Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.
OP says (in comments): "I don't know if we can take the hit on io though."
Are there IO requirements? Cost restrictions? What are they?
We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?
This is for http://cssfingerprint.com
I have a system (see about page on site for details) where:
I need to output a ranked list, with confidences, of categories that match a particular feature vector
the binary feature vectors are a list of site IDs & whether this session detected a hit
feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
categories are a large, non-closed set (user IDs)
my total feature space is approximately 50 million items (URLs)
for any given test, I can only query approx. 0.2% of that space
I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)
I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).
This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).
There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.
Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.
I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.
What's a good way to approach this problem?
If you know nothing about the features you have not sampled, then you have little to go on when deciding whether to explore or exploit your data. If you can express your ROI as a single number following every query, then there is an optimal way of making this choice by keeping track of the upper confidence bounds. See the paper Finite-time Analysis of Multiarmed Bandit Problem.