I just read about static vertices in Titan 0.5.0 and I was wondering whether you could get any performance improvements when defining them as such?
Static vertices in Titan primarily serve two purposes:
To guard against accidental deletion or modification
To allow TTL of vertices
There aren't any performance improvements for static vertices yet. As we improve Titan's caches, you will see that static vertices can be cached much fore effectively.
Related
Is there any source of information relating efficiency of different tfjs methods, for example after benchmarking it seems like using casting on a tf.add operation is much slower than summing the pre existing tensors.
Any ideas?
During bench marking I see operations suche as argMax or max are very slow.
I am running Janusgraph with Scylla as a storage engine.
The graph has a vertex with a degree of 5M (in + out), i.e. around 5M vertices are connected to it,
I am trying to drop this vertex by gremlin query graph.traversal().V(vertexId).drop().iterate() but it's taking a lot of time (unable to delete in 20 minutes).
I understand that above query iterates all edges and does the actual deletion
I wanted to know if anyone has faced a similar issue and figured out any workaround on it. Any lead would be really helpful.
My information may be dated and perhaps there are revised ways to do this, but since there have been no responses on this question I figured I'd offer the advice as I know it. In the days before JanusGraph when this graph was called Titan and I had situations like the one you describe, I found similar results that you are finding when doing a direct g.V(id).drop() and that to fully get rid of the vertex of that size meant having some patience. The strategy I used to get rid of it involved pruning the vertex of its edges so that a delete of the vertex itself became possible.
How you go about pruning the edges is dependent on your data and how those 5M edges are composed. It could be as simple as doing it by label or by blocks of 10000 within each label at a time or something else that makes sense to break the process down into chunks.
while(g.V(vertexId).outE('knows').limit(1).hasNext()) {
g.V(vertexId).outE('knows').limit(10000).drop().iterate();
}
I think I recall that I was able to run these types of operates in parallel which sped the process a bit. In any case, when you get the vertex bare of all edges (or down to a significantly smaller degree size at least) you can then g.V(vertexId).drop() and say good-bye to it.
I didn't use ScyllaDB but I think I remember that this many deletes can create tombstone types of issues for Cassandra, so that's worth looking out for potentially. You might also look at increasing the various timeouts that might come into play during this process.
For me, the lesson I learned over the years with respect to this issue was to build OLAP based monitors that keep track of graph statistics to ensure that you have proper and expected growth within your graph (i.e. degree distribution, label distributions, etc). This is especially important with graphs being fed from high-volume streams like Kafka where you can turn your head for a few hours and come back and find your graph in an ugly unexpected state. I think it's also important to model in ways that work around the possibility of getting to these supernode states. Edge TTLs and unidirectional edges can help with that in many cases.
I would love to hear that this answer is no longer relevant and that there are neat new ways to do these sorts of drops or that there is some ScyllaDB specific way to handle this problem, but, if not, perhaps this will be useful to you and get you past your problem.
I'm looking for a technique or algorithm that can help with a design idea. Keep in mind that my proposed solution is open for modification, as well.
I have a series of blocks (square and rectangular) that fit into a grid. The grid and the pieces are unique in that some blocks can fit into multiple locations and others can fit only into a limited number of locations. If it helps, think of something like "Battleship" where the pieces have unique connectors which limits their placement.
I imagine this as a multi-dimensional array. Using a sort of collision or storage contention technique I would like to devise the series of solutions where all pieces are able to fit. An optimal solution would be the one which allows the most pieces to fit on the board at any one time.
I have considered Interval Scheduling, a variety of 2D collision detection algorithms, and looked into graph theory (e.g. Flow Network). These all seem to be overkill for the design.
I don't have specific terminology for what I am looking for so searching for a solution is difficult. If I have to brute force this then, OK, but I have to believe there is a more elegant solution.
I am looking to model cache for multicore processors, including cache coherence. Do such PROMELA implementations already exist. I tried to search for it, but couldn't find any. Secondly, if I have to implement it myself, is it feasible in PROMELA to declare very large arrays as in to represent cache structures?
I personally don't know of such existing Promela models. Moreover, large array structures sounds like a serious state blow-up.
Depending on what properties you want to show, I would suggest to abstract from reality as much as possible. Modeling things with a high precision compared to the real world is typically nothing one should do in Promela.
Two alternative suggestions:
Model your cache in Java and prove first-order assertions with the KeY proof system
Model your cache in a mathematical fashion using the Coq proof assistant and prove the desired theorems with Coq
[This is the type of question that would be closed… but there aren't many people answering Promela/SPIN questions so you won't get 5 close votes.]
Google Search for 'formal verification cache coherence spin' notes SPIN use a couple of times.
There is a yearly SPIN Workshop; full papers are listed for the last 14 years.
The situation and the goal
Imagine a user search system that provides a proximity search from a user’s own position, which is specified by a decimal latitude/longitude combination. An Atlanta resident’s position, for instance, would be represented by 33.756944,-84.390278 and a perimeter search by this user should yield other users in his area from a radius of 10 mi, 50 mi, and so on.
A table-valued function calculates distances and provides users accordingly, ordered by ascending distance to the user that started the search. It’s always a live query, and it’s a tough and frequent one. Now, we want to built some sort of caching to reduce load.
On the way to solutions
So far, all users were grouped by the integer portion of their lat/long. The idea is to create cache files with all users from a grid square, so accessing the relevant cache file would be easy. If a grid square contains more users than a cache file should, the square is quartered or further divided into eight pieces and so on. To fully utilize a square and its cache file, multiple overlaying squares are contemplated. One deficiency of this approach is that gridding and quartering high density metropolitan areas and spacious countrysides into overlaying cache files may not be optimal.
Reading on, I stumbled upon topics like nearest neighbor searches, the Manhattan distance and tree-esque space partitioning techniques like a k-d tree, a quadtree or binary space partitioning. Also, SQL Server provides its own geographical datatypes and functions (though I’d guess the pure-mathematical FLOAT way has an adequate performance). And of course, the crux is making user-centric proximity searches cacheable.
Question!
I haven’t found much resources on this, but I’m sure I’m not the first one with this plan. Remember, it’s not about the search, but about caching.
Can I scrap my approach?
Are there ways of an advantageous partitioning of users into geographical divisions of equal size?
Is there a best practice for storing spatial user information for efficient proximity searches?
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
Do you know an example of successfully caching user-specific proximity search?
Can I scrap my approach?
You can adapt your appoach, because as you already noted, a quadtree uses this technic. Or you use a geo spatial extension. That is available for MySql, too.
Are there ways of an advantageous partitioning of users into
geographical divisions of equal size
A simple fixed grid of equal size is fine when locations are equally distributed or if the area is very small. Geo locations are hardly equal distributed. Usually a geo spatial structure is used. see next answer:
Is there a best practice for storing spatial user information for
efficient proximity searches
Quadtree, k-dTree or R-Tree.
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
There is some work from Hannan Samet, which describes Quadtrees and caching.