How do I look up a Vertex using its ID? - giraph

I have a graph computation that passes 'visited' Vertex IDs around, and I need to output information from those in the output phase. How do I look up a Vertex from its ID? I found Partition.getVertex(), but IIUC there is no guarantee that an arbitrary Vertex will be in a particular partition. Thanks in advance.

AFAIK you can’t look simply up all the vertices. That’s why you have the computation phase to store all the necessary information inside the nodes so you can afterwards print them.
Doing it differently would to my knowledge completely screw up Giraph’s programming paradigm.

Related

How to Query a Titan index(standard) directly to retrieve vertices in sorted order

I am using Rexster/TITAN 0.4 over Cassandra.
The vertex keys are indexed using standard index as below.
g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make();
I am not using Uniqueness for performance and scalability.
There are around ~10M vertices in graph.
My requirement is to iterate over each vertices and identify if any duplicates and then remove it.
Is there a way to get the sorted list of vertices, directly from the index which is already present.
A direct query on index (standard TITAN index) similar to "Direct Index Query" .
So that I can partition the entire vertices into smaller batches and process individually.
If not possible , what is the best way to achieve this.
I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.
I want to run the below query to get 1000 vertices in the sorted order.
gremlin> g.V.has('domain').domain.order[0..1000]
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes
But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.
How can I force gremlin to use index in this particular case?
The answer is the same as the one I provided in the comments of your previous question:
Throw more memory at the problem (i.e. increase -Xmx to the console or whatever application is running your query) - which would be a short-term solution.
Use titan-hadoop.
Restructure your graph or queries in some way to allow a use of an index. This could mean giving up some performance on insert and using a uniqueness lock. Maybe you don't have to remove duplicates in your source data - perhaps you can dedup them in your Gremlin queries at the time of traversal. The point is that you'll need to be creative.
Despite your reluctance to use titan-hadoop and not wanting to use it to "just for finding/removing duplicates in graph", that's the exact use case it will be good at. You have a batch process that must iterate all vertices and it can't fit in the memory you've allotted and you don't want to use titan-hadoop. That's a bit like saying: "I have a nail and a hammer, but I don't want to use the hammer to bang in the nail." :)
How can I force gremlin to use index in this particular case?
There is no way in gremlin to do this. In theory, there might be a way to try to read from Cassandra directly (bypassing Titan), decode the binary result and somehow iterate and delete, but it's not known to me. Even if you figured it out, which would mean lots of hours trying to dig into the depths of Titan to see how to read the index data, it would be a hack that is likely to break at any time you upgrade Titan, as the core developers might close that avenue to you at any point as you are circumventing Titan in an unexpected way.
The best option is to simply use titan-hadoop to solve your problem. Unless your graph is completely static and no longer growing, you will reach a point where titan-hadoop is inevitable. How will you be sure that your graph is growing correctly when you have 100M+ edges? How will you gather global statistics about your data? How will you repair bad data that got into the database from a bug in your code? All of those things become issues when your graph reaches a certain scale and titan-hadoop is your only friend there at this time.

How to chain together points in an array

I have a series of points with lengths and rotations like this:
I need to create separate chains from points whose lines overlap but I’m having real trouble doing this efficiently.
I have an array of simple Point objects, in no particular order, and I can loop through them and test them with a simple "intersect" function. I need to end up with an array of chains, each with an ordered list of points. (Or another way of representing the chains).
At the moment every avenue I explore seems to involve a convoluted hack of arrays, nudges and fudges. Having never studied Computer Science I wonder if there is some sort of data structure or technique that would lend itself well to this sort of thing.
Could anyone point me in the right direction to achieving this? Pseudocode is fine (or indeed any language), although I am coding in Processing/Java if that helps.
Many thanks,
Josh
You can use union-find algorithm to find joined sets.
If your sets always are well-defined (there are no multiple intersections and so on), then some modifications seem possible to build chains during join process:
for every set, besides a 'representative', keep two extreme segments, and modify them when joining.

child node in MBR (R-Tree Implementation)

I am new to R-Tree concept. Sorry If I ask a very basic question related to Rtree. I have read a few literature on R-Tree to get the basic concept of R-Tree. However, I could not understand the clustering or grouping steps in MBR. What's bothering me is:
How many points or object could fit in each MBR? i could see that the number of object stored in each MBR is varies. So is there any condition or procedure or formula or anything to determine how many objects will be stored in each MBR?
Thanks for your help! Gracias!
Read the R-tree publication, or a book on index structures.
You fix a page size (because the R-tree is a disk-oriented data structure, this should be something such as e.g. 8kb).
If a page gets too empty, it will be removed. If a page is too full, it will be split.
Just like with pretty much any other page-based tree, actually (e.g. B-tree).

How should I store a sparse decision tree(move list) in a database?

I have been thinking of making an AI for a board game for a long time, and recently I've started to gather resources and algorithms. The game is non-random, and most of the time, there < 3 moves for a player, sometimes, there are >20 moves. I would like to store critical moves, or ambiguous moves so that the AI learns from its mistakes and will not make a same mistake the next time. Moves that surely win or lose need not be stored. So I actually have a sparse decision tree for the beginning of games.
I would like to know how I should store this decision tree in a database? The database does not need to be SQL, and I do not know which database is suitable for this particular problem.
EDIT: Please do not tell me to parse the decision tree into memory, just imagine the game as complicated as chess.
As you will be traversing the tree, neo4j seems like a good solution to me. SQL is no good choice because of the many joins you would need for queries. As i understand the question, you are asking for a way to store some graph in a database, and neo4j is a database explicitely for graphs. For the sparseness, you can attach arrays of primitives or Strings to the edges of your graph to encode sequences of moves, using PropertyContainers (am i right that by sparseness and skipping of nodes you mean your tree-edges are sequences of moves rather than single moves?).
Firstly what you are trying to do sounds like a case based reasoning(CBR) problem see: http://en.wikipedia.org/wiki/Case-based_reasoning#Prominent_CBR_systems . CBR will have a database of decisions, your system would in theory pick the best outcomes available.
Therefore I would suggest using neo4j which is a nosql graph database. http://neo4j.org/
So to represent your game each position is a node in the graph, and each node should contain potential moves from said position. You can track scoring metrics which are learnt as games progress so that the AI is more informed.
I would use a document database (NOSQL) like RavenDB because you can store any data structure in the database.
Documents aren't flat like in a normal SQL database and that allows you to store hierarchical data like trees directly:
{
decision: 'Go forward',
childs: [
{ decision: 'Go backwards' },
{
decision: 'Stay there',
childs: [
{ decision: 'Go backwards' }
]
}
]
}
Here you can see an example JSON tree which can be stored in RavenDB.
RavenDB also has a built-in feature to query hierarchical data:
http://ravendb.net/faq/hierarchies
Please look at the documentation to get more information how RavenDB works.
Resources:
What type of NoSQL database is best suited to store hierarchical data?
You can use memory mapped file as storage.
First, create "compiler". This compiler will parse text file and convert it into compact binary representation. Main application will map this binary optimized file into memory. This will solve your problem with memory size limitation
Start with a simple database table design.
Decisions:
CurrentState BINARY(57) | NewState BINARY(57) | Score INT
CurrentState and NewState are a serialized version of the game state. Score is a weight given to the NewState (positive scores are good moves, negative scores are bad moves) your AI can update these scores appropriately.
Renju, uses a 15x15 board, each location can be either black, white or empty so you need Ceiling( (2bits * 15*15) / 8 ) bytes to serialize the board. In SQL that would be a BINARY(57) in T-SQL
Your AI would select the current moves it has stored like...
SELECT NewState FROM Decisions WHERE CurrentState = #SerializedState ORDER BY Score DESC
You'll get a list of all the stored next moves from the current game state in order of best score to least score.
Your table structure would have a Composite Unique Index (primary key) on (CurrentState, NewState) to facilitate searching and avoid duplicates.
This isn't the best/most optimal solution, but because of your lack of DB knowledge I beleive it would be the easiest to implement and give you a good start.
If I compare with chess engines, those play from memory, maybe apart from opening libraries. Chess is too complicated to store a decinding decision tree. Chess engines play by assigning heuristic evaluations to potential and transient future positions (not moves). Future positions are found by some kind of limited depth search, may be cached for some time in memory, but often are plainly recalculated each turn as the search space is just too big to store in a way faster to look up than recalculating is possible.
Do you know Chinook — the AI that solves checkers? It does this by compiling a database of every possible endgame. While this is not exactly what you are doing, you might learn from it.
I can't clearly conceive neither the data structures you handle in your tree nor their complexity.
But here are some thoughts which may interest you :
Map your decision tree into sparse matrix, a tree is a graph after all
Devise a storage/retrieval strategy taking advantage of sparse matrix properties.
I would approach this with the traditional way an opening book is handled in chess engines:
Generate all possible moves
For each move:
Make that move
Look the resulting position up in your database
Undo the move
Make the move that had the highest score in your database
Looking up a move
Chess engines usually compute a hash function of the current game state via Zobrist hashing, which is a simple way to construct a good hash function for gamestates.
The big advantage of this approach is that it takes care of transpositions, that is, if the same state can be reached via alternate paths, you don't need to worry about those alternate paths, only about the game states themselves.
How chess engines do this
Most chess engines use static opening books that are compiled from recorded games and hence use a simple binary file that maps these hashes to a score; e.g.
struct book_entry {
uint64_t hash
uint32_t score
}
The entries are then sorted by hash, and thanks to operating system caching, a simple binary search through the file will find the needed entries very quickly.
Updating the scores
However, if you want the engine to learn continously, you will need a more complicated data structure; at this point it is usually not worth doing yourself, and you should use an available library. I would probably use LevelDB, but anything that lets you store key-value pairs is fine (Redis, SQLite, GDBM, etc.)
Learning the scores
How exactly you update the scores depends on your game. In games with a lot of data available, a simple approach such as just storing the percentage of games won after the move that resulted in the position works; if you have less data, you can store the result of a game tree search from the position in question as score. Machine learning techniques such as Q learning are also a possibility, although I do not know of a program that actually does this in practice.
I'm assuming your question is asking about how to convert a decision tree into a serial format that can be written to a location and later used to reconstruct the tree.
Try using a pre-order traversal of the tree, using a toString() function (or its equivalent) to convert the data stored at each node of the decision tree to a textual descriptor. By pre-order traversal, I mean implementing an algorithm that first performs the toString() operation on the node, and writes the output to a database or file, and then recursively performs the same operation on its child nodes, in a specified order. Because you are dealing with a sparse tree, your toString() operation should also include information about the existence or non-existence of subtrees.
Reconstructing the tree is simple - the first stored value is the root node, the second is the root member of the left subtree, and so on. The serial data stored for each node should provide information as to which subtree the next inputted node should belong to.

How should I change my Graph structure (very slow insertion)?

This program I'm doing is about a social network, which means there are users and their profiles. The profiles structure is UserProfile.
Now, there are various possible Graph implementations and I don't think I'm using the best one. I have a Graph structure and inside, there's a pointer to a linked list of type Vertex. Each Vertex element has a value, a pointer to the next Vertex and a pointer to a linked list of type Edge. Each Edge element has a value (so I can define weights and whatever it's needed), a pointer to the next Edge and a pointer to the Vertex owner.
I have a 2 sample files with data to process (in CSV style) and insert into the Graph. The first one is the user data (one user per line); the second one is the user relations (for the graph). The first file is quickly inserted into the graph cause I always insert at the head and there's like ~18000 users. The second file takes ages but I still insert the edges at the head. The file has about ~520000 lines of user relations and takes between 13-15mins to insert into the Graph. I made a quick test and reading the data is pretty quickly, instantaneously really. The problem is in the insertion.
This problem exists because I have a Graph implemented with linked lists for the vertices. Every time I need to insert a relation, I need to lookup for 2 vertices, so I can link them together. This is the problem... Doing this for ~520000 relations, takes a while.
How should I solve this?
Solution 1) Some people recommended me to implement the Graph (the vertices part) as an array instead of a linked list. This way I have direct access to every vertex and the insertion is probably going to drop considerably. But, I don't like the idea of allocating an array with [18000] elements. How practically is this? My sample data has ~18000, but what if I need much less or much more? The linked list approach has that flexibility, I can have whatever size I want as long as there's memory for it. But the array doesn't, how am I going to handle such situation? What are your suggestions?
Using linked lists is good for space complexity but bad for time complexity. And using an array is good for time complexity but bad for space complexity.
Any thoughts about this solution?
Solution 2) This project also demands that I have some sort of data structures that allows quick lookup based on a name index and an ID index. For this I decided to use Hash Tables. My tables are implemented with separate chaining as collision resolution and when a load factor of 0.70 is reach, I normally recreate the table. I base the next table size on this Link.
Currently, both Hash Tables hold a pointer to the UserProfile instead of duplication the user profile itself. That would be stupid, changing data would require 3 changes and it's really dumb to do it that way. So I just save the pointer to the UserProfile. The same user profile pointer is also saved as value in each Graph Vertex.
So, I have 3 data structures, one Graph and two Hash Tables and every single one of them point to the same exact UserProfile. The Graph structure will serve the purpose of finding the shortest path and stuff like that while the Hash Tables serve as quick index by name and ID.
What I'm thinking to solve my Graph problem is to, instead of having the Hash Tables value point to the UserProfile, I point it to the corresponding Vertex. It's still a pointer, no more and no less space is used, I just change what I point to.
Like this, I can easily and quickly lookup for each Vertex I need and link them together. This will insert the ~520000 relations pretty quickly.
I thought of this solution because I already have the Hash Tables and I need to have them, then, why not take advantage of them for indexing the Graph vertices instead of the user profile? It's basically the same thing, I can still access the UserProfile pretty quickly, just go to the Vertex and then to the UserProfile.
But, do you see any cons on this second solution against the first one? Or only pros that overpower the pros and cons on the first solution?
Other Solution) If you have any other solution, I'm all ears. But please explain the pros and cons of that solution over the previous 2. I really don't have much time to be wasting with this right now, I need to move on with this project, so, if I'm doing to do such a change, I need to understand exactly what to change and if that's really the way to go.
Hopefully no one fell asleep reading this and closed the browser, sorry for the big testament. But I really need to decide what to do about this and I really need to make a change.
P.S: When answering my proposed solutions, please enumerate them as I did so I know exactly what are you talking about and don't confuse my self more than I already am.
The first approach is the Since the main issue here is speed, I would prefer the array approach.
You should, of course, maintain the hash table for the name-index lookup.
If I understood correctly, you only process the data one time. So there is no dynamic data insertion.
To deal with the space allocation problem, I would recommend:
1 - Read once the file, to get the number of vertex.
2 - allocate that space
If you data is dynamic, you could implement some simple method to increment the array size in steps of 50%.
3 - In the Edges, substitute you linked list for an array. This array should be dynamically incremented with steps of 50%.
Even with the "extra" space allocated, when you increment the size with steps of 50%, the total size used by the array should only be marginally larger than with the size of the linked list.
I hope I could help.

Resources