How should I store a sparse decision tree(move list) in a database? - database

I have been thinking of making an AI for a board game for a long time, and recently I've started to gather resources and algorithms. The game is non-random, and most of the time, there < 3 moves for a player, sometimes, there are >20 moves. I would like to store critical moves, or ambiguous moves so that the AI learns from its mistakes and will not make a same mistake the next time. Moves that surely win or lose need not be stored. So I actually have a sparse decision tree for the beginning of games.
I would like to know how I should store this decision tree in a database? The database does not need to be SQL, and I do not know which database is suitable for this particular problem.
EDIT: Please do not tell me to parse the decision tree into memory, just imagine the game as complicated as chess.

As you will be traversing the tree, neo4j seems like a good solution to me. SQL is no good choice because of the many joins you would need for queries. As i understand the question, you are asking for a way to store some graph in a database, and neo4j is a database explicitely for graphs. For the sparseness, you can attach arrays of primitives or Strings to the edges of your graph to encode sequences of moves, using PropertyContainers (am i right that by sparseness and skipping of nodes you mean your tree-edges are sequences of moves rather than single moves?).

Firstly what you are trying to do sounds like a case based reasoning(CBR) problem see: http://en.wikipedia.org/wiki/Case-based_reasoning#Prominent_CBR_systems . CBR will have a database of decisions, your system would in theory pick the best outcomes available.
Therefore I would suggest using neo4j which is a nosql graph database. http://neo4j.org/
So to represent your game each position is a node in the graph, and each node should contain potential moves from said position. You can track scoring metrics which are learnt as games progress so that the AI is more informed.

I would use a document database (NOSQL) like RavenDB because you can store any data structure in the database.
Documents aren't flat like in a normal SQL database and that allows you to store hierarchical data like trees directly:
{
decision: 'Go forward',
childs: [
{ decision: 'Go backwards' },
{
decision: 'Stay there',
childs: [
{ decision: 'Go backwards' }
]
}
]
}
Here you can see an example JSON tree which can be stored in RavenDB.
RavenDB also has a built-in feature to query hierarchical data:
http://ravendb.net/faq/hierarchies
Please look at the documentation to get more information how RavenDB works.
Resources:
What type of NoSQL database is best suited to store hierarchical data?

You can use memory mapped file as storage.
First, create "compiler". This compiler will parse text file and convert it into compact binary representation. Main application will map this binary optimized file into memory. This will solve your problem with memory size limitation

Start with a simple database table design.
Decisions:
CurrentState BINARY(57) | NewState BINARY(57) | Score INT
CurrentState and NewState are a serialized version of the game state. Score is a weight given to the NewState (positive scores are good moves, negative scores are bad moves) your AI can update these scores appropriately.
Renju, uses a 15x15 board, each location can be either black, white or empty so you need Ceiling( (2bits * 15*15) / 8 ) bytes to serialize the board. In SQL that would be a BINARY(57) in T-SQL
Your AI would select the current moves it has stored like...
SELECT NewState FROM Decisions WHERE CurrentState = #SerializedState ORDER BY Score DESC
You'll get a list of all the stored next moves from the current game state in order of best score to least score.
Your table structure would have a Composite Unique Index (primary key) on (CurrentState, NewState) to facilitate searching and avoid duplicates.
This isn't the best/most optimal solution, but because of your lack of DB knowledge I beleive it would be the easiest to implement and give you a good start.

If I compare with chess engines, those play from memory, maybe apart from opening libraries. Chess is too complicated to store a decinding decision tree. Chess engines play by assigning heuristic evaluations to potential and transient future positions (not moves). Future positions are found by some kind of limited depth search, may be cached for some time in memory, but often are plainly recalculated each turn as the search space is just too big to store in a way faster to look up than recalculating is possible.

Do you know Chinook — the AI that solves checkers? It does this by compiling a database of every possible endgame. While this is not exactly what you are doing, you might learn from it.

I can't clearly conceive neither the data structures you handle in your tree nor their complexity.
But here are some thoughts which may interest you :
Map your decision tree into sparse matrix, a tree is a graph after all
Devise a storage/retrieval strategy taking advantage of sparse matrix properties.

I would approach this with the traditional way an opening book is handled in chess engines:
Generate all possible moves
For each move:
Make that move
Look the resulting position up in your database
Undo the move
Make the move that had the highest score in your database
Looking up a move
Chess engines usually compute a hash function of the current game state via Zobrist hashing, which is a simple way to construct a good hash function for gamestates.
The big advantage of this approach is that it takes care of transpositions, that is, if the same state can be reached via alternate paths, you don't need to worry about those alternate paths, only about the game states themselves.
How chess engines do this
Most chess engines use static opening books that are compiled from recorded games and hence use a simple binary file that maps these hashes to a score; e.g.
struct book_entry {
uint64_t hash
uint32_t score
}
The entries are then sorted by hash, and thanks to operating system caching, a simple binary search through the file will find the needed entries very quickly.
Updating the scores
However, if you want the engine to learn continously, you will need a more complicated data structure; at this point it is usually not worth doing yourself, and you should use an available library. I would probably use LevelDB, but anything that lets you store key-value pairs is fine (Redis, SQLite, GDBM, etc.)
Learning the scores
How exactly you update the scores depends on your game. In games with a lot of data available, a simple approach such as just storing the percentage of games won after the move that resulted in the position works; if you have less data, you can store the result of a game tree search from the position in question as score. Machine learning techniques such as Q learning are also a possibility, although I do not know of a program that actually does this in practice.

I'm assuming your question is asking about how to convert a decision tree into a serial format that can be written to a location and later used to reconstruct the tree.
Try using a pre-order traversal of the tree, using a toString() function (or its equivalent) to convert the data stored at each node of the decision tree to a textual descriptor. By pre-order traversal, I mean implementing an algorithm that first performs the toString() operation on the node, and writes the output to a database or file, and then recursively performs the same operation on its child nodes, in a specified order. Because you are dealing with a sparse tree, your toString() operation should also include information about the existence or non-existence of subtrees.
Reconstructing the tree is simple - the first stored value is the root node, the second is the root member of the left subtree, and so on. The serial data stored for each node should provide information as to which subtree the next inputted node should belong to.

Related

Use a database for artificial intelligence?

I'm working on a toy problem: it is to make code that learns to be unbeatable at Tic Tac Toe only by playing itself. I'm a newbie to this sort of thing... I just decided that I would do it by using a random number generator to choose moves for X and for 0... and then keep record in a text file of moves that should never be played again (there is some complexity in that part of the algorithm, but never mind).
My question is: should I use a relational database management system instead of a text file to store "game states" to avoid? I'm tempted to do so because then I could have unique ID's for each game state... and special tables added to the database to keep track of, say, all "game states" that came before a loosing "game state" (one "game state" mapped to many "previous-move game states", for example). It seems to me that with some smart use of tables, a relational database management system would be WAY easier to go fishing through than a crude text file. I'm assuming the downside will be the run time needed for the code to train itself (because all the calls to the database could take a while).
Any wisdom would be much appreciated. Is it sane to be thinking about using a database? Is my analysis of the pros and cons accurate? Thanks!
Tic tac toe has 9 fields.
in any field you have 3 value options (' ', 'X', 'O'). That makes 3^9 = 19.683 possible combinations.
You need to store 9 bit for any field (0,1 or null). That makes 177.147 bits of data, not mentioning meta data.
Additionally not all possible combinations are valid (you can't have 3 O's and only 1 X in your field). That makes the number above even smaller.
In other words, If you don't have any extremly high timelimitations the choice of your database is not a performance driven one.
You still should read about indexing if you use relational databases.
However, i would not recomend you to store your data in textfiles (not talking about no-sql db, but real text files on your hard disk). Because that makes accessing complicated and, if not programmed well, maybe slow.
A final recomendation would be the relational database system, because here you have more restrictions like unique ids and no unwanted datatypes. That often leads to less mistakes/bugs. Also you can give any possible combination a unique key and let your RDBMS enforce this for you with a unique_constraint. Than later on you make a "linking table" that links only the id's of your unique combinations.
RDB (relational databases) are useful when you have to be certain that all data are consistent, however, the velocity of the system (due to transactions, lock, etc...) is less efficient than Objects Database (Redis) or Documents Database (MongoDB)
You can use this approach with MongoDB :
1 game is a JSON document that contains :
ID of the game (eg "0001")
Game state (eg "0" for '0' win, "1" for 'X', "2" otherwise
An array that contains all move during game
Other data like time, date, ...
After that you can simply explore all JSON with regex throw Mongo Connector.
I really love your analysis of database problem, that is true !
Hope this can be helpful.
Florian made a good start, but note that you have notably fewer than the 3^9 game states listed. If we assume that X moves first, then the count of X entries must be equal or one more than the count of O entries. For instance, the board XX- | --- | --- is illegal.
There are 7665 unique board positions. This should be small enough to give each a simple key (such as its integer value in trinary). Use a graph package to build the graph of positions that lead to other positions; label the edges with the choice of move. For instance, if we code the board as 0=blank, 1=X, 2=O, then we get the opening board (0) transitioning to position 1 (aka 000 000 001) on choice 9 (lower-right corner).
You should be able to hold the entire game tree in memory at once: 7665 nodes with a mean of perhaps 6 edges per node, 3 words per edge. That's ~120k words of memory for the entire tree, before you apply pruning as you train the game AI.

Database Position Index

Does anyone know of any databases (SQL or NoSQL) that have native support for position based indexes?
To clarify, on many occasions I've had the need to maintain a position based collection, where the order or position is maintained by an external entity (user, external service, etc). By maintained I mean the order of the items in the collection will be changed quite often but are not based on any data fields in the record, the order is completely arbitrary as far as the service maintaining the collection is concerned. The service needs to provide an interface that allows CRUD functions by position (Insert after Pos X, Delete at Pos Y, etc) as well as manipulating the position (move from pos X to pos Y).
I'm aware there are workaround ways that you can achieve this, I've implemented many myself but this seems like a pretty fundamental way to want to index data. So I can't help but feel there must be an off the shelf solution out there for this.
The only thing I've seen that comes close to this is Redis's List data type, which while it's ordered by position, is pretty limited (compared to a table with multiple indexes) and Redis is more suited as a Cache rather than a persistent data store.
Finally I'm asking this as I've got a requirement that needs user ordered collections that could contain 10,000's of records.
In case it helps anyone, the best approximation of this I've found so far is to implement a Linked List structure in a Graph Database (like Neo4J). Maintaining the item links is considerably easier than maintaining a position column (especially if you only need next links, i.e. not doubly linked). It's easier as there is no need to leave holes, re-index, etc, you only have to move pointers (or relations). The performance is pretty good but reads slow down linearly if you're trying to access items towards the end of the list by position, as you have to scan (SKIP) the whole list start to end.

How to persist a graph data structure in a relational database?

I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.
The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.
I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.
Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.
I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.

How should I change my Graph structure (very slow insertion)?

This program I'm doing is about a social network, which means there are users and their profiles. The profiles structure is UserProfile.
Now, there are various possible Graph implementations and I don't think I'm using the best one. I have a Graph structure and inside, there's a pointer to a linked list of type Vertex. Each Vertex element has a value, a pointer to the next Vertex and a pointer to a linked list of type Edge. Each Edge element has a value (so I can define weights and whatever it's needed), a pointer to the next Edge and a pointer to the Vertex owner.
I have a 2 sample files with data to process (in CSV style) and insert into the Graph. The first one is the user data (one user per line); the second one is the user relations (for the graph). The first file is quickly inserted into the graph cause I always insert at the head and there's like ~18000 users. The second file takes ages but I still insert the edges at the head. The file has about ~520000 lines of user relations and takes between 13-15mins to insert into the Graph. I made a quick test and reading the data is pretty quickly, instantaneously really. The problem is in the insertion.
This problem exists because I have a Graph implemented with linked lists for the vertices. Every time I need to insert a relation, I need to lookup for 2 vertices, so I can link them together. This is the problem... Doing this for ~520000 relations, takes a while.
How should I solve this?
Solution 1) Some people recommended me to implement the Graph (the vertices part) as an array instead of a linked list. This way I have direct access to every vertex and the insertion is probably going to drop considerably. But, I don't like the idea of allocating an array with [18000] elements. How practically is this? My sample data has ~18000, but what if I need much less or much more? The linked list approach has that flexibility, I can have whatever size I want as long as there's memory for it. But the array doesn't, how am I going to handle such situation? What are your suggestions?
Using linked lists is good for space complexity but bad for time complexity. And using an array is good for time complexity but bad for space complexity.
Any thoughts about this solution?
Solution 2) This project also demands that I have some sort of data structures that allows quick lookup based on a name index and an ID index. For this I decided to use Hash Tables. My tables are implemented with separate chaining as collision resolution and when a load factor of 0.70 is reach, I normally recreate the table. I base the next table size on this Link.
Currently, both Hash Tables hold a pointer to the UserProfile instead of duplication the user profile itself. That would be stupid, changing data would require 3 changes and it's really dumb to do it that way. So I just save the pointer to the UserProfile. The same user profile pointer is also saved as value in each Graph Vertex.
So, I have 3 data structures, one Graph and two Hash Tables and every single one of them point to the same exact UserProfile. The Graph structure will serve the purpose of finding the shortest path and stuff like that while the Hash Tables serve as quick index by name and ID.
What I'm thinking to solve my Graph problem is to, instead of having the Hash Tables value point to the UserProfile, I point it to the corresponding Vertex. It's still a pointer, no more and no less space is used, I just change what I point to.
Like this, I can easily and quickly lookup for each Vertex I need and link them together. This will insert the ~520000 relations pretty quickly.
I thought of this solution because I already have the Hash Tables and I need to have them, then, why not take advantage of them for indexing the Graph vertices instead of the user profile? It's basically the same thing, I can still access the UserProfile pretty quickly, just go to the Vertex and then to the UserProfile.
But, do you see any cons on this second solution against the first one? Or only pros that overpower the pros and cons on the first solution?
Other Solution) If you have any other solution, I'm all ears. But please explain the pros and cons of that solution over the previous 2. I really don't have much time to be wasting with this right now, I need to move on with this project, so, if I'm doing to do such a change, I need to understand exactly what to change and if that's really the way to go.
Hopefully no one fell asleep reading this and closed the browser, sorry for the big testament. But I really need to decide what to do about this and I really need to make a change.
P.S: When answering my proposed solutions, please enumerate them as I did so I know exactly what are you talking about and don't confuse my self more than I already am.
The first approach is the Since the main issue here is speed, I would prefer the array approach.
You should, of course, maintain the hash table for the name-index lookup.
If I understood correctly, you only process the data one time. So there is no dynamic data insertion.
To deal with the space allocation problem, I would recommend:
1 - Read once the file, to get the number of vertex.
2 - allocate that space
If you data is dynamic, you could implement some simple method to increment the array size in steps of 50%.
3 - In the Edges, substitute you linked list for an array. This array should be dynamically incremented with steps of 50%.
Even with the "extra" space allocated, when you increment the size with steps of 50%, the total size used by the array should only be marginally larger than with the size of the linked list.
I hope I could help.

C : Store and index a HUGH amount of info! School Project

i need to do a school project in C (I'm really don't know c++ as well).
I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?
Tkz
Roberto
If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.
Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?
bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.
Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.
If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.

Resources