Fully Persistent B+ Tree - c

I'm trying to implement B+ tree (in C language) with each key being some data(int/float/string) and corresponding value is a list, whose size is not fixed.
I want to store this tree in a file and access later on, when required. You may consider the implementation as follows:
Each search key corresponds to a page in the file and
Each page contains set of values corresponding to that key
The problem is: I cannot just assign a page to a key, as it may consume very little and waste the entire page. So I need a persistent way of implementing B+ tree in file system, instead of main-memory.

Check this disk-based B-tree implementation out, it might be help.
And this paper titled Fully Persistent B+-Trees

Related

B+ tree - what is the rule regarding non leaf pages?

Is it OK for a non leaf page in a B+ tree to hold a number which is not equal to none of the leaf pages?
Yes, the keys in the internal blocks may be different from the actual keys that are stored in the bottom (data) layer of the B+ tree.
These internal keys only serve to set intervals, not to represent the actual key. For instance, to find a key in the tree, the traversal will always be to the bottom layer of the tree, and only there it is identified (or determined it does not exist). The internal keys only serve to determine the direction of the search.
Like you suggest, this can reduce the update-work needed when a key is deleted. Unless this deletion triggers a merge, there is no need to go upwards in the tree and update keys.
The fact that keys in internal blocks do not need to correspond to actual data keys, is also used for saving space. As Wikipedia mentions:
For internal blocks, space saving can be achieved by either compressing keys or pointers. For string keys, space can be saved by using the following technique: Normally the 𝑖th entry of an internal block contains the first key of block 𝑖 + 1. Instead of storing the full key, we could store the shortest prefix of the first key of block 𝑖 + 1 that is strictly greater (in lexicographic order) than last key of block 𝑖.

design a data structure and algorithm to store key-value in a malloced memory

I am writing a config library, the config is key value set. e.g. ip=192.168.10.1
the config is stored in a config file, and loaded into a shared memory (shmctl/shmget).
the library provide apis for fast load, query, insert, delete, save。
the config is cached to shared memory, so I need design how to store key-value set into a linear memory. and how to insert/query/delete to/from memory.
my bad implmentation is store key value one by one, but that is very inefficient for any operation.
What you are looking for is open addressing. Basically what this means is that you store the data by getting a hash value modulus the hashmap length and putting them in that address. If you got a collision you can chose from different policies to solve it.
An example is linear probing which just means that you find the first open spot after the wished position. Regardless of what policy you chose you have to make changes to how you find, delete, resize. Getting the best is difficult as it depends on your usage behaviour.
If you are lucky you can find a perfect hash which means you get no collisions, but that is a science in itself.
Alternatively you can just use your languages hashmap.

Novel idea for NoSQL node mapping based on time based ids?

I'm coding a new NoSQL database, and had what I thought was a novel idea (for me anyways) regarding the hashing mechanism used to locate nodes for a given key.
I'm using object keys that incorporate a timestamp. A hash will be used to determine the node(s) holding the data. Pretty common so far.
The (possible) twist lies in that a map will record the times at which nodes have been added to the cluster. That way I can determine for any given object which nodes were present in the cluster when that object was added (and therefore which nodes hold the object's data).
I'm thinking that in this way growing the cluster wont require any data to be transferred. Objects always live on the same node...for ever.
Has anyone tried something like this? Any potential problems that anyone can foresee?

How should I store a sparse decision tree(move list) in a database?

I have been thinking of making an AI for a board game for a long time, and recently I've started to gather resources and algorithms. The game is non-random, and most of the time, there < 3 moves for a player, sometimes, there are >20 moves. I would like to store critical moves, or ambiguous moves so that the AI learns from its mistakes and will not make a same mistake the next time. Moves that surely win or lose need not be stored. So I actually have a sparse decision tree for the beginning of games.
I would like to know how I should store this decision tree in a database? The database does not need to be SQL, and I do not know which database is suitable for this particular problem.
EDIT: Please do not tell me to parse the decision tree into memory, just imagine the game as complicated as chess.
As you will be traversing the tree, neo4j seems like a good solution to me. SQL is no good choice because of the many joins you would need for queries. As i understand the question, you are asking for a way to store some graph in a database, and neo4j is a database explicitely for graphs. For the sparseness, you can attach arrays of primitives or Strings to the edges of your graph to encode sequences of moves, using PropertyContainers (am i right that by sparseness and skipping of nodes you mean your tree-edges are sequences of moves rather than single moves?).
Firstly what you are trying to do sounds like a case based reasoning(CBR) problem see: http://en.wikipedia.org/wiki/Case-based_reasoning#Prominent_CBR_systems . CBR will have a database of decisions, your system would in theory pick the best outcomes available.
Therefore I would suggest using neo4j which is a nosql graph database. http://neo4j.org/
So to represent your game each position is a node in the graph, and each node should contain potential moves from said position. You can track scoring metrics which are learnt as games progress so that the AI is more informed.
I would use a document database (NOSQL) like RavenDB because you can store any data structure in the database.
Documents aren't flat like in a normal SQL database and that allows you to store hierarchical data like trees directly:
{
decision: 'Go forward',
childs: [
{ decision: 'Go backwards' },
{
decision: 'Stay there',
childs: [
{ decision: 'Go backwards' }
]
}
]
}
Here you can see an example JSON tree which can be stored in RavenDB.
RavenDB also has a built-in feature to query hierarchical data:
http://ravendb.net/faq/hierarchies
Please look at the documentation to get more information how RavenDB works.
Resources:
What type of NoSQL database is best suited to store hierarchical data?
You can use memory mapped file as storage.
First, create "compiler". This compiler will parse text file and convert it into compact binary representation. Main application will map this binary optimized file into memory. This will solve your problem with memory size limitation
Start with a simple database table design.
Decisions:
CurrentState BINARY(57) | NewState BINARY(57) | Score INT
CurrentState and NewState are a serialized version of the game state. Score is a weight given to the NewState (positive scores are good moves, negative scores are bad moves) your AI can update these scores appropriately.
Renju, uses a 15x15 board, each location can be either black, white or empty so you need Ceiling( (2bits * 15*15) / 8 ) bytes to serialize the board. In SQL that would be a BINARY(57) in T-SQL
Your AI would select the current moves it has stored like...
SELECT NewState FROM Decisions WHERE CurrentState = #SerializedState ORDER BY Score DESC
You'll get a list of all the stored next moves from the current game state in order of best score to least score.
Your table structure would have a Composite Unique Index (primary key) on (CurrentState, NewState) to facilitate searching and avoid duplicates.
This isn't the best/most optimal solution, but because of your lack of DB knowledge I beleive it would be the easiest to implement and give you a good start.
If I compare with chess engines, those play from memory, maybe apart from opening libraries. Chess is too complicated to store a decinding decision tree. Chess engines play by assigning heuristic evaluations to potential and transient future positions (not moves). Future positions are found by some kind of limited depth search, may be cached for some time in memory, but often are plainly recalculated each turn as the search space is just too big to store in a way faster to look up than recalculating is possible.
Do you know Chinook β€” the AI that solves checkers? It does this by compiling a database of every possible endgame. While this is not exactly what you are doing, you might learn from it.
I can't clearly conceive neither the data structures you handle in your tree nor their complexity.
But here are some thoughts which may interest you :
Map your decision tree into sparse matrix, a tree is a graph after all
Devise a storage/retrieval strategy taking advantage of sparse matrix properties.
I would approach this with the traditional way an opening book is handled in chess engines:
Generate all possible moves
For each move:
Make that move
Look the resulting position up in your database
Undo the move
Make the move that had the highest score in your database
Looking up a move
Chess engines usually compute a hash function of the current game state via Zobrist hashing, which is a simple way to construct a good hash function for gamestates.
The big advantage of this approach is that it takes care of transpositions, that is, if the same state can be reached via alternate paths, you don't need to worry about those alternate paths, only about the game states themselves.
How chess engines do this
Most chess engines use static opening books that are compiled from recorded games and hence use a simple binary file that maps these hashes to a score; e.g.
struct book_entry {
uint64_t hash
uint32_t score
}
The entries are then sorted by hash, and thanks to operating system caching, a simple binary search through the file will find the needed entries very quickly.
Updating the scores
However, if you want the engine to learn continously, you will need a more complicated data structure; at this point it is usually not worth doing yourself, and you should use an available library. I would probably use LevelDB, but anything that lets you store key-value pairs is fine (Redis, SQLite, GDBM, etc.)
Learning the scores
How exactly you update the scores depends on your game. In games with a lot of data available, a simple approach such as just storing the percentage of games won after the move that resulted in the position works; if you have less data, you can store the result of a game tree search from the position in question as score. Machine learning techniques such as Q learning are also a possibility, although I do not know of a program that actually does this in practice.
I'm assuming your question is asking about how to convert a decision tree into a serial format that can be written to a location and later used to reconstruct the tree.
Try using a pre-order traversal of the tree, using a toString() function (or its equivalent) to convert the data stored at each node of the decision tree to a textual descriptor. By pre-order traversal, I mean implementing an algorithm that first performs the toString() operation on the node, and writes the output to a database or file, and then recursively performs the same operation on its child nodes, in a specified order. Because you are dealing with a sparse tree, your toString() operation should also include information about the existence or non-existence of subtrees.
Reconstructing the tree is simple - the first stored value is the root node, the second is the root member of the left subtree, and so on. The serial data stored for each node should provide information as to which subtree the next inputted node should belong to.

Index on images to access data in a database

We have the Trie structure to efficiently access data when the key to that data set is a string. What would be the best possible index if key to a data set is an image?
By key, I mean some thing which uniquely distinguishes data. Is this a less frequently used scenario i.e. accessing data by an image? I do feel there are applications where it is used like a finger print database.
Does hashing help in this case? I mean hash the image into a unique number, depending on pixel values.
Please share any pointers on this.
cheers
You could use a hash function to find a item based on an image. But I see little practical use for this scenario.
Application such as finger print recognition, face recognition, or object identification perform a feature extraction process. This means they convert the complex image structure into simpler feature vectors that can be compared against stored patterns.
The real hard work is the feature extraction process that must seperate the important information from the 'noise' in the image.
Just hashing the image will will yield no usable features. The only situation I would think about hashing a image to find some information is to build a image database. But even in this case a common hash function as SHA1 or MD5 will be of little use, because modifying a single pixel or metadata such as the author will change the hash and make it impossible to identify the two images based on a common hash function.
I'm not 100% sure what you're trying to do, but hashing should give you a unique string to identify an image with. You didn't specify your language, but most have a function to hash an entire file's data, so you could just run the image file through that. (For example, PHP has md5_file())
It's unclear what problem you're trying to solve. You can definitely obtain a hash for an entire image and use that as a key in a Trie structure, although I think in this case the Trie structure would give you almost no performance benefit over a regular hash table, because you are performing a (large) hash every time you do a lookup.
If you are implementing something where you want to compare two images or find similar images in the tree quickly, you might consider using the GIF or JPEG header of the image as the beginning of the key. This would cause images with similar type, size, index colors, etc. to be grouped near each other within the Trie structure. You could then compute a hash for the image only if there was a collision (that is, multiple images in the Trie with the exact same header).

Resources