Index on images to access data in a database

Index on images to access data in a database - database

We have the Trie structure to efficiently access data when the key to that data set is a string. What would be the best possible index if key to a data set is an image?
By key, I mean some thing which uniquely distinguishes data. Is this a less frequently used scenario i.e. accessing data by an image? I do feel there are applications where it is used like a finger print database.
Does hashing help in this case? I mean hash the image into a unique number, depending on pixel values.
Please share any pointers on this.
cheers

You could use a hash function to find a item based on an image. But I see little practical use for this scenario.
Application such as finger print recognition, face recognition, or object identification perform a feature extraction process. This means they convert the complex image structure into simpler feature vectors that can be compared against stored patterns.
The real hard work is the feature extraction process that must seperate the important information from the 'noise' in the image.
Just hashing the image will will yield no usable features. The only situation I would think about hashing a image to find some information is to build a image database. But even in this case a common hash function as SHA1 or MD5 will be of little use, because modifying a single pixel or metadata such as the author will change the hash and make it impossible to identify the two images based on a common hash function.

I'm not 100% sure what you're trying to do, but hashing should give you a unique string to identify an image with. You didn't specify your language, but most have a function to hash an entire file's data, so you could just run the image file through that. (For example, PHP has md5_file())

It's unclear what problem you're trying to solve. You can definitely obtain a hash for an entire image and use that as a key in a Trie structure, although I think in this case the Trie structure would give you almost no performance benefit over a regular hash table, because you are performing a (large) hash every time you do a lookup.
If you are implementing something where you want to compare two images or find similar images in the tree quickly, you might consider using the GIF or JPEG header of the image as the beginning of the key. This would cause images with similar type, size, index colors, etc. to be grouped near each other within the Trie structure. You could then compute a hash for the image only if there was a collision (that is, multiple images in the Trie with the exact same header).

Related

Open Street Map enclosing polygons

I am working on an Android application that uses the Overpass API at [1]. My goal is to get all circular ways that enclose a certain lat-long point.
In order to do so I build a request for a rectangle that contains my location, then parse the response XML and run a ray-casting algorithm to filter the ways that enclose the given lat-long position. This is too slow for the purpose of my application because sometimes the response has tens or hundreds of MB.
Is there any OSM API that I can call to get all ways that enclose a certain location? Otherwise, how could I optimize the process?
Thanks!
[1] http://overpass-api.de/

To my knowledge, there is no standard API in OSM to do this (it is indeed a very uncommon usecase).
I assume you define enclose as the point representing the current location is inside the inner area of the polygon. Furthermore I assume optimizing the process might including changing the entire concept of the algorithm.
First of all, you need to define the rectangle to fetch data. For that, you need to consider that querying a too large rectangle would yield too much data. As far as I know there is no specific API to query circular ways only, and even if there is, querying a too large rectangle would probably denied by the server, because the server load would be enormous.
Server-side precomputation / prefiltering
Therefore I suggest the first optimization: Instead of querying an API that is not specifically suited for your purpose, use an offline database saved on the Android device. OsmAnd and others save the whole database for a country offline, but in your specific usecase you only need to save a pre-filtered database of circular ways.
As far as I know, only a small fraction of the ways in OSM is circular. Therefore I suggest writing a script that regularly downloads OSM dumps e.g. from Geofabrik, remove non-circular ways (e.g. you could check if the last node ID in a way is equal to the first node ID, but you'd need to check if that captures any way you would define as circular). How often you would run it depends on your usecase.
This optimization solves:
The issue of downloading a large amount of data
The issue of overloading the API with large request
The issue of not being able to request large chunks of data
If that is not suitable for your usecase, I suggest to build a simple API for that on your server.
Re-chunking the data into appriopriate grids
However, you still would need to filter a large amount of data. In order to partially solve this, I suggest the second optimization: Re-chunk your data. For example, if your current location is in Virginia, you would not need to filter circular ways that have an area not beyond Texas. Because filtering by state etc. would by highly country-dependent and difficult (CPU-intensive), I suggest to choose a grid, say e.g. 0.05 lat/lon degree (I'd choose a equirectangular projection because it's easy to calculate if you already have lat/lon coordinates).
The script that preprocessed that data shall then create one chunk of data (that could be a file, but we don't know enough about your usecase to talk about specific data strucutres) for any rectangle in the area you want to use. A circular way is included in this chunk if and only if it has at least one node that is inside the chunk area.
You would then only request / filter the specific chunk your position is currently in. Choose the chunk size appropriately for your application (preferably rather small, but that depends on numerous factors!).
This optimization solves:
Assuming most of the circular ways are quite small in terms of their bounding rectangles, you only need to filter a tiny fraction of the overall ways
IO is minimized, especially if you
Hysteretic heuristics
If the aforementioned optimizations do not sufficiently reduce your computation time, I'd suggest the third optimization that depends on how many circular ways you want to find (if you really need to find all, it won't help at all): Use hysteresis. Save the circular ways you were inside of during the last computation (assuming the new current location is near to the last location) and check them first. If your location didn't change too much, you have a high chance of hitting a way you're inside of during the first few raycasts.
Leveraging relations between different circular ways
Also, a fourth optimization is possible: There will be some circular ways that are fully enclosed in another circular way. You could code your program so that it knows about that relation and checks the inner circular way first. If this check succeeds, you automatically now that the current position is also contained in the outer circular way. I think computing the information (server-side) could be incredibly CPU-intensive and implementing it might also be a hard task, so I'd suggest to use this optimization only if not avoidable.
Tuning the parameters of these optimizations should be sufficient to decrease the CPU time needed for your computation significantly. Please feel free to comment/ask if you have further questions regarding these suggestions.

Novel idea for NoSQL node mapping based on time based ids?

I'm coding a new NoSQL database, and had what I thought was a novel idea (for me anyways) regarding the hashing mechanism used to locate nodes for a given key.
I'm using object keys that incorporate a timestamp. A hash will be used to determine the node(s) holding the data. Pretty common so far.
The (possible) twist lies in that a map will record the times at which nodes have been added to the cluster. That way I can determine for any given object which nodes were present in the cluster when that object was added (and therefore which nodes hold the object's data).
I'm thinking that in this way growing the cluster wont require any data to be transferred. Objects always live on the same node...for ever.
Has anyone tried something like this? Any potential problems that anyone can foresee?

Metric for finding similar images in a database

There's a lot of different algorithms for computing the similarity between two images, but I can't find anything on how you would store this information in a database such that you can find similar images quickly.
By "similar" I mean exact duplicates that have been rotated (90 degree increments), color-adjusted, and/or re-saved (lossy jpeg compression).
I'm trying to come up with a "fingerprint" of the images such that I can look them up quickly.
The best I've come up with so far is to generate a grayscale histogram. With 16 bins and 256 shades of gray, I can easily create a 16-byte fingerprint. This works reasonably well, but it's not quite as robust as I'd like.
Another solution I tried was to resize the images, rotate them so they're all oriented the same way, grayscale them, normalize the histograms, and then shrink them down to about 8x8, and reduce the colors to 16 shades of gray. Although the miniature images were very similar, they were usually off by a pixel or two, which means that exact matching can't work.
Without exact-matching, I don't believe there's any efficient way to group similar photos (without comparing every photo to every other photo, i.e., O(n^2)).
So, (1) How can I create I create a fingerprint/signature that is invariant to the requirements mentioned above? Or, (2) if that's not possible, what other metric can I use such that given a single image, I can find it's best matches in a database of thousands?

There's one little confusing thing in your question: the "fingerprint" you linked to is explicitly not meant to find similar images (quote):
TinEye does not typically find similar images (i.e. a different image with the same subject matter); it finds exact matches including those that have been cropped, edited or resized.
Now, that said, I'm just going to assume you know what you are asking, and that you actually want to be able to find all similar images, not just edited exact copies.
If you want to try and get into it in detail, I would suggest looking up papers by Sivic, Zisserman and Nister, Stewenius. The idea these two papers (as well as quite a bit of others lately) have been using is to try and apply text-searching techniques to image databases, and search the image database in a same manner Google would search it's document (web-page) database.
The first paper I have linked to is a good starting point for this kind of approach, since it addresses mainly the big question: What are the "words" in the images?. Text searching techniques all focus on words, and base their similarity measures on calculations including word counts. Successful representation of images as collections of visual words is thus the first step to applying text-searching techniques to image databases.
The second paper then expands on the idea of using text-techniques, presenting a more suitable search structure. With this, they allow for a faster image retrieval and larger image databases. They also propose how to construct an image descriptor based on the underlying search structure.
The features used as visual words in both papers should satisfy your invariance constraints, and the second one definitely should be able to work with your required database size (maybe even the approach from the 1st paper would work).
Finally, I recommend looking up newer papers from the same authors (I'm positive Nister did something new, it's just that the approach from the linked paper has been enough for me until now), looking up some of their references and just generally searching for papers concerning Content based image (indexing and) retrieval (CBIR) - it is a very popular subject right now, so there should be plenty.

Get info from map data file by coordinate

Imagine I have a map shape file (.shp) or osm xml, I'm able to see different kind of data from different layers in GIS oriented programs, e.g. ArcGIS, QGIS etc. But how can I get this info programmatically? Is there a specific library for that?
What I'm really looking for is a some kind of method getMapData(longitude, latitude) to get landscape/terrain info (e.g. forest, river, city, highway) in specified location
Thanks in advance for your answers!

It still depends what you want to achieve whether you are better off using raster or vector data.
If your are using your grid to subdivide an area as an array of containers for geographic features, then stick with vector data. To do this, I would create a polygon grid file and intersect it with each of your data layers. You can then add an ID field that represents the cell's location in the array (and hence it's relative position to a known lat/long coordinate - let's say lower left). Alternatively you can use spatial queries to access your data by selecting a polygon in your vector grid file and then finding all the features in your other file that are contained by it.
OTOH, if you want to do some multi-feature analysis based on presence/abscence then you may be better going down the route of raster analysis. My gut feeling from what you have said is that this is what you are trying to achieve but I am still not 100% sure. You would handle this by creating a set of boolean rasters of a suitable resolution and then performing maths operations on the set (add, subtract, average etc - depending on what questions your are asking).
Let's say you are looking at animal migration. Let's say your model assumes that streams, hedges and towns are all obstacles to migration but roads only reduce the chance of an area being crossed. So you convert your obstacles to a value of '1' and NoData to '0' in each case, except roads where you decide to set the value to 0.5. You can then add all your rasters together in one big stack and predict migration routes.
Ok that's a simplistic example but perhaps you can see why we need EVEN more information on what you are wanting to do.

Shapefiles or an osm xml file are just containers that hold geometric shapes. There are plenty of software libraries out there that let you read these files and extract the data. I would recommend looking at GDAL/OGR as a starting point.
A method like getMapData(longitude, latitude) is essentially a search/query function. You need to be a little more specific too, do you want geometries that contain the point, are within a distance of a point, etc?
You could find the map data using a brute force algorithm
for shape in shapefile:
if shape.contains(query_point):
return shape
Or you can use more advanced algorithms/data structures such as RTrees, KDTrees, QuadTrees, etc. The easiest way to get start with querying map data is to load it into a spatial database. I would recommending investigating PostgreSQL+PostGIS and SpatiaLite

You may also like to look at Spatialite and/or PostGIS which are two spatial enabled databses that you could use separately or in conjunction with GDAL/OGR.
I must echo Charles' request that you explain your use-case in more detail because the actual implementation will depend greatly on exactly what you are wanting to achieve. My reading of this is that you may want to convert your data into a series of aligned rasters which you can overlay and treat as a 3 dimensional array.

How should I store a sparse decision tree(move list) in a database?

I have been thinking of making an AI for a board game for a long time, and recently I've started to gather resources and algorithms. The game is non-random, and most of the time, there < 3 moves for a player, sometimes, there are >20 moves. I would like to store critical moves, or ambiguous moves so that the AI learns from its mistakes and will not make a same mistake the next time. Moves that surely win or lose need not be stored. So I actually have a sparse decision tree for the beginning of games.
I would like to know how I should store this decision tree in a database? The database does not need to be SQL, and I do not know which database is suitable for this particular problem.
EDIT: Please do not tell me to parse the decision tree into memory, just imagine the game as complicated as chess.

As you will be traversing the tree, neo4j seems like a good solution to me. SQL is no good choice because of the many joins you would need for queries. As i understand the question, you are asking for a way to store some graph in a database, and neo4j is a database explicitely for graphs. For the sparseness, you can attach arrays of primitives or Strings to the edges of your graph to encode sequences of moves, using PropertyContainers (am i right that by sparseness and skipping of nodes you mean your tree-edges are sequences of moves rather than single moves?).

Firstly what you are trying to do sounds like a case based reasoning(CBR) problem see: http://en.wikipedia.org/wiki/Case-based_reasoning#Prominent_CBR_systems . CBR will have a database of decisions, your system would in theory pick the best outcomes available.
Therefore I would suggest using neo4j which is a nosql graph database. http://neo4j.org/
So to represent your game each position is a node in the graph, and each node should contain potential moves from said position. You can track scoring metrics which are learnt as games progress so that the AI is more informed.

I would use a document database (NOSQL) like RavenDB because you can store any data structure in the database.
Documents aren't flat like in a normal SQL database and that allows you to store hierarchical data like trees directly:
{
decision: 'Go forward',
childs: [
{ decision: 'Go backwards' },
{
decision: 'Stay there',
childs: [
{ decision: 'Go backwards' }
]
}
]
}
Here you can see an example JSON tree which can be stored in RavenDB.
RavenDB also has a built-in feature to query hierarchical data:
http://ravendb.net/faq/hierarchies
Please look at the documentation to get more information how RavenDB works.
Resources:
What type of NoSQL database is best suited to store hierarchical data?

You can use memory mapped file as storage.
First, create "compiler". This compiler will parse text file and convert it into compact binary representation. Main application will map this binary optimized file into memory. This will solve your problem with memory size limitation

Start with a simple database table design.
Decisions:
CurrentState BINARY(57) | NewState BINARY(57) | Score INT
CurrentState and NewState are a serialized version of the game state. Score is a weight given to the NewState (positive scores are good moves, negative scores are bad moves) your AI can update these scores appropriately.
Renju, uses a 15x15 board, each location can be either black, white or empty so you need Ceiling( (2bits * 15*15) / 8 ) bytes to serialize the board. In SQL that would be a BINARY(57) in T-SQL
Your AI would select the current moves it has stored like...
SELECT NewState FROM Decisions WHERE CurrentState = #SerializedState ORDER BY Score DESC
You'll get a list of all the stored next moves from the current game state in order of best score to least score.
Your table structure would have a Composite Unique Index (primary key) on (CurrentState, NewState) to facilitate searching and avoid duplicates.
This isn't the best/most optimal solution, but because of your lack of DB knowledge I beleive it would be the easiest to implement and give you a good start.

If I compare with chess engines, those play from memory, maybe apart from opening libraries. Chess is too complicated to store a decinding decision tree. Chess engines play by assigning heuristic evaluations to potential and transient future positions (not moves). Future positions are found by some kind of limited depth search, may be cached for some time in memory, but often are plainly recalculated each turn as the search space is just too big to store in a way faster to look up than recalculating is possible.

Do you know Chinook — the AI that solves checkers? It does this by compiling a database of every possible endgame. While this is not exactly what you are doing, you might learn from it.

I can't clearly conceive neither the data structures you handle in your tree nor their complexity.
But here are some thoughts which may interest you :
Map your decision tree into sparse matrix, a tree is a graph after all
Devise a storage/retrieval strategy taking advantage of sparse matrix properties.

I would approach this with the traditional way an opening book is handled in chess engines:
Generate all possible moves
For each move:
Make that move
Look the resulting position up in your database
Undo the move
Make the move that had the highest score in your database
Looking up a move
Chess engines usually compute a hash function of the current game state via Zobrist hashing, which is a simple way to construct a good hash function for gamestates.
The big advantage of this approach is that it takes care of transpositions, that is, if the same state can be reached via alternate paths, you don't need to worry about those alternate paths, only about the game states themselves.
How chess engines do this
Most chess engines use static opening books that are compiled from recorded games and hence use a simple binary file that maps these hashes to a score; e.g.
struct book_entry {
uint64_t hash
uint32_t score
}
The entries are then sorted by hash, and thanks to operating system caching, a simple binary search through the file will find the needed entries very quickly.
Updating the scores
However, if you want the engine to learn continously, you will need a more complicated data structure; at this point it is usually not worth doing yourself, and you should use an available library. I would probably use LevelDB, but anything that lets you store key-value pairs is fine (Redis, SQLite, GDBM, etc.)
Learning the scores
How exactly you update the scores depends on your game. In games with a lot of data available, a simple approach such as just storing the percentage of games won after the move that resulted in the position works; if you have less data, you can store the result of a game tree search from the position in question as score. Machine learning techniques such as Q learning are also a possibility, although I do not know of a program that actually does this in practice.

I'm assuming your question is asking about how to convert a decision tree into a serial format that can be written to a location and later used to reconstruct the tree.
Try using a pre-order traversal of the tree, using a toString() function (or its equivalent) to convert the data stored at each node of the decision tree to a textual descriptor. By pre-order traversal, I mean implementing an algorithm that first performs the toString() operation on the node, and writes the output to a database or file, and then recursively performs the same operation on its child nodes, in a specified order. Because you are dealing with a sparse tree, your toString() operation should also include information about the existence or non-existence of subtrees.
Reconstructing the tree is simple - the first stored value is the root node, the second is the root member of the left subtree, and so on. The serial data stored for each node should provide information as to which subtree the next inputted node should belong to.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight