Deep Learning: Dataset containing images of varying dimensions and orientations - dataset

This is a repost of the question asked in ai.stackexchange. Since there is not much traction in that forum, I thought I might try my chances here.
I have a dataset of images of varying dimensions of a certain object. A few images of the object are also in varying orientations. The objective is to learn the features of the object (using Autoencoders).
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image, or should I strictly consider a dataset containing images of uniform dimensions? What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
The idea is, I want to avoid pre-processing my dataset by normalizing it via scaling, re-orienting operations etc. I would like my network to account for the variability in dimensions and orientations. Please point me to resources for the same.
EDIT:
As an example, consider a dataset consisting of images of bananas. They are of varying sizes, say, 265x525 px, 1200x1200 px, 165x520 px etc. 90% of the images display the banana in one orthogonal orientation (say, front view) and the rest display the banana in varying orientations (say, isometric views).

Almost always people will resize all their images to the same size before sending them to the CNN. Unless you're up for a real challenge this is probably what you should do.
That said, it is possible to build a single CNN that takes input of images as varying dimensions. There are a number of ways you might try to do this, and I'm not aware of any published science analyzing these different choices. The key is that the set of learned parameters needs to be shared between the different inputs sizes. While convolutions can be applied at different images sizes, ultimately they always get converted to a single vector to make predictions with, and the size of that vector will depend on the geometries of the inputs, convolutions and pooling layers. You'd probably want to dynamically change the pooling layers based on the input geometry and leave the convolutions the same, since the convolutional layers have parameters and pooling usually doesn't. So on bigger images you pool more aggressively.
Practically you'd want to group together similarly (identically) sized images together into minibatches for efficient processing. This is common for LSTM type models. This technique is commonly called "bucketing". See for example http://mxnet.io/how_to/bucketing.html for a description of how to do this efficiently.

Is it possible to create a network with layers that account for varying dimensions and orientations of the input image
The usual way to deal with different images is the following:
You take one or multiple crops of the image to make width = height. If you take multiple crops, you pass all of them through the network and average the results.
You scale the crop(s) to the size which is necessary for the network.
However, there is also Global Average Pooling (e.g. Keras docs).
What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
That is a difficult question to answer as (1) there are many different approaches in deep learning and the field is quite young (2) I'm pretty sure there is no quantitative answer right now.
Here are two rules of thumb:
You should have at least 50 examples per class
The more parameters your model has, the more data you need
Learning curves and validation curves help to estimate the effect of more training data.

Related

Do you keep the true distribution between classes when you manually create your own dataset, or do you make it balanced?

(I struggled a bit to phrase the title - please feel free to suggest another title).
I have a text-dataset which I need to classify, say there's three classes. I need to create the targets by manually setting the labels based on the text (say the three classes are dog,cat,bird).
When I do so I notice we have, say, 70% dog, 20% cat and 10% bird.
Since a lot of machine learning models struggle with imbalanced data, my first thought would be to force the dataset being balanced simply to ignore some of the dog and cat text (i.e "undersampling") thus ending up with (almost) a balanced dataset, making it more easy to train the model.
My concern is though that if we want to train e.g a neural network and get the probability for each class, not training over the correct distribution of the data would result in over/under-confident predictions?
Indeed if your dataset is imbalanced, there is a risk of affecting the performance of your classifier.
You'll find plenty of libraries to help you deal with this problem (see below) and the bottom line is if classes are equally represented in your dataset, it can only help prevent biais of your classifier:
https://imbalanced-learn.org/stable/auto_examples/index.html#general-examples
https://github.com/ufoym/imbalanced-dataset-sampler
https://github.com/MaxHalford/pytorch-resample etc...
(but you can also do that sampling yourself, shouldn't be too difficult, eg libraries like pandas have such functionality)
As a safeguard, split your dataset into 3:
Training (eg 70% of our data): the bulk of the data used for learning
Validation (eg 20%): what your classifier uses for regularization (ie to prevent over fitting)
Test (eg 10%): this data is NEVER exposed to your classifier for learning purposes, you keep it separate and just use it at the end on your model to evaluate its true performance (you call predict and compare with expected classes).
This should be a good starting point.

Databasing to feed ~9k data points to High Charts

I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.

Metric for finding similar images in a database

There's a lot of different algorithms for computing the similarity between two images, but I can't find anything on how you would store this information in a database such that you can find similar images quickly.
By "similar" I mean exact duplicates that have been rotated (90 degree increments), color-adjusted, and/or re-saved (lossy jpeg compression).
I'm trying to come up with a "fingerprint" of the images such that I can look them up quickly.
The best I've come up with so far is to generate a grayscale histogram. With 16 bins and 256 shades of gray, I can easily create a 16-byte fingerprint. This works reasonably well, but it's not quite as robust as I'd like.
Another solution I tried was to resize the images, rotate them so they're all oriented the same way, grayscale them, normalize the histograms, and then shrink them down to about 8x8, and reduce the colors to 16 shades of gray. Although the miniature images were very similar, they were usually off by a pixel or two, which means that exact matching can't work.
Without exact-matching, I don't believe there's any efficient way to group similar photos (without comparing every photo to every other photo, i.e., O(n^2)).
So, (1) How can I create I create a fingerprint/signature that is invariant to the requirements mentioned above? Or, (2) if that's not possible, what other metric can I use such that given a single image, I can find it's best matches in a database of thousands?
There's one little confusing thing in your question: the "fingerprint" you linked to is explicitly not meant to find similar images (quote):
TinEye does not typically find similar images (i.e. a different image with the same subject matter); it finds exact matches including those that have been cropped, edited or resized.
Now, that said, I'm just going to assume you know what you are asking, and that you actually want to be able to find all similar images, not just edited exact copies.
If you want to try and get into it in detail, I would suggest looking up papers by Sivic, Zisserman and Nister, Stewenius. The idea these two papers (as well as quite a bit of others lately) have been using is to try and apply text-searching techniques to image databases, and search the image database in a same manner Google would search it's document (web-page) database.
The first paper I have linked to is a good starting point for this kind of approach, since it addresses mainly the big question: What are the "words" in the images?. Text searching techniques all focus on words, and base their similarity measures on calculations including word counts. Successful representation of images as collections of visual words is thus the first step to applying text-searching techniques to image databases.
The second paper then expands on the idea of using text-techniques, presenting a more suitable search structure. With this, they allow for a faster image retrieval and larger image databases. They also propose how to construct an image descriptor based on the underlying search structure.
The features used as visual words in both papers should satisfy your invariance constraints, and the second one definitely should be able to work with your required database size (maybe even the approach from the 1st paper would work).
Finally, I recommend looking up newer papers from the same authors (I'm positive Nister did something new, it's just that the approach from the linked paper has been enough for me until now), looking up some of their references and just generally searching for papers concerning Content based image (indexing and) retrieval (CBIR) - it is a very popular subject right now, so there should be plenty.

Get info from map data file by coordinate

Imagine I have a map shape file (.shp) or osm xml, I'm able to see different kind of data from different layers in GIS oriented programs, e.g. ArcGIS, QGIS etc. But how can I get this info programmatically? Is there a specific library for that?
What I'm really looking for is a some kind of method getMapData(longitude, latitude) to get landscape/terrain info (e.g. forest, river, city, highway) in specified location
Thanks in advance for your answers!
It still depends what you want to achieve whether you are better off using raster or vector data.
If your are using your grid to subdivide an area as an array of containers for geographic features, then stick with vector data. To do this, I would create a polygon grid file and intersect it with each of your data layers. You can then add an ID field that represents the cell's location in the array (and hence it's relative position to a known lat/long coordinate - let's say lower left). Alternatively you can use spatial queries to access your data by selecting a polygon in your vector grid file and then finding all the features in your other file that are contained by it.
OTOH, if you want to do some multi-feature analysis based on presence/abscence then you may be better going down the route of raster analysis. My gut feeling from what you have said is that this is what you are trying to achieve but I am still not 100% sure. You would handle this by creating a set of boolean rasters of a suitable resolution and then performing maths operations on the set (add, subtract, average etc - depending on what questions your are asking).
Let's say you are looking at animal migration. Let's say your model assumes that streams, hedges and towns are all obstacles to migration but roads only reduce the chance of an area being crossed. So you convert your obstacles to a value of '1' and NoData to '0' in each case, except roads where you decide to set the value to 0.5. You can then add all your rasters together in one big stack and predict migration routes.
Ok that's a simplistic example but perhaps you can see why we need EVEN more information on what you are wanting to do.
Shapefiles or an osm xml file are just containers that hold geometric shapes. There are plenty of software libraries out there that let you read these files and extract the data. I would recommend looking at GDAL/OGR as a starting point.
A method like getMapData(longitude, latitude) is essentially a search/query function. You need to be a little more specific too, do you want geometries that contain the point, are within a distance of a point, etc?
You could find the map data using a brute force algorithm
for shape in shapefile:
if shape.contains(query_point):
return shape
Or you can use more advanced algorithms/data structures such as RTrees, KDTrees, QuadTrees, etc. The easiest way to get start with querying map data is to load it into a spatial database. I would recommending investigating PostgreSQL+PostGIS and SpatiaLite
You may also like to look at Spatialite and/or PostGIS which are two spatial enabled databses that you could use separately or in conjunction with GDAL/OGR.
I must echo Charles' request that you explain your use-case in more detail because the actual implementation will depend greatly on exactly what you are wanting to achieve. My reading of this is that you may want to convert your data into a series of aligned rasters which you can overlay and treat as a 3 dimensional array.

Datasets to test Nonlinear SVM

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,
Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.
The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!
You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.
If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.

Resources