How to Include Non-Ordinal Dimensions in Clustering

How to Include Non-Ordinal Dimensions in Clustering - artificial-intelligence

Sorry about all of the questions on a particular topic.
But I'm also trying to think of adding a couple of non-cardinal dimensions to a set of existing numeric dimensions. What I essentially am looking for is to cluster on either the same instances of a value within one of these non-ordinal dimensions or (in a different dimension) to cluster on different instances of a value.
For instance, if I had added a dimension of color with 3 values {Blue, Green, Red} I would want to add weight towards common Greens in a cluster, but not give any more weighting what different color was in a cluster aka. a cluster with green and red would be the same as green and blue.
I am also interested in how to quantify the opposite dimension, that is trying to cluster on a high varying set of points with different color values.
The important thing here is that the extra dimensions have no cardinality to them, I only care to cluster on same instances, for one dimension, and different instances on another. This is just to supplement already existing cardinal, numeric dimensions.

Related

How do I design a database schema to compute IOU for two regions (coordinates)

Short Question
I want to store coordinates of multiple polygons (regions) in a postgres database. What I would need is to be able to get is all region pairs which have an IOU (Intersection Over Union) > 0.5 (any value). If one region has more than one matching regions, pick the one with the highest IOU.
It would be very helpful if someone can give me the approach on how the schema should be and what kind of SQL queries would be needed to achieve this.
Long Question
Context: Users and AI models add annotations on files (images) in our platform. Let's say AI draws 2 boxes on an image. Box1 with label l1 and Box2 with label l2. User draws 1 box on the same image named Box3 with label l1.
There could be millions of such files and we want to compute various detection and classification metrics from the above information.
Detection metrics would be based on if the box detected by AI matches the user's box or not. We rely on IOU to understand if 2 boxes match or not.
Classification metrics would be on top of those boxes which are determined to be correct based on the IOU by checking if label given by user is there in the labels given by AI.
I want an approach on what kind of DB schema should be used for this kinds of problem statements and how complex the SQL queries would be in terms of performance

Deep Learning: Dataset containing images of varying dimensions and orientations

This is a repost of the question asked in ai.stackexchange. Since there is not much traction in that forum, I thought I might try my chances here.
I have a dataset of images of varying dimensions of a certain object. A few images of the object are also in varying orientations. The objective is to learn the features of the object (using Autoencoders).
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image, or should I strictly consider a dataset containing images of uniform dimensions? What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
The idea is, I want to avoid pre-processing my dataset by normalizing it via scaling, re-orienting operations etc. I would like my network to account for the variability in dimensions and orientations. Please point me to resources for the same.
EDIT:
As an example, consider a dataset consisting of images of bananas. They are of varying sizes, say, 265x525 px, 1200x1200 px, 165x520 px etc. 90% of the images display the banana in one orthogonal orientation (say, front view) and the rest display the banana in varying orientations (say, isometric views).

Almost always people will resize all their images to the same size before sending them to the CNN. Unless you're up for a real challenge this is probably what you should do.
That said, it is possible to build a single CNN that takes input of images as varying dimensions. There are a number of ways you might try to do this, and I'm not aware of any published science analyzing these different choices. The key is that the set of learned parameters needs to be shared between the different inputs sizes. While convolutions can be applied at different images sizes, ultimately they always get converted to a single vector to make predictions with, and the size of that vector will depend on the geometries of the inputs, convolutions and pooling layers. You'd probably want to dynamically change the pooling layers based on the input geometry and leave the convolutions the same, since the convolutional layers have parameters and pooling usually doesn't. So on bigger images you pool more aggressively.
Practically you'd want to group together similarly (identically) sized images together into minibatches for efficient processing. This is common for LSTM type models. This technique is commonly called "bucketing". See for example http://mxnet.io/how_to/bucketing.html for a description of how to do this efficiently.

Is it possible to create a network with layers that account for varying dimensions and orientations of the input image
The usual way to deal with different images is the following:
You take one or multiple crops of the image to make width = height. If you take multiple crops, you pass all of them through the network and average the results.
You scale the crop(s) to the size which is necessary for the network.
However, there is also Global Average Pooling (e.g. Keras docs).
What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
That is a difficult question to answer as (1) there are many different approaches in deep learning and the field is quite young (2) I'm pretty sure there is no quantitative answer right now.
Here are two rules of thumb:
You should have at least 50 examples per class
The more parameters your model has, the more data you need
Learning curves and validation curves help to estimate the effect of more training data.

Does this bin-packing variant have a name?

I have what sounds like a typical bin-packing problem: x products of differing sizes need to be packed into y containers of differing capacities, minimizing the number of containers used, as well as minimizing the wasted space.
I can simplify the problem in that product sizes and container capacities can be reduced to standard 1-dimensional units. i.e. this product is 1 unit big while that one is 3 units, this box holds 6 units, that one 12. Think of eggs and cartons, or cases of beer.
But there's an additional constraint: each container has a particular attribute (we'll call it colour ), and each product has a set of colours it is compatible with. There is no correlation between colour and product/container sizing; One product may be colour-compatible with the entire palette, Another may only be compatible with the red containers.
Is this problem variant already described in literature? If so, what is its name?

I think there is no special name for this variant. Although the coloring constraint first gives the impression it's graph coloring related, it's not. It's simply a limitation on the values for a variable.
In a typical solver implementation, each product (= item) will have a variable to which container it's assigned. The color constraints just reduces the value range for a specific variable. So instead of specifying that all variables use the same value range, make it variable specific. (For example, in OptaPlanner this is the difference between a value range provided by the solution generally or by the entity specifically.) So the coloring constraint doesn't even need to be a constraint: it can be part of the model in most solvers.
Any solver that can handle bin packing should be able to handle this variant. Your problem is actually a relaxation of the Roadef 2012 Machine Reassignment problem, which is about assigning processes to computers. Simply drop all the constraints, except for 1 resource usage constraints and the constraint which excludes certain processes to certain machines. That use case is implemented in many solvers. (Although, in practice it is probably be easier to start from a basic bin packing example such as Cloud Balancing.)

Most likely 2d bin-packing or classic knapsack problem.

Pre-Process polygons and linestrings into grid areas to partition data

Due to the size, number, and performance of my polygon queries (polygon in polygon) I would like to pre-process my data and separate the polygons into grids. My data is pretty uniform in my area of interest so like 12 even grids would work well. I may adjust this number later based on performance. Basically I am going to create 12 tables with associated spatial indexes or possibly I will just create a single table with a partition key of my grid. This will reduce my total index size 12x and hopefully increase performance. From the query side I will direct the query to the appropriate table.
The key is for me to be able to figure out how to group polygons into these grids. If the polygon falls within multiple grids then I would likely create a record in each and de-duplicate upon query. I wouldn't expect this to happen very often.
Essentially I will have a "grid" that I want to intersect my polygon and figure out what grids the polygon falls in.
Thanks

My process would be something like this:
Find the MIN/MAX ordinate values for your whole data set (both axes)
Extend those values by a margin that seems appropriate (in case the ordinates when combined don't form a regular rectangular shape)
Write a small loop that generates polygons at a set interval within those MIN/MAX ordinates - i.e. create one polygon per grid square
Use the SDO_COVERS to see which of the grid squares cover each polygon. If multiple grid squares cover a polygon, you should see multiple matches as you describe.
I also agree with your strategy of partitioning the data within a single table. I have heard positive comments about this, but I have never personally tried it. The overhead of going to multiple tables seems like something you'll want to avoid though.

How to efficiently query large multi-polygons with PostGIS

I am working with radio maps that seem to be too fragmented to query efficiently. The response time is 20-40 seconds when I ask if a single point is within the multipolygon (I have tested "within"/"contains"/"overlaps"). I use PostGIS, with GeoDjango to abstract the queries.
The multi-polygon column has a GiST index, and I have tried VACUUM ANALYZE. I use PostgreSQL 8.3.7. and Django 1.2.
The maps stretch over large geographical areas. They were originally generated by a topography-aware radio tool, and the radio cells/polygons are therefore fragmented.
My goal is to query for points within the multipolygons (i.e. houses that may or may not be covered by the signals).
All the radio maps are made up of between 100.000 and 300.000 vertices (total), with wildly varying number of polygons. Some of the maps have less than 10 polygons. From there it jumps to between 10.000 and 30.000 polygons. The ratio of polygons to vertices does not seem to effect the time the queries take to complete very much.
I use a projected coordinate system, and use the same system for both houses and radio sectors. Qgis shows that the radio sectors and maps are correctly placed in the terrain.
My test queries are with only one house at a time within a single radio map. I have tested queries like "within"/"contains"/"overlaps", and the results are the same:
Sub-second response if the house is "far from" the radio map (I guess this is because it is outside the bounding box that is automatically used in the query).
20-40 seconds response time if the house/point is close to or within the radio map.
Do I have alternative ways to optimize queries, or must I change/simplify the source material in some way? Any advice is appreciated.

Hallo
The first thing I would do was to split the multipolygons into single polygons and create a new index. Then the index will work a lot more effective. Now the whole multipolygon has one big bounding box and the index can do nothing more than tell if the house is inside the bounding box. So, the smaller polygons in relation to the whole dataset, the more effective index-use. There are even techniques to split single polygons into smaller ones with a grid to get the index-part of the query even more effective. But, the first thing would be to split the multi polygons into single ones with ST_Dump(). If you have a lot of attributes in the same table it would be wise to put that into another table and only keep an ID telling what radiomap it belongs to. Otherwise you will get a lot of duplicated attribute data.
HTH
Nicklas

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight