Lets say I have the following model in JAVA
class Shape {
String type;
String color;
String size;
}
And say I have the following data based on the model above.
Triangle, Blue, Small
Triangle, Red, Large
Circle, Blue, Small
Circle, Blue, Medium
Square, Green, Medium
Star, Blue, Large
I would like to answer the following questions
Given the type Circle how many unique colors?
Answer: 1
Given the type Circle how many unique sizes?
Answer: 2
Given the color Blue how many unique shapes?
Answer: 2
Given the color Blue how many unique sizes?
Answer: 3
Given the size Small how many unique shapes?
Answer: 2
Given the size Small how many unique colors?
Answer: 1
I'm wondering if I should model it the following way...
set: shapes -> key: type -> bin(s): list of colors, list of sizes
set: colors -> key: color -> bin(s): list of shapes, list of sizes
set: sizes -> key: size -> bin(s): list of shapes, list of colors
Or is there a better way to do this? If I do this way I need 3 times more the storage.
I also expect to have billions of entries for each set. Btw the model has been redacted to protect the inoncent code ;)
Data modeling in NoSQL is always about how you plan to retrieve the data, at what throughput and at what latency.
There are several ways to model this data; the simplest is to mimic the class structure where each field becomes a Bin. You could define Secondary Indexes on each bin and use Aggregation Queries to answer your questions (above).
But this is only one way; you may need to satisfy the factors of latency and throughput with a different data model.
Related
Suppose that we have a M*N maze and some and there are K dogs in different cells of this mase looking for their houses (their unique houses are also located in some cell in the maze). in each step, all of the dogs can stay at their location or move to an adjacent cell in the maze (the eligible moves are: up, down, right, left if possible). what could be a good state space for this problem?
Unique houses mean that each dog has its specific house located somewhere on the maze.
Two dogs can stand same cell too.
I personally think that the sum of manhattan distances for each dog from its house could be a good heuristic but I could not define a good state space myself.
Here is a link to a picture of a sample for k=2 and a 5*5 maze:
Example
Because all of the animals are independent (they don't block each other and they have unique individual goals), you shouldn't model the joint actions between all agents. You are really solving K independent pathfinding problems, where each one can use the manhattan distance heuristic individually, given 4-connected movement. If you solve them jointly you make the problem exponentially larger when it doesn't have to be.
There are lots of ways of building better heuristics or re-using search information, but that's a different question.
This question already has answers here:
How are super- and subtype relationships in ER diagrams represented as tables?
(4 answers)
Closed 5 years ago.
I am from an optics company. Imagine that we have a flat rectangular glass plate (Length x width x thickness) which is a parent product that can be processed into smaller 2D shape child product such as circular glass plate, rectangular glass plate or triangle glass plate.
I would like to find a way to store size of that particular shape in a correct manner. For eg, a rectangular part is defined by lengthxwidthxthickness only but not by diameter. A circle is define by diameter and thickness but not by length and width. What is the correct way in keep track this kind of info?
For example, I can have a table which contain the following field:
ChildProductID
2DShapeCategory = {circle, rectangle, triangle}
DiameterOfCircle
LengthOfRect
WidthOfRect
L1OfTriangle
L2OfTriangle
L3OfTriangle
And then for a circular part, the value of that row will be for eg:
productID=1 ,shape=circular , diamterofCircle=5 , lengthofRect=0, widthofRect=0, L1ofTriangle=0, L2OfTriangle=0, L3OfTriangle=0
then for a rect part, the value of that row can be for eg:
productID2, shape=rectangle, diameterofcircle=0, lenghtofRect=10, widthofRect=5, L1ofTriangle=0, L2OfTriangle=0, L3OfTriangle=0
But is this a correct way or best practice? I feel not and thus need some help and advice
You're on the right track. Good practice would not include fields that aren't going to hold valid data. A better way is to use three tables; one for circles, one for triangles, and one for rectangles. Your parent product table would be productID and any fields that are about the parent product itself. Then you'd have these tables:
Circles_table:
productID child_circleID Diameter
Rectangle_table:
productID child_rectID length width
Triangles_table
productID child_triID L1 L2 L3
The productID field in each table being foreign key that links it to the parent product.
I made a photo mosaic script (PHP). This script has one picture and changes it to a photo buildup of little pictures. From a distance it looks like the real picture, when you move closer you see it are all little pictures. I take a square of a fixed number of pixels and determine the average color of that square. Then I compare this with my database which contains the average color of a couple thousand of pictures. I determine the color distance with all available images. But to run this script fully it takes a couple of minutes.
The bottleneck is matching the best picture with a part of the main picture. I have been searching online how to reduce this and came a cross “Antipole Clustering.” Of course I tried to find some information on how to use this method myself but I can’t seem to figure out what to do.
There are two steps. 1. Database acquisition and 2. Photomosaic creation.
Let’s start with step one, when this is all clear. Maybe I understand step 2 myself.
Step 1:
partition each image of the database into 9 equal rectangles arranged in a 3x3 grid
compute the RGB mean values for each rectangle
construct a vector x composed by 27 components (three RGB components for each rectangle)
x is the feature vector of the image in the data structure
Well, point 1 and 2 are easy but what should I do at point 3. How do I compose a vector X out of the 27 components (9 * R mean, G mean, B mean.)
And when I succeed to compose the vector, what is the next step I should do with this vector.
Peter
Here is how I think the feature vector is computed:
You have 3 x 3 = 9 rectangles.
Each pixel is essentially 3 numbers, 1 for each of the Red, Green, and Blue color channels.
For each rectangle you compute the mean for the red, green, and blue colors for all the pixels in that rectangle. This gives you 3 numbers for each rectangle.
In total, you have 9 (rectangles) x 3 (mean for R, G, B) = 27 numbers.
Simply concatenate these 27 numbers into a single 27 by 1 (often written as 27 x 1) vector. That is 27 numbers grouped together. This vector of 27 numbers is the feature vector X that represents the color statistic of your photo. In the code, if you are using C++, this will probably be an array of 27 number or perhaps even an instance of the (aptly named) vector class. You can think of this feature vector as some form of "summary" of what the color in the photo is like. Roughly, things look like this: [R1, G1, B1, R2, G2, B2, ..., R9, G9, B9] where R1 is the mean/average of red pixels in the first rectangle and so on.
I believe step 2 involves some form of comparing these feature vectors so that those with similar feature vectors (and hence similar color) will be placed together. Comparison will likely involve the use of the Euclidean distance (see here), or some other metric, to compare how similar the feature vectors (and hence the photos' color) are to each other.
Lastly, as Anony-Mousse suggested, converting your pixels from RGB to HSB/HSV color would be preferable. If you use OpenCV or have access to it, this is simply a one liner code. Otherwise wiki HSV etc. will give your the math formula to perform the conversion.
Hope this helps.
Instead of using RGB, you might want to use HSB space. It gives better results for a wide variety of use cases. Put more weight on Hue to get better color matches for photos, or to brightness when composing high-contrast images (logos etc.)
I have never heard of antipole clustering. But the obvious next step would be to put all the images you have into a large index. Say, an R-Tree. Maybe bulk-load it via STR. Then you can quickly find matches.
Maybe it means vector quantization (vq). In vq the image isn't subdivide in rectangles but in density areas. Then you can take a mean point of this cluster. First off you need to take all colors and pixels separate and transfer it to a vector with XY coordinate. Then you can use a density clustering like voronoi cells and get the mean point. This point can you compare with other pictures in the database. Read here about VQ: http://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php.
How to plot vector from adjacent pixel:
d(x) = I(x+1,y) - I(x,y)
d(y) = I(x,y+1) - I(x,y)
Here's another link: http://www.leptonica.com/color-quantization.html.
Update: When you have already computed the mean color of your thumbnail you can proceed and sort all the means color in a rgb map and using the formula I give to you to compute the vector x. Now that you have a vector of all your thumbnails you can use the antipole tree to search for a thumbnail. This is possbile because the antipole tree is something like a kd-tree and subdivide the 2d space. Read here about antipole tree: http://matt.eifelle.com/2012/01/17/qtmosaic-0-2-faster-mosaics/. Maybe you can ask the author and download the sourcecode?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I noticed that LSH seems a good way to find similar items with high-dimension properties.
After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I'm still confused with those formulas.
Does anyone know a blog or article that explains that the easy way?
The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets.
Check Chapter 3 - Finding Similar Items
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
Also I recommend the below slide:
http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf .
The example in the slide helps me a lot in understanding the hashing for cosine similarity.
I borrow two slides from Benjamin Van Durme & Ashwin Lall, ACL2010 and try to explain the intuitions of LSH Families for Cosine Distance a bit.
In the figure, there are two circles w/ red and yellow colored, representing two two-dimensional data points. We are trying to find their cosine similarity using LSH.
The gray lines are some uniformly randomly picked planes.
Depending on whether the data point locates above or below a gray line, we mark this relation as 0/1.
On the upper-left corner, there are two rows of white/black squares, representing the signature of the two data points respectively. Each square is corresponding to a bit 0(white) or 1(black).
So once you have a pool of planes, you can encode the data points with their location respective to the planes. Imagine that when we have more planes in the pool, the angular difference encoded in the signature is closer to the actual difference. Because only planes that resides between the two points will give the two data different bit value.
Now we look at the signature of the two data points. As in the example, we use only 6 bits(squares) to represent each data. This is the LSH hash for the original data we have.
The hamming distance between the two hashed value is 1, because their signatures only differ by 1 bit.
Considering the length of the signature, we can calculate their angular similarity as shown in the graph.
I have some sample code (just 50 lines) in python here which is using cosine similarity.
https://gist.github.com/94a3d425009be0f94751
Tweets in vector space can be a great example of high dimensional data.
Check out my blog post on applying Locality Sensitive Hashing to tweets to find similar ones.
http://micvog.com/2013/09/08/storm-first-story-detection/
And because one picture is a thousand words check the picture below:
http://micvog.files.wordpress.com/2013/08/lsh1.png
Hope it helps.
#mvogiatzis
Here's a presentation from Stanford that explains it. It made a big difference for me. Part two is more about LSH, but part one covers it as well.
A picture of the overview (There are much more in the slides):
Near Neighbor Search in High Dimensional Data - Part1:
http://www.stanford.edu/class/cs345a/slides/04-highdim.pdf
Near Neighbor Search in High Dimensional Data - Part2:
http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf
LSH is a procedure that takes as input a set of documents/images/objects and outputs a kind of Hash Table.
The indexes of this table contain the documents such that documents that are on the same index are considered similar and those on different indexes are "dissimilar".
Where similar depends on the metric system and also on a threshold similarity s which acts like a global parameter of LSH.
It is up to you to define what the adequate threshold s is for your problem.
It is important to underline that different similarity measures have different implementations of LSH.
In my blog, I tried to thoroughly explain LSH for the cases of minHashing( jaccard similarity measure) and simHashing (cosine distance measure). I hope you find it useful:
https://aerodatablog.wordpress.com/2017/11/29/locality-sensitive-hashing-lsh/
I am a visual person. Here is what works for me as an intuition.
Say each of the things you want to search for approximately are physical objects such as an apple, a cube, a chair.
My intuition for an LSH is that it is similar to take the shadows of these objects. Like if you take the shadow of a 3D cube you get a 2D square-like on a piece of paper, or a 3D sphere will get you a circle-like shadow on a piece of paper.
Eventually, there are many more than three dimensions in a search problem (where each word in a text could be one dimension) but the shadow analogy is still very useful to me.
Now we can efficiently compare strings of bits in software. A fixed length bit string is kinda, more or less, like a line in a single dimension.
So with an LSH, I project the shadows of objects eventually as points (0 or 1) on a single fixed length line/bit string.
The whole trick is to take the shadows such that they still make sense in the lower dimension e.g. they resemble the original object in a good enough way that can be recognized.
A 2D drawing of a cube in perspective tells me this is a cube. But I cannot distinguish easily a 2D square from a 3D cube shadow without perspective: they both looks like a square to me.
How I present my object to the light will determine if I get a good recognizable shadow or not. So I think of a "good" LSH as the one that will turn my objects in front of a light such that their shadow is best recognizable as representing my object.
So to recap: I think of things to index with an LSH as physical objects like a cube, a table, or chair, and I project their shadows in 2D and eventually along a line (a bit string). And a "good" LSH "function" is how I present my objects in front of a light to get an approximately distinguishable shape in the 2D flatland and later my bit string.
Finally when I want to search if an object I have is similar to some objects that I indexed, I take the shadows of this "query" object using the same way to present my object in front of the light (eventually ending up with a bit string too). And now I can compare how similar is that bit string with all my other indexed bit strings which is a proxy for searching for my whole objects if I found a good and recognizable way to present my objects to my light.
As a very short, tldr answer:
An example of locality sensitive hashing could be to first set planes randomly (with a rotation and offset) in your space of inputs to hash, and then to drop your points to hash in the space, and for each plane you measure if the point is above or below it (e.g.: 0 or 1), and the answer is the hash. So points similar in space will have a similar hash if measured with the cosine distance before or after.
You could read this example using scikit-learn: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
I'm wondering what is the best way to model something like the following.
Lets say my company sells metal bars (parameters/fields are: length, profile_type, quantity etc.) of different profiles, where profiles may be pipe(pipe_diameter, wall_thickness) or hollow_rectangle(base, height, wall_thickness), or maybe some other profile with different parameters. Lets say maximum number of profiles would be 12, each profile having between 2-5 parameters.
Should everything be in a single table like
table_bars: id, length, quantity, profile_type, pipe_diameter, wall_thickness, base, height, etc.) where profile type would be (pipe, rectangle etc.)
or should every shape have its own table with its own parameters and in table_bars keep only id, length, quantity profile_type and profile_id)
and are there any django specific issues is multiple tables are the best answer?
Thanks
This is a classic issue and comes down to your own judgment. I had a case like this many years ago and went for generalizing the parameters, in one table. Then I made a second table that contained descriptions for the parameters based on the profile, and flags for which were required, optional, or ignored.
You various parameters might be:
Lentgh: Universal for all, no?
WidthOuter1: might be height of square or rectangle, radius of pipe, etc.
WidthOuter2: ignored for pipe and square, required for rectangle
WidthInner1: ignored for solid objects, required for hollow. Treated just like WidthOuter1, ie., radius for pipes, dimension of square, first dimension for rectangles
WidthInner2: same ideas as WidthOther2 and Widthinner1
...And perhaps your other properties would yield to the same treatment.