Similarity calculation in a database - database

I have sets each containing a non-constant number of elements. They are represented in a database as shown below (this is a very simplistic example)
I have two problems
How to efficiently calculate the similarity?
How to represent the calculated similarity in a database?
Note that the current algorithm will not scale well because of the n! complexity.
Note: I can change the representation of the database as well as the algorithm used to calculate the similarity.

Related

Is there an algorithm to sort or filter a database table by distance to a vector, other than the naive one?

Say I have a large database table whose entries are vectors. I wish to do a search and sort by distance to a vector. The naive way consists in every time computing the distance between my vector and each of the ones from the database, then sorting by that distance.
Is there any other known algorithm for doing this, perhaps involving some type of indexing in advance?
Alternatively, are there known implementations of such algorithms, for say SQL or Elasticsearch?

Tf-idf with large or small corpus size

"An essence of using Tf-Idf method with large corpuses is, the larger size of corpuses used is, the more unique weights terms have. This is because of the increasing of documents size in corpus or documents length gives a lower probability of duplicating a weight value for two terms in corpus. That is, the weights in Tf-Idf scheme can present a fingerprint for weights. Where in low size corpus, Tf-Idf can’t make that difference since there is huge potential of finding two terms having the same weights since they share the same source documents with the same frequency in each document. This feature can be an adversary and supporter by using Tf-Idf weighting scheme in plagiarism detection field, depending on the corpus size."
This is what I have deduced from tf-idf technique .. is it true?
Are there any link or documents can prove my conclusion؟
After 4 years of waiting for an answer, I can say the answer is yes :)
This actually can be proved simply as in the following picture. We have 4 documents and below the TF and TFIDF tables for each term.
When we have a small corpus (few documents), we can see that the probability that some terms have the same distribution would be high (air, quality), and because of this, their tfidf values are identical. See the table above.
But when we have a corpus with a huge amount of documents, it's less probable that we can find two terms that have the same distribution across all of the corpus.
Note: I used this website to calculate Tf-Idf: https://remykarem.github.io/tfidf-demo/

Array partitioning — Minimizing maximum sum vs. minimizing absolute value difference of sums

Are these two problems isomorphic to each other? — (1) Finding the position in an array that splits it into two partitions minimizing the maximum of the sum between the two partitions. (2) Finding the position in an array that splits it into two partitions minimizing the absolute value difference of the sums of the two partitions.
It intuitively seems that both these problems essentially want the sums of the two partitions to be as close to each other as possible.
However, the former seems to have a O(lg N) solution through binary search while the latter is the NP-Complete partition problem, only having a pseudo-polynomial time dynamic programming algorithm.
Is there a case where the partition point for these two problems would not be the same?

Building a "sparse" lookup array minimizing memory footprint

let's say I want to build an array to perform a lookup to parse network protocols (like an ethertype). Since such an identifier is 2-byte long, I would end up with a 2^16 cells array if I use direct indexing: this is a real waste, because it is very likely that the array is sparse - i.e. lots of gaps into the array.
In order to reduce memory usage to the maximum, I would use a perfect hashing function generator like CMPH, so that I can map my "n" identifiers to a n-sized array without any collision. The downside of this approach is that I have to rely on an external "exoteric" library.
I am wondering whether - in my case - there are smarter ways to have a constant time lookup while keeping at bay memory usage; bear in mind that I am interested in indexing 16-bit unsigned numbers and the set size is quite limited.
Thanks
Since you know for a fact that you're dealing with 16-bit values, any lookup algorithm will be a constant-time algorithm, since there are only O(1) different possible values. Consequently, algorithms that on the surface might be slower (for example, linear search, which runs in O(n) for n elements) might actually be useful here.
Barring a perfect hashing function, if you want to guarantee fast lookup, I would suggest looking into cuckoo hashing, which guarantees worst-case O(1) lookup times and has expected O(1)-time insertion (though you have to be a bit clever with your hash functions). It's really easy to generate hash functions for 16-bit values; if you compute two 16-bit multipliers and multiply the high and low bits of the 16-bit value by these values, then add them together, I believe that you get a good hash function mod any prime number.
Alternatively, if you don't absolutely have to have O(1) lookup and are okay with good expected lookup times, you could also use a standard hash table with open addressing, such as a linear probing hash table or double hashing hash table. Using a smaller array with this sort of hashing scheme could be extremely fast and should be very simple to implement.
For an entirely different approach, if you're storing sparse data and want fast lookup times, an option that might work well for you is to use a simple balanced binary search tree. For example, the treap data structure is easy to implement and gives expected O(log n) lookups for values. Since you're dealing with 16-bit values, here log n is about 16 (I think the base of the logarithm is actually a bit different), so lookups should be quite fast. This does introduce a bit of overhead per element, but if you have only a few elements it should be simple to implement. For even less overhead, you might want to look into splay trees, which require only two pointers per element.
Hope this helps!

Combination of percentiles from different data sets: how can this be accomplished?

I need to compute the Nth percentiles of a series of related, but segmented data sets.
The combined data sets are too large to compute all at once due to memory limitations, but the framework to perform piece-wise calculations is already in place. So how might I perform calculations on each data set and then combine those calculations to find the percentile that I need?
Other information about the data:
The data often has outliers.
The individual data sets tend to be roughly the same size, but not always
The individual data sets are not expected to share the same distribution
Could I compute the combined median, means, and standard deviations and then estimate any percentile from there?
The median, mean and standard deviation alone are unlikely to be enough, especially if you have outliers.
If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, such as in the parallel mode of the C++ STL library.
If only approximate percentiles are required then Cross Validated has a question -- Estimation of quantile given quantiles of subset -- that suggests a subsampling approach. You would take some (but not all) datapoints from each dataset, make a new combined dataset that is small enough to fit on a single machine and compute the percentiles of that.
Another approximate approach, efficient if percentiles of each segment are already available, would be to approximate the cumulative distribution function of each segment as a step function from the percentiles. Then the overall distribution would be a finite mixture of the segment distributions and the cumulative distribution function a weighted sum of the segment cumulative distribution functions. The quantile function (i.e., the percentiles) could be computed by numerically inverting the cumulative distribution function.

Resources