Is there an algorithm to sort or filter a database table by distance to a vector, other than the naive one? - database

Say I have a large database table whose entries are vectors. I wish to do a search and sort by distance to a vector. The naive way consists in every time computing the distance between my vector and each of the ones from the database, then sorting by that distance.
Is there any other known algorithm for doing this, perhaps involving some type of indexing in advance?
Alternatively, are there known implementations of such algorithms, for say SQL or Elasticsearch?

Related

How to efficiently index vectors of 2000 values in PostgreSQL and find nearest neighbours?

I have feature vectors of more that 2000 values. For example, say, I have 10,000 vectors of 2000 decimal values each. I need to index them and find the nearest neighbours for a query vector. Can I index them using R trees in PostgreSql? If so, how can we do it? Or is there any other way or any other DB for efficiently doing this?
Check this out if elastic search is an option. You can do distributed searches in vector space: https://blog.mimacom.com/elastic-cosine-similarity-word-embeddings/
An R-tree on 2000 dimensions will probably be much worse than a sequential scan. Your best bet might be to store the table data in an index in a format preorganized for computational speed, then resign yourself to scanning the whole index. This is what bloom indexes do (In concept, with the full scan of the index. The organization and computation of the data of course is different.)

Similarity calculation in a database

I have sets each containing a non-constant number of elements. They are represented in a database as shown below (this is a very simplistic example)
I have two problems
How to efficiently calculate the similarity?
How to represent the calculated similarity in a database?
Note that the current algorithm will not scale well because of the n! complexity.
Note: I can change the representation of the database as well as the algorithm used to calculate the similarity.

Performance of vectors in loops in c++

I am having a for loop with 100,000 iteration - each iteration involving a simple distance calculation of some objects' position. This is all part of a sophisticated collision detection mechanism.
Now too many unnecessary iterations are not efficient and slows down the program. I would like to decrease the calculation time by using sorted vectors.
Therefore alternatively I thought of decreasing the iterations to a minimum by inserting referenced elements into a 2D vector which sorts the positions according to a "grid". Instead of 100,000 iterations, I would only have perhaps 1000 iterations with the same calculation, but involving now only particular "sector". However, the downside is perhaps that the 2D vector needs to be regularly updated with the objects grid or sector position by using push_back and erase, whenever an object changes its position.
My question regards the performance and not the code. Is updating a vectors by using erase and push_back quicker than using a brute force attempt of iteration? I need just a rough estimate if it is worth while pursuing this idea. Thanks.
What you are looking for is binary space partitioning. This way for any given object, finding one colliding object is O(log N), where N is the number of objects. This should cut your collision detection cost from O(N2) to O(N log N).
If you're doing distance calculations between mostly static objects, you could use quadtrees to reduce the number of checks.
If a lot of the objects are moving, then "loose quadtrees" are a better option.

Combination of percentiles from different data sets: how can this be accomplished?

I need to compute the Nth percentiles of a series of related, but segmented data sets.
The combined data sets are too large to compute all at once due to memory limitations, but the framework to perform piece-wise calculations is already in place. So how might I perform calculations on each data set and then combine those calculations to find the percentile that I need?
Other information about the data:
The data often has outliers.
The individual data sets tend to be roughly the same size, but not always
The individual data sets are not expected to share the same distribution
Could I compute the combined median, means, and standard deviations and then estimate any percentile from there?
The median, mean and standard deviation alone are unlikely to be enough, especially if you have outliers.
If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, such as in the parallel mode of the C++ STL library.
If only approximate percentiles are required then Cross Validated has a question -- Estimation of quantile given quantiles of subset -- that suggests a subsampling approach. You would take some (but not all) datapoints from each dataset, make a new combined dataset that is small enough to fit on a single machine and compute the percentiles of that.
Another approximate approach, efficient if percentiles of each segment are already available, would be to approximate the cumulative distribution function of each segment as a step function from the percentiles. Then the overall distribution would be a finite mixture of the segment distributions and the cumulative distribution function a weighted sum of the segment cumulative distribution functions. The quantile function (i.e., the percentiles) could be computed by numerically inverting the cumulative distribution function.

Why does searching an index have logarithmic complexity?

Is an index not similar to a dictionary? If you have the key, you can immediately access it?
Apparently indexes are sometimes stored as B-Trees... why is that?
Dictionaries are not implicitly sorted, B-Trees are.
A B-Tree index can be used for ranged access, like this:
WHERE col1 BETWEEN value1 AND value2
or ordering, like this:
ORDER BY col1
You cannot immediately access a page in a B-Tree index: you need to traverse the child pages whose number increases logarithmically.
Some databases support dictionary-type indexes as well, namely, HASH indexes, in which case the search time is constant. But such indexes cannot be used for ranged access or ordering.
Database Indices are usually (almost always) stored as B-Trees. And all balanced tree structures have O(log n) complexity for searching.
'Dictionary' is an 'Abstract Data Type' (ADT), ie it is a functional description that does not designate an implementation. Some dictionaries could use a Hashtable for O(1) lookup, others could use a tree and achieve O(log n).
The main reason a DB uses B-trees (over any other kind of tree) is that B-trees are self-balancing and are very 'shallow' (requiring little disk I/O)
One of the only data structures you can access immediately with a key is a vector, which for a massive amount of data, becomes inefficient when you start inserting and removing elements. It also needs contiguous memory allocation.
A hash can be efficient but needs more space and will end up with collisions.
A B tree has a good balance between search performance and space.
If your only queries are equality tests then, its true, dictionaries are a good choice since they will do lookups in amortized O(1) time. However, if you want to extend queries to involve range checks, eg (select * from students where age > 10) then suddenly your dictionaries lose their advantage completely.. This is where tree-based indexes come in. With a tree-based index you can perform equalities and range checks in logarithmic time.
There is one problem with naive tree structures. They get unbalanced, this means that after adding certain values to the index, the tree structure become lopsided (ex like a long line) and lookups start to take O(N) time again. This can be resolved by balancing your tree. The B-Tree is one such approach which also takes advantage of systems capable of doing large blocks of I/O, and so is most appropriate for databases.
You can achieve O(1) if you pre-allocate N entries an array and hash the key to this N values.
But then if you have more than N entries stored there are collision. So for each key in the array you have a list of value. So it's not exactly O(1) anymore. Scanning the list itself will be O(m) where m is the average number of collision.
Example with hash = n mod 3
0 --> [0,a] [3,b] ...
1 --> [1,a] [4,b] [7,b] ...
2 --> [2,a] [5,b] ...
At a point in time, it becomes some bad that you spend more time traversing the list of value for a potential key than using another structure with O(log n) lookup time, where n is the total number of entries.
You could of course pick N so big that the array/hash would be faster than the B-Tree. But the array has a fixed pre-allocated size. So if N=1000 and you store 3 values, you've wasted 997 slots in the array.
So it's essentially a performance-space trade-off. For small set of value, array and hashing is excellent. For large set of value, B-Tree are most efficient.
Hashes are the fastest look up data structures, but have some problems:
a) are not sorted
b) no matter how good the hash is, will have collisions, that becomes problematic when lots of data
c) to find a hash value in a hash indexed file takes a long time, so most of the time hashes make sense only for in memory (RAM) data - which makes them not suitable for databases- that most of the time cannot fit all data in RAM
Sorted trees address these issues, and b-trees operations in particular can be implemented efficiently using files. The only drawback is they have slower lookup times as a hash structure.
No data structure is perfect in all cases, depending on estimated size of data and how you use it, one is better.
hashindex (eg. in mysql and postgres) has constant complexity (O(1)) for search.
CREATE INDEX ... USING HASH

Resources