Related
I have skew when I keyBy on my data. Let's say the key is:
case class MyKey(x: X, y:Y)
To solve this I am thinking of adding an extra field that would make distribution even among the workers by using this field only for partitioning:
case class MyKey(z: evenlyDistributedField, x: X, y:Y) extends MyKey(x, y) {
override def hashCode(): Int = z.hashCode
}
due to this line my records will use the overridden hashCode and be distributed evenly to each worker and use the original equals method (that takes into consideration only the X and Y fields) to find the proper keyed state in later stateful operators.
I know that same (X, Y) pairs will end in different workers, but I can handle that later. (after making the necessary processing with my new key to avoid skew).
My question is where else is the hashCode method of the Key is used?
I suspect for sure when getting keyed state (what is namespace btw?) as I saw extending classes use the key in a hashMap to get the state for this key. I know that retrieving the KeyedState from the map will be slower as as the hashCode will not consider the X, Y fields. But is there any other place in the flink code that uses the hashcode method of the key?
Is there any other way to solve this? I thought of physical partitioning but I cannot use keyBy as well afaik.
SUMMING UP I WANT TO:
partition my data in each worker randomly to produce an even distribution
[EDITED] do a .window().aggregate() in each partition independently from one another (as if the others dont exists). The data in each window aggregate should be keyed on (X,Y)s of this partition ignoring same (X,Y) keys in other partitions.
merge the conflicts due to same (X,Y) pairs appearing in different partition later (This i need not guidance. I just do a new key by on (X, Y))
In this situation I usually create a transient Tuple2<MyKey, Integer>, where I fill in the Tuple.f1 field with whatever I want to use to partition by. The map or flatMap operation following the .keyBy() can emit MyKey. That avoids mucking with MyKey.hashCode().
And note that having a different set of fields for the hashCode() vs. equals() methods leads to pain and suffering. Java has a contract that says "equals consistency: objects that are equal to each other must return the same hashCode".
[updated]
If you can't offload a significant amount of unkeyed work, then what I would do is...
Set the Integer in the Tuple2<MyKey, Integer> to be hashCode(MyKey) % <operator parallelism * factor>. Assuming your parallelism * factor is high enough, you'll only get a few cases of 2 (or more) of the groups going to the same sub-task.
In the operator, use MapState<MyKey, value> to store state. You'll need this since you'll get multiple unique MyKey values going to the same keyed group.
Do your processing and emit a MyKey from this operator.
By using hashCode(MyKey) % some value, you should get a pretty good mix of unique MyKey values going to each sub-task, which should mitigate skew. Of course if one value dominates, then you'll need another approach, but since you haven't mentioned this I'm assuming it's not the case.
I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.
I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).
I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here
I have a system that stores vectors and allows a user to find the n most similar vectors to the user's query vector. That is, a user submits a vector (I call it a query vector) and my system spits out "here are the n most similar vectors." I generate the similar vectors using a KD-Tree and everything works well, but I want to do more. I want to present a list of the n most similar vectors even if the user doesn't submit a complete vector (a vector with missing values). That is, if a user submits a vector with three dimensions, I still want to find the n nearest vectors (stored vectors are of 11 dimensions) I have stored.
I have a couple of obvious solutions, but I'm not sure either one seem very good:
Create multiple KD-Trees each built using the most popular subset of dimensions a user will search for. That is, if a user submits a query vector of thee dimensions, x, y, z, I match that query to my already built KD-Tree which only contains vectors of three dimensions, x, y, z.
Ignore KD-Trees when a user submits a query vector with missing values and compare the query vector to the vectors (stored in a table in a DB) one by one using something like a dot product.
This has to be a common problem, any suggestions? Thanks for the help.
Your first solution might be fastest for queries (since the tree-building doesn't consider splits in directions that you don't care about), but it would definitely use a lot of memory. And if you have to rebuild the trees repeatedly, it could get slow.
The second option looks very slow unless you only have a few points. And if that's the case, you probably didn't need a kd-tree in the first place :)
I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent.
EDIT
Sorry, didn't realize it would be so obvious in the source code. You should use this version of the neighbor function:
nearest(double [] key, int n, Checker<T> checker)
And implement your own Checker class; see their EuclideanDistance.java to see the Euclidean version. You may also need to comment out any KeySizeException that the query code throws, since you know that you can handle differently sized keys.
Your second option looks like a reasonable solution for what you want.
You could also populate the missing dimensions with the most important( or average or whatever you think it should be) values if there are any.
You could try using the existing KD tree -- by taking both branches when the split is for a dimension that is not supplied by the source vector. This should take less time than doing a brute force search, and might be less trouble than trying to maintain a bunch of specialized trees for dimension subsets.
You would need to adapt your N-closest algorithm (without more info I can't advise you on that...), and for distance you would use the sum of the squares of only those elements supplied by the source vector.
Here's what I ended up doing: When a user didn't specify a value (when their query vector lacked a dimension), I I simply adjusted my matching range (in the API) to something huge so that I match any value.