A perfect hash for known values - c

Say I have some known values, against which I want to create a hash table. For example,
For 0x78409 -> 1
For 0x89934 -> 2
For 0x89834 -> 3
etc...
But these values (0x78409, 0x89934, 0x89834) are only known at runtime, so switch/case cannot be used. However, they become known at the beginning of execution, so maybe we can create a hash function which adapts itself to make a perfect hash table. So my question is, can we create a perfect hash function for such case.

If the entire domain of inputs is known before the hashmap is created, then this is possible, but requires some form of runtime code generation, either via a VM or JIT (probably through a scripting language, such as LuaJIT), that would allow you to use gperf and its ilk to create a hash at runtime, compile it, then use it to fill and retrieve from the map.
An easier, more viable solution is to use a hash function with extremely low collisions for the given set of input permutations (ie: you might only be using alphabetical, lowercase characters for instance), a minimal perfect hash.
Murmur3 and crapwow are the ones to lookout for (though, I'd be cautious with crapwow), Google's CityHash, and xxHash are also worth looking at. Bob Jenkins also has a good minimal perfect hash based map available here, which should do just fine as well.

Wikipedia gives this page. But are you sure you want a perfect hash function? Perhaps a good and fast hash function can be enough?

Related

Creating New Matching Logic in Informatica (Ratcliffe - Obershelp)

I am conducting a matching project in Informatica 10.2.1 wherein I need to identify matching strings within product descriptions. Ratcliffe-Obershelp is the matching strategy I need to implement.
I've heard Ratcliffe-Obershelp yields greater results than Jaro - Winkler but I am not sure how to code this into a transformation in Informatica since it is not built in.
No code to show as I don't even know where to start.
I'd expect this to be a transformation/group of transformations that would reproduce the matching score that Ratcliffe-Obershelp creates on a per-line basis.
If I understand correctly, the matching logic performs operations in a loop iterating over the input strings. It is not possible to implement such "loop over string" in Expression Transformation using built-in functions. I see two options:
create DECODE function with multiple conditions for each possible length. - This will be ugly. And can be possible assuming only that we start at the begining of each string - implementing full substring comparison will be... so ugly I can't imagine :)
use Java Transformation - as much as I have putting Java into mappings, there are some cases where it's justified. This look like one of the few. Here's some JS reference

how to use pulp to generate variables and constraints of sparse matrix?

there,
I am new to pulp. I learn pulp from some examples I got online. These examples are very helpful and now I am able to write simple models by mtself. But I still feel difficult to build complex model, especailly model with sparse matrix.
Could you please kindly post with some complex examples with sparse matrix, and conplex constraints. I want to learn how to create necessary variables only, instead of simple one, such as, y = LpVariable.dicts("y", (Factorys, Customers) ,0,1,LpBinary).
I have another question: What happen if I simply use y = LpVariable.dicts("y", (Factorys, Customers) ,0,1,LpBinary) to define variables, in which most of variables are useless in model objective function and constraints, and I add some constraints to explicitly set such useless variable to 0? Does pulp algorithm is able to firstly identify such uesless variables and remove them first, then run Integer Programming algorithm (such as B&B or B&C) to solve the problem with reduced size? If this is true, It looks the "setting useless variable to 0" method will not decrease the solution speed at all. Am I right?
This may help
http://www.stuartmitchell.com/journal/2012/2/3/my-top-n-tips-for-python-coding-in-optimisation-1.html
In particular generate a set of of factories and customers first that is sparse.
factories_customers = [(f,c) for f in factories for c in customers
if <insert your condition here>]
Then use
y = LpVariable.dicts("y", factories_customers ,0,1,LpBinary)
Pulp does not remove "useless" variables and constraints so the model build time will be long.
However, the solution algorithms (CBC by default contain pre-solve algorithms that will remove the variables).

Best way to store redis keys

I am using Redis to store some information and detect changes in that information over time (for example, think users and locations). What is the value to using a longer or shorter keyname? Using a longer key is clearer, but is there much cost for memory or performance to using longer keyname?
Here are examples:
SET L:123456 "<name> <latitude> <longitude> ..."
HSET U:987654321 loc 123456 time <epoch>
or
SET loc:{123456} "<name> <latitude> <longitude> ..."
HSET user:{U987654321} loc 123456 time <epoch>
It all depends on how you are going to use it.
If every byte counts, for example when you have to pay for each kB transferred to a cloud service, you can calculate the costs. The maths is simple; a byte is a byte 'on the wire'. Inside redis, for larger values it is equally simple. For smaller values, Redis does some memory optimization.
In your HSET example, you split out the members, which only makes sense if you need them separated from eachother most of the time. A better approach -might- be: HSET user:data 987654321 '{"loc": "123456", "time": "2014-01-01T13:00:00"}'. Separate keys/members 'cost' a lot more than longer strings, performance wise. You can even put a whole table or dataset in one member if it's only going to be used as one complete semi-static entity.
Speed and Size: There is a notable difference between keys and values.
Keys:
Shorter is generally more memory efficient as well as speed efficient. If you use a redis Sorted Set you can even use 'numbers' as keys (sorted set 'members' plus 'scores'). I say 'numbers' because a score is technically a float64, but to be used as an ID it has to be between -999999999999999 and 999999999999999 including (that's 15 digits), without any fractional part. This can be really helpful, since Redis does fast and scalable O(log(n)) on-the-fly sorting of Sorted Sets (using skiplists, simplified).
Values:
The MsgPack format (uncompressed) takes up the least space, especially if you store the definitions once and the values many. JSON is a bit less memory efficient, but is ofcourse such a common IPC format that it should not be left out. Raw strings, character separated, fixed length (ugh), whatever your desire, it's possible to use. You can always compress your data before storing it in Redis. So far memory efficiency. When it comes to speed, it's less simple. If you want to use Lua server-side scripting (which you should), you can't do anything with compressed data. JSON and MsgPack can be deserialized, but only 'as a whole'. Which is fine in mosts scenarios. Most flexible is storing separate values (for example as members of a HSET), but this comes at a price as well (most of the time: too high a price). You also can combine all these. What we use most: a prefix of two or three delimiter-separated values, followed by a MsgPack payload.
My general advice is: start with using only HSET's and ZSET's, don't split out data that belongs together, use descriptive PascalCased names for your keys between 10-25 chars, use ':' if you need delimiters in your keys (namespaces), serialize as JSON (for simplicity, but code for easy switching to MsgPack), use Lua scripting (even if you don't know Lua, the subset you use in Redis is tiny).
I wouldn't worry about it too much in the startup phase of your project, you can always change it later on and do some A/B comparisons as soon as you have some interpolatable data.
Hope this helps, TW
Now that Redis v3.2 is almost here, you should consider switching to the new geo hashing functionality: http://redis.io/commands/geoadd

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

Resources