Creating New Matching Logic in Informatica (Ratcliffe - Obershelp) - matching

I am conducting a matching project in Informatica 10.2.1 wherein I need to identify matching strings within product descriptions. Ratcliffe-Obershelp is the matching strategy I need to implement.
I've heard Ratcliffe-Obershelp yields greater results than Jaro - Winkler but I am not sure how to code this into a transformation in Informatica since it is not built in.
No code to show as I don't even know where to start.
I'd expect this to be a transformation/group of transformations that would reproduce the matching score that Ratcliffe-Obershelp creates on a per-line basis.

If I understand correctly, the matching logic performs operations in a loop iterating over the input strings. It is not possible to implement such "loop over string" in Expression Transformation using built-in functions. I see two options:
create DECODE function with multiple conditions for each possible length. - This will be ugly. And can be possible assuming only that we start at the begining of each string - implementing full substring comparison will be... so ugly I can't imagine :)
use Java Transformation - as much as I have putting Java into mappings, there are some cases where it's justified. This look like one of the few. Here's some JS reference

Related

Solr how to write a custom aggregation function/facet function

Is it possible/how to write a new custom facet function in Solr?
I want to write some new functions that executes against a facet bucket and produces a single result, but couldn't find where to start. One of them is a function that just returns some arbitrary value from bucket without calculating min/max/avg etc.
A good way to get started in these cases are usually to find one of the existing and more uniquely named functions, then search for those in the code base.
Starting from JSON Facet API: Stat Facet Functions I see a couple of good candidates, such as percentile and sumsq (and a few others).
Searching for those in the code shows that the percentile aggregator is implemented in solr/core/src/java/org/apache/solr/search/facet/PercentileAgg.java as can be seen on Github. That seems a bit complicated, but the directory contains all the facet aggregation functions.
The Variance aggregator should be simpler, and indeed it is. Starting from the VarianceAgg source you should be able to create another version that aggregates all the values for a specific facet into a single value as well.
Start by extending SimpleAggValueSource and work from there.

What is the lib/algorithm for matching multiple regular expression against a single target string in C?

I want to match multiple regular expressions against a single string & stop when the first regular expression matches.
I am exploring few solutions from here
http://sljit.sourceforge.net/regex_perf.html
but none of them seem to take into consideration match of multiple regular expression against single string.
Is there any solution to speed this up?
You could just use alternation. That is, if you're looking for the expressions \a+b\ and \[a-z0-9]+xyz\, you could write a single regular expression with grouping: \(a+b)|([a-z0-9]+xyz)\. The regex engine will return the first match it finds.
The Unix fgrep tool does what you're looking for. If you give it a list of expressions to find, it will find all occurrences in a single scan of the file. Dr. Dobb's Journal published an article about it, with C source, sometime back in the late '80s. A quick search reveals that the article was called Parallel pattern matching and fgrep, by Ian Ashdown. I didn't find the source, but I didn't look all that hard. Given a little time, you might have more luck.

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

A perfect hash for known values

Say I have some known values, against which I want to create a hash table. For example,
For 0x78409 -> 1
For 0x89934 -> 2
For 0x89834 -> 3
etc...
But these values (0x78409, 0x89934, 0x89834) are only known at runtime, so switch/case cannot be used. However, they become known at the beginning of execution, so maybe we can create a hash function which adapts itself to make a perfect hash table. So my question is, can we create a perfect hash function for such case.
If the entire domain of inputs is known before the hashmap is created, then this is possible, but requires some form of runtime code generation, either via a VM or JIT (probably through a scripting language, such as LuaJIT), that would allow you to use gperf and its ilk to create a hash at runtime, compile it, then use it to fill and retrieve from the map.
An easier, more viable solution is to use a hash function with extremely low collisions for the given set of input permutations (ie: you might only be using alphabetical, lowercase characters for instance), a minimal perfect hash.
Murmur3 and crapwow are the ones to lookout for (though, I'd be cautious with crapwow), Google's CityHash, and xxHash are also worth looking at. Bob Jenkins also has a good minimal perfect hash based map available here, which should do just fine as well.
Wikipedia gives this page. But are you sure you want a perfect hash function? Perhaps a good and fast hash function can be enough?

Resources