Incorporating fuzzy search in a matcher object - c

My task is querying medical texts for institute names using a rule as below:
[{'ENT_TYPE': 'institute_name'}, {'TEXT': 'Hospital'}]
The rule will only idenify a match if both terms are included therein. Thus, it will accept "Mount Sinai Hospital", but not "Mount Sinai". I've tried spaczz that wraps spaCy and works great for single term or phrase. However neither spaCy not spaczz allow for a fuzzy multi-words rule with more than one typo as in "Moung Sinai Mospital."
Therefore, I'm trying to re-write the Matcher object by incorporating a fuzzy similarity algorithm such as RapidFuzz but I'm having some difficulty with its Cython component.
The Matcher's Class call method finds all token sequences matching the supplied patterns on doclike, the document to match over or a Span (Type: Doc/Span), returning
a list of (match_id, start, end) tuples, describing the matches:
matches = find_matches (&self.patterns[0], self.patterns.size(), doclike, length,
extensions=self._extensions, predicates=self._extra_predicates)
for i, (key, start, end) in enumerate(matches):
on_match = self._callbacks.get(key, None)
if on_match is not None:
on_ma
return matches
find_matches is a cython class that returns the matches in a doc, with a compiled array of patterns as a list of (id, start, end) tuples and has main loop that seems to match the doc against the pre-defined patterns:
# Main loop
cdef int nr_predicate = len(predicates)
for i in range(length):
for j in range(n):
states.push_back(PatternStateC(patterns[j], i, 0))
transition_states(states, matches, predicate_cache,
doclike[i], extra_attr_values, predicates)
extra_attr_values += nr_extra_attr
predicate_cache += len(predicates)
Can you help me locate the actual matching operation (pattern against string) in the python/C-level objects as attributes? I hope to be able to extend this operation with the fuzzy matching algorithm. You can find the code for the Matcher class, the call method and the find_matches class here.
You can follow a more pythonic effort to achieve this goal by spaczz here.

I think the easiest way would be to add an additional predicate type called something like FUZZY. Look at how the regex, set, and comparison predicates are defined and do something similar for FUZZY with your custom code to match on strings with small edit differences:
https://github.com/explosion/spaCy/blob/master/spacy/matcher/matcher.pyx#L687-L781
The predicate classes are standard python classes, no cython required. You'll also need to add the predicate to the schema in spacy/matcher/_schemas.py.
Remember that like the rest of the Matcher predicates, it matches over tokens, so your definition of fuzziness will have to be at the token level.

Related

Compile error when trying to use jooq any operator in kotlin

I have problems using jooq with kotlin and the any clause.
Given the following:
I have a Database field in a postgreSQL database which is an array
I have search parameters which are a List of Strings
I want to use jooq any operator to search in the array
I have the following code which is not working:
fun findAll(
someArrayListOfStrings: List<String>?
): List<SomeDTO> {
val filters = ArrayList<Condition>()
filters.add(TABLE.SOME_FIELD.eq(DSL.any(someArrayListOfStrings)))
}
Here I want to dynamically create filters (jooq Conditions) to be added to some SQL statement. It should work to look if SOME_FIELD (PostgreSQL Array Type) contains one of the following strings using the ANY clause (PostgreSQL jooq binding). However I get the following compile-time error:
None of the following functions can be called with the arguments supplied:
public abstract fun eq(p0: Array<(out) String!>!): Condition defined in org.jooq.TableField
public abstract fun eq(p0: Field<Array<(out) String!>!>!): Condition defined in org.jooq.TableField
public abstract fun eq(p0: QuantifiedSelect<out Record1<Array<(out) String!>!>!>!): Condition defined in org.jooq.TableField
public abstract fun eq(p0: Select<out Record1<Array<(out) String!>!>!>!): Condition defined in org.jooq.TableField
But my function call should match the third type where QuantifiedSelect is used.
I looked for hours on the internet but was not able to find any solution. Any site I found told me to try the solution I already have. Does anyone have an idea what I could try and why it does not work?
Thank you!
The method you're calling here is DSL.any(T...), which takes a generic varargs array (in Java). You're passing a List<String>, so this binds T = List<String>, which doesn't satisfy the type constraint on the eq() method.
But even if you changed that to an Array<String>, it wouldn't work because the jOOQ ANY operator doesn't do the exact same thing as the PostgreSQL any(array) operator. So, just resort to either plain SQL templating:
condition("{0} = any({1})", TABLE.SOME_FIELD,
DSL.value(someArrayListOfStrings.toTypedArray()))
Or just use the IN predicate
TABLE.SOME_FIELD.in(someArrayListOfStrings)

Query on TFP Probabilistic Model

In the TFP tutorial, the model output is Normal distribution. I noted that the output can be replaced by an IndependentNormal layer. In my model, the y_true is binary class. Therefore, I used an IndependentBernoulli layer instead of IndependentNormal layer.
After building the model, I found that it has two output parameters. It doesn't make sense to me since Bernoulli distribution has one parameter only. Do you know what went wrong?
# Define the prior weight distribution as Normal of mean=0 and stddev=1.
# Note that, in this example, the we prior distribution is not trainable,
# as we fix its parameters.
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
prior_model = Sequential([
tfpl.DistributionLambda(
lambda t: tfd.MultivariateNormalDiag(loc=tf.zeros(n), scale_diag=tf.ones(n))
)
])
return prior_model
# Define variational posterior weight distribution as multivariate Gaussian.
# Note that the learnable parameters for this distribution are the means,
# variances, and covariances.
def posterior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
posterior_model = Sequential([
tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n), dtype=dtype),
tfpl.MultivariateNormalTriL(n)
])
return posterior_model
# Create a probabilistic DL model
model = Sequential([
tfpl.DenseVariational(units=16,
input_shape=(6,),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0],
activation='relu'),
tfpl.DenseVariational(units=16,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0],
activation='sigmoid'),
tfpl.DenseVariational(units=tfpl.IndependentBernoulli.params_size(1),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0]),
tfpl.IndependentBernoulli(1, convert_to_tensor_fn=tfd.Bernoulli.logits)
])
model.summary()
screenshot of the results executed the codes on Google Colab
I agree the summary display is confusing but I think this is an artifact of the way tfp layers are implemented to interact with keras. During normal operation, there will only be one return value from a DistributionLambda layer. But in some contexts (that I don't fully grok) DistributionLambda.call may return both a distribution and a side-result. I think the summary plumbing triggers this for some reason, so it looks like there are 2 outputs, but there will practically only be one. Try calling your model object on X_train, and you'll see you get a single distribution out (its type is actually something called TensorCoercible, which is a wrapper around a distribution that lets you pass it into tf ops that call tf.convert_to_tensor -- the resulting value for that op will be the result of calling your convert_to_tensor_fn on the enclosed distribution).
In summary, your distribution layer is fine but the summary is confusing. It could probably be fixed; I'm not keras-knowledgeable enough to opine on how hard it would be.
Side note: you can omit the event_shape=1 parameter -- the default value is (), or "scalar", which will behave the same.
HTH!

How to find which element failed comparison between arrays in Kotlin?

I'm writing automated tests for one site. There's a page with all items added to the cart. Maximum items is 58. Instead of verification of each element one by one I decided to create 2 arrays filled with strings: 1 with correct names : String and 1 with names : String I got from the site. Then I compare those 2 arrays with contentEquals.
If that comparison fails, how do I know which element exactly caused comparison fail?
Short simple of what I have now:
#Test
fun verifyNamesOfAddedItems () {
val getAllElementsNames = arrayOf(materials.text, element2.text,
element3.text...)
val correctElementsNames = arrayOf("name1", "name2", "name3"...)
val areArraysEqual = getAllElementsNames contentEquals correctElementsNames
if (!areArraysEqual) {
assert(false)
} else {
assert(true)
}
}
This test fails if 2 arrays are not the same but it doesn't show me the details, so is there a way to see more details of fail, e.g. element that failed comparison?
Thanks.
I recommend using a matcher library like Hamcrest or AssertJ in tests. They provide much better error messages for cases like this. In this case with Hamcrest it would be:
import org.hamcrest.Matchers.*
assertThat(getAllElementsNames, contains(*correctElementsNames))
// or just
assertThat(getAllElementsNames, contains("name1", "name2", "name3", ...))
There are also matcher libraries made specifically for Kotlin: https://github.com/kotlintest/kotlintest, https://yobriefca.se/expect.kt/, https://github.com/winterbe/expekt, https://github.com/MarkusAmshove/Kluent, probably more. Tests using them should be even more readable, but I haven't tried any of them. Look at their documentation and examples and pick the one you like.
You need to find the intersection between the two collections. Intersection will be the common elements. After than removing the intersection collection from the collection you want to perform the test will give you the complementary elements.
val intersection = getAllElementsNames.intersect(correctElementsNames)
getAllElementsNames.removeAll(intersection)

Logical-Indexing for Matlab-object-arrays

Is there any way in Matlab R2011b to apply logical-indexing to object-arrays? The objects which fulfill specific condition(s) regarding their properties should be returned. At best the solution is also possible with object-arrays that are a property of another object (aggregation).
In my project there are a lot of entities which have to be identified by their manifold features. Matlab objects with their properties provide a clear data foundation for this purpose. The alternative of using structs (or cells) and arrays of indices seems to be too confusing. Unfortunately the access to the properties of objects is a little bit complicated.
For Example, all Objects in myArray with Element.val==3 should be returned:
elementsValIsThree = myElements(Element.val==3);
Best solution so far:
find([myElements.val]==3);
But this doesn't return the objects and not the absolute index if a subset of myElements is input.
Another attempt returns only the first Element and needs constant properties:
myElements(Element.val==3);
A minimal example with class definition etc. for clarification:
% element.m
classdef Element
properties
val
end
methods
function obj = Element(value)
if nargin > 0 % to allow empty construction
obj.val = value;
end
end
end
end
Create array of Element-Objects:
myElements(4) = Element(3)
Now myElements(4) has val=3.
I'm not sure I understood the question, but the logical index can be generated as
arrayfun(#(e) isequal(e.val,3), myElements);
So, to pick the elements of myElements whose val field equals 3:
elementsValIsThree = myElements(arrayfun(#(e) isequal(e.val,3), myElements));

List of polysemy words

I am trying to find list of polysemous words but did not get anything on internet. Can someone suggest me a source from where I can get it? I want to use it at backend of my word sense disambiguation project for polysemy detection mechanism.
From http://ixa2.si.ehu.es/signatureak/SENSECORPUS.README.TXT
We say that a word is monosemous if it has a unique sense, that is, if
a word has a unique synset taking into account all its part of speech.
A polysemous word thus is one which has more than one sense. You can get this information from the wordnet itself.
Check out this.
The following will work:
from nltk.corpus import wordnet as wn
def is_polysemous(word):
if(len(wn.synsets(word)) > 1): #more than 1 sense
return True
else:
return False
You can further qualify the code by adding POS. For example :
from nltk.corpus import wordnet as wn
def is_polysemous(word):
if(len(wn.synsets(word, pos=wn.NOUN)) > 1): #more than 1 sense
return True
else:
return False
WordNet has become more and more finely grained with each version. Take the example of the noun 'line'. In WordNet1.5, it had 6 senses, while WordNet3.0 lists 30 senses for the same noun.
#axiom has given you the correct answer, but if you do not want your application to be so specific you could chenge the WordNet version you are using or you could use the so-called 'sense mapping', which groups more related senses from a greater version (ex. 3.0) to the same sense in 1.5.
You can find some sense mappings here http://www.cse.unt.edu/~rada/downloads.html#wordnet or, if you want different versions you can make your own mapping.

Resources