multiple 'instances' of the same labelled field on the same page in Azure Form Recognizer custom model? - azure-form-recognizer

I was wondering if there is something I'm missing on dealing with multiple instances of the same labelled field in a Azure Form Recognizer Custom Model (with labels)? Let's use the following (VERY simplified) document, for example:
Now, If I train a model to detect 'Name', 'DOB', and 'Company', I end up with results that look like:
{
"fields": {
"Name": {
"value_type": "string",
"label_data": null,
"value_data": {
"page_number": 1,
"text": "John R. Smith Ronald Johnson., Esquire",
"bounding_box": [
[
0.57,
4.435
],
[
1.8,
4.435
],
[
1.8,
6.005
],
[
0.57,
6.005
]
],
"field_elements": null
},
"name": "Name",
"value": "John R. Smith Ronald Johnson., Esquire",
"confidence": 1
},
...
As you can see, there is no delimiter between each 'instance' of the Name field in the Azure Form Recognizer results JSON. How should I train and/or deal with the Field results in a way that allows me to extract each instance of a given field from the document?
The first thing I tried, was marking the label name & the value for a field from the document and training on that. For example, Name: John R. Smith and Name: Ronald Johnson., Esquire would be what i marked in FOTT as the Name field for this training example. Then, I would split the result on Name:. This seems fine in theory, but in practice I ended up with VERY low accuracy compared to selecting JUST the field value and training on those.

Please label these as Name1 and Name2 to extract them as separate fields.

Related

Date field not recognized with Azure Form Recognizer

I have built a custom neural model using the Form Recognizer Studio. I have marked the date fields when I labeled the data to build the model.
I have problems extracting the exact date value using the following Java SDK:
com.azure:azure-ai-formrecognizer:4.0.0-beta.5
The returned JSON (as previewed in the Form Recognizer Studio) is:
"Start Date": {
"type": "date",
"content": "01.05.2022",
"boundingRegions": [
{
"pageNumber": 1,
"polygon": [
1.6025,
4.0802,
2.148,
4.0802,
2.148,
4.1613,
1.6025,
4.1613
]
}
],
"confidence": 0.981,
"spans": [
{
"offset": 910,
"length": 10
}
]
}
If I am using the Java SDK, then the getValueDate() returns null, while the getContent() returns the correct string value.
Most likely the issue occurs because the document I use is not in English and the date format might not be recognized. As per documentation here: https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/language-support only English is supported for custom neural models.

How do I restrict medication annotations to a specific document section via IBM Watson Annotator for Clinical Data (ACD)

I’m using the IBM Watson Annotator for Clinical Data (ACD) API hosted in IBM Cloud to detect medication mentions within discharge summary clinic notes. I’m using the out-of-the-box medication annotator provided with ACD.
I’m able to detect and extract medication mentions, but I ONLY want medications mentioned within “DISCHARGE MEDICATIONS” or “DISCHARGE INSTRUCTIONS” sections.
Is there a way I can restrict ACD to only return medication mentions that appear within those two sections? I’m only interested in discharge medications.
For example, given the following contrived (non-PHI) text:
“Patient was previously prescribed cisplatin.DISCHARGE MEDICATIONS: 1. Aspirin 81 mg orally once daily.”
I get two medication mentions: one over “cisplatin” and another over “aspirin” - I only want the latter, since it appears within the “DISCHARGE MEDICATIONS” section.
Since the ACD medication annotator captures the section headings as part of the mention annotations that appear within a section, you can define an inclusive filter that checks for (1) the desired normalized section heading as well as (2) a filter that checks for the existence of the section heading fields in general, should a mention appear outside of any section and not include section header fields as part of the annotation. This will filter out any medication mentions from the ACD response that don't appear within a "DISCHARGE MEDICATIONS" section. I added a couple other related normalized section headings so you can see how that's done. Feel free to modify the sample below to meet your needs.
Here's a sample flow you can persist via POST /flows and then reference on the analyze call as POST /analyze/{flow_id} - e.g. POST /analyze/discharge_med_flow
{
"id": "discharge_med_flow",
"name": "Disharge Medications Flow",
"description": "Detect medication mentions within DISCHARGE MEDICATIONS sections",
"annotatorFlows": [
{
"flow": {
"elements": [
{
"annotator": {
"name": "medication",
"configurations": [
{
"filter": {
"target": "unstructured.data.MedicationInd",
"condition": {
"type": "all",
"conditions": [
{
"type": "all",
"conditions": [
{
"type": "match",
"field": "sectionNormalizedName",
"values": [
"Discharge medication",
"Discharge instructions",
"Medications on discharge"
],
"not": false,
"caseInsensitive": true,
"operator": "equals"
},
{
"type": "match",
"field": "sectionNormalizedName",
"operator": "fieldExists"
}
]
}
]
}
}
}
]
}
}
],
"async": false
}
}
]
}
See the IBM Watson Annotator for Clinical Data filtering docs for additional details.
Thanks

How to ensure all items in a collection match a filter in Azure Cognitive Search

I have Azure Cognitive Search running, and my index is working as expected.
We are trying to add a security filter into the search, based on the current users permissions.
The users permissions are coming to me in as IEnumerable, but I am currently selecting just a string[] and passing that into my filter, then do a string.join, which looks like this.
permission1, permission2, permission3, permission4
In our SQL database, we have a view that is where the index is getting it's data from. There is a column on the view called RequiredPermissions, it is a Collection(Edm.string) in the index, and the data looks like this.
[ 'permission1', 'permission2', 'permission3' ]
The requirement is that for a record to return in the results, a user's permissions must contain all of the RequiredPermissions for that record.
So if we have a user with the following permissions
permission1, permission3, permission5
And we have the following records
Id, SearchText, Type, Permissions
1, abc, User, [ 'permission1', 'permission2' ]
2, abc.pdf, Document, [ 'permission1' ]
3, abc, Thing, [ 'permission1', 'permission3' ]
4, abc, Stuff, [ 'permission3', 'permission4' ]
If the user searched for 'abc' and these four results would come back, I need to $filter results that do not have the proper permissions. So I would expect the following results
Id, Returned, Reason
1, no, the user does not have permission2
2, yes, the user has permission1 and nothing else is needed
3, yes, the user has both permission1 and permission3
4, no, the user does not have permission4
If I run the following filter, then I get back anything that has permission1 or permission3, which is not acceptable, since the user should not see items Id 1 or 4
RequiredPermissions/any(role: search.in(role, 'permission1, permission3', ','))
If I run this filter, then I get nothing back, everything is rejected, because no records have permission5, and the user has it
RequiredPermissions/all(role: not search.in(role, 'permission1, permission3', ','))
If I try to run the search using 'all' and without the 'not' I get the following error
RequiredPermissions/all(role: search.in(role, 'permission1, permission3', ','))
Invalid expression: Invalid lambda expression. Found a test for equality or inequality where the opposite was expected in a lambda expression that iterates over a field of type Collection(Edm.String). For 'any', please use expressions of the form 'x eq y' or 'search.in(...)'. For 'all', please use expressions of the form 'x ne y', 'not (x eq y)', or 'not search.in(...)'.\r\nParameter name: $filter
So it seems that I cannot use the 'not' with 'any', and I must use the 'not' with 'all'
What I wish for is a way to say that a user has all the permissions in their list that is in the RequiredPermissions column.
I am currently just working in Postman using the RestApi to solve this, but I will eventually move this into .Net.
Your scenario can't be implemented with Collection(Edm.String) due to the limitations on how all and any work on such collections (documented here).
Fortunately, there is an alternative. You can model permissions as a collection of complex types, which allows you to use all the way that you need to implement your permissions model. Here is a JSON example of how the field would be defined:
{
"name": "test",
"fields": [
{ "name": "Id", "type": "Edm.String", "key": true },
{ "name": "RequiredPermissions", "type": "Collection(Edm.ComplexType)", "fields": [{ "name": "Name", "type": "Edm.String" }] }
]
}
Here is a JSON example of what a document would look like with its permissions defined:
{ "#search.action": "upload", "Id": "1", "RequiredPermissions": [{"Name": "permission1"}, {"Name": "permission2"}] }
Here is how you could construct a filter that has the desired effect:
RequiredPermissions/all(perm: search.in(perm/Name, 'permission1,permission3,permission5'))
While this works, you are strongly advised to test the performance of this solution with a realistic set of data. Under the hood, all is executed as a negated any, and negated queries can sometimes perform poorly with the type of inverted indexes used by a search engine.
Also, please be aware that there is currently a limit on the number of elements in all complex collections across a document. This limit is currently 3000. So if RequiredPermissions were the only complex collection in your index, this means you could have at most 3000 permissions defined per document.

Elastic Search Interaction of Highlights with Synonym Filter

We have an analyzer which includes the synonym filter which is defined as follows:
synonym_filter :
type : synonym
synonyms_path : synonyms.txt
ignore_case : true
expand : true
format : solr
In the synonym file we have a synonym defined as follows:
dawdle,waste time
Then in our data we have an entity with a name field "dawdle company".
Because of the synonym filter this gets analyzed to something like:
1 -dawdle- 2 -company- 3
1 -wasted- 2 -time- 3
With time and company in the same position. Then when performing a search for "wasted time" we get a hit in this entity. We would like the highlights to be "dawdle" since that is the equivalent synonym, but it seems elastic search sees this as a two hits since it matched "wasted" and "time" and it returns two highlights: "dawdle" and "company".
Is there a recommended way to solve these kind of issues where an unexpected word is returned in the highlights because it occupies the same position of a search term that was inserted because of a synonym?
#SergeyS the situation both you and #user2430530 has is perfectly described in this section of the documentation.
And the suggestion there is to try and define a single term for each serie of synonyms not to get back that mix up of terms highlighted in the result.
Something like this:
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"dawdle, waste time=>waste_time"
]
}
}
}
Then you'll get the desired result from ES:
"highlight": {
"text": [
"some <em>dawdle</em> company"
]
}

Solr, adding a record via JSON with a multi-value field and boosted values

I'm pretty new to Solr, I'm trying to add a multi-value field with boost values defined for each value, all defined via JSON. In other words, I'd like this to work:
[{ "id": "ID1000",
"tag": [
{ "boost": 1, "value": "A test value" },
{ "boost": 2, "value": "A boosted value" } ]
}]
I know how to do that in XML (multiple <field name = 'tag' boost = '...'>), but the JSON code above doesn't work, the server says "Error parsing JSON field value. Unexpected OBJECT_START". Has Solr a limit/bug?
PS: I fixed the originally-missing ']' and that's not the problem.
EDIT: It seems the way to go should be payloads (http://wiki.apache.org/solr/Payloads), but I couldn't make them to work on Solr (followed this: http://sujitpal.blogspot.co.uk/2011/01/payloads-with-solr.html). Leaving the question open to see if someone can further help.
Found the following sentence in the from the Solr Relevancy FAQ - Query Elevation Component section
An Index-time boost on a value of a multiValued field applies to all values for that field.
I do not think adding an individual boost to each value in the multivalued field is going to work. I know that the Xml will allow it, but I would guess that it may only apply the boost value from the last value applied to the field.
So based on that I would change the Json to the following and see if that works.
[
{
"id": "ID1000",
"tag": {
"boost": 2,
"value": [ "A test value", "A boosted value"]
}
}
]
The JSON seems to be invalid missing a closing ]
[
{
"id": "ID1000",
"tag": [
{
"boost": 1,
"value": "A test value"
},
{
"boost": 2,
"value": "A boosted value"
}
]
}
]
You hit an edge case. You can have the boosts on single values and you can have an array of values. But not one inside another (from my reading of Solr 4.1 source code)
That might be something to create as an enhancement request.
If you are generating that JSON by hand, you can try:
"tag": { "boost": 1, "value": "A test value" },
"tag": { "boost": 2, "value": "A boosted value" }
I believe Sols will merge the values then. But if you are generating it via a framework, it will most likely disallow or override multiple object property names (tag here).
The error has nothing to do with boosting.
I get the same error with a very simple json doc.
No luck solving it.
see Solr errors when trying to parse a collection: Error parsing JSON field value. Unexp ected OBJECT_START
I hit the same error message. Actually the error message was misplaced. The underlying real error was the two of the required fields as per schema.xml in solr configuration were missing in the json payload.
An error message of the kind "required parameters are missing in the document" would have been more helpful here. You might want to check if some required fields are missing in the json payload.

Resources