Extend Azure Search fields with Cosmos DB indexer

Extend Azure Search fields with Cosmos DB indexer - azure-cognitive-search

I have a multiple Cosmos DB document types which are aggregated into a single Azure Search index.
For each document type I have a data source + indexer pair. It is an autogenerated pair that maps from Cosmos DB document fields into the index.
I faced with an issue: I generated a new indexer + data source for a new document type. I see that indexer completed the indexing. But my Index does not contain new fields. And I can't even add a new field manually, which will contain the data added by indexer.
Basing on docs it is possible to add a new field without rebuilding the index. But it's not clear how exactly this can be done

What problem are you seeing when you try to add the new index field manually? You should be able to add a new field to the index definition with a new field via REST or the portal interface.
Via portal: Navigate to your search service, select appropriate index from list, click on the "Fields" tab and click "Add field". A new field should show up at the bottom of the index grid and you should be able to edit field attributes/name/type.
Via REST: You have the right link to the update operation. You can do a GET on the existing index definition and add your new field to the "fields" array before using PUT for update. Your new index field JSON should be similar to the other fields in the index, for example:
"fields": [
...,
{
"name": "new field",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
Here's a doc reference on the index definition JSON.
Be sure to set index field attributes (Filterable, Sortable, Searchable...) on new field creation as not all attributes can be added without rebuilding the index. Only retrievable, searchAnalyzer, and synonymMaps can be modified to an existing field.

Related

How to index blob content with existing "content" field that is Collection(Edm.String)?

I can successfully index documents like PDFs, etc... from blob storage with Azure Search and it will go into a field by default called content.
But what I want to achieve is:
index the blob file content to a field called fileContent (Edm.String)
have a field for other uses called content (Collection(Edm.String))
And I cannot make this work without an error. I've tried everything with some success but from what I can tell it's not possible to redirect the data to a different field other than content while also having a content field defined that is Collection(Edm.String).
Here's what I've tried:
Have output field mappings setup so that the content goes into a field called "fileContent". For example:
"outputFieldMappings": [
{
"sourceFieldName": "/document/content",
"targetFieldName": "fileContent"
}
]
This works fine and the content of the file goes into the fileContent field defined as Edm.String. However, if I create add a custom field called content in my index defined as Collection(Edm.String) I get an exception during the indexing operation:
The data field 'content' in the document with key '1234' has an invalid value of type 'Edm.String' (String maps to Edm.String). The expected type was 'Collection(Edm.String)'.
Why does it care what my data type for content is when I'm mapping this to a different field?
I have verified that if I make the content field just Edm.String I don't get an error but now I have duplicate entries in the index since both content and fileContent contain the same information.
According to the documentation it's possible to change the field from content to something else (but then it doesn't tell you how):
A content field is common to blob content. It contains the text extracted from blobs. Your definition of this field might look similar to the one above. You aren't required to use this name, but doing lets you take advantage of implicit field mappings. The blob indexer can send blob contents to a content Edm.String field in the index, with no field mappings required.
I've also tried using normal (non output) fieldMappings to redirect the input content field to fileContent but I end up with the same error if content is also defined with Collection(Edm.String)
{
"sourceFieldName": "content",
"targetFieldName": "fileContent",
"mappingFunction": null
}
I've also tried redirecting this content through a skillset but even though I can capture that output in a custom field, as soon as I add the content (Collection(Edm.String)) everything explodes.
Any pointers are much appreciated.
Update Turns out that the above (non output) fieldMapping does work so long as the fileContent type is just Edm.String. However, if you want to add a skillset to process this data, that data needs to be redirected to yet-another-field. It will not allow you to redirect that back to fileContent and you end up an error like: "Target
Parameter name: Enrichment target name 'fileContent' collides with existing '/document/fileContent'". So it seems that you end up being required to store the raw blob document data in a field and if you want to process it, it requires another field which is quite annoying.

The indexer will try to index as much content as possible by matching index field names, that's why it attempts to put the blob content string into the index field content collection (and fails).
To get around this you need to add a (non output) field mapping from content to another name that's not an index field name, such as blobContent to prevent the indexer from being too eager. Then in the skillset you can use blobContent by either
replacing all occurrences of /document/content with /document/blobContent, or
setting a value for /document/content which is only accessible within the skillset (and output field mappings), with a conditional skill to minimize other changes to your skillset
{
"#odata.type": "#Microsoft.Skills.Util.ConditionalSkill",
"context": "/document",
"inputs": [
{ "name": "condition", "source": "= true" },
{ "name": "whenTrue", "source": "/document/blobContent" },
{ "name": "whenFalse", "source": "= null" }
],
"outputs": [ { "name": "output", "targetName": "content" } ]
}

Error Creating Index Using SDK When analyzers or tokenFilters Elements are Present

I have an index definition that includes a token filter definition and a corresponding custom analyzer definition as shown below.
"suggesters": [],
"scoringProfiles": [],
"defaultScoringProfile": "",
"corsOptions": null,
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "test_soundex",
"tokenizer": "standard_v2",
"tokenFilters": [
"lowercase",
"test_phonetic"
],
"charFilters": []
}
],
"charFilters": [],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.PhoneticTokenFilter",
"name": "test_phonetic",
"encoder": "soundex",
"replace": false
}
],
"tokenizers": [],
When I attempt to create a new index using this definition I get the following error:
Microsoft.Rest.Azure.CloudException: The request is invalid. Details: index : A type named ‘Analyzer’ could not be resolved by the model. When a model is available, each type name must resolve to a valid type.
If I remove the analyzers and tokenFilters elements from the definition the index gets created with no issue. If I remove the custom analyzer definition I get a similar error where "A type named ‘TokenFilter’ could not be resolved by the model".
I’m running with the latest version of the SDK (10.1.0).
For further clarity here is the code that I'm using to create the index. I'm not instantiating the Analyzer directly. It's being created when the Index object is deserialized below.
var text = System.IO.File.ReadAllText(#"index.json");
var indexDefinition = JsonConvert.DeserializeObject<Index>(text);
_searchServiceClient.Indexes.Create(indexDefinition);
I know the index definition is valid as the same JSON works fine for creating the index when submitted using the API via Postman. Any thoughts?

The JSON deserializer needs to instantiate a CustomAnalyzer instead of an Analyzer. The Analyzer class ought to be abstract, but isn’t currently. Unless you use the same JSON serializer settings as the SDK itself, you won't be able to successfully serialize and deserialize SDK model classes on your own. This is not actually a supported scenario -- you might be able to get it to work, but it's not something we test.
As an aside for folks who might instantiate Analyzer manually by mistake, we're working on a new .NET SDK for Azure Cognitive Search that is currently in preview. In the new library, Analyzer has been renamed to LexicalAnalyzer and its constructor is no longer accessible.

How to ensure all items in a collection match a filter in Azure Cognitive Search

I have Azure Cognitive Search running, and my index is working as expected.
We are trying to add a security filter into the search, based on the current users permissions.
The users permissions are coming to me in as IEnumerable, but I am currently selecting just a string[] and passing that into my filter, then do a string.join, which looks like this.
permission1, permission2, permission3, permission4
In our SQL database, we have a view that is where the index is getting it's data from. There is a column on the view called RequiredPermissions, it is a Collection(Edm.string) in the index, and the data looks like this.
[ 'permission1', 'permission2', 'permission3' ]
The requirement is that for a record to return in the results, a user's permissions must contain all of the RequiredPermissions for that record.
So if we have a user with the following permissions
permission1, permission3, permission5
And we have the following records
Id, SearchText, Type, Permissions
1, abc, User, [ 'permission1', 'permission2' ]
2, abc.pdf, Document, [ 'permission1' ]
3, abc, Thing, [ 'permission1', 'permission3' ]
4, abc, Stuff, [ 'permission3', 'permission4' ]
If the user searched for 'abc' and these four results would come back, I need to $filter results that do not have the proper permissions. So I would expect the following results
Id, Returned, Reason
1, no, the user does not have permission2
2, yes, the user has permission1 and nothing else is needed
3, yes, the user has both permission1 and permission3
4, no, the user does not have permission4
If I run the following filter, then I get back anything that has permission1 or permission3, which is not acceptable, since the user should not see items Id 1 or 4
RequiredPermissions/any(role: search.in(role, 'permission1, permission3', ','))
If I run this filter, then I get nothing back, everything is rejected, because no records have permission5, and the user has it
RequiredPermissions/all(role: not search.in(role, 'permission1, permission3', ','))
If I try to run the search using 'all' and without the 'not' I get the following error
RequiredPermissions/all(role: search.in(role, 'permission1, permission3', ','))
Invalid expression: Invalid lambda expression. Found a test for equality or inequality where the opposite was expected in a lambda expression that iterates over a field of type Collection(Edm.String). For 'any', please use expressions of the form 'x eq y' or 'search.in(...)'. For 'all', please use expressions of the form 'x ne y', 'not (x eq y)', or 'not search.in(...)'.\r\nParameter name: $filter
So it seems that I cannot use the 'not' with 'any', and I must use the 'not' with 'all'
What I wish for is a way to say that a user has all the permissions in their list that is in the RequiredPermissions column.
I am currently just working in Postman using the RestApi to solve this, but I will eventually move this into .Net.

Your scenario can't be implemented with Collection(Edm.String) due to the limitations on how all and any work on such collections (documented here).
Fortunately, there is an alternative. You can model permissions as a collection of complex types, which allows you to use all the way that you need to implement your permissions model. Here is a JSON example of how the field would be defined:
{
"name": "test",
"fields": [
{ "name": "Id", "type": "Edm.String", "key": true },
{ "name": "RequiredPermissions", "type": "Collection(Edm.ComplexType)", "fields": [{ "name": "Name", "type": "Edm.String" }] }
]
}
Here is a JSON example of what a document would look like with its permissions defined:
{ "#search.action": "upload", "Id": "1", "RequiredPermissions": [{"Name": "permission1"}, {"Name": "permission2"}] }
Here is how you could construct a filter that has the desired effect:
RequiredPermissions/all(perm: search.in(perm/Name, 'permission1,permission3,permission5'))
While this works, you are strongly advised to test the performance of this solution with a realistic set of data. Under the hood, all is executed as a negated any, and negated queries can sometimes perform poorly with the type of inverted indexes used by a search engine.
Also, please be aware that there is currently a limit on the number of elements in all complex collections across a document. This limit is currently 3000. So if RequiredPermissions were the only complex collection in your index, this means you could have at most 3000 permissions defined per document.

google cloud datastore API update issue

Following the guidance provided here (https://cloud.google.com/datastore/docs/concepts/transactions), I'm creating an entity with below schema
data: [
{
name: 'created',
value: new Date().toJSON()
},
{
name: 'name',
value: templateObj.name,
excludeFromIndexes: true
}
]
I see this goes thru as expected and notice that the column "name" is not indexed.
Now, I use the transaction and update the entity. Below is the payload
[
{
"property": "active",
"value": false
},
{
"property":"name",
"value":"new updated template"
}
]
Code is exactly the same as documented.
While updating the data, why does it update the schema?

There is really no schema when it comes to Cloud Datastore. It is schemaless. Any time you save (Insert or Update) an entity (regardless of its Kind), you have to specify whether or not the properties in the entity should be indexed. If you do not explicitly set the excludeFromIndexes, the property will be indexed. So, when you create an entity or update an existing entity, make sure to set the excludeFromIndexes to false if you want to keep the property unindexed.

Simple request using googleapis to freebase

I would like to query freebase, to get some basic information about a celebrity.
For example, I would like to get the place of birth, and the gender of Madonna.
I achieved to get Madonna's friends by using:
https://www.googleapis.com/freebase/v1/mqlread?&query={"id":"/en/madonna","name":null,"type":"/base/popstra/celebrity","/base/popstra/celebrity/friendship":[{"participant":[]}]}
But, I'm understanding this request pretty badly. How can I change it to get the information I talked above ?
I was thinking about :
https://www.googleapis.com/freebase/v1/mqlread?&query={"id":"/en/madonna","name":null,"type":"/base/popstra/celebrity","/base/popstra/celebrity/gender":[{"gender":}], "/base/popstra/celebrity/place_of_birth":[{"place of birth":}]}
but it's not working ofc.

Here is how you build MQL queries for Freebase topics:
Find a sample topic that meets you query (ex. Madonna)
Go the the Query Editor and build a simple query for the ID of that topic:
[{
"id": "/en/madonna",
"name": null
}]
Go to the sample topic page, click on "Edit this topic" and find the facts that you want to query for (ex. Date of birth)
You can see that Date of birth is part of the Person (/people/person) type.
Click on little wrench icon to the right of Person to go to the schema page for that type.
From there, you can see that the property that stores the date of birth is called /people/person/date_of_birth.
Add the properties that you want to your query like so:
[{
"id" "/en/madonna",
"name": null,
"/people/person/date_of_birth": null,
"/people/person/gender": null
}]
Queries can also specify a default type (ex. /people/person) which makes the property paths shorter:
[{
"id" "/en/madonna",
"name": null,
"type": "/people/person",
"date_of_birth": null,
"gender": null
}]
But each level of query can only have one default type so if you want to mix in properties from other types you'll need to use the full property IDs:
[{
"id": "/en/madonna",
"name": null,
"type": "/people/person",
"date_of_birth": null,
"gender": null,
"/base/popstra/celebrity/friendship": [{
"participant": []
}]
}]
When you've got your query working in the query editor, just click the "Link" menu at the top right to get the MQLRead Link that will make that query to our API.