Azure Cognitive Search text translation skill 50k charactes limitation

Azure Cognitive Search text translation skill 50k charactes limitation - azure-cognitive-search

We are using Azure Cognitive Search to index various documents, e.g. Word or PDF files, which are stored in Azure Blob Storage. We would like to be able to translate the extracted content of non-English documents and store the translation result into a dedicated field in the index.
Currently the built-in Text Translation cognitive skill supports up to 50,000 characters on the input. The documents that we have could contain up to 1 MB of text. According to the documentation it's possible to split the text into chunks using the built-in Split Skill, however there's no skill that could merge the translated chunks back together. Our goal is to have all the extracted text translated and stored in one index field of type Edm.String, not an array.
Is there any way to translate large text blocks when indexing, other than creating a custom Cognitive Skill via Web API for that purpose?

Yes, the Merge Skill will actually do this. Define the skill in your skillset like the below. The "text" and "offsets" inputs to this skill are optional, and you can use "itemsToInsert" to specify the text you want to merge together (specify the appropriate source for your translation output). Use insertPreTag and insertPostTag if you want to insert perhaps a space before or after each merged section.
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Merge text back together",
"context": "/document",
"insertPreTag": "",
"insertPostTag": "",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/translation_output/*/text"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text_field_in_your_index"
}
]
}

Below is a snippet in C#, using Microsoft.Azure.Search classes. It follows the suggestion given by Jennifer in the reply above.
The skillset definition was tested to properly support translation of the text blocks bigger than 50k characters.
private static IList<Skill> GetSkills()
{
var skills = new List<Skill>();
skills.AddRange(new Skill[] {
// ...some skills in the pipeline before translation
new ConditionalSkill(
name: "05-1-set-language-code-for-split",
description: "Set compatible language code for split skill (e.g. 'ru' is not supported)",
context: "/document",
inputs: new []
{
new InputFieldMappingEntry(name: "condition", source: SplitLanguageExpression),
new InputFieldMappingEntry(name: "whenTrue", source: "/document/language_code"),
new InputFieldMappingEntry(name: "whenFalse", source: "= 'en'")
},
outputs: new [] { new OutputFieldMappingEntry(name: "output", targetName: "language_code_split") }
),
new SplitSkill
(
name: "05-2-split-original-content",
description: "Split original merged content into chunks for translation",
defaultLanguageCode: SplitSkillLanguage.En,
textSplitMode: TextSplitMode.Pages,
maximumPageLength: 50000,
context: "/document/merged_content_original",
inputs: new []
{
new InputFieldMappingEntry(name: "text", source: "/document/merged_content_original"),
new InputFieldMappingEntry(name: "languageCode", source: "/document/language_code_split")
},
outputs: new [] { new OutputFieldMappingEntry(name: "textItems", targetName: "pages") }
),
new TextTranslationSkill
(
name: "05-3-translate-original-content-pages",
description: "Translate original merged content chunks",
defaultToLanguageCode: TextTranslationSkillLanguage.En,
context: "/document/merged_content_original/pages/*",
inputs: new []
{
new InputFieldMappingEntry(name: "text", source: "/document/merged_content_original/pages/*"),
new InputFieldMappingEntry(name: "fromLanguageCode", source: "/document/language_code")
},
outputs: new [] { new OutputFieldMappingEntry(name: "translatedText", targetName: "translated_text") }
),
new MergeSkill
(
name: "05-4-merge-translated-content-pages",
description: "Merge translated content into one text string",
context: "/document",
insertPreTag: " ",
insertPostTag: " ",
inputs: new []
{
new InputFieldMappingEntry(name: "itemsToInsert", source: "/document/merged_content_original/pages/*/translated_text")
},
outputs: new [] { new OutputFieldMappingEntry(name: "mergedText", targetName: "merged_content_translated") }
),
// ... some skills in the pipeline after translation
});
return skills;
}
private static string SplitLanguageExpression
{
get
{
var values = Enum.GetValues(typeof(SplitSkillLanguage)).Cast<SplitSkillLanguage>();
var parts = values.Select(v => "($(/document/language_code) == '" + v.ToString().ToLower() +"')");
return "= " + string.Join(" || ", parts);
}
}

Related

How to add new field to mongodb schema?

I have a react application that performs CRUD operations on data stored in mongodb in the cloud.mongodb.com.
The schema of the data in my react looks like this:
const restaurantSchema = new Schema({
"uuid": {
"type": "string"
},
"name": {
"type": "string"
},
"city": {
"type": "string"
}
}, {timestamps: true});
I would like to add a new field called "preference" of type number.
My questions are:
How do I add this new field of "preference"?
Can I give it a default value of say 1 when I add this new field? (There are 900 entries in the mongodb.)?
Can I give the "preference" value based on the order of the "name" field in ascending order?
thanks.

You can add and remove fields in the schema using option { strict: false }
option: strict
The strict option, (enabled by default), ensures that values passed to our model constructor that were not specified in our schema do not get saved to the db.
var thingSchema = new Schema({..}, { strict: false });
And also you can do this in update query as well
Model.findOneAndUpdate(
query, //filter
update, //data to update
{ //options
returnNewDocument: true,
new: true,
strict: false
}
)
You can check the documentation here

AngularJS - Using JSON contents to determine name of string for replacement in HTML contents

I'm creating a new Web application using AngularJS. it consists of a main page with a side menu, whose structure is pre-defined via a JSON file. The JSON would look something like the following (highly simplified!!):
G_Main_Menu = {
"Management":
[{"Command":"DoThis","Label":"Label_DoThis"},
{"Command":"DoThat","Label":"Label_DoThat"}],
"Others":
[{"Command":"DoOther","Label":"Label_DoOther"}]
}
On the other hand, within the HTML page I would be deploying labels extracted from the database (it is a multi-lingual application and hence the contents of the labels would depend on the language selected by the user):
...{{ThisIstheLabelFor_DoThis}}...
...
...{{ThisIstheLabelFor_DoThat}}...
...
...{{ThisIstheLabelFor_DoOther}}...
The JSON as received from the database would look like:
{"Management":
{"Label_DoThis":"This is the explicit contents of label DoThis",
:
"Label_DoThat":"This is the explicit contents of label DoThat",
:
},
"Others":
{"Label_DoOther":"This is the explicit contents of label DoOther"
}
}
So, I have a JSON that contains a string specifying the name of the element contained in a second JSON.
How could I implement such indirect extraction?
Thanks in advance.

You can search for the label in the translation JSON using a function
{{::translate(ThisIstheLabelFor_DoThis, category)}}
where category could be "Management" etc.
And the translate function could be implemented like this:
$scope.translate = function(label_name, category){
return TranslateJson[category][label_name];
}

let gMainMenu = {
"Management": [{
"Command": "DoThis",
"Label": "Label_DoThis"
},
{
"Command": "DoThat",
"Label": "Label_DoThat"
}
],
"Others": [{
"Command": "DoOther",
"Label": "Label_DoOther"
}]
};
let db = {
"Management": {
"Label_DoThis": "This is the explicit contents of label DoThis",
"Label_DoThat": "This is the explicit contents of label DoThat",
},
"Others": {
"Label_DoOther": "This is the explicit contents of label DoOther"
}
};
Object.values(gMainMenu).forEach(menu => {
menu.forEach(item => {
let explicitContent = null;
for (let d of Object.values(db)) {
let key = Object.keys(d).find(k => k === item.Label);
if(key){
explicitContent = d[key];
}
}
item.Label = explicitContent;
});
});
console.log(gMainMenu);

I try to implement a connection using relay and all the node's IDs are the same

I write a really simple schema using graphql, but some how all the IDs in the edges are the same.
{
"data": {
"imageList": {
"id": "SW1hZ2VMaXN0Og==",
"images": {
"edges": [
{
"node": {
"id": "SW1hZ2U6",
"url": "1.jpg"
}
},
{
"node": {
"id": "SW1hZ2U6",
"url": "2.jpg"
}
},
{
"node": {
"id": "SW1hZ2U6",
"url": "3.jpg"
}
}
]
}
}
}
}
I posted the specific detail on github here's the link.

So, globalIdField expects your object to have a field named 'id'. It then takes the string you pass to globalIdField and adds a ':' and your object's id to create its globally unique id.
If you object doesn't have a field called exactly 'id', then it wont append it, and all your globalIdField will just be the string you pass in and ':'. So they wont be unique, they will all be the same.
There is a second parameter you can pass to globalIdField which is a function that gets your object and returns an id for globalIdField to use. So lets say your object's id field is actually called '_id' (thanks Mongo!). You would call globalIdField like so:
id: globalIdField('Image', image => image._id)
There you go. Unique IDs for Relay to enjoy.
Here is the link to the relevant source-code in graphql-relay-js: https://github.com/graphql/graphql-relay-js/blob/master/src/node/node.js#L110

paste the following code in browser console
atob('SW1hZ2U6')
you will find that the value of id is "Image:".
it means all id property of records fetched by (new MyImages()).getAll()
is null.
return union ids or I suggest you define images as GraphQLList
var ImageListType = new GraphQL.GraphQLObjectType({
name: 'ImageList',
description: 'A list of images',
fields: () => ({
id: Relay.globalIdField('ImageList'),
images: {
type: new GraphQLList(ImageType),
description: 'A collection of images',
args: Relay.connectionArgs,
resolve: (_, args) => Relay.connectionFromPromisedArray(
(new MyImages()).getAll(),
args
),
},
}),
interfaces: [nodeDefinition.nodeInterface],
});

Creating an array from GeoJSON file in OpenLayers 3

I am using OpenLayers 3 to animate the paths of migrating animals tagged by scientists. I load the geoJSON file like so
var whaleSource = new ol.source.Vector({
url: 'data/BW2205005.json',
format: new ol.format.GeoJSON()
});
Instead of loading this directly into a layer, I would like to use and reuse the data in the geoJSON file for different purposes throughout my program. For example, I want to pull the lat & lon coordinates into an array to manipulate them to create interpolated animated tracks. Later I will want to query the geoJSON properties to restyle the tracks of males and females.
How might I load the geoJSON data into various arrays at different stages of my program instead of directly into a layer?
Thanks much

When using the url property of ol.source.Vector the class loads the given url via XHR/AJAX for you:
Setting this option instructs the source to use an XHR loader (see ol.featureloader.xhr) and an ol.loadingstrategy.all for a one-off download of all features from that URL.
You could load the file yourself using XHR/AJAX using XMLHttpRequest or a library like jquery which has XHR/AJAX functionality. When you've retreived the GeoJSON collection you can loop over the features array it holds and split it up into what every you need and put those features into new separate GeoJSON collections. Here's a very crude example to give you and idea of the concept:
Assume the following GeoJSON collection:
{
"type": "FeatureCollection",
"features": [{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [0, 0]
},
"properties": {
"name": "Free Willy"
}
}, {
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [1, 1]
},
"properties": {
"name": "Moby Dick"
}
}, {
// Etc.
}]
}
Here's how to load it (using jQuery's $.getJSON XHR function) and to split it up in to separate collections:
// Object to store separated collections
var whales = {};
// Load url and execute handler function
$.getJSON('collection.json', function (data) {
// Iterate the features of the collection
data.features.forEach(function (feature) {
// Check there is a whale already with that name
if (!whales.hasOwnProperty(feature.properties.name)) {
// No there isn't create empty collection
whales[feature.properties.name] = {
"type": "FeatureCollection",
"features": []
};
}
// Add the feature to the collection
whales[feature.properties.name].features.push(feature);
});
});
Now you can use the separate collections stored in the whale object to create layers. Note this differs some from using the url property:
new ol.layer.Vector({
source: new ol.source.Vector({
features: (new ol.format.GeoJSON()).readFeatures(whales['Free Willy'], {
featureProjection: 'EPSG:3857'
})
})
});
Here's a working example of the concept: http://plnkr.co/edit/rGwhI9vpu8ZYfAWvBZZr?p=preview
Edit after comment:
If you want all the coordinates for Willy:
// Empty array to store coordinates
var willysCoordinates = [];
// Iterate over Willy's features
whales['Free Willy'].features.forEach(function (feature) {
willysCoordinates.push(feature.geometry.coordinates);
});
Now willysCoordinates holds a nested array of coordinates: [[0, 0],[2, 2]]

Trouble with updating object properties in AngularJs

I am building my first app in AngularJs.
Here is the plunkr with what I've done so far. The user should be able to add new websites and group them in groups. Groups are also made by the user. Any time the new group is created it is available for new websites. What app should also do is to update group objects with newly assigned websites... and this is where I fail.
Here is how json should look like:
{
"sites": [
{
"url": "http://sfdg",
"id": 0,
"groups": [
{
"name": "adsf",
"id": 0
}
]
}
],
"groups": [
{
"name": "adsf",
"id": 0,
"sites": [//sites assigned
]
}
]
}
In the plunkr code I used push but that just adds new group...
Could you please direct me to the right way of achieving this.
Thanks!

To prevent circular references (a website object refers to a group object that refers to the website object, etc...), I would store id's to the relevant objects instead.
First, when creating a new group, add an empty sites array to it:
function createGroup(newGroup) {
newGroup.sites = []; // <-- add empty array of sites
$scope.groups.push(newGroup);
newGroup.id = groupIdSeq;
groupMap[newGroup.id] = newGroup;
groupIdSeq++;
return newGroup;
}
Then, when you create a new site, update each group to which the site is added:
function createSite(newSite, groups) {
$scope.sites.push(newSite);
newSite.id = siteIdSeq;
sitesMap[newSite.id] = newSite;
// instead of storing the groups array, only store their id:
newSite.groups = groups.map(function(group) { return group.id });
// and add this new sites id to the groups' sites array.
groups.forEach(function(group) {
group.sites.push(newSite.id);
});
siteIdSeq++;
return newSite;
}
(updated plunker here)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Azure Cognitive Search text translation skill 50k charactes limitation - azure-cognitive-search

Related

How to add new field to mongodb schema?

AngularJS - Using JSON contents to determine name of string for replacement in HTML contents

I try to implement a connection using relay and all the node's IDs are the same

Creating an array from GeoJSON file in OpenLayers 3

Trouble with updating object properties in AngularJs

Categories

Resources