How do I get french text FEMMES.COM to index as language variants of FEMMES - azure-cognitive-search

I need FEMMES.COM to get tokenized as singular + plural forms of the base word FEMME.
Custom Analyzer Config
"analyzers": [ { "#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", "name": "text_language_search_custom_analyzer", "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer", "tokenFilters": [ "lowercase", "asciifolding" ], "charFilters": [ "html_strip" ] } ], "tokenizers": [ { "#odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer", "name": "text_language_search_custom_analyzer_ms_tokenizer", "maxTokenLength": 300, "isSearchTokenizer": false, "language": "english" } ], "tokenFilters": [], "charFilters": []}
Analyze API call for FEMMES
{ "analyzer": "text_language_search_custom_analyzer", "text": "FEMMES" }
Analyze API response for FEMMES
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }
Analyze API response for FEMMES.COM
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ] }
Analyze API response for FEMMES COM
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ]}

I think I figured this one out myself after some experimentation. I found the MappingCharFilter could be used to replace . with , before the indexer did the tokenization. This allowed the lemmatization/stemming to work as expected on the terms in question. I need to do more thorough integration tests with our other use cases, but I think this would solve the problem for anybody facing the same type of issue.

My previous answer was not correct. Azure Search implementation actually applies the language tokenizer BEFORE token filters. This essentially made the WordDelimiterToken filter useless in my use case.
What I ended up having to do was to pre-process data BEFORE I uploaded to Azure for indexing. In my C# code, I added some regex logic that would break apart text like FEMMES2017 into FEMMES 2017, before I sent it to Azure. This way, when the text got to Azure, the indexer would see FEMMES by itself and properly tokenize as FEMME and FEMMES using the language tokenizer.

Related

Ingesting and LINE BREAKING json data to SPLUNK

Looking to ingest this RESTAPI data to SPLUNK, but having issues with LINE BREAKER, can't seem to discover the correct combination for props.conf.
Also as data is returned in array format without keys, do I need a script to add the keys to the returned array data or can this be achieved using SPLUNK?
N.B.
The keys are returned in the tail of the response.
RESTAPI CALL:
{{base_url}}accounts/{{account}}/{{siteid}}/report?dimensions=queryName,queryType,responseCode,responseCached,coloName,origin,dayOfWeek,tcp,ipVersion,querySizeBucket,responseSizeBucket&metrics=queryCount,uncachedCount,staleCount,responseTimeAvg&limit=2
Any help appreciated.
{
"result": {
"rows": 100,
"data": [
{
"dimensions": [
"college.edu",
"A",
"REFUSED",
"uncached",
"EWR",
"192.0.0.0",
"1",
"0",
"4",
"48-63",
"48-63"
],
"metrics": [
1,
1,
0,
16
]
},
{
"dimensions": [
"school.edu",
"A",
"REFUSED",
"uncached",
"EWR",
"192.0.0.0",
"1",
"0",
"4",
"32-47",
"32-47"
],
"metrics": [
1,
1,
0,
10
]
}
],
"data_lag": 0,
"min": {},
"max": {},
"totals": {
"queryCount": 12,
"responseTimeAvg": 37.28936572607269,
"staleCount": 0,
"uncachedCount": 2147541
},
"query": {
"dimensions": [
"queryName",
"queryType",
"responseCode",
"responseCached",
"coloName",
"origin",
"dayOfWeek",
"tcp",
"ipVersion",
"querySizeBucket",
"responseSizeBucket"
],
"metrics": [
"queryCount",
"uncachedCount",
"staleCount",
"responseTimeAvg"
],
"since": "2022-10-17T04:37:00Z",
"until": "2022-10-17T10:37:00Z",
"limit": 100
}
},
"success": true,
"errors": [],
"messages": []
}
Assuming you want the JSON object to be a single event, the LINE_BREAKER setting should be }([\r\n]+){.
Splunk should have no problems parsing the JSON, but I think there will be problems relating metrics to dimensions because there are multiple sets of data and only one set of keys. Creating a script to combine them seems to be the best option.

useQuery's onCompleted being called with cached value

Hopefully I can articulate this question clearly without too much code as it's difficult to extract the pieces from my codebase.
I was observing odd behavior yesterday with useQuery that I can't seem to understand. I think I understand Apollo's cache pretty well but this particular behavior doesn't make sense to me. I have a query that looks something like this:
query {
reservations {
priceBreakdown {
sections {
id
name
total
}
}
}
}
The schema is something like:
type Query {
reservations: [Reservation]
}
type Reservation {
priceBreakdown: PriceBreakdown
}
type PriceBreakdown {
sections: [Section]
}
type Section {
id: String
name: String
total: Float
}
That id on Section is not a proper ID and, in fact, is not unique. It's just a string and all PriceBreakdowns have a list of Sections that contain the same ID. I've pointed this out to the backend folks and it's being fixed but I realize this causes incorrect caching with Apollo since there will be collisions w.r.t. __typename and id. My confusion comes from how onCompleted is called. I noticed when doing
const { data } = useQuery(myQuery, {
onCompleted: console.log
})
that when the network call returns, all PriceBreakdowns are unique and correct, as they should be. But when onCompleted is called with what I thought would be that same API data, it's different and seems to reflect the cached values. In case that's confusing, here are the two results. First is straight from the API and second is the log from onCompleted:
// api results
"data": [
{
"id": "92267",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
},
{
"id": "92266",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$30.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$25.50",
"id": "HOST_TOTAL"
}
]
}
}
]
// onCompleted log
"data": [
{
"id": "92267",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
},
{
"id": "92266",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
}
]
As you can see, in the onCompleted log, the Sections that had the same ID as Sections from the previous record are duplicated, suggesting Apollo is rebuilding the payload from cache and calling onCompleted with that. Is that what's happening? If I set the fetchPolicy to no-cache, the results are correct, but of course that's just a patch for the problem. I want to better understand Apollo because I thought I understood and now I see something unintuitive. I wouldn't have expected onCompleted to be called with something built from the cache. Thanks in advance.

Array within Element within Array in Variant

How can I get the data out of this array stored in a variant column in Snowflake. I don't care if it's a new table, a view or a query. There is a second column of type varchar(256) that contains a unique ID.
If you can just help me read the "confirmed" data and the "editorIds" data I can probably take it from there. Many thanks!
Output example would be
UniqueID ConfirmationID EditorID
u3kd9 xxxx-436a-a2d7 nupd
u3kd9 xxxx-436a-a2d7 9l34c
R3nDo xxxx-436a-a3e4 5rnj
yP48a xxxx-436a-a477 jTpz8
yP48a xxxx-436a-a477 nupd
[
{
"confirmed": {
"Confirmation": "Entry ID=xxxx-436a-a2d7-3525158332f0: Confirmed order submitted.",
"ConfirmationID": "xxxx-436a-a2d7-3525158332f0",
"ConfirmedOrders": 1,
"Received": "8/29/2019 4:31:11 PM Central Time"
},
"editorIds": [
"xxsJYgWDENLoX",
"JR9bWcGwbaymm3a8v",
"JxncJrdpeFJeWsTbT"
] ,
"id": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"messages": [],
"orderJson": {
"EntryID": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"Orders": [
{
"DropShipFlag": 1,
"FromAddressValue": 1,
"OrderAttributes": [
{
"AttributeUID": 548
},
{
"AttributeUID": 553
},
{
"AttributeUID": 2418
}
],
"OrderItems": [
{
"EditorId": "aC3f5HsJYgWDENLoX",
"ItemAssets": [
{
"AssetPath": "https://xxxx573043eac521.png",
"DP2NodeID": "10000",
"ImageHash": "000000000000000FFFFFFFFFFFFFFFFF",
"ImageRotation": 0,
"OffsetX": 50,
"OffsetY": 50,
"PrintedFileName": "aC3f5HsJYgWDENLoX-10000",
"X": 50,
"Y": 52.03909266409266,
"ZoomX": 100,
"ZoomY": 93.75
}
],
"ItemAttributes": [
{
"AttributeUID": 2105
},
{
"AttributeUID": 125
}
],
"ItemBookAttribute": null,
"ProductUID": 52,
"Quantity": 1
}
],
"SendNotificationEmailToAccount": true,
"SequenceNumber": 1,
"ShipToAddress": {
"Addr1": "Addr1",
"Addr2": "0",
"City": "City",
"Country": "US",
"Name": "Name",
"State": "ST",
"Zip": "00000"
}
}
]
},
"orderNumber": null,
"status": "order_placed",
"submitted": {
"Account": "350000",
"ConfirmationID": "xxxxx-436a-a2d7-3525158332f0",
"EntryID": "xxxxx-5AvGgeSHy8Ms6Ytyc-1",
"Key": "D83590AFF0CC0000B54B",
"NumberOfOrders": 1,
"Orders": [
{
"LineItems": [],
"Note": "",
"Products": [
{
"Price": "00.30",
"ProductDescription": "xxxxxint 8x10",
"Quantity": 1
},
{
"Price": "00.40",
"ProductDescription": "xxxxxut Black 8x10",
"Quantity": 1
},
{
"Price": "00.50",
"ProductDescription": "xxxxx"
},
{
"Price": "00.50",
"ProductDescription": "xxxscount",
"Quantity": 1
}
],
"SequenceNumber": "1",
"SubTotal": "00.70",
"Tax": "1.01",
"Total": "00.71"
}
],
"Received": "8/29/2019 4:31:10 PM Central Time"
},
"tracking": null,
"updatedOn": 1.598736670503000e+12
}
]
So, this is how I'd query that exact JSON assuming the data is in column var in table x:
SELECT x.var[0]:confirmed:ConfirmationID::varchar as ConfirmationID,
f.value::varchar as EditorID
FROM x,
LATERAL FLATTEN(input => var[0]:editorIds) f
;
Since your sample output doesn't match the JSON that you provided, I will assume that this is what you need.
Also, as a note, your JSON includes outer [ ] which indicates that the entire JSON string is inside an array. This is the reason for var[0] in my query. If you have multiple records inside that array, then you should remove that. In general, you should exclude those and instead load each record into the table separately. I wasn't sure whether you could make that change, so I just wanted to make note.

Cannot retrieve attribute value from geoserver

Iam new to geoserver. I have created shape file of my district and added certain attributes like covid count, covid zone , district name etc related to COVID . I have loaded this to postgis database and I could see attributes also in table .But when I try to retrieve the feature using postman . Attribute values are not retrieved. Can anyone help
Below is my request
http://localhost:8080/geoserver/rest/workspaces/DistrictWpc/datastores/district_store/featuretypes/ernakulam.json
Response is
{
"featureType": {
"name": "ernakulam",
"nativeName": "ernakulam",
"namespace": {
"name": "DistrictWpc",
"href": "http://localhost:8080/geoserver/rest/namespaces/DistrictWpc.json"
},
"title": "ernakulam",
"keywords": {
"string": [
"features",
"ernakulam"
]
},
"srs": "EPSG:404000",
"nativeBoundingBox": {
"minx": 76.1618881225586,
"maxx": 76.6080093383789,
"miny": 9.63820648193359,
"maxy": 10.1869020462036
},
"latLonBoundingBox": {
"minx": 76.1618881225586,
"maxx": 76.6080093383789,
"miny": 9.63820648193359,
"maxy": 10.1869020462036,
"crs": "EPSG:4326"
},
"projectionPolicy": "FORCE_DECLARED",
"enabled": true,
"store": {
"#class": "dataStore",
"name": "DistrictWpc:district_store",
"href": "http://localhost:8080/geoserver/rest/workspaces/DistrictWpc/datastores/district_store.json"
},
"serviceConfiguration": false,
"maxFeatures": 0,
"numDecimals": 0,
"padWithZeros": false,
"forcedDecimal": false,
"overridingServiceSRS": false,
"skipNumberMatched": false,
"circularArcPresent": false,
"attributes": {
"attribute": [
{
"name": "id",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "java.lang.Long"
},
{
"name": "district",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "java.lang.String"
},
{
"name": "count",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "java.lang.Long"
},
{
"name": "zone",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "java.lang.String"
},
{
"name": "geom",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "org.locationtech.jts.geom.MultiPolygon"
}
]
}
}
}
GeoServer's REST API is used for administrative tasks and as such does not provide a way to see the actual data you have stored in the database, just the details of how GeoServer is connecting to the database and some metadata about the store.
To access the actual data you need to use the WFS endpoint which is described by the OGC WES Specification and described in the GeoServer manual.
If you must have REST access to the features you could use the experimental OGC Features API module to do this.

Elasticsearch arrays query/filter

I'm looking at Elasticsearch for the first time and spent around a day looking at it. We already use Lucene extensively and want to start using ES instead. I'm looking at alternative data structures to what we currently have.
If I run *match_all* query this is what I get at the moment. I am happy with this structure.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 22,
"max_score": 1,
"hits": [
{
"_index": "integration-test-static",
"_type": "sport",
"_id": "4d38e07b-f3d3-4af2-9221-60450b18264a",
"_score": 1,
"_source": {
"Descriptions": [
{
"FeedSource": "dde58b3b-145b-4864-9f7c-43c64c2fe815",
"Value": "Football"
},
{
"FeedSource": "e4b9ad44-00d7-4216-adf5-3a37eafc4c93",
"Value": "Football"
}
],
"Synonyms": [
"Football"
]
}
}
]
}
}
What I can't figure out is how a query is written to pull back this document by searching for the synonym "Football". Looks like it should be easy!
I got this approach after reading this: http://gibrown.wordpress.com/2013/01/24/elasticsearch-five-things-i-was-doing-wrong/
He mentions storing multiple fields in arrays. I realise my example does not have multiple fields, but we will certainly be looking for a solution which can cater for them.
Tried various different queries with filters, bool things, term this and terms that, none return.
What does your search and mappings look like?
If you let Elasticsearch generate the mapping, it'll use the standard analyzer which lowercases the text (and removes stopwords).
So Football will actually be indexed as football. The term-family of queries/filters do not do text analysis, so term:Football will be looking for Football, which is not indexed. The match-family of queries do.
This is a very common problem, and is covered quite extensively in my article on Troubleshooting Elasticsearch searches, for Beginners, which can be worth skimming through. Text analysis is a very important part of working with search, so there's some more articles about it as well.
A simple match query would work in this scenario.
POST integration-test-static/_search
{
"query": {
"match": {
"Synonyms": "Football"
}
}
}
Which returns:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "integration-test-static",
"_type": "sport",
"_id": "4d38e07b-f3d3-4af2-9221-60450b18264a",
"_score": 0.30685282,
"_source": {
"Descriptions": [
{
"FeedSource": "dde58b3b-145b-4864-9f7c-43c64c2fe815",
"Value": "Football"
},
{
"FeedSource": "e4b9ad44-00d7-4216-adf5-3a37eafc4c93",
"Value": "Football"
}
],
"Synonyms": [
"Football"
]
}
}
]
}
}

Resources