Azure Search Define Custom Analyzer - azure-cognitive-search

I'm defining the Index schema. One of the field is "InvoiceNumber" which it can be something like "459" or "00459" or "P00459".
I want the text "00459" while indexing tokenize to 2 tokens "459" and the original "00459".
And the text "P00459", tokenize to 3 tokens "459", "00459" and the original "P00459".
Is there a way to define the custom analyzer for this?

configuring pattern_capture token filter with appropriate regex is able to produce multiple tokens based on the same text while preserving the original text.
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers
https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
This is the example from the latter link
"(https?://([a-zA-Z-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".

Related

Search for list of words in body using simple

Is it possible to specify a variable list of words to check if they exist in body using simple? Something similar to:
simple("${in.header.type} in 'gold,silver'")
But using "contains" to search in
${in.body} (Camel 2.32.2)
Contains works with only one string at a time. To search for a list of strings in the body you can implement a processor to do so.

Regex in C - pattern matches too early and drops trailing character(s)

I'm working on an interpolation module in C. I have access to both POSIX and PCRE2 regex parsers. Essentially, I have a block of text which contains tokens. These tokens are substituted with values from a lookup table. Each token has the ability to specify alternate text if said token is not found in the lookup table.
Nominally, a token is formatted as ?{token} To handle situations where a token may not be defined and you want alternate text, you suffix the token with a ':' (color) and the alternate, thusly ?{token:alternate} The alternate can be plain text or another token.
The module has the ability to process nested tokens; this includes a token as alternate text. Such a definition looks like ?{token:?{alt_token}} And herein lies the problem. When I parse this, I get ?{token:?{alt_token} as the match. Notice that the closing '}' is missing. When I un-greedy the pattern where it captures the alternate text, the regex over-grabs.
A pathological example would be ?{some:?{deep:?{replacement:?{inside:all this}}}} Ideally, this would grab and resolve (on each pass) the following:
?{inside:all this}
?{replacement:?{inside:all this}}
?{deep:?{replacement:?{inside:all this}}}
?{some:?{deep:?{replacement:?{inside:all this}}}}
... but what I get is each one missing its closing '}'
I'm using several capture groups in my regex to group/grab the token and its optional alternate text. Here's my regex:
`"\\Q?{\\E([\\Q:=\\E]{0,1}[A-Za-z_-][0-9A-Za-z_\\.-]*?)(\\:(.*?)){0,1}\\Q}\\E"`
match[0] is the whole RE; match[1] is the token; match[3] is the alternate text
Any suggestions how I might modify my regex to retain the closing '}' when the token to parse ends with more than one '}' ... ?

How to access fields with special characters from _source?

I am creating a search application in React using searchkit. Some of the fields in my _source have special characters in the beginning, like _index or #timestamp. I am not able to access those fields using source._index, the way I can access fields without special characters, like source.Body.
My sourceFilter looks like this:
sourceFilter={["#timestamp", "From", "MessageStatus", "SmsStatus", "To", "Message", "Timestamp", "Body", "_index"]}
I am trying to access the fields to set the html using dangerouslySetInnerHtml.
How can I either change what the field is called or access a field with a special character?
In the end, I created a new field in the elasticsearch index with a name that doesn't have special characters.

URL Type Hierarchy in Adobe DTM

I'm trying to create a page type hierarchy where I can use it both a page hierarchy as well as props and evars, using the page URL. In a nutshell my URL would look something like this:
http://www.domain.com/BrandHomePage/SuperCategory/ProductCategory/Product
The mindset is to take the URL and use a data element to split the URL, and then capture the values into separate data elements that could also be used in a page hierarchy.
var url = "http://www.domain.com/part1/part2/part3/part4"
var parts = url.split('/').splice(2);
console.log(parts);
var baseUrl = parts[0];
var part1 = parts[1];
var part2 = parts[2];
var part3 = parts[3];
var part4 = parts[4]
My question is, would it even be possible to capture each individual portion of the URL into separate data elements? Or is my approach overkill.
Create a Data Element
The following will create a Data Element that returns an array containing up to 4 elements, depending on how many dir levels there are in the URL.
Go to Rules > Data Elements > Create New Data Element
Name it "hier1" (no quotes).
Choose Type Custom Script and click Open Editor.
Add the following code to the code box:
return location.pathname.split('/').filter(Boolean).slice(0,4);
When you are done, Save Changes.
Populate the Hierarchy Variable
Here is an example of populating hier1 on page view.
Go to Overview > Adobe Analytics Tool Config > Pageviews & Content
Under Hierarchy, select Hierarchy1 from the dropdown (this is shown by default).
To the right of the dropdown, in the first field, add %hier1%
Leave the other 3 fields blank.
Leave Delimiter as default comma , (it doesn't matter what you put here).
Note: DTM stringifies the returned array (String(Array) or Array.toString()) from the Data Element, which is effectively the same as doing Array.join(','). This is why the above shows to only put the Data Element reference in the first field, and the Delimiter is ignored.
If your implementation uses a delimiter other than a comma, see additional notes below.
Additional Notes
Populating other Variables
You can also reference %hier1% to populate other variable fields in the Global Variables section. Note that the data element will be stringified with default comma delimiter.
Alternatively, you may consider using Dynamic Variable syntax (e.g. D=h1) as the value, to shorten the request URL. If you are using the latest AppMeasurement and Marketing Cloud Service libraries, this isn't a big deal (the libs will automatically use a POST request instead of GET request if the request URL is too long).
Using the Data Element in Custom Code Boxes
You can use _satellite.getVar('hier1') to return the data element. Note that this returns an array, e.g. ['foo','bar'], so you need to use .join() to concatenate to a single delimited string value.
Using a different Delimiter
If your implementation uses a delimiter other than a comma (,) and you use the same alternate delimiter for all your variables, you can update the Data Element as such:
return location.pathname.split('/').filter(Boolean).slice(0,4).join('[d]');
Where [d] is replaced by your delimiter. Note that this will now cause the Data Element to return a single concatenated String value instead of an Array. Using %hier1% syntax in DTM fields remains the same, but you will no longer need to use .join() in Custom Code boxes.
If your implementation uses different delimiters for different variables, implement the Data Element per the original instructions in the first section. You may use %hier1% syntax in DTM fields only if the delimiter is a comma. For all other delimiters, you will need to populate the variable in a custom code box and use a .join('[d]').
Capturing more than Four Directory Levels
Since you are no longer trying to put a value in four hierarchy fields, you may consider pushing more levels to hier1 or other variables.
In the Data Element, change the 4 in .slice(0,4); to whatever max level of dirs you want to capture. Or, if you want to capture all dir levels, remove .slice(0,4) completely.

Issues with searching special characters in Solr

I'm using Solr 6.1.0
When I use defType=edismax, and using debug mode by setting debug=True, I found that the search for "r&d" is actually done to search on just the character "r".
http://localhost:8983/solr/collection1/highlight?q="r&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r",
"querystring":"\"r",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)"
Even if I search with escape character, it is of no help.
http://localhost:8983/solr/collection1/highlight?q="r\&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r\\",
"querystring":"\"r\\",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)",
But if I'm using other symbols like "r*d", then the search is ok.
http://localhost:8983/solr/collection1/highlight?q="r*d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r*d\"",
"querystring":"\"r*d\"",
"parsedquery":"(+DisjunctionMaxQuery((text:\"r d\")))/no_coord",
"parsedquery_toString":"+(text:\"r d\")",
What could be the reason behind this?
Regards,
Edwin
First - if you're using the URL as you've pasted, & is the separator between different arguments in the URL, and have to be properly urlencoded if it belongs to an argument, and is not an argument separator.
q=text:"foo&bar"&fl=..
is parsed as
q=text:"foo
bar"
fl=..
Your Solr library usually handles this for you transparently. text%3A%22r%26d%22 is the urlencoded version of text:"r&d".
Secondly, any further parsing will depend on the analysis chain and tokenizer for the field you're searching. This determines which characters are kept and how the text is tokenized (split into separate tokens) before the tokens are matched between the querying text and the indexed text.
What Analyzer are you using for your field . Better try a Analyzer that doesn't tokenize your field much like KeyWordTokenizerFactory.

Categories

Resources