Azure search not behaving as expected for dashes - azure-cognitive-search

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc
When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-...
Is there some setting or way to change this behavior?
Current search settings:
TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
Side = EdgeNGramTokenFilterSide.Front,
MinGram = 3,
MaxGram = 20
});
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
TokenFilters =
{
TokenFilterName.Lowercase,
new TokenFilterName("frontEdgeNGram"),
TokenFilterName.Classic,
TokenFilterName.AsciiFolding
}
});
SearchOptions UsersSearchOptions = new SearchOptions
{
QueryType = SearchQueryType.Simple,
SearchMode = SearchMode.All,
};
Using azure.search.documents ver 11.1.1
Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?
Just to add to this..
The portal version is 2020-06-30
The sdk version we use is azure.search.documents ver 11.1.1
abc-123-456 does NOT work as expected
"abc-123-456" does NOT work as expected
"abc-123-456"* does NOT work
"abc-123-456*" does NOT work
If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected.
IE:
abc-123-456* works as expected.
(abc-123-456* | abc-123-457* ) works as expected.
Why is the asterisks required? How can we make this work within a phrase?

This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.
For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.
The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.
Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

Related

Regex Negative lookbehind in React app breaking app on iOS devices

After a recent build my app has stopped displaying on iOS devices (just shows a blank screen).
After a log of digging, I've been able to narrow down the cause and it's this regex expression:
(?<!#)
Here's a context how I used it:
/\b(?<!#)gmail\b|\b(?<!#)google\b/i
which means I want to capture the words "gmail" and "google", but only if they are not preceded by an "#" symbol.
My question is, what is the correct regex expression that will do the same job, and work on all browsers/devices?
Thank you
You could capture the words "gmail" and "google", but only if they are not preceded by an "#" symbol by matching them using a non capture group #(?:gmail|google)
Use an alternation | and a capture group (gmail|google) for google or gmail.
#(?:gmail|google)\b|\b(gmail|google)\b
See a regex demo
For example, if you are doing a replacement you could check for the existence of group 1.
const regex = /#(?:gmail|google)\b|\b(gmail|google)\b/g;
const str = `gmail
google
#gmail
#google
test#google.com
#agmail`;
let res = Array.from(str.matchAll(regex), m => m[1] ? `[REPLACED]${m[1]}[REPLACED]` : m[0]);
console.log(res)
You could just directly match a non # character:
/[^#](google|gmail)\b/i
If you need to also match the domain (which can only be google or gmail), then you may access the first capture group.
It seems that a combination of your two answers worked for me:
/[^#](?:gmail|google)\b|\b(gmail|google)\b/i
Thanks!

QRadar, parsing Log

I want to parse some application log, I did a lot of regex that works correctly with notepad++ and the website www.regex101.com .
But when I apply them in QRadar they don't match nothing.
For example
12/2/2017 9:53:58,4040007,blablablbla,blablabla --- Abonnement Mobile N° : 0663016666 | balbalbal | 03/06/2006 11:11:22 --- Soldes,10.10.10.10
I did this regex (?<=---)\s+[A-Za-z+ \/\w+0-9._%+-]+(?=(\sN°|\s\sN°|\sID)) to match Abonnement mobile it works correctly , but it doesn't match anything in QRadar.
QRadar does not accept all regex configurations. When you try parsing something you can use extract property field to check. Here is a regex that works fine in my system.
\-\-\-\s(\w+\s\w+)\s
this regex will work if only "Abonnement Mobile" field is includes letters or digits. If you want to catch "Abonnement Mobile N°" you can use this regex and this will work whatever comes in this field.
\-\-\-\s([^\:]+)\:

How to query atom field with unicode value in Google App Engine production search?

I wrote some text search with use Google App Engine search.
In SDK I tested such query on atom field:
u'tag:"wartości"'
In production I run the same query but it not works on same data.
How can I do unicode query on atom field?
Is it possible to use unicode in Google App Engine search?
We are aware of this issue and plan to fix ASAP. The fix that we're currently planning will require that the atom field value include exactly the same accent characters in order to match. Matches will continue to be case-insensitive. We expect that at least initially, values that use combining diacritical marks will be treated as different values than those using precomposed characters. We may revisit that decision depending on feedback, but it's the most straightforward fix on our end.
For more on the precomposed characters vs. combining diacritical marks, see this Wikipedia article:
http://en.wikipedia.org/wiki/Precomposed_character
Chris
It looks that I need translate AtomField values into new string and I need to translate queries too. This workaround will allow only Polish unicode search. I do not know tonkenization rules so I use 'q', 'x' to expand alphabet since not used in Polish.
# coding=utf-8
translate = {
u'ą': u'aq',
u'Ą': u'Aq',
u'ć': u'cq',
u'Ć': u'Cq',
u'ę': u'eq',
u'Ę': u'Eq',
u'ł': u'lq',
u'Ł': u'Lq',
u'ń': u'nq',
u'Ń': u'Nq',
u'ó': u'oq',
u'Ó': u'Oq',
u'ś': u'sq',
u'Ś': u'Sq',
u'ż': u'zx',
u'Ż': u'Zx',
u'ź': u'zq',
u'Ź': u'Zq',
}
import re
reTranslate = re.compile(u'(%s)' % u'|'.join(translate))
print reTranslate.pattern
test = u"""\
Właściwie prowadzona komunikacja wewnętrzna w firmie,\
zwłaszcza dużej czy posiadającej rozproszoną sieć oddziałów,\
może przynieść oszczędność czasu, a co za tym idzie, również pieniędzy."""
print reTranslate.sub(lambda match: translate[match.group(0)], test)

App Engine Search API - Sort Results

I have several entities that I am searching across that include dates, and the Search API works great across all of them except for one thing - sorting.
Here's the data model for one of my entities (simplified of course):
class DepositReceipt(ndb.Expando):
#Sets creation date
creation_date = ndb.DateTimeProperty(auto_now_add=True)
And the code to create the search.Document where de is an instance of the entity:
document = search.Document(doc_id=de.key.urlsafe(),
fields=[search.TextField(name='deposit_key', value=de.key.urlsafe()),
search.DateField(name='created', value=de.creation_date),
search.TextField(name='settings', value=de.settings.urlsafe()),
])
This returns a valid document.
And finally the problem line. I took this snippet from the official GAE Search API tutorial and just changed the direction of the sort to DESCENDING and changed the search expression to created (the date property from the Document above).
expr_list = [search.SortExpression(
expression="created", default_value='',
direction=search.SortExpression.DESCENDING)]
I don't think this is important, but the rest of the search code looks like this:
sort_opts = search.SortOptions(expressions=expr_list)
query_options = search.QueryOptions(
cursor=query_cursor,
limit=_NUM_RESULTS,
sort_options=sort_opts)
query_obj = search.Query(query_string=query, options=query_options)
search_results = search.Index(name=index_name).search(query=query_obj)
In production, I get this error message:
InvalidRequest: Failed to parse search request "settings:ag5zfmdoaWRvbmF0aW9uc3IQCxIIU2V0dGluZ3MYmewDDA"; failed to parse date
Changing the expression="created" to anything else works perfectly fine. This also happens across my other entity types that use dates, so I have no idea what's going on. Advice?
I think default_value needs to be a valid date, rather than '' as you have it.

Google Datastore problem with query on *User* type

On this question I solved the problem of querying Google Datastore to retrieve stuff by user (com.google.appengine.api.users.User) like this:
User user = userService.getCurrentUser();
String select_query = "select from " + Greeting.class.getName();
Query query = pm.newQuery(select_query);
query.setFilter("author == paramAuthor");
query.declareParameters("java.lang.String paramAuthor");
greetings = (List<Greeting>) query.execute(user);
The above works fine - but after a bit of messing around I realized this syntax in not very practical as the need to build more complicated queries arises - so I decided to manually build my filters and now I got for example something like the following (where the filter is usually passed in as a string variable but now is built inline for simplicity):
User user = userService.getCurrentUser();
String select_query = "select from " + Greeting.class.getName();
Query query = pm.newQuery(select_query);
query.setFilter("author == '"+ user.getEmail() +"'");
greetings = (List<Greeting>) query.execute();
Obviously this won't work even if this syntax with field = 'value' is supported by JDOQL and it works fine on other fields (String types and Enums). The other strange thing is that looking at the Data viewer in the app-engine dashboard the 'author' field is stored as type User but the value is 'user#gmail.com', and then again when I set it up as parameter (the case above that works fine) I am declaring the parameter as a String then passing down an instance of User (user) which gets serialized with a simple toString() (I guess).
Anyone any idea?
Using string substitution in query languages is always a bad idea. It's far too easy for a user to break out and mess with your environment, and it introduces a whole collection of encoding issues, etc.
What was wrong with your earlier parameter substitution approach? As far as I'm aware, it supports everything, and it sidesteps any parsing issues. As far as the problem with knowing how many arguments to pass goes, you can use Query.executeWithMap or Query.executeWithArray to execute a query with an unknown number of arguments.

Resources