in AIML, can I give priority to the pattern matching - artificial-intelligence

in AIML, if I have multiple files matching for the same pattern, how can I give precedence to match in one file ?

You should use AIML's wildcards to control the priority of pattern matching.
AIML 1.0 only has * and _ to match 1 or more words. AIML 2.0 adds ^ and # to match 0 or more words.
Below is the priority rank of AIML 2.0 wildcards, from the highest matching priority to the lowest.
"$" : indicates that the word now has higher matching priority than "_"
"#" : 0 or more words
"_" : 1 or more words
word : exact word match
"^" : 0 or more words
"*" : 1 or more words
Please see AIML 2.0 working draft for details, specifically chapter 5.A.Zero or more words wildcards for wildcards and priority description.

The AIML 1.0 wildcards * and _ are defined so that they match one or more words. AIML 2.0 introduces two new wildcards, ^ and #, defined to match zero or more words. As a shorthand description, we refer to these as “zero+ wildcards”.
Both ^ and # are defined to match 0 or more words. The difference between them is the same as the difference between * and _. The # matching operator has the highest priority in matching, followed by _, followed by an exact word match, followed by ^, and finally * has the lowest matching priority.
When defining a zero+ wildcard it is necessary to consider what the value of (as well as and ) should be when the wildcard match has zero length. In AIML 2.0 we leave this up to the botmaster. Each bot can have a global property named nullstar which the botmaster can set to “”, “unknown”, or any other value.
What’s new in AIML 2.0?

The Alice site has the following notes on how priority is determined:
At every node, the "_" has first priority, an atomic word match second priority, and a "*" match lowest priority.
The patterns need not be ordered alphabetically, only partially ordered so that "_" comes before any word and "*" after any word.
The matching is word-by-word, not category-by-category.
The algorithm combines the input pattern, the pattern, and the pattern into a single "path" or sentence such as: "PATTERN
THAT TOPIC" and treats the tokens and
like ordinary words. The PATTERN, THAT and TOPIC patterns may contain
multiple wildcards.
The matching algorithm is a highly restricted version of depth-first search, also known as backtracking.
You can simplify the algorithm by removing the "_" wildcard, and considering just the second two steps. Also try understanding the
simple case of PATTERNs without and .
From Alicebot.org
Based on this you could use the '_' to give something presidence. Take the following example:
<category>
<pattern>_ BAR</pattern>
<template>Which bar?</template>
</category>
<category>
<pattern>FOO BAR</pattern>
<template>Don't you mean FUBAR? That's an old military acronym, that roughly translates to "broken". I can't directly translate it because I don't curse.</template>
</category>
<category>
<pattern>* BAR</pattern>
<template>There are a lot of bars. There's a crow bar, the state bar, a bar for drinking, and foo bar.</template>
</category>
The _ takes highest priority being matched first. The simple BAR is second in priority and the * is last.

Related

Preserving word order in Vespa in non-English

I am creating a schema for Vespa mainly for English, but with two fields in Wylie transliteration of Tibetan, which looks like this
'jam dpal smra ba'i seng ge la bstod pa ut+pal dmar po'i do shal
Typically users want to match every token and preserve the word order, and preferably in the beginning of the field.
For example, to find the field above, user might enter "'jam dpal smra ba'i seng ge". They would not appreciate results where these tokens would appear in different order, even if that would rank high with BM25. BM25 would still be needed for fallback.
Could you give me an example of the schema field / ranking expression to rank in this order:
exact match in the beginning of field
exact match anywhere
bm25
Naturally, I'll turn off stemming. Also, apostrophes and, less importantly, plus signs should be preserved.
I have read especially the Schema Reference of Vespa docs, but I did not find a solution.
I got the best results with
field wylie type string {
indexing: index | summary
index: enable-bm25
stemming: none
}
rank-profile native_rank_and_wylie {
first-phase {
expression: nativeRank(title, body) + fieldMatch(wylie).earliness + fieldMatch(wylie).longestSequence * 0.4
}
}
Note that longestSequence is not normalized and can affect scores a lot.

How do you remove a word completely from an Apache Solr index?

I'm running Apache Solr 6.6.5. When a user searches for "ETCS" (a special technical term) then all documents are matches that contain the word "etc". But I only want to match documents that really contain "ETCS". Solr should never even index "etc" since it is such a common word. The stemmer should never turn "etc" into "etcs" (the plural stemming).
I added "etc" to stopwords.txt:
# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc
I added "etc" to protwords.txt:
#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&
<
>
'
"
etc
That helps to not match documents that contain "etc", but documents containing "etc.", "etc," or similar are still matched.
So I could add even more variants to protwords.txt:
&
<
>
'
"
etc
etc.
etc..
etc...
etc,
But that will always be incomplete. How can I tell the stemmer to consider "etc" as tokenized word with arbitrary non-word characters around it?
My schema.xml: https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945
1.)
I added "etc" to protwords.txt:
you should add etcs to protwords to protect stemming of the term etcs.
2.)
So I could add even more variants to protwords.txt:
Add all variations of words you like to remove from the index into the stopwords.txt, not the protwords.txt
3.) check what filed type you are using. Maybe you can tune that here a bit
//Edit: adding a link to your schema.xml will not help as long as you does not explain, which field you are using.
4.) don`t forget to restart and (if needed) reindex your index.

search for any number in SOLR

How can I set up a SOLR index in a way that allows me to search for any number?
I believe that the following works, more or less:
0* OR 1* OR 2* OR 3* OR 4* OR 5* OR 6* OR 7* OR 8* OR 9*
But it really does not seem to be ideal, and cannot be used as part of double-quoted expressions, etc.
If you're looking for all documents that contain a token that just is a number, a regular expression search should work:
q=field:/[0-9]+/
If you have tokens in your text that contain a number within other characters (.. but those wouldn't have matched your example), you can add a wildcard before and after matching the numbers:
q=field:/.*[0-9]+.*/

Solr wildcard query on multiple words in text field

I'm searching for "foo" followed by "bar" in a text field named "doc".
My query needs to match the text "foo walks into a bar" but not "bar has place for foo"
I've seen a few similar questions, but no concrete answer.
Queries that don't work:
q=doc:foo*bar
q=doc:/.*foo.bar./
It seems that this is because each word in the text field is tokenized separately. Is there a way to get around this? (Note: I can't change the field type)
Have a look at the Surround Query Parser and at the Complex Phrase Query Parser
The SurroundQParser enables the Surround query syntax, which provides
proximity search functionality.
There are two positional operators: w creates an ordered span query
and n creates an unordered one. Both operators take a numeric value
to indicate distance between two terms. The default is 1, and the
maximum is 99.
Note that the query string is not analyzed in any way.
Example:
{!surround} 3w(foo, bar)
This example would find documents where the terms "foo" and "bar" were
no more than 3 terms away from each other (i.e., no more than 2 terms
between them).
Regarding the Complex Phrase Query Parser, pay attention at the inOrder parameter that let you specify the order of the matched keywords.

Solr Tokenizer Question

I have what I think is a simple solr exercise, but I'm unsure what to use.
I have a field of names, e.g. Joe Smith and Jack Daniels and Steve. They could each be one name or two names. I want to be able to search this s.t. if you search for "Danie" you get everything that has a first or last name that starts with "Danie". Three example returns would be "Danielle", "Steven Daniels", and "Danier Daniellson".
I would also like it so that the preference is given to the first name.
So two questions would be do I need to use a copyField and break up the names into first and last name? And what would my analyzer look like?
Edit: Two edits on the searching ability.
1. Something like "Joe S" should return all users that look like "Joe S*"
2. If a user searches with an "&" character, that should be included in the search and not used as an operator.
To solve your first part I suggest the following solution:
index your fields twice:
once with solr.KeywordTokenizerFactory - that will index your entire field as it is. It will not be splitted into tokens. This will be useful for boosting results with the preference given to the first name.
once with WordDelimiterTokenizerFactory or StandardTokenizerFactory
You can find more about these tokenizers here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
After you indexed them in two filters with different tokenizers you just use boost query to boost your results from one field (the one with preference given to the first name) as it is explained here: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_make_.22superman.22_in_the_title_field_score_higher_than_in_the_subject_field
If a user searches with an "&" character, that should be included in the search and not used as an operator.
For this part you either use DisMax query http://wiki.apache.org/solr/DisMaxQParserPlugin or when you make a request use "&" instead of &
Also you need to use a tokenizer like WhiteSpaceDelimiter to just keep other characters in tokens.

Resources