Cloudant Search Index Analyzers doesn't sort alphabetically - cloudant

I have tested all available Analyzers on my Search Index. But none, except Keyword Analyzer, gave me proper sorted results in alphabetical order. But Keyword Analyzer doesn't fit in my filtering requirements. With Keyword Analyzer i couldn't search for a sub-string in a given sentence.
Example: "description": "This is 2 test different Analyzers in a Search Index"
Whitespace Analyzer gives proper search results but it doesn't help me with sorting. Does anyone have pointers on how we can achieve both sorting and searching with Search Index?

The analyzers define how text is broken into words and how those words are truncated (stemmed) into tokens for indexing. For example, the keyword analyzer keeps words intact in their entirety which is handy for tags.
Analyzers don't have much to do with sorting. By default, sorting is by "best match first" i.e. the documents that are the closest match to your input string appear first, which is what you might expect from a search engine.
You can override the default sort by supplying a sort parameter. e.g.
e.g. q=frank+sinatra&sort=date
See https://console.bluemix.net/docs/services/Cloudant/api/search.html#search for further sorting options.

Related

change Commercetools Product Projection Search sort rules

Good afternoon, the standard sorting in comerstools supports sorting alphabetically. The sorting principle is special characters first, then numbers, then letters. I need records to be returned
A-Z
numbers
specsymbols
and if the record starts with a space, then this space is not taken.
"A"
" B"
"9"
"("
Is it possible to do this with the standard tools of comerstools? The documentation says only about sorting in ascending and descending order. I need to set a different sorting principle.
I'm trying to use the queries described in the documentation
products-search#sorting
Currently, the sort feature cannot be customized as you describe.
As mentioned in the documentation, if multiple sort expressions are specified via multiple sort parameters, they are combined into a composed sort where the results are first sorted by the first expression, followed by equal values being sorted according to the second expression, and so on. This could be helpful to your use case.
https://docs.commercetools.com/api/general-concepts#sorting
Best regards,
Michael

solr find the most used word after a given word

I need to find the most used word after a given word. For an example collection,
A B
A C
A B
B C
Here the most used word after word A is B.
How can I find this in solr?
Create a field with ShingleFilterFactory as one of its filters. This will generate a token sequence for each word when indexing the field, so that A B C is indexed as A B and B C. You will want to use the WhitespaceTokenizer or something similar for the field.
Make a request that searches for field:A\ * (meaning everything starting with the word A) as the query, and add a facet for the field.
facet=true&facet.field=field&facet.limit=10&facet.sort=count
will give you the ten most used sequences that start with the word A.
ShingleFilterFactory defaults to generating shingles with two tokens in each shingle, but you can tune this by altering minShingleSize and maxShingleSize.

SQL Server Full Text Search Most Common Word Pairs

I am looking for a way to query for the most common adjacent words and/or most common included words in a document given a set of documents containing a word.
For example, I would like a query that would accept 'windows' and return a list of words that are most commonly found in a document containing 'windows', like 'microsoft' or 'doors'.
I would like to find adjacent words, but I also see a potential need in my application for eventually knowing the most common words also present in the document. An example of that might be 'linux' or 'efficiency'. Those words might not be adjacent to 'windows' but they are likely to be in the same document.
I found this question which helps me part way, but that only gets me the most common words given all the documents, or a specific document, not a set of documents.

In solr how to find no of occurrences of searched words in each document

I want to find out how many times the searched keyword repeated in each document. For Ex: search word: pharmacy related. This word may be repeated n no of times in all MATCHED documents, how to find out the COUNT per document? Please suggest me
You can do that with Solr's functions:
termfreq(text,'pharmacy related')
The only condition for that is you need to index this field correctly! In case when you need to return count of phrases rather than single words, I would use ShingleFilterFactory

How to efficiently search large dataset for substrings?

I have a large set of short strings. What are some algorithms and indexing strategies for filtering the list on items that contain a substring? For example, suppose I have a list:
val words = List(
"pick",
"prepick",
"picks",
"picking",
"kingly"
...
)
How could I find strings that contain the substring "king"? I could brute force the problem like so:
words.filter(_.indexOf("king") != -1) // yields List("picking", "kingly")
This is only practical for small sets; Today I need to support 10 million strings, with a future goal in the billions. Obviously I need to build an index. What kind of index?
I have looked at using an ngram index stored in MySQL, but I am not sure if this is the best approach. I'm not sure how to optimally query the index when the search string is longer than the ngram size.
I have also considered using Lucene, but this is optimized around token matching, not substring matching, and does not seem to support the requirement of simple substring matching. Lucene does have a few classes related to ngrams (org.apache.lucene.analysis.ngram.NGramTokenFilter is one example), but these seem to be intended for spell check and autocomplete use cases, not substring matching, and the documentation is thin.
What other algorithms and indexing strategies should I consider? Are there any open source libraries that support this? Can the SQL or Lucene strategies (above) be made to work?
Another way to illustrate the requirement is with SQL:
SELECT word FROM words WHERE word LIKE CONCAT('%', ?, '%');
Where ? is a user provided search string, and the result is a list of words that contain the search string.
How big is the longest word?
if that's about 7-8 char you may find all substrings for each and every string and insert that substrings in trie (the one is used in Aho-Corasik - http://en.wikipedia.org/wiki/Aho-Corasick)
It will take some time to build the tree but then searching for all occurances will be O(length(searched word)).
Postgres has a module which does a trigram index
That seems an interesting idea too- building a trigram index.
About a comment in your question regarding how to break down text searches greater than n-gram length:
Here's one approach which will work:
Say we have a search string as "abcde" , and we have built a trigram index. (You have strings which are of smaller lengths-this could hit a sweet spot for you)
Let the search results of abc= S1, bcd=S2,cde=S3 (where S1,S2,S3 are sets of indexes )
Then the longest common substring of S1,S2,S3 will give the indexes that we want.
We can transform each set of indexes,as a single string separated by a delimiter (say space) before doing LCS.
After we find the LCS,we would have to search the indexes for the complete pattern,since we have broken down the search term. ie we would have to prune results which have "abc-XYZ- bcd-HJI-def"
The LCS of a set of strings can be efficiently found Suffix Arrays. or Suffix trees

Resources