Solr multilingual stemisation - solr

I'm using Solr to index documents like .pdf or .docx. These documents are in french or in english and I want to use the stemisation for both languages.
For exemple, if I search "chevaux" I want to find "cheval" (french) and if I search "raise" I want to find "raising" (english).
Is there a way to do this without createting 2 core (one in english and one in french) ?

Have two fields, one with the field definition you want for French, and one with the field definition you want for English. Then use the Language Detection feature to submit the content to the correct field.
When searching, query the field that has the correct language as the user, or if you don't know, search both - or use language detection to try to do a better guess.
You can also index the same content into both fields, but my initial guess is that it'll give you weird results down the road, where someone enters a French word, but due to the processing rules for English, you get hit that wouldn't have happened if you only indexed to the correct field.
By enabling langid.map, you can tell Solr to index the content into fields named fieldname_langcode (where fieldname is picked up from langid.fl).
langid.map: Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.
You can use langid.map.replace or langid.map.pattern if you want to change the default fieldname_langcode naming, but I'd leave those alone for now.

Related

Apache solr search text search (among multiple fields)

I am studying/getting familiar Apache Solr database.
I created a simple document via the admin UI:
{
"company_name":["Rikotech inc"],
"id":"12345",
"full_title":["ft rikotech marinov"],
"_version_":1681062832169287680}]
}
Here is the document fetched:
But when I type rikotech in the standard query field, I get no result:
Both full_title and company_name are of type text_general .
I watched YouTube video with some Indian guy, and it worked for him ;|
What am I missing here?
Solr will not search all fields (under any configuration, really) without specifying the fields. However, the tutorial you watched probably had the default copyField rule enabled where everything is copied into a field named _text_, and then that field is configured as the default search field. This effectively means that everything is being copied into a specific field, and then that (single) field is being searched by default.
In your case it's probably better to use the edismax query parser (check the box in front of edismax in the user interface), and then give full_title company_name as the query fields (qf). That will allow you to adjust the weights between the fields as well. full_title company_name^5 will give 5x as much weight to any hits in company_name compared to those in full_title.
I found the problem.
It was that the fields I want to search through by default were copied to some strange fields like full_title_str, instad of text . This is the correct schema setting:

Spell checking with Solr

I use Solr to index documents (pdf, word, .txt, etc). I need to use spell checker (in french) but I don't know how to do this. I need this function only on the field "content" the type of this field is text_general.
The spellchecker uses the content of your index to build the terms that are used for suggestions - there is no language configuration, since as long as the content that has been indexed is French, the suggestion back to the user will be based on those terms.
The exception is if you're using the FileBasedSpellChecker, where you provide a dictionary of terms with their correct spelling.
# spellcheck.q is only necessary if you want to use a different query than your actual query
&spellcheck=true&spellcheck.q=foo

Solr Support for wildcard in facet.prefix

I have a system which has huge number of facet values on Country name. So the countries can be USA, United State, Canada etc
Now I wanted facets to be custom sorted. By default solr supports either count based sorting or alphabetic sorting. However I did not wanted sorting in this manner. I wanted to have a custom sort such that also USA variations comets at a top, then europe, then asia and so on.
For this I have written a tokenizer which reads a text file and generates token like this
0001_usa
0002_united state
So basically I prefix my sort sequence and then sort on alphabetic order. I then remove the prefix while displaying on UI. So far it works great. Now since the number of facets are huge, I also want a search feature with auto suggest. So for example if a user types "u" I should be able to display all countries starting with "u" in the type ahead. I was using facet.prefix earlier for this but after my custom token it would not work since I prefix 000x to the token. Also facet.prefix does not seem to support wild card. So how can I implement this type ahead? Any other way to support custom sorting in Solr. I do not want to get all the data on client and sort since its huge.
Please help
You easily achieve this by indexing the country names in an additional field, with the right handlers for autosuggest.
You could have something like country_sort where you put your prefixed values like before (001_usa, 0002_united state) and a country_autosugest field where you put the plain values (usa, united states).
Then query on country_autosugest and sort on country_sort. This way you can also return the value of country_autosuggest, no need to process the string at display time.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Resources