At the moment I am researching what the best configuration for Solr is to fit the scope of my application. It involves a lot of testing and I was wondering if I can display what Solr saves as index. I.e. I want to see the tokenized, stemmed, lower cased, etc. version of my documents. Is there any way Solr will provide this information?
Thank you
Jan
Have a look at Luke: http://www.getopt.org/luke/
Solr also has a Luke handler built-in: https://wiki.apache.org/solr/LukeRequestHandler
You can use the Solr Analysis which is provided on Solr admin interface. http://wiki.apache.org/solr/SolrAdminGUI
When on the analysis page, just putting the 'field type' or 'field name' you want the analysis on and put in any field value. Solr Analysis will show you what each Filter/Tokenizer is doing and how exactly does your content look after each step. Its great for testing and debugging.
You can do the same on a query if you have set such analyzers (tokenizers/filters) on your query as well in the schema.
Hope this helps.
Related
With Hybris commerce 6.7 which does have Solr 7.7, I'm finding it very difficult to configure Solr appropriately to meet business expectation while showing them "Did you mean" suggestion. I searched many articles regarding this and found many configuration parameters. Based on all those, with meaningful changes, I was expecting to have any working for me. Unfortunately, I'm still in search of that particular configuration or approach that retailers like Flipkart or Amazon is handling it. Below are the points that troubled me a lot.
To my knowledge Spellcheck works per word from entire search phrase. If user searches with single word but does have understandable spelling mistake, Solr is able to find the correct word easily. E.g. Telvison, mobbile etc.
If user searches multi-word (phrase), for some instances, Hybris Solr is not able to bring any suggestion. Sometimes, it shows suggestions with no real-world existence. E.g. If you misspelled aple watch, it gives suggestion apple water. For bode speaker, it suggests body speaker. For water heatre, it suggests skywater theatre. For samsung note 10 lite, it suggests samsung note 10 litre. For red apple wath, red apple with is getting suggested. For red apple watch, it shows led apple watch. And there are many. Isn't it ridiculous?
Tried adding WordBreakSolrSpellChecker dictionary with existing DirectSolrSpellChecker, it didn't impact the suggestion. I doubt if Hybris does allow this.
I also tried FileBasedSpellChecker dictionary to maintain a separate text file, but it seems like Hybris does have hard dependency that doesn't allow such changes.
Changed the dictionary to IndexBasedSpellChecker, but it threw exception in Solr admin console.
After playing with collate parameters, I figured out that Solr is giving me suggestions that don't have any direct search result (product). FYI, I used phrase search (not freetext) in my search implementation.
There are many parameters that standalone Solr does offer to us. I studied and implemented those, but I remained helpless although conceptually those should work.
Can anyone please guide me how I should proceed? If you want, I can share my Solr configuration.
I would like to add ommitNorm=true to the title field.
It is wrongfully overboosting some of our titles.
However I don't know how the title field is indexed. What is its name - just dc.title?
Because in the schema.xml, I don't see anything about it. What is the type of that field, what analyzer or anything else is used for it. Is there anyway to know?
Most metadata fields in DSpace are handled via dynamic fields. That's why you don't see each specified individually in the search core's schema.xml file.
I'm not sure where the boosting is happening (or whether DSpace does any, even). I don't recall seeing any boost clauses when looking through the solr log files. I see some extraction parameters being set in SolrServiceImpl#writeDocument, where the document is being indexed. It looks like there is an extraction parameter for boosting individual fields, perhaps you can play with that to get what you'd like.
If you want to see the field type for any Solr field, the easiest option is probably the Schema Browser in the Solr admin user interface, eg
http://localhost:8080/solr/#/search/schema-browser?field=title (you may need to use an SSH tunnel or the like to access Solr running on a different host since the DSpace solr install is typically IP-limited to access from localhost).
ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?
besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.
It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.
If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak
In short, I need to search against my Riak buckets via SOLR. The only problem is, is that by default SOLR searches are case-sensitive. After some digging, I see that I need to write a custom SOLR text analyzer schema. Anyone have any good references for writing search analyzer schemas?
And finally, when installing a new schema for an index, is re-indexing all objects in a bucket necessary to show prior results in a search (using new schema)?
RTFM fail.... I swear though, getting to this page was not easy
http://docs.basho.com/riak/latest/dev/advanced/search-schema/#Defining-a-Schema
Hey so I started researching about Solr and have a couple of questions on how Solr works. I know the schema defines what is stored and indexed in the Solr application. But I'm confuse as to how Solr knows that the "content" is the content of the site or that the url is the url?
My main goal is I'm trying to extract phone numbers from websites and I want Solr to nicely spit out 1234567890.
You need to define it in Solr schema.xml by declaring all the fields and its field type. You can then query Solr for any field to search.
Refer this: http://wiki.apache.org/solr/SchemaXml
Solr will not automatically index content from a website. You need to tell it how to index your content. Solr only knows the content you tell it to know. Extracting phone numbers sounds pretty simple so writing an update script or finding one online should not be an issue. Good luck!