Problem with spellchecker

Problem with spellchecker - solr

Why I don't get any suggestion when I execute this query agains Solr:
q=%2B%28text%3A%28gasal%29%29&suggestField=contentOriginal&ontologySeed=gasal&spellcheck.build=true&spellcheck.q=gasal&spellcheck=true&spellcheck.collate=true&hl=true&hl.snippets=5&hl.fl=text&hl.fl=text&rows=12&start=0&qt=%2Fsuggestprobabilistic
I am searching gasal and it should suggest gasol.
Thanks in advance

By default, the spellchecker works by taking the indexed content of a source field (in Solr) and store into an external Lucene index. That external index is the dictionary. Each words of the source field are stored in the dictionary in a format that allows to match words that are closed to each other. When asking for suggestions, Solr will look into that dictionary, NOT into the Solr index.
So in order for the dictionary to be built, you have to specify the source field. It should be defined in your schema using an appropriate analyzer (usually no stemming). That field should contain enough words to built a good dictionary. A good practice is to populate it from your text fields using copyfield instructions.
Then, the dictionary have to be built. This is the operation where the content of the source field is taken to build the actual dictionary. It can be done automatically at each commit or manually using the "build" parameter.

Related

Remove null from SOLR Indexes

I am currently indexing a few documents from an external source into SOLR. This external source has few empty elements that are getting indexed in SOLR as well. How can I avoid indexing empty/null values in SOLR.
For e.g.
My CSV is name,city,zip. Some values are
Jack,Houston, 89812
,Austin,98123
In the second value set I do not have a name. However, when SOLR indexes this document it adds {"Name":"","City":"Austin","Zip":"98123"}. How can I avoid having "Name" as an empty element in SOLR?
Thanks in advance

If you need to do any pre-processing on submitted documents before they hit the schema, Solr has a whole UpdateRequestProcessor subsystem. The specific one you are looking for is RemoveBlankFieldUpdateProcessorFactory, possibly coupled with TrimFieldUpdateProcessorFactory. there
Remember that you need to tell Solr that you want to use them, either via chain (default or explicit) or via individual configuration (explicit), all described in the first link above.

You could convert your CSV to JSON, not providing the empty name and then indexing the JSON file(s).
Solr by itself only indexes what it gets. If it indexes an empty field, it got an empty field. And this is what happens with the CSV indexer, I guess, it just is not made to leave empty fields out.
With JSON you are in control.

Why does Solr store the original/pre-analysis content of a field rather than just its index?

This question kind of makes it clear that I am new to Solr and all of its wonderful features. I apologise for my noobness.
But why does Solr store the original content in addition to the index? It just seems wasteful. I do realise that it stores the original content only if the field has the property stored="true".
Where does it store the original content? Does it reference the actual document somehow?
Also, Is there any way to directly view the index files saved by Solr for each collection?
Links will be appreciated.

If Solr didn't store the text, it wouldn't be able to actually return the text it found - making it impossible to do stuff like highlighting, or build an application that uses the results from Solr directly. You'd have to look up the actual content somewhere else for each and every result, which might not be what you want (and that content might not be available, for example if you're building a search engine - it wouldn't really be effective to retrieve each page in a search result to get the relevant information anyways).
You can read up on the index file format in the API documentation for the Lucene60 codec, the stored fields are stored using the stored fields format. These fields live in the .fdt files in your index directory.
The index files are usually available in the data/index/ directory under the collection / core on disk:
data/index$ ls
_zq.fdt _zr.fdx _zs.si
...

How can we retrieve tokens of a particular property from search engine?

Community version. When contents are added in Alfresco search engine tokenizes properties (name, description) and stores it in indexes. I would like to know if there a way by which we could retrieve a list of those keywords associated with particular content?
Ex.. Fetch me tokens from "Name" of "abc.txt" content
I see there are API's exposed by SolR to get overall status of indexes and to fix transactions, but nothing which meets my needs.

I had a similar experience, needed to find out what the tokenizer was doing about indexes because a particular file name was not found during search.
I finally used Luke Lucene index toolbox which is:
Luke is a handy development and diagnostic tool, which accesses
already existing Lucene indexes and allows you to display and modify
their content in several ways:
browse by document number, or by term
view documents / copy to clipboard
retrieve a ranked list of most frequent terms execute a search, and browse the results
analyze search results
selectively delete documents from the index
reconstruct the original document fields, edit them and re-insert to the index
optimize indexes
open indexes consisting of multiple parts, and/or located on Hadoop filesystem
and much more...
Simply open the index files and you will have a peek on how properties and data were tokenized.
As reported in this post it could be easily used also for SolR indexes.

Can Solr be used to match a document against keywords without storing the document?

I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?

Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?

"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.

We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Problem with spellchecker - solr

Related

Remove null from SOLR Indexes

Why does Solr store the original/pre-analysis content of a field rather than just its index?

How can we retrieve tokens of a particular property from search engine?

Can Solr be used to match a document against keywords without storing the document?

Index file content and custom metadata separately with Solr3.3

Categories

Resources