I am currently indexing a few documents from an external source into SOLR. This external source has few empty elements that are getting indexed in SOLR as well. How can I avoid indexing empty/null values in SOLR.
For e.g.
My CSV is name,city,zip. Some values are
Jack,Houston, 89812
,Austin,98123
In the second value set I do not have a name. However, when SOLR indexes this document it adds {"Name":"","City":"Austin","Zip":"98123"}. How can I avoid having "Name" as an empty element in SOLR?
Thanks in advance
If you need to do any pre-processing on submitted documents before they hit the schema, Solr has a whole UpdateRequestProcessor subsystem. The specific one you are looking for is RemoveBlankFieldUpdateProcessorFactory, possibly coupled with TrimFieldUpdateProcessorFactory. there
Remember that you need to tell Solr that you want to use them, either via chain (default or explicit) or via individual configuration (explicit), all described in the first link above.
You could convert your CSV to JSON, not providing the empty name and then indexing the JSON file(s).
Solr by itself only indexes what it gets. If it indexes an empty field, it got an empty field. And this is what happens with the CSV indexer, I guess, it just is not made to leave empty fields out.
With JSON you are in control.
Related
So I've got a comma separated value field (technically a textfield, but all of the values will be formatted as CSV) in Drupal which will be submitted to an Apache Solr query document.
The values will be a list of keywords, for example something like this (but not necessarily this):
productid, nameofproduct, randomattribute1, randomattribute2, etc, etc2
How would I best get Solr to process each of these? Do I need to create a separate string field for each of them, or is there anyway for Apache Solr to process what is essentially an array of values as a single field?
I'm not seeing any documentation on the dynamic fields that allows this, but it seems like a common enough use case that it would be usable.
So in short, is there anyway to use a field of CSV in Solr, or do I have to separate each value into a separate field for indexing?
If you are just looking for arrays, see 'multiValued' attribute of field. More on field attributes here. It is difficult to say what is right schema from your question. See
/Solr_Directory/example/solr/collection1/conf/schema.xml
The file can be used as a starting point and contains various combinations of fields.
Also look at this question. The answer shows how to split string by comma and store.
I am trying to index Wikipedia's dump. In order to provide abstract for the articles (or, maybe, enable highlighting feature in future) I'd like to store their text without WikiMarkup. For the first try, it would be enough for me to leave just alphanumeric symbols. So the question is it possible to store the field, that is filtered at character level, not the original one?
There is no way to do this out of the box. If you want Solr to do this, you can create your own UpdateHandler, but this might be a little tricky. The easiest way to do this would be to pre-process the document before sending it to Solr.
Solr by default stores original field values before the filters are been applied by the index time analyzers for your fieldType. So by default it is not storing the filtered value. However you have two options for getting the result that you want.
You can apply the same filters to the field at query time as are being applied at index time to remove the wiki markup. Please see Analyzers, Tokenizers and Token Filters on the Solr Wiki for more details.
You can apply the filters to the data in a separate process prior to loading the data into Solr, then Solr will store the filtered values, since you will be passing them in already in a filtered state.
I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?
Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...
I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.
Why I don't get any suggestion when I execute this query agains Solr:
q=%2B%28text%3A%28gasal%29%29&suggestField=contentOriginal&ontologySeed=gasal&spellcheck.build=true&spellcheck.q=gasal&spellcheck=true&spellcheck.collate=true&hl=true&hl.snippets=5&hl.fl=text&hl.fl=text&rows=12&start=0&qt=%2Fsuggestprobabilistic
I am searching gasal and it should suggest gasol.
Thanks in advance
By default, the spellchecker works by taking the indexed content of a source field (in Solr) and store into an external Lucene index. That external index is the dictionary. Each words of the source field are stored in the dictionary in a format that allows to match words that are closed to each other. When asking for suggestions, Solr will look into that dictionary, NOT into the Solr index.
So in order for the dictionary to be built, you have to specify the source field. It should be defined in your schema using an appropriate analyzer (usually no stemming). That field should contain enough words to built a good dictionary. A good practice is to populate it from your text fields using copyfield instructions.
Then, the dictionary have to be built. This is the operation where the content of the source field is taken to build the actual dictionary. It can be done automatically at each commit or manually using the "build" parameter.