So I've got a comma separated value field (technically a textfield, but all of the values will be formatted as CSV) in Drupal which will be submitted to an Apache Solr query document.
The values will be a list of keywords, for example something like this (but not necessarily this):
productid, nameofproduct, randomattribute1, randomattribute2, etc, etc2
How would I best get Solr to process each of these? Do I need to create a separate string field for each of them, or is there anyway for Apache Solr to process what is essentially an array of values as a single field?
I'm not seeing any documentation on the dynamic fields that allows this, but it seems like a common enough use case that it would be usable.
So in short, is there anyway to use a field of CSV in Solr, or do I have to separate each value into a separate field for indexing?
If you are just looking for arrays, see 'multiValued' attribute of field. More on field attributes here. It is difficult to say what is right schema from your question. See
/Solr_Directory/example/solr/collection1/conf/schema.xml
The file can be used as a starting point and contains various combinations of fields.
Also look at this question. The answer shows how to split string by comma and store.
Related
I am currently indexing a few documents from an external source into SOLR. This external source has few empty elements that are getting indexed in SOLR as well. How can I avoid indexing empty/null values in SOLR.
For e.g.
My CSV is name,city,zip. Some values are
Jack,Houston, 89812
,Austin,98123
In the second value set I do not have a name. However, when SOLR indexes this document it adds {"Name":"","City":"Austin","Zip":"98123"}. How can I avoid having "Name" as an empty element in SOLR?
Thanks in advance
If you need to do any pre-processing on submitted documents before they hit the schema, Solr has a whole UpdateRequestProcessor subsystem. The specific one you are looking for is RemoveBlankFieldUpdateProcessorFactory, possibly coupled with TrimFieldUpdateProcessorFactory. there
Remember that you need to tell Solr that you want to use them, either via chain (default or explicit) or via individual configuration (explicit), all described in the first link above.
You could convert your CSV to JSON, not providing the empty name and then indexing the JSON file(s).
Solr by itself only indexes what it gets. If it indexes an empty field, it got an empty field. And this is what happens with the CSV indexer, I guess, it just is not made to leave empty fields out.
With JSON you are in control.
In my project, I have come to terms with creating relations between documents by storing them in a comma separated field, e.g.: relatedDocIds="2455,4564,7345" Those relations are updated from time to time by using a scheduled job that run through my DB, fetches a record, and updates its Solr document.
I know that instead of using a single comma-separated string field, I can use a multiValued string, where each ID could take one value slot. due to some limitations of my client API though, I can only set one value per field at the moment. I have not seen any disadvantages to using it the way I do, i.e. queries, such as relatedDocIds:2455 resolve exactly the way I want them to. The documentation of multiValued says that it does the same thing.
Am I missing a potential advantage of using multiValued? Is my method OK, and what are its limits? What would be a better and more optimized approach to store those IDs?
You are fine. Under the covers, the indexed form of the multiValued field is converted to a set of tokens, the same as if your tokenizer split them using that particular tokenizer's rule.
The main difference is that multiValued field pretends that the end token of one value and start token of another value are far from each other. That's what the positionIncrementGap means (usually 100).
This matters if you wanted to do a phrase search like "2455,4564". In your case, I believe, it will match, but if you had them as multiValued field with each value separate, it would not.
And, of course, multiValued fields - if stored - are returned as an array of value. Strings - if stored - are returned as they were given, even if the indexed version has all been broken up into the tokens.
I am trying to index Wikipedia's dump. In order to provide abstract for the articles (or, maybe, enable highlighting feature in future) I'd like to store their text without WikiMarkup. For the first try, it would be enough for me to leave just alphanumeric symbols. So the question is it possible to store the field, that is filtered at character level, not the original one?
There is no way to do this out of the box. If you want Solr to do this, you can create your own UpdateHandler, but this might be a little tricky. The easiest way to do this would be to pre-process the document before sending it to Solr.
Solr by default stores original field values before the filters are been applied by the index time analyzers for your fieldType. So by default it is not storing the filtered value. However you have two options for getting the result that you want.
You can apply the same filters to the field at query time as are being applied at index time to remove the wiki markup. Please see Analyzers, Tokenizers and Token Filters on the Solr Wiki for more details.
You can apply the filters to the data in a separate process prior to loading the data into Solr, then Solr will store the filtered values, since you will be passing them in already in a filtered state.
I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?
Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...
I've set up my first 'installation' of Solr, where each index (document) represents a musical work (with properties like number (int), title (string), version (string), composers (string) and keywords (string)). I've set the field 'title' as the default search field.
However, what do I do when I would like to do a query on all fields? I'd like to give users the opportunity to search in all fields, and as far as I've understood there is at least two options for this:
(1) Specify which fields the query should be made against.
(2) Set up the Solr configuration with copyfields, so that values added to each of the fields will be copied to a 'catch-all'-like field which can be used for searching. However, in this case, i am uncertain how things would turn out when i take into consideration that the data types are not all the same for the various fields (the various fields will to a lesser og greater degree go through filters, but as copyfield values are taken from their original fields before the values have been run through their original fields' filters, i would have to apply one single filter to all values on the copyfield. This, again, would result in integers being 'filtered' just as strings would).
Is this a case where i should use copyfields? At first glance, it seems a bit more 'flexible' to rather just search on all fields. However, maybe there's a cost?
All feedback appreciated! Thanks!
When doing a copy field, the data within the destination field will be indexed using the analyzer defined for that field. So if you define the destination field to be textual data, it is best to only copy textual data in it. So yes, copying an integer in the same field probably does not make sense. But do you really want the user to be able to search for your "number" field in a default search? It makes sense for the title, the composer and the keyword, but maybe not for the integer field that probably represents id in your database.
Another option to query on all fields is to use Dismax. You can specify exactly which fields you want to query, but also defined specific boots for each of them. You can also defined a default sort, add extra boost for more recent documents and many other fancy stuff.