Using a comma separated string of IDs vs. a multivalued field - solr

In my project, I have come to terms with creating relations between documents by storing them in a comma separated field, e.g.: relatedDocIds="2455,4564,7345" Those relations are updated from time to time by using a scheduled job that run through my DB, fetches a record, and updates its Solr document.
I know that instead of using a single comma-separated string field, I can use a multiValued string, where each ID could take one value slot. due to some limitations of my client API though, I can only set one value per field at the moment. I have not seen any disadvantages to using it the way I do, i.e. queries, such as relatedDocIds:2455 resolve exactly the way I want them to. The documentation of multiValued says that it does the same thing.
Am I missing a potential advantage of using multiValued? Is my method OK, and what are its limits? What would be a better and more optimized approach to store those IDs?

You are fine. Under the covers, the indexed form of the multiValued field is converted to a set of tokens, the same as if your tokenizer split them using that particular tokenizer's rule.
The main difference is that multiValued field pretends that the end token of one value and start token of another value are far from each other. That's what the positionIncrementGap means (usually 100).
This matters if you wanted to do a phrase search like "2455,4564". In your case, I believe, it will match, but if you had them as multiValued field with each value separate, it would not.
And, of course, multiValued fields - if stored - are returned as an array of value. Strings - if stored - are returned as they were given, even if the indexed version has all been broken up into the tokens.

Related

How to implement usual and exact match search based on the same fields in index?

We are indexing parties in our project which have names, alternate names, different identifiers, addresses and so on. And we would like to have STRICT exact search functionality using single/double inverted commas besides usual searching functionality (without inverted commas).
In order to achieve that we configured two different search handlers and switch between them based on existence inverted commas in user input. And also we indexed all mentioned party's attributes using for each one KeywordTokenizerFactory (for STRICT exact match search) and StandardTokenizerFactory (for usual search).
But the problem is the we doubled number of fields in Solr index and naturally its size.
So the question : is it possible to implement both types of searching based on having one field in Solr index per party attribute ?
If you had implemented the same functionality using a single field, you'd still have the more or less the exact amount of data in the index. The tokens you're searching against still has to be present and stored somewhere, and you'd end up with a confusing situation where it'd be very hard to score and rank hits in the different "types" contained in the same field (which, for all purposes, would be two fields, just with the same name.. so .. it's two fields..)
Using two fields as you currently are is the way to do this. But remember, you don't have to have to store content for all the fields (use stored="false" for fields that have identical values to other fields). That value would be identical for both/all fields, so just display the value from the first field, but search against them both / just the first / just the second.
Another option to reduce index size is to just store the id of the field, and then don't store any other fields. Retrieve any values from a primary data storage by looking up the id from the hit afterwards.
There are also many options you can disable for specific fields - which may not be needed depending on how you're using the field, such as termVectors, etc.

How are Solr's stored and indexed fields stored internally (in Lucene)

In Solr, when I set the field as 'indexed' but not 'stored', it is still stored in the index. If I go the other way around and set the field as 'stored' but not 'indexed', it is also stored in the index if I understand correctly.
My question is, how is the document stored internally in Lucene in these cases? How do 'stored' fields look like in Lucene and how do 'indexed' fields look like in Lucene internally.
The answer to this question will perhaps help me understand why atomic updates in Solr only work with stored fields and not indexed fields (as explained here: https://wiki.apache.org/solr/Atomic_Updates#Stored_Values).
In Solr/Lucene, indexed and stored are two different concepts.
indexed means the field value will be saved in inverted index, and you can search on them when you do query. But you can't see them in search results documents.
stored just means it will be saved in stored field value part, not in inverted index, which means it cannot be searched, but can be used to display when you get search results documents.
Actually, the way how Solr do update is, it will take out the whole document(only stored fields), change the value you want to update, and save them back(with re-index). That's why it can only support stored fields.

CSV field in Solr

So I've got a comma separated value field (technically a textfield, but all of the values will be formatted as CSV) in Drupal which will be submitted to an Apache Solr query document.
The values will be a list of keywords, for example something like this (but not necessarily this):
productid, nameofproduct, randomattribute1, randomattribute2, etc, etc2
How would I best get Solr to process each of these? Do I need to create a separate string field for each of them, or is there anyway for Apache Solr to process what is essentially an array of values as a single field?
I'm not seeing any documentation on the dynamic fields that allows this, but it seems like a common enough use case that it would be usable.
So in short, is there anyway to use a field of CSV in Solr, or do I have to separate each value into a separate field for indexing?
If you are just looking for arrays, see 'multiValued' attribute of field. More on field attributes here. It is difficult to say what is right schema from your question. See
/Solr_Directory/example/solr/collection1/conf/schema.xml
The file can be used as a starting point and contains various combinations of fields.
Also look at this question. The answer shows how to split string by comma and store.

Can Solr be used to match a document against keywords without storing the document?

I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?
Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...

Solr : how do i index and search several fields?

I've set up my first 'installation' of Solr, where each index (document) represents a musical work (with properties like number (int), title (string), version (string), composers (string) and keywords (string)). I've set the field 'title' as the default search field.
However, what do I do when I would like to do a query on all fields? I'd like to give users the opportunity to search in all fields, and as far as I've understood there is at least two options for this:
(1) Specify which fields the query should be made against.
(2) Set up the Solr configuration with copyfields, so that values added to each of the fields will be copied to a 'catch-all'-like field which can be used for searching. However, in this case, i am uncertain how things would turn out when i take into consideration that the data types are not all the same for the various fields (the various fields will to a lesser og greater degree go through filters, but as copyfield values are taken from their original fields before the values have been run through their original fields' filters, i would have to apply one single filter to all values on the copyfield. This, again, would result in integers being 'filtered' just as strings would).
Is this a case where i should use copyfields? At first glance, it seems a bit more 'flexible' to rather just search on all fields. However, maybe there's a cost?
All feedback appreciated! Thanks!
When doing a copy field, the data within the destination field will be indexed using the analyzer defined for that field. So if you define the destination field to be textual data, it is best to only copy textual data in it. So yes, copying an integer in the same field probably does not make sense. But do you really want the user to be able to search for your "number" field in a default search? It makes sense for the title, the composer and the keyword, but maybe not for the integer field that probably represents id in your database.
Another option to query on all fields is to use Dismax. You can specify exactly which fields you want to query, but also defined specific boots for each of them. You can also defined a default sort, add extra boost for more recent documents and many other fancy stuff.

Resources