I've set up my first 'installation' of Solr, where each index (document) represents a musical work (with properties like number (int), title (string), version (string), composers (string) and keywords (string)). I've set the field 'title' as the default search field.
However, what do I do when I would like to do a query on all fields? I'd like to give users the opportunity to search in all fields, and as far as I've understood there is at least two options for this:
(1) Specify which fields the query should be made against.
(2) Set up the Solr configuration with copyfields, so that values added to each of the fields will be copied to a 'catch-all'-like field which can be used for searching. However, in this case, i am uncertain how things would turn out when i take into consideration that the data types are not all the same for the various fields (the various fields will to a lesser og greater degree go through filters, but as copyfield values are taken from their original fields before the values have been run through their original fields' filters, i would have to apply one single filter to all values on the copyfield. This, again, would result in integers being 'filtered' just as strings would).
Is this a case where i should use copyfields? At first glance, it seems a bit more 'flexible' to rather just search on all fields. However, maybe there's a cost?
All feedback appreciated! Thanks!
When doing a copy field, the data within the destination field will be indexed using the analyzer defined for that field. So if you define the destination field to be textual data, it is best to only copy textual data in it. So yes, copying an integer in the same field probably does not make sense. But do you really want the user to be able to search for your "number" field in a default search? It makes sense for the title, the composer and the keyword, but maybe not for the integer field that probably represents id in your database.
Another option to query on all fields is to use Dismax. You can specify exactly which fields you want to query, but also defined specific boots for each of them. You can also defined a default sort, add extra boost for more recent documents and many other fancy stuff.
Related
We are indexing parties in our project which have names, alternate names, different identifiers, addresses and so on. And we would like to have STRICT exact search functionality using single/double inverted commas besides usual searching functionality (without inverted commas).
In order to achieve that we configured two different search handlers and switch between them based on existence inverted commas in user input. And also we indexed all mentioned party's attributes using for each one KeywordTokenizerFactory (for STRICT exact match search) and StandardTokenizerFactory (for usual search).
But the problem is the we doubled number of fields in Solr index and naturally its size.
So the question : is it possible to implement both types of searching based on having one field in Solr index per party attribute ?
If you had implemented the same functionality using a single field, you'd still have the more or less the exact amount of data in the index. The tokens you're searching against still has to be present and stored somewhere, and you'd end up with a confusing situation where it'd be very hard to score and rank hits in the different "types" contained in the same field (which, for all purposes, would be two fields, just with the same name.. so .. it's two fields..)
Using two fields as you currently are is the way to do this. But remember, you don't have to have to store content for all the fields (use stored="false" for fields that have identical values to other fields). That value would be identical for both/all fields, so just display the value from the first field, but search against them both / just the first / just the second.
Another option to reduce index size is to just store the id of the field, and then don't store any other fields. Retrieve any values from a primary data storage by looking up the id from the hit afterwards.
There are also many options you can disable for specific fields - which may not be needed depending on how you're using the field, such as termVectors, etc.
So I've got a comma separated value field (technically a textfield, but all of the values will be formatted as CSV) in Drupal which will be submitted to an Apache Solr query document.
The values will be a list of keywords, for example something like this (but not necessarily this):
productid, nameofproduct, randomattribute1, randomattribute2, etc, etc2
How would I best get Solr to process each of these? Do I need to create a separate string field for each of them, or is there anyway for Apache Solr to process what is essentially an array of values as a single field?
I'm not seeing any documentation on the dynamic fields that allows this, but it seems like a common enough use case that it would be usable.
So in short, is there anyway to use a field of CSV in Solr, or do I have to separate each value into a separate field for indexing?
If you are just looking for arrays, see 'multiValued' attribute of field. More on field attributes here. It is difficult to say what is right schema from your question. See
/Solr_Directory/example/solr/collection1/conf/schema.xml
The file can be used as a starting point and contains various combinations of fields.
Also look at this question. The answer shows how to split string by comma and store.
Edit
Can Solr do fuzzy field collapsing? IE collapsing fields that have similar values, rather than identical ones?
I'd assumed that it could, but now I'm not sure, which makes my original question below invalid.
Original Question
For a large given set of values I need to decide which is the most prevalent. The set of all values will change over time, and so I can expect that the output may change over time too.
I gather Solr can do "field collapsing" to group results by a given field, with a tolerance of similarity. Would it be possible, neigh even appropriate, to use Solr solely to collapse fields, to derive the most common value? We use Solr in other parts of the business, and it would be good to leverage existing code rather than home-brewing a custom solution.
No, solr does not support fuzzy collapsing. (at least not based on what is documented on the wiki)
Solr 4.0 supports group.func which allows you to group results based on the result of a FunctionQuery, so it's possible that at some point in time a function could be created to get you approximately what you want, but none of the existing functions will do what you want.
However, Solr does support result clustering, which will maybe work for your use-case. Clustering is done with Carrot2. If you limit the fields used by carrot to a single field, you may get a similar result to "fuzzy clustering", but you have far less control over what carrot does than you do with field collapsing.
For a normal document you might want all your fields analyzed by carrot, e.g.:
carrot.title=my_title&carrot.snippet=my_title,my_description
But if you have, for example, a manufacturer field with slight variations of spelling or punctuation, it might work to only give carrot a single field for both title and snippet:
carrot.title=manufacturer&carrot.snippet=manufacturer
I am trying to index Wikipedia's dump. In order to provide abstract for the articles (or, maybe, enable highlighting feature in future) I'd like to store their text without WikiMarkup. For the first try, it would be enough for me to leave just alphanumeric symbols. So the question is it possible to store the field, that is filtered at character level, not the original one?
There is no way to do this out of the box. If you want Solr to do this, you can create your own UpdateHandler, but this might be a little tricky. The easiest way to do this would be to pre-process the document before sending it to Solr.
Solr by default stores original field values before the filters are been applied by the index time analyzers for your fieldType. So by default it is not storing the filtered value. However you have two options for getting the result that you want.
You can apply the same filters to the field at query time as are being applied at index time to remove the wiki markup. Please see Analyzers, Tokenizers and Token Filters on the Solr Wiki for more details.
You can apply the filters to the data in a separate process prior to loading the data into Solr, then Solr will store the filtered values, since you will be passing them in already in a filtered state.
I need to navigate forth and back in Solr results set ordered by score viewing documents one by one. To visualise that, first a list of document titles is presented to user, then he or she can click one of the title to see more details and then needs to have an opportunity to move to the next document in the original list without getting back and clicking another title.
During viewing documents get changed: their dynamic field is modified (or created is not exists yet) to mark that document has already been viewed (used in other search).
The problem I face is that when the document is altered and re-indexed to keep those changes, sometimes (and not always, which is very disturbing) its place in the results set for the same query changes (in other words, it's score changes as that doesn't happen when browsing results sorted by one of the documents' fields). So, "Previous" / "Next" navigation doesn't work properly.
I'm not using any custom weighting or boosters on fields for score calculation. Also, that dynamic field changed during browsing doesn't participate in the query used to get the record set browsed.
So, the questions are: can the modification of the document's field not included in the query change its relevance score? And if it can, then how can I control that?
UPDATE
I did some tests and can add the following:
Document changes its place in the result set even if no field is amended - just requesting the document and re-indexing it without any changes to its fields makes it take another place next time the same query over the same index is executed.
That happens even if the result set is sorted explicitly ("first_name DESC"), so score (which depends on the update date) is not involved. The document stays the same, its field result set is sorted by is the same, yet its position changes.
Still have no idea how to avoid that.
In Solr, if your field is "indexed", it will have an effect on the relevancy ranking ("stored" fields show up in search results but are not necessarily searchable). If the fields in question aren't marked as indexed then you are good to go. Note that "indexed" and "stored" are not necessarily the same, hence you confusion about results lists changing even though not all fields are shown (a field can be "indexed" and not "stored" as well).
In this case I think you want your "viewed" field to be "stored" but not "indexed". If you really want to control the query, you can use copyField to copy the relevant results into a single searchable field. You can also boost terms or documents so that certain fields are "less important" to the search query.
If you want to see how the relevancy rankings are calculated, you can add "debugQuery=on" to the end of your Solr Query (see the Relevancy FAQ for more info).
However, all that being said, I would recommend you cache your search result query (at least for the first page for your results), since you will always have results changing (documents added, removed by other users, etc). Your best bet is to design a UI that anticipates this, or at least batches a user's query.
I've found the solution which doesn't eliminate the problem completely but makes it much less likely to happen.
So the problem happens when the documents are sorted by some field and there is a number of them with the same value in this field (e.g. result set is sorted by first name, and there are 100 entries for "John").
This is when the indexed time gets involved - apparently Solr uses it to sort the documents when their main sorting fields are identical. To make this case much less probable, you need to add more sorting fields, e.g. "first_name desc" should become "first_name desc, last_name desc, register_date asc".
Also, adding document's unique id as the last sorting field should remove the problem completely (the set of sorting fields will never be identical for any two documents in the index).