I'm using Sorl v3.6.1 and have successfully managed to index data as well as using Apache Tika to index binary items. I'm using SolrNet to pull this data out. However I have an issue whereby I want to link 2 results together.
Now consider the following XML (this is just for illustration purposes):
<doc>
<id>263</id>
<title>This is the title</title>
<summary>This is the summary<summary/>
<binary_id>994832</binary_id>
</doc>
<doc>
<id>994832</id>
<title>This is the title</title>
<summary>This is the summary<summary/>
<text>this is the contents of the binary</text>
</doc>
Is it possible (through SolrNet) to merge the two above results together so when a user searches for This is the contents of the binary it also returns the data in the first item?
In my example you can see the first item contains the id of the binary (994832) so my initial thoughts are that I need to do 2 queries and somehow merge them?
Not really sure about how to do this so any help would be greatly appreciated.
You can try to do something funky with a join kind of query, however beware of performance impacts. Here is my post from some time ago where I was trying to do something similar.
solr grouping based on all index values
Alternatively, a better solution, IF and only IF you can massage the data a bit before going in. Would be to assign the same ID to all documents that need to be retrieved as a group, per your example, this would be to add binaryid field to the second doc and assign 994832 value to it. You would be able to very cleanly use Solr grouping to group the items as one and then group sorting to only return the item that you want.
Related
After standing up a basic jetty Solr example. I've tried to make my own core to represent the data my company will be seeing. I made a directory structure with conf and data directories and copied core.properties, schema.xml, and solrconfig.xml from the collection1 example.
I've editted core.properties to change the core name, and I've added 31 fields (most of type text_general, indexed, stored, not required or multivalued) to the schema.
I'm pretty sure I've set it up correctly as I can see my core in the admin page drop down and interact with it. The problem is, when I feed a document designed for the new fields, I cannot get a successful query for any of the values. I believe the data is fed as I got the same command line response:
"POSTing file incidents.xml...
1 file indexed. ....
COMMITting..."
I thought, the Indexing process took more time, but when I copy a field node out of an example doc (e.g <field name="name">Apple 60 GB iPod with Video Playback Black</field> from ipod_video.xml) into a copy of my file (incidents2.xml) searches on any of those strings instantly succeed.
The best example of my issue is both files have the field:
<field name="Brand" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="Brand">APPLE</field>
However, only the second document (with the aforementioned name field) is returned with a query for apple.
Thanks for reading this far; my questions are:
1) Is there a way to dump the analysis/tokenization phase of document ingestion? Either I don't understand it or the Analysis tab isn't designed for this. The debugQuery=true parameter gives relevance score data but no explanation of why a document was excluded.
2) Once I solve my overall issue, I we would like to have large text fields included in the index, can I wrap long form text in CDATA blocks in solr?
Thanks again.
To debug any query issues in Solr, there's a few useful things to check. You might also want to add the output of your analysis page and the field you're having issues with from your schema.xml to your question. It's also a good idea to have a smaller core to work with (use three or four fields just to get started and get it to work) when trying to debug any indexing issues.
Are the documents actually in the index? - Perform a search for : (q=*:*) to make sure that there are any documents present in the index. *:* is a shortcut that means "give me all documents regardless of value". If there are no documents returned, there is no content in the index, and any attempt to search it will give zero results.
Check the logs - Make sure that SolrLogging is set up, so you get any errors thrown in your log. That way you can see if there's anything in particular going wrong when the query or indexing is taking place, something which would result in the query never being performed or any documents being added to the index.
Use the Analysis page - If you have documents in the index, but they're not returned for the queries you're making, select the field you're querying at the analysis page and add both the value given when indexing (in the index column) and the value used when querying (in the query field). The page will then generate all the steps taken both when indexing and querying, and show you the token stream at each step. If the tokens match, they will be highlighted with a different background color, and depending on your setting, you might require all tokens present on the query side to be present on the indexing side (i.e. every token AND-ed together). Start with searching for a single token on the query side for that reason.
If you still doesn't have any hits, but have the documents in the index, be more specific. :-)
And yes, you can use CDATA.
I have people indexed into solr based on documents that they have authored. For simplicity's sake, let's say they have three fields - an integer ID, a Text field and a floating point 'SpecialRank' (a value between 0 and 1 to indicate how great the person is). Relevance matching in solr is all done through the Text field. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own SpecialRank. Namely, I need to re-rank the results based on the following formula:
finalScore = (0.8 * solrScore) + (0.2 * SpecialScore)
As far as I'm aware, this is a common task in information retrieval, as we are just combining two different scores in a weighted manner. The trouble is, I need solrScore to be normalized for this to work. What I have been doing is normalizing the solrScore based on the maxScore for a particular query and re-ranking the results client-side. This has been working OK, but means I have to retrieve all the matching documents from solr before I do my re-ranking.
I am looking for the best way to have solr take care of this re-ranking. Are boost functions able to help here? I have read that they can be multiplicative or additive to the solr score, but since the solr score is not normalized and all over the place depending on different queries, this doesn't really seem to solve my problem. Another approach I have tried is to first query solr for a single document just to get the maxScore, and then use the following formula for the sort:
sum(product(0.8,div(score,maxScore)),product(0.2,SpecialRank))+desc
This, of course, doesn't work as you're unable to use the score as a variable in a sort function.
Am I crazy here? Surely this is a common enough task in IR. I've been banging my head against the wall for a while now, any ideas would be much appreciated.
You could try to implement custom SearchComponent that will go trough results on Solr and calculate your custom score there. Get results found from ResponseBuilder (rb.getResults().docSet), iterate trough them, add calculated value to your results and re-sort them.
You can then register your SearchComponent as last in RequestHandler chain:
<arr name="last-components">
<str>elevator</str>
</arr>
More info in SolR manual:
http://wiki.apache.org/solr/SearchComponent
Sorry, but no better idea for now.
I'm trying to figure out how to store and search data pairs. I have a document similar to that below and my goal is to perform a search that returns all documents with a given specialty and then sort the results by the matching specialty ability:
<doc>
<id>123</id>
<firstName>Joe</firstName>
<lastName>Bloggs</lastName>
<specialties>
<specialty>
<type>Foo</type>
<ability>1</ability>
</specialty>
<specialty>
<type>Bar</type>
<ability>2</ability>
</specialty>
<specialty>
<type>Baz</type>
<ability>2</ability>
</specialty>
</specialties>
</doc>
I'm familiar with working indexing, searching and faceting simple documents but I am struggling to even find a starting point with this =(
Should I simply use two collections and join?
If the number of specialties is finite and known beforehand, you may try following.
Instead of having two fields storing specialty and ability, just have ONE field containing "ability_of_a_specialty"
For example,
<specialties>
<Foo_ability> 1 </Foo_ability>
<Bar_ability> 2 </Bar_ability>
<Dummy_ability> 0 </Dummy_ability>
...
</specialties>
Now, it should be straightforward to transform the above attributes to a Lucene doc.
I have solr documents that look like this:
<doc>
<str name="some_attribute">some_attribute_value</str>
<!-- ... -->
<arr name="locationCoordinates">
<str>48.117,11.539</str>
<str>23.423,11.342</str>
<!-- ... -->
</arr>
</doc>
My question is whether it's possible to filter the returned fields of a document to only return certain values, for example to only return the locationCoordinates that are within a 50 km range of another point and leave the others out.
I.e. return the above document, but with only the first locationCoordinates.
I don't really know whether this should even be possible in Solr (because of the document-oriented structure), but I can at least ask :).
Maybe I should also elaborate on the way I want to use this feature and alternatives I "found" for this:
change the document design to create one document per location (Pro: works, Cons: need to check for duplicates on the client side, heaps of duplicated data in the Solr-database)
leave it with this structure (Pro: works, don't have to change the current structure, Cons: I have to sort the correct coordinates out by myself (on the client) and therefore encounter problems with distance calculation (I already filter the documents by distance beforehand, and maybe I will lose some data if I compute the distance on the client-side badly)
create a new Document "type" for the locations (and their names, etc.) on the Solr-side and use a foreign-key-like structure to add the locations to the articles and in order to compute distances I have to query for reachable locations first and then join on the articles (Pro: everything works on the solr-side, Cons: I will need Solr-Joins for that)
change the document design to create one document per location (Pro: works, Cons: need to check for duplicates on the client side, heaps of
duplicated data in the Solr-database)
If this is your only concern, probably you can consider making the locationCoordinates as your unique_key for each document.
By doing this, you are not allowing duplicate locationCoordinates in your Index, thereby eliminating the need to check for duplicates on the client side.
I implemented Solr SpellCheck Component based on the document from http://wiki.apache.org/solr/SpellCheckComponent , it works good. But i am trying to filter the spell check result based on some other filter. Consider the following schema
product_name
product_text
product_category
product_spell -> copy string from product_name and product_text . And tokenized using white space analyzer
For the above schema, i am trying to filter the spell check result based on provided category. I tried querying like http://127.0.0.1:8080/solr/colr1/myspellcheck/?q=product_category:160%20appl&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true . Spellcheck results does not consider the product_category:160
Is it because the dictionary was build for all the categories? If so is it a good idea to create the dictionary for every category?
Is it not possible to have another filter condition in spellcheck component?
I am using solr 3.5
I previously understood from the SOLR-2010 issue that filtering through the fq parameter should be possible using collation, but it isn't, I think I misunderstood.
In fact, the SpellCheckComponent has most likely a separate index, except for the DirectoSolrSpellChecker implementation. It means the field you select is indexed in a different index, which contains only the information about that specific field you chose to make spelling corrections.
If you're curious, you can also have a look how that additional index looks like using luke, since it's of course a lucene index. Unfortunately filtering using other fields isn't an option there, simply because there is only one field there, the one you use to make spelling corrections.