Extending Solr Tutorial with custom fields/core - solr

After standing up a basic jetty Solr example. I've tried to make my own core to represent the data my company will be seeing. I made a directory structure with conf and data directories and copied core.properties, schema.xml, and solrconfig.xml from the collection1 example.
I've editted core.properties to change the core name, and I've added 31 fields (most of type text_general, indexed, stored, not required or multivalued) to the schema.
I'm pretty sure I've set it up correctly as I can see my core in the admin page drop down and interact with it. The problem is, when I feed a document designed for the new fields, I cannot get a successful query for any of the values. I believe the data is fed as I got the same command line response:
"POSTing file incidents.xml...
1 file indexed. ....
COMMITting..."
I thought, the Indexing process took more time, but when I copy a field node out of an example doc (e.g <field name="name">Apple 60 GB iPod with Video Playback Black</field> from ipod_video.xml) into a copy of my file (incidents2.xml) searches on any of those strings instantly succeed.
The best example of my issue is both files have the field:
<field name="Brand" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="Brand">APPLE</field>
However, only the second document (with the aforementioned name field) is returned with a query for apple.
Thanks for reading this far; my questions are:
1) Is there a way to dump the analysis/tokenization phase of document ingestion? Either I don't understand it or the Analysis tab isn't designed for this. The debugQuery=true parameter gives relevance score data but no explanation of why a document was excluded.
2) Once I solve my overall issue, I we would like to have large text fields included in the index, can I wrap long form text in CDATA blocks in solr?
Thanks again.

To debug any query issues in Solr, there's a few useful things to check. You might also want to add the output of your analysis page and the field you're having issues with from your schema.xml to your question. It's also a good idea to have a smaller core to work with (use three or four fields just to get started and get it to work) when trying to debug any indexing issues.
Are the documents actually in the index? - Perform a search for : (q=*:*) to make sure that there are any documents present in the index. *:* is a shortcut that means "give me all documents regardless of value". If there are no documents returned, there is no content in the index, and any attempt to search it will give zero results.
Check the logs - Make sure that SolrLogging is set up, so you get any errors thrown in your log. That way you can see if there's anything in particular going wrong when the query or indexing is taking place, something which would result in the query never being performed or any documents being added to the index.
Use the Analysis page - If you have documents in the index, but they're not returned for the queries you're making, select the field you're querying at the analysis page and add both the value given when indexing (in the index column) and the value used when querying (in the query field). The page will then generate all the steps taken both when indexing and querying, and show you the token stream at each step. If the tokens match, they will be highlighted with a different background color, and depending on your setting, you might require all tokens present on the query side to be present on the indexing side (i.e. every token AND-ed together). Start with searching for a single token on the query side for that reason.
If you still doesn't have any hits, but have the documents in the index, be more specific. :-)
And yes, you can use CDATA.

Related

Why does solr add new document while updating?

curl 'http://localhost/solr/collection/update?commit=true'
-H 'Content-type:application/json'
-d
'[
{
"id":"11111",
"price":{"set":1000}
}
]'
If id:11111 exists, price value is updated. It's ok.
If id:11111 doesn't exist, new document is created in solr index. This behavior is not desirable. I expect error with some text like: document you tried to update does not exist.
I cannot understand what is wrong.
Solr version: 4.8.0.
Part of schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<uniqueKey>id</uniqueKey>
The /update request handler actually updates the index for new and existing documents and handles deletion as well.
During indexation:
A document is considered new if it has no identifier or if its id does not match any of the indexed documents. If no id is generated during indexing and if the uniqueKey field is required, the document is rejected.
A document that has an identifier matching an indexed document is merged with its stored version : all stored fields are loaded from the index and are overriden by field values from the request parameters, and the resulting document replace the previous one (but in the end it is the same operation).
In other word an update request - if not a delete - always ends up in the same add operation. By the way the XML schema recognized by solr.UpdateRequestHandler contains the elements <add>, <doc> and <field> regardless of the operation (add or replace).
Recent versions of Solr provide more options for updating parts of documents. (see atomic updates and in-place updates.
What you describe is the expected behavior. Since the id field is required, Solr will throw an error for document missing this field. In your situation, the document is indexed in both cases because the id is given in both cases.
With this configuration you would have to ensure that id field is empty for what you consider a new document, either client side when preparing the request or server side using an update processor or by updating the request handler implementation. Maybe it would be even simpler to prevent the indexation of any new docs ?
that is how the current implementation of Atomic updates seems to work. I concur it might be desirable to get an error...You should raise the issue in the user mailing list, and see what commiters think, maybe they agree with you that an error should be raised, they'll ask you to open a jira then.
Oh, just noticed the 4.8 version, that is quite old, can you by any chance test the behaviour in current versions?

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

Duplicate SOLR Document Issue While Using Overwrite=True

I am having an issue with temporary duplicate documents in my SOLR collection that are causing my user rankings system to be incorrect.
I am using SOLR version 4.8.1 so it is one of the latest builds. I am using XML to update the SOLR collection like described in this SOLR Documentation:
<add overwrite="true" commitWithin="#COMMIT_WITHIN.GLOBAL_VALUE#">
<doc>
<field name="END_USER_ID">#END_USER_ID#</field>
<field name="TARGET_REGION_ID">#TARGET_REGION_ID#</field>
<field name="POPULARITY_RANK">#POPULARITY_RANK#</field>
<field name="VISIBILITY_SCORE">#VISIBILITY_SCORE#</field>
<field name="POPULARITY_VISIBILITY_SCORES_ID">#POPULARITY_VISIBILITY_SCORES_ID#</field>
<cfif #POP_VIS_SCORES_LAST_MODIFIED_DATETIME# NEQ "">
<field name="POPULARITY_VISIBILITY_SCORES_DATE_MODIFIED">#POP_VIS_SCORES_LAST_MODIFIED_DATETIME#</field>
</cfif>
</doc>
</add>
As you can see from the code above, I am using the overwrite parameter (to have newer documents replace previously added documents with the same uniqueKey) in conjunction with the commitWithin parameter (to add the document within a certain time period). The uniqueKey in this case should be END_USER_ID and the time period should be 15 seconds; I have checked to make sure that the uniqueKey is defined in the appropriate schema.xml file and that multiValued is set to false for END_USER_ID.
So on my rankings page, there are several calls to our local SOLR server. For example:
http://localhost:8983/solr/pop_vis_scores/select/?q=TARGET_REGION_ID:50%20AND%20-POPULARITY_RANK:0&version=4.8&start=0&rows=1&indent=off&stats=true&stats.field=POPULARITY_RANK&sort=POPULARITY_RANK%20ASC&fl=[docid],END_USER_ID,POPULARITY_RANK&timeAllowed=8000
From my observations, when the commitWithin is set to 15000 milliseconds, the updated SOLR document is available right away but a duplicate SOLR document exists that reflects the older data. When the commitWithin is set to 500 milliseconds, it seems like the problem does not exist. Having said that, I would theorize the problem is still there but users cannot act quickly enough to see the duplicate documents. When I have thousands of users playing this game, I theorize that this problem may in fact still exist on a larger scale. In addition, it would be nice to set that commitWithin back to 15 seconds when the player base of the game increases.
Anybody face a similar issue before and if so, how would you go by solving it? Anybody have any recommendations? Thanks in advance!
I assumed that when a SOLR document gets added to the collection within that given 15 second time window that the old document would get deleted at the same time as the new one would be inserted into the collection. It appears that this assumption was incorrect. I was able to exclude the user id from my queries to get more accurate statistical values when it came to rankings. For anybody experiencing a similar situation that I was in, I recommend not assuming that SOLR documents get deleted and updated at the same time.

Solr dynamicField not searched in query without field name

I'm experimenting with the Example database in Solr 4.10 and not understanding how dynamicFields work. The schema defines
dynamicField name="*_s" type="string" indexed="true" stored="true"
If I add a new item with a new field name (say "example_s":"goober" in JSON format), a query like
?q=goober
returns no matches, while
?q=example_s:goober
will find the match. What am I missing?
I would like to see the SearchHandler from solrconfig.xml file that you are using to execute the above mentioned query.
In SearchHandler we generally have Default Query Field i.e. qf parameter.
Check that your dynamic field example_s is present in that query field list of solrconfig file else you can pass it while sending query to search handler.
Hope this will help you in resolving your problem.
If you are using the default schema, here's what's happening:
You are probably using default end-point (/select), so you get the definition of search type and parameters from that. Which means, it is default (lucene) search and the field searched is text.
The text field is an aggregate and is populated by copyField instruction from other fields.
Your dynamic field definition for *_s allows you to index the text with any name ending in _s, such as example_s. It's indexed (so you could search against it directly) and stored (so you can see it when you ask for all fields). It will not however search it as a general text. Notice that (differently from ElasticSearch), Solr strings have to be matched fully and completely. If you have some multi-word text in it, there is barely any point searching it. "goober" is one word so it's not a very good example to understand the difference here.
The easiest solution for you is add another copyField instruction:
<copyField source="*_s" dest="text"/>, then all your *_s dynamic fields would also be searchable. But notice that the search analyzers will not be the ones for *_s definition, but the ones for the text field's definition, which is not string, but text_general, defined elsewhere in the file.
As to Solr vs. ElasticSearch, they both err on the different sides of magic. Solr makes you configure the system and makes it very easy to see the exact current configuration. ElasticSearch hides all of the configuration, but you have to rediscover it the second you want to change away from the default behaviour. In the end, the result is probably similar and meets somewhere in the middle.

How to query a specific document by id

From a previous query I already have the document ID (the uniqueKey in this schema is 'track_id') of the document I'm interested in.
Then I would like to query a sequence of words on that document while highlighting the match.
I can't seem to be able to combine the search parameters in a successful way (all my google searches return purple links :\ ), although I've already tried many combinations these past few days. I also know the field where the matches will be if that's any use in terms of improving match speed.
I'm guessing it should be something like this:
/select?q=track_id:{key_i_already_have} AND/&/{part_I_dont_know} word1 word2 word3
Currently, since I can't combine these two search parameters, I'm only querying the words and thus getting several results from several documents.
Thanks in advance.
From Solr 4 you can use the realtime get, which is much more faster than searching the index by id.
http://localhost:8983/solr/get?ids=id1,id2,id3
For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time view of the index. The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher. This is primarily useful when using Solr as a NoSQL data store and not just a search index.
You may try applying Filter Query for id. So it will filter your search query to that id, and then search in that document for all the keywords, and highlight them.
Your query will look like:
/select?fq=track_id:DOC_ID&q=word1 word2 word3
Just make sure your "id" field in schema.xml is defined of the type string to apply filter queries on it.
<field name="id" type="string" indexed="true" stored="true" required="true" />

Resources