How to query a specific document by id - solr

From a previous query I already have the document ID (the uniqueKey in this schema is 'track_id') of the document I'm interested in.
Then I would like to query a sequence of words on that document while highlighting the match.
I can't seem to be able to combine the search parameters in a successful way (all my google searches return purple links :\ ), although I've already tried many combinations these past few days. I also know the field where the matches will be if that's any use in terms of improving match speed.
I'm guessing it should be something like this:
/select?q=track_id:{key_i_already_have} AND/&/{part_I_dont_know} word1 word2 word3
Currently, since I can't combine these two search parameters, I'm only querying the words and thus getting several results from several documents.
Thanks in advance.

From Solr 4 you can use the realtime get, which is much more faster than searching the index by id.
http://localhost:8983/solr/get?ids=id1,id2,id3
For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time view of the index. The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher. This is primarily useful when using Solr as a NoSQL data store and not just a search index.

You may try applying Filter Query for id. So it will filter your search query to that id, and then search in that document for all the keywords, and highlight them.
Your query will look like:
/select?fq=track_id:DOC_ID&q=word1 word2 word3
Just make sure your "id" field in schema.xml is defined of the type string to apply filter queries on it.
<field name="id" type="string" indexed="true" stored="true" required="true" />

Related

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

solr query : with the Wildcard Searches Type *

the filed define in the schema.xml :
<field name="typeDesc" type="text_general" indexed="true" stored="true"/>
The typeDesc store the values like 公立, 公立,三甲, 公立,二甲。
The question is when I query typeDesc:*三甲*, there is nothing, but when I query typeDesc:*公立* or typeDesc:*三* or typeDesc:*甲* or typeDesc:三甲, they all could find the result like 公立,三甲。 I want to know the reason.
While I'm not too familiar with word breaking rules for kanji, I'm going to guess that the reason is that when you're doing wildcard searches, analysis for the field isn't performed. If 三 and 甲 are split into separate tokens, the wild card match will not find any token matching your search.
You can confirm this by using the analysis tab of the admin page to see which tokens an indexed term is being broken into.
Possible solutions would be to index the terms in a single string field as well and do wildcard matches against that, or use a KeywordTokenizer for your text field if you need further processing before storing the token (the keyword tokenizer will keep the text as one single token). You could also use an ngramfilter and drop the wildcards.

Extending Solr Tutorial with custom fields/core

After standing up a basic jetty Solr example. I've tried to make my own core to represent the data my company will be seeing. I made a directory structure with conf and data directories and copied core.properties, schema.xml, and solrconfig.xml from the collection1 example.
I've editted core.properties to change the core name, and I've added 31 fields (most of type text_general, indexed, stored, not required or multivalued) to the schema.
I'm pretty sure I've set it up correctly as I can see my core in the admin page drop down and interact with it. The problem is, when I feed a document designed for the new fields, I cannot get a successful query for any of the values. I believe the data is fed as I got the same command line response:
"POSTing file incidents.xml...
1 file indexed. ....
COMMITting..."
I thought, the Indexing process took more time, but when I copy a field node out of an example doc (e.g <field name="name">Apple 60 GB iPod with Video Playback Black</field> from ipod_video.xml) into a copy of my file (incidents2.xml) searches on any of those strings instantly succeed.
The best example of my issue is both files have the field:
<field name="Brand" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="Brand">APPLE</field>
However, only the second document (with the aforementioned name field) is returned with a query for apple.
Thanks for reading this far; my questions are:
1) Is there a way to dump the analysis/tokenization phase of document ingestion? Either I don't understand it or the Analysis tab isn't designed for this. The debugQuery=true parameter gives relevance score data but no explanation of why a document was excluded.
2) Once I solve my overall issue, I we would like to have large text fields included in the index, can I wrap long form text in CDATA blocks in solr?
Thanks again.
To debug any query issues in Solr, there's a few useful things to check. You might also want to add the output of your analysis page and the field you're having issues with from your schema.xml to your question. It's also a good idea to have a smaller core to work with (use three or four fields just to get started and get it to work) when trying to debug any indexing issues.
Are the documents actually in the index? - Perform a search for : (q=*:*) to make sure that there are any documents present in the index. *:* is a shortcut that means "give me all documents regardless of value". If there are no documents returned, there is no content in the index, and any attempt to search it will give zero results.
Check the logs - Make sure that SolrLogging is set up, so you get any errors thrown in your log. That way you can see if there's anything in particular going wrong when the query or indexing is taking place, something which would result in the query never being performed or any documents being added to the index.
Use the Analysis page - If you have documents in the index, but they're not returned for the queries you're making, select the field you're querying at the analysis page and add both the value given when indexing (in the index column) and the value used when querying (in the query field). The page will then generate all the steps taken both when indexing and querying, and show you the token stream at each step. If the tokens match, they will be highlighted with a different background color, and depending on your setting, you might require all tokens present on the query side to be present on the indexing side (i.e. every token AND-ed together). Start with searching for a single token on the query side for that reason.
If you still doesn't have any hits, but have the documents in the index, be more specific. :-)
And yes, you can use CDATA.

Know indexing time for a document in Solr

Is it possible to know the indexing time of a document in solr. Like there is a implicit field for "score" which automatically gets added to a document, is there a field that stores value of indexing time?
I need it to know the date when a document got indexed.
Thanks
Solr does not automatically add a create date to documents. You could certainly index one with the document though, using Solr's DateField. In earlier versions or Solr ( < 4.2 ), there was a commented timestamp field in the example schema.xml, which looked like:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Also, I think it bears noting that there is no implicit "score" field. Scores are calculated at query time, rather than being tied to the document. Different queries will generate different scores for the same document. There are norms stored with the document that are factored into scores, but they aren't really fields.
femtoRgon give you a correct solution but you must be carefull with partial document update.
If you do not do partial document update you can stop reading now ;-)
If you partially update your document, SolR will merge the existing value with your partial document and the timestamp will not be updated. The solution is to not store the timestamp, then SolR will not be able to merge this value. The drawback is you cannot retrieve the timestamp with your search result.

How to view non-stored fields per document?

I have a field like this:
<field name="status" type="string" indexed="true" stored="false" required="false" />
Using LukeRequestHandler I can view only statistics of the indexed terms, I can view indexed terms per document if stored="true". TermsComponent can show only frequencies of terms, I cannot view terms per document.
Is it possibly to look inside the inverted index without setting stored="true" and reindexing Solr?
In order to view the indexed terms for a single document, you need to use the full Luke application, not the LukeRequestHandler. You would need to copy the index folder from your Solr data directory to another location, then open it in Luke.
There is however a workaround within solr itself - do a search that will return just the one document, and facet on the field you want to examine. Every term in the index for that field on that document will be an entry in the facet output. Here is a full sample URL for this kind of search:
http://localhost:8983/solr/core/select?q=id:1234&facet.field=status&facet.limit=-1&facet.mincount=1&facet=true&facet.method=enum
If you decide to go the Luke route, you can step through your index (or search for an individual document) and view just one document.
The official Luke page is here, but it only supports up through 4.0-ALPHA:
http://code.google.com/p/luke/
You can find Luke for versions beyond 4.0-ALPHA here:
https://java.net/projects/opengrok/downloads
There is an effort underway to absorb Luke into the Lucene/Solr source code as a module, so it will always be current and released at the same time as each Lucene/Solr version.

Resources