Generate highlighted snippets in Solr for PDFs - solr

I'm new to solr. I've set up a solr server and have indexed a few thousand PDFs. I am trying to query solr via the rest API in a PHP page. I am trying to build something similar to the solritas interface included in the tutorial (solrserver/browse), but I don't know how to generate highlighted snippets. I found in the documentation "hl" is a query parameter and is by default set to false.
When I get http://solrserver/?q=search+term&hl=true I get back a response with a hightlighting section, but it only contains the document IDs, no generated snippets.
I am using the tutorial provided schema and config for solr 4.2.1. I believe that the configuration is fine because solritas is able to display highlighted snippets using the same indexed data. I've tried seeing how solritas is built but it's separated out in .vm template files and I haven't been able to find what I'm looking for yet.
I can see the full text of the PDF in the doc->content area, so it is stored. I think I just don't understand the proper way to generate snippets! Can someone please help!
Thanks :)

I would suggest, you should try using hl.fl parameter. So your query should be something like this:
?q=search+term&hl=true&hl.fl=field1,field2,field3
Where field1, field2 and field3 are three source fields you would like to generate highlights.
In your case, if the field name you want to use for highlighting is content, your query can be:
?q=search+term&hl=true&hl.fl=content
More details: http://docs.lucidworks.com/display/solr/Highlighting
With highlighting, you can even specify fragment size, HTML tags around highlighted text etc...

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

solr highlighting in old and new versions

I am migrating a web site from an old version of solr (1.4.1) to the current release version (5.2.1) on a different machine and noticing some differences.
In the old version, I could get highlighting with a url like this:
http://localhost:8983/solr/select?indent=on&q=text:software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200
In the new version, one thing that's different is I need to specify a collection. Another difference is that the new version gives an error if I put text: in front of the value of q.
So, taking into account those differences, I end up with a URL like this:
http://localhost:8983/solr/default/select?indent=on&q=software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.snippets=1&hl.fl=%2a&hl.fragsize=200
That second URL does not give me highlighting fragments/snippets. That is to say, where the old URL would give something like this:
"highlighting":{
"document0_id":{"text":["The <em>software</em> is awesome"]}}
The new URL gives something like this:
"highlighting":{
"document0_id":{}}
What do I need to do to get highlighting fragments returned in solr 5.2.1?
[edited]
In addition, I tried selecting a single document by its id on both machines. On the old machine, a url like
http://localhost:8983/solr/select?wt=json&indent=true&q=id:thedocumentid
returns some JSON that includes a text field containing the full searchable text of the original HTML document. On the new machine a similar url (but one that includes the collection):
http://localhost:8983/solr/default/select?wt=json&indent=true&q=id:thedocumentid
...returns similar JSON that does not include the text field.
I note that searching returns the correct results; the problem is that on the new machine, the results do not include the highlighting fragments.
So it seems like maybe the issue is that I need to specify that these documents have a text field when I index them; how do I do that?
A colleague (not tempted by the bounty) noticed that my text field had stored="false" in my schema.xml and suggsted changing it to true. That did the trick.
In the first query you are specifically searching in the text field and in the second its not.
And in the second you have mentioned hl.fl which means "Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets. If left blank, highlights the defaultSearchField"
Try again by making the changes...
http://localhost:8983/solr/default/select?q=text:software&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200

How can I edit the data section of an Omindex produced database document by editing the omegaScript?

I have been able to set up and search through some documents from a database using this tutorial:
a)
http://www.ibm.com/developerworks/opensource/library/os-xapianomega/index.html?cmp=dw&cpb=dwope&ct=dwnew&cr=dwnen&ccy=zz&csr=110410
The data field is added to every document in the indexing process started with this bash call:
$ omindex --db info --url information /mnt/data0/Information
The call indexes all the files in the dir at /mnt/data0/Information and saves it at a database named
info. According to the last section in the documentation here:
http://xapian.org/docs/omega/overview.html
According to the above documentation, you can set the fields that go into the data field of a document by editing the OmegaScript Template but I have not been able to find this template anywhere. I am hoping I can get some guidance from someone who is familiar with editing an OmegaScript to set up the data field.
I ultimately want data to have the following fields:
sample
caption
type
The standard ones without the url field.
OmegaScript templates are used by omega to render search results (in its web interface), and are stored in the template_dir as mentioned in the IBM tutorial section on the Omega web interface. omindex will have created the fields you require — that documentation also mentions that the OmegaScript command you want to extract those fields is $field{}, which is documented along with all the OmegaScript commands.
So to just display the three fields you would want a fragment of OmegaScript something like:
$hitlist{
Sample: $field{sample}
Caption: $field{caption}
MIME type: $field{type}
}
(which isn't formatted as HTML, but has the advantage of being hopefully clearer as to what is happening).

Solr search – Tika extracted text from PDF not return highlighting snippet

I have successfully indexed Pdf –using Tika- and pure text –fetched from database- in one single collection. Now I am trying to implement highlighting. When I querying Solr i placing in the url the following: http://myhost:8090/solr/ktm/select/?q=BlahBlah&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received output has the original (not highlighted text) content under “docs” and the highlighted snippets under “highlighting”. But I had noticed the documents that have been extracted by Tika don’t have “highlighting” snippet. That kind of response, cause me many troubles (zero length rows). Is there any workaround in order to tackle it? I have already tried to copyField (at index time) but the response come out blank ({“highlighting”:{}}). I really need help on this.

Solr configuration

I'm very new with Solr,
And I really want a step by step to have my Solr search result like the google one.
To give you an idea, when you search 'PHP' in http://wiki.apache.org/solr/FindPage , the word 'php' shows up in bold .. This is the same result I want to have.
Showing only a parser even if the pdf is a very huge one.
You can use highlighting to show a matching snippet in the results.
http://wiki.apache.org/solr/HighlightingParameters
By default, it will wrap matching words with <em> tags, but you can change this by setting the hl.simple.pre/hl.simple.post parameters.
You may be looking at the wrong part of the returned data. Try looking at the 'highlighting' component of the returned data structure (i.e. don't look at the response docs). This should give you the snippets you want.

Resources