Solr display page no of PDF along with the results - solr

My question is just a continuation of this activity where I would like to display page no for the searched word in the input document.
Solr open document after searching a keyword
So I use
1) tika-0.9.jar to extract the output as an intermediate file.
2) Then I create another XML where the extracted output is the input and write the data in the format expected by Solr and then post this xml using post.jar command.
3) I use Solritas Serach UI with Solr 3.2 version (http://localhost:8983/solr/browse) to view the results.
I would like to display the page no's along with the results.
Example :
If I search for a word test in the input PDF's what I have manged so far is to display all set of docs that contain this result and on click of any doc the input PDF will open. I would like to display the page no of the where this word say 'test' is present in each of the input doc.
Please give me some suggestion , like whether this can be done by some how storing the page no in the index .
Your suggestions are most welcome.
Thanks and regards.

Related

PDF to Solr how to index the paragraphs of a PDF

I'm working with Solr and I'm trying to find out how to index a bunch of PDF files and in the specific ingest paragraphs.
My PDF contains a paragraph as:
Test (Some Test) -> Heading of the paragraph
Some text -> Text of the paragraph
What I need to achieve is when I fire a search to Solr I should see a result the paragraph heading and the text related to it.
For example, I will search "keyword" and the result will be for this Keyword:
Hello (Keyword)
Paragraph whole text
I need a help with this as I have no idea how to do it.
I would like to know if I should use some external tool or what modification I need to do in Solr to achieve my results.
You definitively need to do external work, if you use just solr, it will bundle all text it extracts into the same field, and you dont want that. So, you have to use Apache Tika/pdfbox or some other library to extract yourself the text (keeping headings and bodies separate), and index them into different fields.
This will have the additional result of the indeixng process being more resilient, as using built in Tika code in Solr is not recommended for very large indexing jobs.

suggestions(recommendations) in the solr search results

I am using solr to search contents of the PDF files that I have indexed. I am considering the whole content of the file as content field. Now I need some suggestions in the result as how we get in online shopping sites. Along with the solr result, I need information such as most searched text or something like you can also find: blah blah. And I also need the result to be displayed in the order of most selected answers for the searched text.
Please can anyone brief me on this? I am new to solr. Thanks in advance.

Generate highlighted snippets in Solr for PDFs

I'm new to solr. I've set up a solr server and have indexed a few thousand PDFs. I am trying to query solr via the rest API in a PHP page. I am trying to build something similar to the solritas interface included in the tutorial (solrserver/browse), but I don't know how to generate highlighted snippets. I found in the documentation "hl" is a query parameter and is by default set to false.
When I get http://solrserver/?q=search+term&hl=true I get back a response with a hightlighting section, but it only contains the document IDs, no generated snippets.
I am using the tutorial provided schema and config for solr 4.2.1. I believe that the configuration is fine because solritas is able to display highlighted snippets using the same indexed data. I've tried seeing how solritas is built but it's separated out in .vm template files and I haven't been able to find what I'm looking for yet.
I can see the full text of the PDF in the doc->content area, so it is stored. I think I just don't understand the proper way to generate snippets! Can someone please help!
Thanks :)
I would suggest, you should try using hl.fl parameter. So your query should be something like this:
?q=search+term&hl=true&hl.fl=field1,field2,field3
Where field1, field2 and field3 are three source fields you would like to generate highlights.
In your case, if the field name you want to use for highlighting is content, your query can be:
?q=search+term&hl=true&hl.fl=content
More details: http://docs.lucidworks.com/display/solr/Highlighting
With highlighting, you can even specify fragment size, HTML tags around highlighted text etc...

Solr search – Tika extracted text from PDF not return highlighting snippet

I have successfully indexed Pdf –using Tika- and pure text –fetched from database- in one single collection. Now I am trying to implement highlighting. When I querying Solr i placing in the url the following: http://myhost:8090/solr/ktm/select/?q=BlahBlah&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received output has the original (not highlighted text) content under “docs” and the highlighted snippets under “highlighting”. But I had noticed the documents that have been extracted by Tika don’t have “highlighting” snippet. That kind of response, cause me many troubles (zero length rows). Is there any workaround in order to tackle it? I have already tried to copyField (at index time) but the response come out blank ({“highlighting”:{}}). I really need help on this.

Solr configuration

I'm very new with Solr,
And I really want a step by step to have my Solr search result like the google one.
To give you an idea, when you search 'PHP' in http://wiki.apache.org/solr/FindPage , the word 'php' shows up in bold .. This is the same result I want to have.
Showing only a parser even if the pdf is a very huge one.
You can use highlighting to show a matching snippet in the results.
http://wiki.apache.org/solr/HighlightingParameters
By default, it will wrap matching words with <em> tags, but you can change this by setting the hl.simple.pre/hl.simple.post parameters.
You may be looking at the wrong part of the returned data. Try looking at the 'highlighting' component of the returned data structure (i.e. don't look at the response docs). This should give you the snippets you want.

Resources