PDF to Solr how to index the paragraphs of a PDF - solr

I'm working with Solr and I'm trying to find out how to index a bunch of PDF files and in the specific ingest paragraphs.
My PDF contains a paragraph as:
Test (Some Test) -> Heading of the paragraph
Some text -> Text of the paragraph
What I need to achieve is when I fire a search to Solr I should see a result the paragraph heading and the text related to it.
For example, I will search "keyword" and the result will be for this Keyword:
Hello (Keyword)
Paragraph whole text
I need a help with this as I have no idea how to do it.
I would like to know if I should use some external tool or what modification I need to do in Solr to achieve my results.

You definitively need to do external work, if you use just solr, it will bundle all text it extracts into the same field, and you dont want that. So, you have to use Apache Tika/pdfbox or some other library to extract yourself the text (keeping headings and bodies separate), and index them into different fields.
This will have the additional result of the indeixng process being more resilient, as using built in Tika code in Solr is not recommended for very large indexing jobs.

Related

Parsing paragraphs into separate documents in Solr using script

I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.
I have been using the following script to automate the process of crawling/fetching/parsing/indexing:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch -s ./urls/ Crawl 2
My idea is to attach a script in the middle of this workflow (probably the parsing stage of Nutch?) that would break up the paragraphs, like paragraphs.split(). How could I accomplish this?
Additionally, I need to add a field to each paragraph that shows its numerical position in the document, and to what chapter it belongs to. The chapter is an h2 tag in the document.
Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html and parse-tika. These plugins are focused on extracting text content and not so much structured data out of the HTML document.
You would need to have a custom parser (HtmlParserPugin) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).
The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).
For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

Generate highlighted snippets in Solr for PDFs

I'm new to solr. I've set up a solr server and have indexed a few thousand PDFs. I am trying to query solr via the rest API in a PHP page. I am trying to build something similar to the solritas interface included in the tutorial (solrserver/browse), but I don't know how to generate highlighted snippets. I found in the documentation "hl" is a query parameter and is by default set to false.
When I get http://solrserver/?q=search+term&hl=true I get back a response with a hightlighting section, but it only contains the document IDs, no generated snippets.
I am using the tutorial provided schema and config for solr 4.2.1. I believe that the configuration is fine because solritas is able to display highlighted snippets using the same indexed data. I've tried seeing how solritas is built but it's separated out in .vm template files and I haven't been able to find what I'm looking for yet.
I can see the full text of the PDF in the doc->content area, so it is stored. I think I just don't understand the proper way to generate snippets! Can someone please help!
Thanks :)
I would suggest, you should try using hl.fl parameter. So your query should be something like this:
?q=search+term&hl=true&hl.fl=field1,field2,field3
Where field1, field2 and field3 are three source fields you would like to generate highlights.
In your case, if the field name you want to use for highlighting is content, your query can be:
?q=search+term&hl=true&hl.fl=content
More details: http://docs.lucidworks.com/display/solr/Highlighting
With highlighting, you can even specify fragment size, HTML tags around highlighted text etc...

Solr search – Tika extracted text from PDF not return highlighting snippet

I have successfully indexed Pdf –using Tika- and pure text –fetched from database- in one single collection. Now I am trying to implement highlighting. When I querying Solr i placing in the url the following: http://myhost:8090/solr/ktm/select/?q=BlahBlah&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received output has the original (not highlighted text) content under “docs” and the highlighted snippets under “highlighting”. But I had noticed the documents that have been extracted by Tika don’t have “highlighting” snippet. That kind of response, cause me many troubles (zero length rows). Is there any workaround in order to tackle it? I have already tried to copyField (at index time) but the response come out blank ({“highlighting”:{}}). I really need help on this.

Solr configuration

I'm very new with Solr,
And I really want a step by step to have my Solr search result like the google one.
To give you an idea, when you search 'PHP' in http://wiki.apache.org/solr/FindPage , the word 'php' shows up in bold .. This is the same result I want to have.
Showing only a parser even if the pdf is a very huge one.
You can use highlighting to show a matching snippet in the results.
http://wiki.apache.org/solr/HighlightingParameters
By default, it will wrap matching words with <em> tags, but you can change this by setting the hl.simple.pre/hl.simple.post parameters.
You may be looking at the wrong part of the returned data. Try looking at the 'highlighting' component of the returned data structure (i.e. don't look at the response docs). This should give you the snippets you want.

Resources