Parsing paragraphs into separate documents in Solr using script - solr

I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.
I have been using the following script to automate the process of crawling/fetching/parsing/indexing:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch -s ./urls/ Crawl 2
My idea is to attach a script in the middle of this workflow (probably the parsing stage of Nutch?) that would break up the paragraphs, like paragraphs.split(). How could I accomplish this?
Additionally, I need to add a field to each paragraph that shows its numerical position in the document, and to what chapter it belongs to. The chapter is an h2 tag in the document.

Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html and parse-tika. These plugins are focused on extracting text content and not so much structured data out of the HTML document.
You would need to have a custom parser (HtmlParserPugin) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).
The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).
For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

Solr - Bringing back snippets from indexed data

I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.
As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.

PDF to Solr how to index the paragraphs of a PDF

I'm working with Solr and I'm trying to find out how to index a bunch of PDF files and in the specific ingest paragraphs.
My PDF contains a paragraph as:
Test (Some Test) -> Heading of the paragraph
Some text -> Text of the paragraph
What I need to achieve is when I fire a search to Solr I should see a result the paragraph heading and the text related to it.
For example, I will search "keyword" and the result will be for this Keyword:
Hello (Keyword)
Paragraph whole text
I need a help with this as I have no idea how to do it.
I would like to know if I should use some external tool or what modification I need to do in Solr to achieve my results.
You definitively need to do external work, if you use just solr, it will bundle all text it extracts into the same field, and you dont want that. So, you have to use Apache Tika/pdfbox or some other library to extract yourself the text (keeping headings and bodies separate), and index them into different fields.
This will have the additional result of the indeixng process being more resilient, as using built in Tika code in Solr is not recommended for very large indexing jobs.

Interpolating custom data onto a PDF

I am building an Angular test preparation app (with Laravel 5.1 API). One of the requirements is to allow the user to print a certificate of achievement.
The client wants the person's name and credentials interpolated into the document (e.g., highlighted below). Here is a snapshot of the PDF template they sent:
The way I'm handling PDF viewing is simply by storing the file on S3 and giving them a link to that file.
Interpolating information into a PDF doc doesn't seem trivial and I haven't found much information on programmatically allowing this, but there are tools like DocHub, that allow you do edit while viewing the PDF.
I'm interested in learning:
is doing this programmatically trivial?
are there 3rd party tools I'm unaware of?
would I even be able to send this information along to the S3 link to interpolate in the first place?
Using PDF as a format for editing is usually a bad choice. If you have a form with fixed fields, then it's easy. Create a PDF template with an interactive form. In this form, based on AcroForm technology, you'll define fields with fixed coordinates, and a fixed size. You can then add content to these fields.
One major disadvantage with this approach is the lack of flexibility. Did you notice that I used the word "fixed" three times in the previous paragraph? If text doesn't fit the predefined field, you're out of luck. If the field is overdimensioned, you'll end up with plenty of white space. This approach is great if you can predict what the data will be like. A typical use case is a ticket or a voucher. For instance: the empty form is a really nice page, with only a couple of fields where an automated system can put a name, a date, a time, and a seat number.
This isn't the best approach for the example you show in your screen shot. The position of every line of text, every word, every character is known in advance. If you want to replace a short word with a long word (or vice-versa), then all those positions (of each line, of the complete page, possibly of the complete document) need to be recalculated. That's madness. Only people with very poor design skills come up with such an idea.
A better idea, is to store the template as HTML. See for instance chapter 5 of iText's pdfHTML tutorial, where we have this snippet of HTML:
<html>
<head>
<title>Invitation to SXSW 2018</title>
</head>
<body>
<u><b>Re: Invitation</b></u>
<br>
<p>Dear <name>SXSW visitor</name>,
we hope you had a great SXSW film festival experience last year.
And we would like to invite you to the next edition of SXSW Film
that takes place from March 9 until March 17, 2018.</p>
<p>Sincerely,<br>
The SXSW crew<br>
<date>August 4, 2017</date></p>
</body>
</html>
Actually, it's not really HTML, because the <name> tag and the <date> tag don't exist in HTML. All HTML processors (browsers as well as pdfHTML) ignore those tags and treat their content as if the tag was a <span>:
It doesn't make much sense to have such tags in the context of pure HTML, but it does make a lot of sense in the case of pdfHTML. With pdfHTMLL, you can configure custom tags, and have a result that looks like the PDFs shown below:
Look at the document for "John Doe" and compare it with the document for "Bruno Lowagie". The name "John Doe" is much shorter than my name, hence more words fit on that first line. The text flows nicely (we could also have chosen to justify the text on both sides). This "flow" is impossible to achieve with your approach, because you will never get a PDF template to reflow nicely.
OK, I get it, you probably say, but what about the practical aspects? You talk about a Java / .Net library, but I am working with Laravel and Angular.js. First, let me tell you that I don't think you'll find any good PDF tools for Laravel or Angular.js, because of the nature of PDF and those development environments (in my opinion, those technologies don't play well together). Regardless of my opinion, this shouldn't be much of a problem for you because you work in an Amazon environment. AWS supports Java, and the Java code needed to get pdfHTML working is minimal. Most of the code samples I wrote for the pdfHTML tutorial are shorter than 15 lines. So why not try Java and pdfHTML?
If you're already using Amazon services, why not use an amazon lambda function, in combination with iText7 (java), to generate the pdf on demand?
That way, you are guaranteed that the pdf is correct, and has nice layout every time.
Generating the pdf can either be done by:
converting HTML,
programmatically creating your entire document,
filling and flattening an XFA form.
I think for your use-case, either option 1 or 2 are the most sustainable.

How can I edit the data section of an Omindex produced database document by editing the omegaScript?

I have been able to set up and search through some documents from a database using this tutorial:
a)
http://www.ibm.com/developerworks/opensource/library/os-xapianomega/index.html?cmp=dw&cpb=dwope&ct=dwnew&cr=dwnen&ccy=zz&csr=110410
The data field is added to every document in the indexing process started with this bash call:
$ omindex --db info --url information /mnt/data0/Information
The call indexes all the files in the dir at /mnt/data0/Information and saves it at a database named
info. According to the last section in the documentation here:
http://xapian.org/docs/omega/overview.html
According to the above documentation, you can set the fields that go into the data field of a document by editing the OmegaScript Template but I have not been able to find this template anywhere. I am hoping I can get some guidance from someone who is familiar with editing an OmegaScript to set up the data field.
I ultimately want data to have the following fields:
sample
caption
type
The standard ones without the url field.
OmegaScript templates are used by omega to render search results (in its web interface), and are stored in the template_dir as mentioned in the IBM tutorial section on the Omega web interface. omindex will have created the fields you require — that documentation also mentions that the OmegaScript command you want to extract those fields is $field{}, which is documented along with all the OmegaScript commands.
So to just display the three fields you would want a fragment of OmegaScript something like:
$hitlist{
Sample: $field{sample}
Caption: $field{caption}
MIME type: $field{type}
}
(which isn't formatted as HTML, but has the advantage of being hopefully clearer as to what is happening).

Resources