Indexing URL pointing to pdf using TIKA in SOLR

Indexing URL pointing to pdf using TIKA in SOLR - solr

I have a requirement where the incoming update request has a metadata like "link":"htp://example.pdf" (along with some other metadata) and i have to parse the PDF document and indexed it in another field like "link_value":"PDF extracted contents". Is this possible in SOLR using tika?
NOTE: I cannot use Data import handler since the incoming request is not from a single source and is done via external source

So, if I understand correctly:
you are getting some /update call to add some doc
the doc contains a 'link' field, which you want to retrieve, extract text with Tika, and index into another field
Yes you can do this in Solr, but you need to do some work:
set up an UpdateRequestProcessor, you could start off TikaLanguageIdentifierUpdateProcessorFactory as it uses Tika too and maybe you can reuse some stuff
you wire your URP so it is used by the /update handler
that URP will kick in every time a doc is added
in the URP code, you: retrieve the pdf, programatically extract the text with Tika, and add it to the target field

You can map content to a specific field and supply specific field values when you're using the ExtractingRequestHandler (if you're using Tika yourself, you'll include the content as a regular document field).
To map the content to a different field, use fmap: fmap.content=link_value, and to include a literal value (i.e. the URL of the document you're indexing), use literal: literal.link=http://example.com/test.pdf (apply URL escaping as necessary).

Related

How to remove highlights tags before in Azure Cognitive Search documents before searching

Azure Search by default highlights search results with <em> tag. I've met with situation where user uploads document with that tag inside:
<em>Today</em> topic will be...
When i would search for "topic" i would get:
<em>Today</em> <em>topic</em> will be...
And i wouldn't be able to distinguish the right highlight.
I know that i can modify highlight_pre_tag and highlight_post_tag so i would avoid this in this particular situation. But is there other way to encode this tags before appyling highlighs?
EDIT:
By encoding i mean getting something like this:
<em>Today</em> <em>topic</em> will be...;
So I can send it to frontend and then display <em> from "Today" as <em> and use <em> in "topic" to highlight it to yellow.

Azure Search doesn't provide any built-in mechanism to modify the "raw" content of a document if you are using the Index API directly, however, if you are using one of our built-in indexers, you can look into using the field mapper functions (such as the UrlEncode function) or create your own custom skill (if you want to only apply very specific rules) to transform the documents in transit from your data source to the search index.
Alternatively, we've seen customers use custom highlight pre and post tags that are easily recognizable (and unlikely to be mistaken for original content) and then using a simple search and replace function in their client application to transform those back into the desired tag.
For example, using
pre-tag : "!HIGHLIGHT_START!" and post-tag :"!HIGHLIGHT_END!"
and then using
String.Replace("!HIGHLIGHT_START!", "<em>")
before displaying the results in their application. That way, any client-side logic that requires finding the actual highlights can use the custom tags, while still showing the desired tag in the UX.

How to transform/update and store incoming JSON property by applying PatternReplaceFilterFactory in SOLR 7.1.0

I am pretty new to SOLR and we have a requirement where I have to modify one of the JSON property value from incoming request to get updated and stored. Something like the below one.
e.g.
{
"name":"Google,CA,94043"
}
When I add this JSON via add/update documents using SOLR admin. I want this name to be stored as just Google. So when I do a search(query) . from SOLR admin it should list name as Google not "Google,CA,94043"
I have added FieldType with PatternReplaceFilterFactory and referenced the same to name field. The result is not appearing with the updated one. But when I analyze field value (index/query) using the admin tool it has the values correctly. Not sure how to achieve this.
Let me know if anyone has steps on how to achieve this.

Document Converstion for PDF form (eg. w2/1040/etc) as key/values instead of a single string based on font information

Trying to use the Document Conversion service to capture the json key/value pairs for the pdf documents such as (w2/1040/etc forms.)
Content of such forms in json response are coming as part of the "text" under the "content". Missing the form data, but mostly rendering the form labels as a single string.
I would like to know if there is anyway to capture the form data for the pdf (w2/1040/etc) as key / values in json instead of a single string?
Thanks.

Unfortunately, the Document Conversion Service currently does not support forms in PDFs. At most, it may recognize some of the forms as tables, but not as key/value pairs.
If it recognizes a form as a table, you still would need to do some non-trivial post-processing to map it to key/value pairs.

Set Title and Id With Retrieve and Rank Web Interface

I have used IBM Watson Retrieve and Rank Web Interface to create a collection of html articles. Via the web interface I was able to upload my html articles. The problem is when I query the collection the data for id and title are not usable. Here is the query I made in the browser:
https://MY-USER-NAME:MY-PASSWORD#gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/MY-CLUSTER/solr/MY-COLLECTION/select?q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title
The response I get is:
{"responseHeader":{"status":0,"QTime":106,"params":{"q":"what is the basic mechanism of the transonic aileron buzz","fl":"id,title","wt":"json"}},"response":{"numFound":12,"start":0,"docs":[{"id":"6a06f47c-cb3f-4791-9914-c84772eb9415","title":"no-title"}.....
The bold section is the problem. When using the web interface is there a way to set the title and id when uploading documents? Or, better yet, is there another way I query my collection to get the file name of the document I uploaded and/or the text from the document?

When using the web interface is there a way to set the title and id when uploading documents?
No, sorry.
However, if you upload the documents yourself from outside of the web interface, you can specify the title and ID (and the documents will be shown in the web interface when you come back to it).
is there another way I query my collection to get the file name of the document I uploaded
Yes
In the query you posted above, the last parameters you have are the fields you want to retrieve
&fl=id,title
You're retrieving the ID and the title.
If you want the name of the file that the content came from, add fileName. For example:
https://MY-USER-NAME:MY-PASSWORD#gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/MY-CLUSTER/solr/MY-COLLECTION/select?q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title,fileName
is there another way I query my collection to get text from the document
Yes.
Similar to above, you just need to update the list of fields that you retrieve. The contents of the doc is put in a field called body.
So to get the ID, title, and the body, you could use:
https://MY-USER-NAME:MY-PASSWORD#gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/MY-CLUSTER/solr/MY-COLLECTION/select?q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title,body
That gets you a plain text version of the contents. If you want the HTML, use contentHtml instead.
https://MY-USER-NAME:MY-PASSWORD#gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/MY-CLUSTER/solr/MY-COLLECTION/select?q=what is the basic mechanism of the transonic aileron buzz&wt=json&fl=id,title,contentHtml

Returning web page abstract with Solr

I've crawled a site with Nutch successfully and am trying to return a highlighted abstract using Solr as the indexer/searcher. So, if I query "ocean" then I want to return a 20-30 word abstract from just the text of the web page (not the title or url) containing that query term.
I've copied the Nutch schema.xml as my Solr schema.xml.
So I have two questions:
1. Is the "content" field in the Nutch schema.xml the field for body elements of a web page?
2. If this field is not stored, is there a way to have Solr retrieve that field at search time so that it can be highlighted?

I haven't used Nutch in a long time, but I think it's pretty safe to assume that "content" is the field you want to highlight.
You need to store the field to be able to use highlighting and if you want to use FastVectorHighlighting you need to enable the following attributes for that field: termVectors, termPositions and termOffsets.
If you use FVH, you can also use boundaryScanner in Solr 3.5 and up.