Returning web page abstract with Solr

Returning web page abstract with Solr - solr

I've crawled a site with Nutch successfully and am trying to return a highlighted abstract using Solr as the indexer/searcher. So, if I query "ocean" then I want to return a 20-30 word abstract from just the text of the web page (not the title or url) containing that query term.
I've copied the Nutch schema.xml as my Solr schema.xml.
So I have two questions:
1. Is the "content" field in the Nutch schema.xml the field for body elements of a web page?
2. If this field is not stored, is there a way to have Solr retrieve that field at search time so that it can be highlighted?

I haven't used Nutch in a long time, but I think it's pretty safe to assume that "content" is the field you want to highlight.
You need to store the field to be able to use highlighting and if you want to use FastVectorHighlighting you need to enable the following attributes for that field: termVectors, termPositions and termOffsets.
If you use FVH, you can also use boundaryScanner in Solr 3.5 and up.

Related

Solr No "Content" Field in Collection after Indexing PDFs/DOCs

I have a collection of thousands of documents/pdfs and there are a lot of fields like: url, title, date...etc. But there is no content field, which is something that seems like it must exist in order for you to be to able to search by keywords of the entire document, not just the title. I see some people saying that usually, the content field is generated automatically when you index.
How do I go about adding a content field that should contain all the text in the PDFs/DOCs? I am on Solr 6 so I know I need to use API to create a new field to work with managed-schema. But after that, how do I re-index my collection? And if I just name the new field "content", will Solr know that the "content" field should contain all the text in my PDFs/DOCs when it's reindexing?

Creating a "content" field did not work! Instead, I set stored=true for my _text_ field and everything worked.

How to transform/update and store incoming JSON property by applying PatternReplaceFilterFactory in SOLR 7.1.0

I am pretty new to SOLR and we have a requirement where I have to modify one of the JSON property value from incoming request to get updated and stored. Something like the below one.
e.g.
{
"name":"Google,CA,94043"
}
When I add this JSON via add/update documents using SOLR admin. I want this name to be stored as just Google. So when I do a search(query) . from SOLR admin it should list name as Google not "Google,CA,94043"
I have added FieldType with PatternReplaceFilterFactory and referenced the same to name field. The result is not appearing with the updated one. But when I analyze field value (index/query) using the admin tool it has the values correctly. Not sure how to achieve this.
Let me know if anyone has steps on how to achieve this.

Returning price and thumbnail info from solr suggester

I am using the solr (6.5.1) suggester to return autocomplete results.
I am trying to display a price and a thumbnail with the autocomplete results but can't find a way to do this.
Is there a way to return more fields?
I see these two questions from two years ago that seem to be trying to accomplish what I want, and both say that at the time it is not doable.
Solr Suggestion with multiple payloads
Returning an entire Document on Solr Suggestion
Has anything changed since two years ago?
Is there a different way that this can be accomplished?

just put all info you need into a field, and use that field as payload. For example you could:
append some string info, separated by |: payload:"17|/path/to/thumbnail"
or you could use Solr BinaryField and put a Java pojo containing the info you need there serialized
I would go the simple route, the first one.

Indexing URL pointing to pdf using TIKA in SOLR

I have a requirement where the incoming update request has a metadata like "link":"htp://example.pdf" (along with some other metadata) and i have to parse the PDF document and indexed it in another field like "link_value":"PDF extracted contents". Is this possible in SOLR using tika?
NOTE: I cannot use Data import handler since the incoming request is not from a single source and is done via external source

So, if I understand correctly:
you are getting some /update call to add some doc
the doc contains a 'link' field, which you want to retrieve, extract text with Tika, and index into another field
Yes you can do this in Solr, but you need to do some work:
set up an UpdateRequestProcessor, you could start off TikaLanguageIdentifierUpdateProcessorFactory as it uses Tika too and maybe you can reuse some stuff
you wire your URP so it is used by the /update handler
that URP will kick in every time a doc is added
in the URP code, you: retrieve the pdf, programatically extract the text with Tika, and add it to the target field

You can map content to a specific field and supply specific field values when you're using the ExtractingRequestHandler (if you're using Tika yourself, you'll include the content as a regular document field).
To map the content to a different field, use fmap: fmap.content=link_value, and to include a literal value (i.e. the URL of the document you're indexing), use literal: literal.link=http://example.com/test.pdf (apply URL escaping as necessary).

Need plugin to overwrite default title

Im trying to write a plugin for Nutch based on http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html to get a custom title finder.
This works well, and storing extracted titles in new field is no problem. But I want to use it in Solr instead of default title. The problem is Solr needs multivalued fields as I have 2 title fields.
metadata.remove("title");
didnt work.
I really want to use the new title instead of the default one created by Nutch. Any suggestions?

Why don't you put your title in a different field, thus it will be handled properly ?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Returning web page abstract with Solr - solr

Related

Solr No "Content" Field in Collection after Indexing PDFs/DOCs

How to transform/update and store incoming JSON property by applying PatternReplaceFilterFactory in SOLR 7.1.0

Returning price and thumbnail info from solr suggester

Indexing URL pointing to pdf using TIKA in SOLR

Need plugin to overwrite default title

Categories

Resources