Solr No "Content" Field in Collection after Indexing PDFs/DOCs - solr

I have a collection of thousands of documents/pdfs and there are a lot of fields like: url, title, date...etc. But there is no content field, which is something that seems like it must exist in order for you to be to able to search by keywords of the entire document, not just the title. I see some people saying that usually, the content field is generated automatically when you index.
How do I go about adding a content field that should contain all the text in the PDFs/DOCs? I am on Solr 6 so I know I need to use API to create a new field to work with managed-schema. But after that, how do I re-index my collection? And if I just name the new field "content", will Solr know that the "content" field should contain all the text in my PDFs/DOCs when it's reindexing?

Creating a "content" field did not work! Instead, I set stored=true for my _text_ field and everything worked.

Related

How can I use RelationshipFilter in 2sxc Visual Query with text field instead of entity field?

I'm trying to do the same as this example does under title "Attribute-On-Relationship to Query other Fields".
I'm editing Blog application visual query.
So I have RelationshipFilter, which takes entities of type Category via Default in point. And I want to filter them by field Name. Here I can get list of names either from params or from list of posts and their categories. That's not a problem as far as I understand.
So looks like Name has to be of entity type. I'm struggling right now with this filter, since I want to filter Category by field Name of simple text type. Which means that I have nothing to specify in Relationship Attribute. EntityTitle or just empty Relationship Attribute field don't work and cause Bad Request error. So is there a way to make it work?
P.S. ValueFilter is not an option, since it doesn't support returning nothing if there are no items, that satisfy condition and also it supports only filter by item's Attribute, that contains Value and no option that Value can contain any in Attribute with separator.
The RelationshipFilter is only meant for relationships (item with item) - and you seem to want to do a string-compare.
I'm not really sure what you should do because I don't have context, but if things get really special, best use LINQ instead. Check out the tutorials for LINQ here: https://2sxc.org/dnn-tutorials/en/razor/linq/home

How to transform/update and store incoming JSON property by applying PatternReplaceFilterFactory in SOLR 7.1.0

I am pretty new to SOLR and we have a requirement where I have to modify one of the JSON property value from incoming request to get updated and stored. Something like the below one.
e.g.
{
"name":"Google,CA,94043"
}
When I add this JSON via add/update documents using SOLR admin. I want this name to be stored as just Google. So when I do a search(query) . from SOLR admin it should list name as Google not "Google,CA,94043"
I have added FieldType with PatternReplaceFilterFactory and referenced the same to name field. The result is not appearing with the updated one. But when I analyze field value (index/query) using the admin tool it has the values correctly. Not sure how to achieve this.
Let me know if anyone has steps on how to achieve this.

Solr Highlighting - Display Snippet

I have successfully set up highlighting in Solr4, I am indexing docx, xlsx & pdf's mainly so just have fields like url, title & content.
I have Solr highlighting the content field and it displays the small snippet of text, but sometimes the matched word is in the title as opposed to the content and therefore it will not return me a snippet of text
Is there any way of returning even just the first line or two from the content field so that it is not left blank.
I guess your query URL looks like q=(title:ABC OR content:ABC)&hl=true&hl.fl=title,content
Try adding hl.alternateField=content to the query
Use fl=content parameter with your query. If no highlighted content returned then generate snippet from content (fl=content) field returned with each document in result set.

Returning web page abstract with Solr

I've crawled a site with Nutch successfully and am trying to return a highlighted abstract using Solr as the indexer/searcher. So, if I query "ocean" then I want to return a 20-30 word abstract from just the text of the web page (not the title or url) containing that query term.
I've copied the Nutch schema.xml as my Solr schema.xml.
So I have two questions:
1. Is the "content" field in the Nutch schema.xml the field for body elements of a web page?
2. If this field is not stored, is there a way to have Solr retrieve that field at search time so that it can be highlighted?
I haven't used Nutch in a long time, but I think it's pretty safe to assume that "content" is the field you want to highlight.
You need to store the field to be able to use highlighting and if you want to use FastVectorHighlighting you need to enable the following attributes for that field: termVectors, termPositions and termOffsets.
If you use FVH, you can also use boundaryScanner in Solr 3.5 and up.

Need plugin to overwrite default title

Im trying to write a plugin for Nutch based on http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html to get a custom title finder.
This works well, and storing extracted titles in new field is no problem. But I want to use it in Solr instead of default title. The problem is Solr needs multivalued fields as I have 2 title fields.
metadata.remove("title");
didnt work.
I really want to use the new title instead of the default one created by Nutch. Any suggestions?
Why don't you put your title in a different field, thus it will be handled properly ?

Resources