In Solr 4 - How do I include file names in the index?

In Solr 4 - How do I include file names in the index? - solr

I am building a search Engine with Solr 4.8.1 - in doing so, I am attempting to display the file names of each indexed document in my GUI search results.
I can successfully display any field that is in Solr's Schema.xml file (title, author, id, resourcename last_modified etc.). I cannot, however, find a field in the schema.xml that holds the name of the file (such as for the file Test.pdf the name "Test" or for Example.docx the word "Example")
The closest field I can find is "resourcename" which displays the entire file path in my system (ex. C:\Users\myusername\Documents\solr-4.8.1\example\exampledocs\filename.docx when all I want to display is filename.docx)
(1) How do I tell solr to index the name of a file?
or
(2) Is there a field that cover the file name that I am just missing?
Sincerest thanks!
---Research Update---
It seems this question is asking for the same thing - Solr return file name - however, I do not believe that simply adding a field called "filename" will cause Solr to index the file name! I know I need to add a field to the Schema.xml file - now how do I point that field to the name of a file?

This is not so much a question regarding solr functionality as it is about the tools you use to publish to solr. While adding a new field called fileName to solr will resolve part of the issue, modifying the publish tool to add the testPDF.pdf value to each . I guess i'd point my eyes at Tika : http://tika.apache.org/ , seeing how you mention both pdf and doc files.

Related

solr query to return matched text to regex with default schema

I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg, gif.
Server names are given by the regular expression (regex):
INFP[a-zA-z0-9]{3,9}
TRKP[a-zA-z0-9]{3,9}
PLCP[a-zA-z0-9]{3,9}
SQRP[a-zA-z0-9]{3,9}
....
Problem
I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01.
I've indexed the files using Solr/Tikka/Tesseract using the default schema.
I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked
Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).
Questions
Would I have to have modify the managed schema?
If so would I have to save the file content in the schema?
If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
b. Remove line
as I want meta data
c. Change this in the managed schema
stored=false to stored=true

Where is the schema definition of PDF index in SOLR

All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.

This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.

SOLR Search results with associated file

I am using solr search (solr 4.X), everything working as expected, I got the requirement that I need to show the associated file also along with the search results.
I am getting the search results but not the files. How do I get, at least I am expecting file name along with the search results.
Thanks for the help. Please help me

Solr is a generic enterprise search server. It does not know anything about files or where the data it indexes comes from. You will have do do this on your own.
The Schema (schema.xml) defines what fields get indexed. When you design your schema, you have to make decisions on what is stored and in what way.
If you want the filenames back, you will have to manually add them to your index, by first providing a field in your schema and than by filling that field every time you add something to your index.
You probably do not want to tokenizer your filename, unless you want to search on it, too. If your filename includes a full path, it can be considered unique and you could use it as your id, too.
If you add it via xml, all you need is a new field in your doc list, e.g.
<doc>
...
<field name="filename">/some/path/basename.extension</field>
...
</doc>
If you are using solrj, it will look something like this:
HttpSolrServer server = new HttpSolrServer(host);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("filename", document.getFilename());
Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc);
server.add(docs);

Get text snippet from search index generated by solr and nutch

I have just configured nutch and solr to successfully crawl and index text on a web site, by following the geting started tutorials. Now I am trying to make a search page by modifying the example velocity templates.
Now to my question. How can I tell solr to provide a relevant text snippet of the content of the hits? I only get the following fields associated with each hit:
score, boost, digest, id, segment, title, date, tstamp and url.
The content is really indexed, because I can search for words that I know only is in the fulltext, but I still don't get the fulltext back associated with the hit.

don't forget: indexed is not the same as stored.
You can search words in an document, if all field are indexed, but no field is stored.
To get the content of a specific field, it must be also stored=true in schema.xml
If your fulltext-field is
stored, so probably the default "field-list-settings" does not include the fulltext-field.
You can add this by using the fl parameter:
http://<solr-url>:port/select/?......&fl=mytext,*
...this example, if your fulltext is stored in the field called mytext
Finally, if you like to have only a snippet of the text with the searched words (not the whole text) look at the highlight-component from solr/lucene

SOLR Tika: add text of file to existing record (ExtractingRequestHandler)

I am indexing posts in SOLR with "name", "title", and "description" fields. I'd like to later be able to add a file (like a Word doc or a PDF) using Tika / the ExtractingRequestHandler.
I know I can add documents like so: (or through other interfaces)
curl
'http://localhost:8983/solr/update/extract?literal.id=post1&commit=true'
-F "myfile=#tutorial.html"
But this replaces the correct post (post1 above) -- is there a parameter I can pass to have it only add to the record?

In Solr (ver < 4.0) you can't modify fields in a document. You can only delete or add/replace whole documents. Therefore, when "appending" a file to the Solr document you have to rebuild your document from its current values (using literal), i.e. query for the document and then:
http://localhost:8983/solr/update/extract?literal.id=post1&literal.name=myName&literal.title=myTitle&literal.description=myDescription&commit=true

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

In Solr 4 - How do I include file names in the index? - solr

Related

solr query to return matched text to regex with default schema

Where is the schema definition of PDF index in SOLR

SOLR Search results with associated file

Get text snippet from search index generated by solr and nutch

SOLR Tika: add text of file to existing record (ExtractingRequestHandler)

Categories

Resources