I am new to apache solr, and working on a package. The index created by solr has only .CFS, .gen, insegmentparents file, and .del file - solr

I know it contains header and file data in raw format, but does this mean everytime i query the index, the raw data is processed to find out the frequency of terms? Since I cannot see a .frq file? Is there any way to find out how the data is stored in .cfs file?

The Index file format is Compound and hence the cfs file created which has all the files combined.
Check File Format which will give a detail for Lucene Index file formats.
You can use Luke to explore your Lucene Index files.

Related

Where is the schema definition of PDF index in SOLR

All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.
This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.

Index binary file as part of a document

Is it possible to provide an XML file for Solr to index and have PDF, Word, ... file(s) alongside that need to be indexed together with the XML?
I know that Tica can index PDF/Word but my binary files are actually "attachments" to a file I'm indexing.

In Solr 4 - How do I include file names in the index?

I am building a search Engine with Solr 4.8.1 - in doing so, I am attempting to display the file names of each indexed document in my GUI search results.
I can successfully display any field that is in Solr's Schema.xml file (title, author, id, resourcename last_modified etc.). I cannot, however, find a field in the schema.xml that holds the name of the file (such as for the file Test.pdf the name "Test" or for Example.docx the word "Example")
The closest field I can find is "resourcename" which displays the entire file path in my system (ex. C:\Users\myusername\Documents\solr-4.8.1\example\exampledocs\filename.docx when all I want to display is filename.docx)
(1) How do I tell solr to index the name of a file?
or
(2) Is there a field that cover the file name that I am just missing?
Sincerest thanks!
---Research Update---
It seems this question is asking for the same thing - Solr return file name - however, I do not believe that simply adding a field called "filename" will cause Solr to index the file name! I know I need to add a field to the Schema.xml file - now how do I point that field to the name of a file?
This is not so much a question regarding solr functionality as it is about the tools you use to publish to solr. While adding a new field called fileName to solr will resolve part of the issue, modifying the publish tool to add the testPDF.pdf value to each . I guess i'd point my eyes at Tika : http://tika.apache.org/ , seeing how you mention both pdf and doc files.

Is it possible to parse text documents while loading in Solr?

I have a text file which contains some data on each line. Each line can be thought of as a database record with fields in that record being separated by semicolon. We will consider each line in the file as a separate document for indexing purposes. For example, consider the following couple lines from the file:
1.0.5.32;1.0.5.47;aus;vic;richmond;broadband;-1;-37.8186;144.999;3121;36;28389;43552;3;au;21;0;100;100;100;100;+1100;y;
1.0.5.48;1.0.5.63;aus;vic;melbourne;broadband;-1;-37.8143;144.963;3000;36;28389;5601;3;au;5;0;100;100;100;100;+1100;y;
In the example above, we have 2 documents that are to be indexed and each document has 22 fields.
Is it possible to load this text file in Solr and index each line as a separate document, with Solr parsing each document based on delimiter (semicolon in this case) and extracting fields?
If not, is there any way to preprocess the document to convert it into a form that Solr understands?
Look into Solr Wiki, your case clearly described here
http://wiki.apache.org/solr/UpdateCSV

SOLRJ and indexing files

I'm trying to index an email messages complelety with subject, body and all the attachments. For indexing I'm using common SolrInputDocument. How can I add attachments into document to be indexed? I have found the similar post here SolrJ keeps indexed files open but it only shows the way, how to index files separately from document data. How can I index files as being part of the other email message data like subject, body, sender etc. ?
Do you also want the text inside the text to be searchable too? In case yes, then take a loot at Tika which helps reading files in RTF, PDF etc format.
In case not, you can just store the path and filename of the attachments in your index and the attachements locally at some path.

Resources