Where is the schema definition of PDF index in SOLR - solr

All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.

This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.

Related

solr query to return matched text to regex with default schema

I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg, gif.
Server names are given by the regular expression (regex):
INFP[a-zA-z0-9]{3,9}
TRKP[a-zA-z0-9]{3,9}
PLCP[a-zA-z0-9]{3,9}
SQRP[a-zA-z0-9]{3,9}
....
Problem
I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01.
I've indexed the files using Solr/Tikka/Tesseract using the default schema.
I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked
Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).
Questions
Would I have to have modify the managed schema?
If so would I have to save the file content in the schema?
If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
b. Remove line
as I want meta data
c. Change this in the managed schema
stored=false to stored=true

How can I view actually stored transformed Solr text field values?

When Solr returns a document, the field values match those that where passed to the Solr indexer.
However especially for TextFields Solr typically uses a modified value where (depending on the definition in the schema.xml) various filters are applied, typicall:
conversion to lower case
replacing of synonyms
removal of stopwords
application of stemming
One can see the result of the conversion for specific texts by using Solr Admin > Some core > Analysis. There is a tool called Luke and the LukeRequestHandler but it seems I can only view the values passed to Solr but not the tranformed variant. One can also take a look at the index data on the disk but they seem to be stored in a binary format.
However, non of these seem to enable me to see the actual value as stored.
The reason for asking is that I've created a text field based on a certain filter chain which according to Solr Admin > Analysis transforms the text correctly. However when searching for a specific word in the transformed text it won't find it.

Solr fields mapping?

I am indexing documents into solr from a source. At source, for each document, i have some associated properties which i am indexing & fetching into solr.
What i am doing is i am mapping some fields from source properties with solr schema fields. But i could see couple of extra fields in solr logs which i am not mapping. While querying in solr admin UI, i could see only mapped fields.
E.g. In below logs, i am using only content_name & content content_modifier but i could see Template fields also.
INFO - 2014-09-18 12:07:47.185; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.content_name=1_.000&literal.content_modifier=System&literal.Template={8ad4d8f0-93a7-4941-9657-cf3706f00409} {add=[1_.000 (1479581071766978560)]} 0 0
So whats happening here? Will solr index only mapped fields and skip rest of unmapped ones? Or will solr index all fields including mapped & non-mapped but on admin UI , it will show only mapped fields?
Please suggest.
Your question is defined by what your solrconfig and schema say because you can configure it any way you want. Here is how it works for the example schema for Solr 4.10:
1) In solrconfig.xml, the handler use "uprefix" parameter to map all fields NOT in schema to a dynamic field ignored_*
2) In schema.xml, that dynamic field has type ignored
3) Type ignored (in the same file) is defined as stored=false and indexed=false. Which means do not complain if you get one of fields with matching pattern, but do nothing with, literally ignore.
So, if you don't like that, you can modify any part of that pipeline. The easiest test would be to change the dynamic field to use type string and reindex. Then, you should see the rest of the fields.

In Solr 4 - How do I include file names in the index?

I am building a search Engine with Solr 4.8.1 - in doing so, I am attempting to display the file names of each indexed document in my GUI search results.
I can successfully display any field that is in Solr's Schema.xml file (title, author, id, resourcename last_modified etc.). I cannot, however, find a field in the schema.xml that holds the name of the file (such as for the file Test.pdf the name "Test" or for Example.docx the word "Example")
The closest field I can find is "resourcename" which displays the entire file path in my system (ex. C:\Users\myusername\Documents\solr-4.8.1\example\exampledocs\filename.docx when all I want to display is filename.docx)
(1) How do I tell solr to index the name of a file?
or
(2) Is there a field that cover the file name that I am just missing?
Sincerest thanks!
---Research Update---
It seems this question is asking for the same thing - Solr return file name - however, I do not believe that simply adding a field called "filename" will cause Solr to index the file name! I know I need to add a field to the Schema.xml file - now how do I point that field to the name of a file?
This is not so much a question regarding solr functionality as it is about the tools you use to publish to solr. While adding a new field called fileName to solr will resolve part of the issue, modifying the publish tool to add the testPDF.pdf value to each . I guess i'd point my eyes at Tika : http://tika.apache.org/ , seeing how you mention both pdf and doc files.

formatting of files before indexing into solr server

I'm using the Solr server to provide search capability for a tool. I wanted to know if there is a facility provided by solr that will allow me to format some files before they are indexed ? more specifically i have a plain text file with a lot of data ! i want to convert them to an xml format before i index the xml file . eg
some data! some more data : more values
i want to convert this sample line to something like
<field 1>sample data </field 1>
<field 2> some more data </field 2>
<field 3> more values </field 3>
does solr provide a facility for this type of transformation before iindexing a file using solr cell. does it provie any classes or interfaces that i can implement in my java application ??
thanks in advance!
Are you pushing data into Solr or can you pull it from the source by Solr?
If you are pushing into Solr, then you have to use Update Request Processor. However, I am not aware of any that will split data into multiple fields. You may need to write one yourself.
If you are pulling from the source using DataImportHandler, it has a built-in support for splitting content into multiple fields using RegexTransformer.
Both Request Processor and DIH support JavaScript (and possibly other Java script languages) transformers, so you can also write your own script to split the data in whatever way you want.
Some of this is starting with version 4 of Solr though. That's a requirement to keep in mind.
You'll need a custom Index Handler or a SolrRequestHandler

Resources