Solr fields mapping? - solr

I am indexing documents into solr from a source. At source, for each document, i have some associated properties which i am indexing & fetching into solr.
What i am doing is i am mapping some fields from source properties with solr schema fields. But i could see couple of extra fields in solr logs which i am not mapping. While querying in solr admin UI, i could see only mapped fields.
E.g. In below logs, i am using only content_name & content content_modifier but i could see Template fields also.
INFO - 2014-09-18 12:07:47.185; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.content_name=1_.000&literal.content_modifier=System&literal.Template={8ad4d8f0-93a7-4941-9657-cf3706f00409} {add=[1_.000 (1479581071766978560)]} 0 0
So whats happening here? Will solr index only mapped fields and skip rest of unmapped ones? Or will solr index all fields including mapped & non-mapped but on admin UI , it will show only mapped fields?
Please suggest.

Your question is defined by what your solrconfig and schema say because you can configure it any way you want. Here is how it works for the example schema for Solr 4.10:
1) In solrconfig.xml, the handler use "uprefix" parameter to map all fields NOT in schema to a dynamic field ignored_*
2) In schema.xml, that dynamic field has type ignored
3) Type ignored (in the same file) is defined as stored=false and indexed=false. Which means do not complain if you get one of fields with matching pattern, but do nothing with, literally ignore.
So, if you don't like that, you can modify any part of that pipeline. The easiest test would be to change the dynamic field to use type string and reindex. Then, you should see the rest of the fields.

Related

How can I view actually stored transformed Solr text field values?

When Solr returns a document, the field values match those that where passed to the Solr indexer.
However especially for TextFields Solr typically uses a modified value where (depending on the definition in the schema.xml) various filters are applied, typicall:
conversion to lower case
replacing of synonyms
removal of stopwords
application of stemming
One can see the result of the conversion for specific texts by using Solr Admin > Some core > Analysis. There is a tool called Luke and the LukeRequestHandler but it seems I can only view the values passed to Solr but not the tranformed variant. One can also take a look at the index data on the disk but they seem to be stored in a binary format.
However, non of these seem to enable me to see the actual value as stored.
The reason for asking is that I've created a text field based on a certain filter chain which according to Solr Admin > Analysis transforms the text correctly. However when searching for a specific word in the transformed text it won't find it.

Solr indexing fails over media_black_point

In front i want to say that i dont have much experience with Solr.
Problem we are facing, we only want to index content of files and not want to add dynamic fields, is this possible and if so how?
Problem 2: If Problem one is a No, how would we exclude media_black_point,
media_white_point with indexing?
Error code where Solr trips:
{"responseHeader":{"status":400,"QTime":149},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"incompatible dimension (2) and values (313/1000 329/1000). Only 0 values specified","code":400}}
Dynamic Fields and schemaless mode are both there to catch fields you did not declare explicitly. If neither are used, the assumption is that every field you send to Solr (including output from extract handler that generates a Solr document internally) needs to be explicitly mapped. This helps to avoid spelling errors and other unexpected edge-cases.
If you want to ignore all the fields you did not define explicitly, you can use dynamic field with stored/indexed/docValues all set to false. Solr ships with one example out of the box, you just need to uncomment it.
The other option is to ignore specific fields. You can do that by defining a custom UpdateRequestProcessor chain (or individual URP in the latest Solr) and using IgnoreFieldUpdateProcessorFactory with your specific field name or a name pattern.

Solr dismax Query Over Multiple Fields

I am trying to do a solr dismax query over multiple fields, and am a little confused with the syntax.
My core contains a whole load of podcast episodes. The fields in the index are EPISODE_ID, EPISODE_TITLE, EPISODE_DESC, and EPISODE_KEYWORDS.
Now, when I do a query I would like to search for the query term in the EPISODE_TITLE, EPISODE_DESC, and EPISODE_KEYWORDS fields, with different boosts for the different fields.
So when I search for 'jedi', the query I've built looks like this:
http://localhost:8983/solr/episode_core/select?
&defType=dismax&q=jedi&fl=EPISODE_ID,EPISODE_TITLE,EPISODE_DESC,EPISODE_KEYWORDS
&qf=EPISODE_TITLE^3.0+EPISODE_DESC^2.0+EPISODE_KEYWORDS
However, this doesn't seem to work - it returns zero records.
When I put a default field like below, it now works, but this is kind of crap because it means I'm not getting results from searching all of the 3 fields:
http://localhost:8983/solr/episode_core/select?&df=EPISODE_DESC
&defType=dismax&q=jedi&fl=EPISODE_ID,EPISODE_TITLE,EPISODE_DESC,EPISODE_KEYWORDS
&qf=EPISODE_TITLE^3.0+EPISODE_DESC^2.0+EPISODE_KEYWORDS
Is there something I am missing here? I thought that you could search over multiple fields, and I thought that the 'qf' parameter would mean you didn't need to supply the default field parameter?
All help much appreciated...
Your idea is correct. If you've defined qf (query fields) for Dismax, there shouldn't be any need to specify a df (default field).
Can you be more specific about what isn't working?
Also, read up on Configuration Invariants in solrconfig.xml as it is possible your configuration could be sending some different parameters than you've specified in the URL.
(E.g. if you're seeing a specific error message asking you to provide a df)

Where is the schema definition of PDF index in SOLR

All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.
This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.

how to implement solr index partitioning

I want solr to create indexes based on a specific field. For e.g. I have a field in schema.xml, createDate (which might be of value 2012/2013/etc). Now while indexing if the value of that specific field is 2013, the document should be indexed at /data/2013/index folder (or some logically separated folder). I tried to provide the following in my solrconfig xml just before the <config> tag ends:
<partition>
<partitionField name="creationYear">
<value>2004</value>
<value>2005</value>
<value>2006</value>
<value>2007</value>
<value>2008</value>
<value>2009</value>
<value>2010</value>
<value>2011</value>
<value>2012</value>
<value>2013</value>
</partitionField>
</partition>
While indexing its not working and it seems that this was just an idea but not really implemented in solr. Am I assuming correct? Or is there a way I can allow solr to create dynamic index folders based on the year(as in this example)?
Any help would be appreciated!!

Resources