Solr BlockJoinQuery returns false positives - solr

We are trying to query indexed nested child documents in Solr but when we query for example to return a parent of a child where event_id: order-1 the result has a parent which has a child with event_id: order-5.
We did setup a fresh Solr using Solr's example data and when querying that, the returned results were correct. Idea was that maybe there is something in solrconfig.xml but after removing or setting things back to default, the results were still incorrect.
Currently we're working on checking schema.xml to see if we can correct results that way.
Our current solrconfig.xml
<config>
<luceneMatchVersion>8.11.2</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}" />
<schemaFactory class="ClassicIndexSchemaFactory"/>
<indexConfig>
<lockType>single</lockType>
<ramBufferSizeMB>256</ramBufferSizeMB>
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">id asc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<int name="inner.deletesPctAllowed">20</int>
</mergePolicyFactory>
</indexConfig>
<updateHandler class="solr.DirectUpdateHandler2">
<autoCommit>
<maxDocs>1000000</maxDocs>
<maxSize>2g</maxSize>
<openSearcher>false</openSearcher>
</autoCommit>
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
<query>
<maxBooleanClauses>102400</maxBooleanClauses>
<filterCache class="solr.CaffeineCache" maxRamMB="750" initialSize="0" autowarmCount="0" />
<queryResultCache class="solr.CaffeineCache" size="512" initialSize="0" autowarmCount="0" />
<fieldValueCache class="solr.CaffeineCache" size="1" initialSize="0" autowarmCount="0" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>0</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
<requestDispatcher handleSelect="false">
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" />
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler"></requestHandler>
</config>
Our current schema.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="default-config" version="1.6">
<fieldType name="_nest_path_" class="solr.NestPathField" />
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />
<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true" />
<!-- Numeric field types that index values using KD-trees. Point fields don't support FieldCache, so they must have docValues="true"
if needed for sorting, faceting, functions, etc. -->
<fieldType name="pint" class="solr.IntPointField" docValues="true" />
<fieldType name="pfloat" class="solr.FloatPointField" docValues="true" />
<fieldType name="plong" class="solr.LongPointField" docValues="true" />
<fieldType name="pdouble" class="solr.DoublePointField" docValues="true" />
<fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true" />
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true" />
<fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true" />
<fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true" />
<!-- KD-tree versions of date fields -->
<fieldType name="pdate" class="solr.DatePointField" docValues="true" />
<fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true" />
<uniqueKey>id</uniqueKey>
<!-- Solr automatically populates this with the value of the top/parent ID. E.g. the profile ID. It is required. -->
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<!-- Is populated by Solr automatically with the path of the document in the hierarchy for non-root documents. -->
<field name="_nest_path_" type="_nest_path_" />
<!-- Is populated by Solr automatically to store the ID of each document’s parent document (if there is one). -->
<field name="_nest_parent_" type="string" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false" />
<field name="_indexversion_" type="pint" indexed="true" stored="false" multiValued="false" required="true"
default="4" />
<field name="timestamp" type="pdate" indexed="true" stored="false" default="NOW" />
<field name="content_type" type="string" indexed="true" stored="false" />
<!-- define system values, which are known to be single valued -->
<field name="creationdate_l" type="plong" indexed="true" stored="false" />
<field name="lastmodifieddate_l" type="plong" indexed="true" stored="false" />
<field name="firstvisit_l" type="plong" indexed="true" stored="false" />
<field name="lastvisit_l" type="plong" indexed="true" stored="false" />
<!-- behavioral properties -->
<field name="frequency_bp" type="pint" indexed="true" stored="false" />
<field name="intensity_bp" type="pint" indexed="true" stored="false" />
<field name="recent_intensity_bp" type="pfloat" indexed="true" stored="false" />
<field name="firstvisit_behavior_bp" type="pint" indexed="true" stored="false" />
<field name="lastvisit_behavior_bp" type="pint" indexed="true" stored="false" />
<!-- Profile meta data fields only have one value -->
<field name="propertycount_i" type="pint" indexed="true" stored="false" />
<field name="totalpropertycount_i" type="pint" indexed="true" stored="false" />
<field name="totalpropertysize_i" type="pint" indexed="true" stored="false" />
<field name="maxproperty_s" type="string" indexed="true" stored="false" />
<field name="maxpropertyvalues_i" type="pint" indexed="true" stored="false" />
<field name="system_has_property_s" type="strings" indexed="true" stored="false" />
<field name="sample_id_i" type="pint" indexed="true" stored="false" />
<field name="event_id" type="string" indexed="true" multiValued="false" stored="true" />
<field name="event_type_id" type="string" indexed="true" multiValued="false" stored="true" />
<field name="event_date" type="plong" indexed="true" multiValued="false" stored="true" />
<field name="event_profile_id" type="string" indexed="true" multiValued="false" stored="true" />
<dynamicField name="*_ordinal_i" type="pint" indexed="true" stored="false" />
<dynamicField name="*_i" type="pints" indexed="true" stored="false" />
<dynamicField name="*_l" type="plongs" indexed="true" stored="false" />
<dynamicField name="*_f" type="pfloats" indexed="true" stored="false" />
<dynamicField name="*_s" type="strings" indexed="true" stored="false" />
<dynamicField name="*_b" type="boolean" indexed="true" stored="false" />
<dynamicField name="momentum_bp_*" type="pint" indexed="true" stored="false" />
<dynamicField name="threshold_*" type="plong" indexed="true" multiValued="false" stored="false" />
<dynamicField name="firsttouch_*" type="plong" indexed="true" multiValued="false" stored="false" />
<dynamicField name="reentryrestricted_*" type="string" indexed="true" multiValued="false" stored="false"/>
<dynamicField name="exitentrancerestricted_*" type="string" indexed="true" multiValued="false" stored="false"/>
</schema>
Indexed documents:
{
"id":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"name_s":"Profile 1",
"description_t":"test description",
"age_is":[28,
34],
"creationdate_l":1658990989645,
"content_type":"profile",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"timeline_events":
{
"id":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"event_id":"order-0",
"event_type_id":"order",
"event_date":1658990989644,
"total_revenue_f":865.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"content_type":"timeline_event",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"product":[
{
"id":"9dabaac8-7651-4c56-9fb4-66d56b7175c3",
"name_s":"product-0",
"promotion_s":"NO",
"listprice_f":477.0,
"quantity_i":22,
"variant_ss":["handbags",
"men"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"content_type":"order_product",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72"}]}},
{
"id":"c19483e2-f940-403f-bb24-03adce1bcb02",
"name_s":"Profile 2",
"description_t":"test description for profile 2",
"age_is":[25,
40],
"creationdate_l":1658990989653,
"content_type":"profile",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"timeline_events":
{
"id":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"event_id":"order-4",
"event_type_id":"order",
"event_date":1658990989649,
"total_revenue_f":952.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"content_type":"timeline_event",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"product":[
{
"id":"7a143554-b5f9-4487-b182-9938b91f76b4",
"name_s":"product-4",
"promotion_s":"YES",
"listprice_f":487.0,
"quantity_i":25,
"variant_ss":["junior",
"watches"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"content_type":"order_product",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02"}]}},
{
"id":"da88463c-fcca-4405-8656-0371809ccb28",
"name_s":"Profile 3",
"description_t":"test description for profile 3",
"age_is":[34,
39],
"creationdate_l":1658990989648,
"content_type":"profile",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28",
"timeline_events":
{
"id":"61f47b18-15f4-4a4d-bb93-a4232dd22043",
"event_id":"order-2",
"event_type_id":"order",
"event_date":1658990989647,
"total_revenue_f":838.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"da88463c-fcca-4405-8656-0371809ccb28",
"content_type":"timeline_event",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28",
"product":[
{
"id":"1fc4616b-2629-4cc4-8a60-7238f97c9aae",
"name_s":"product-2",
"promotion_s":"YES",
"listprice_f":403.0,
"quantity_i":26,
"variant_ss":["pants",
"women"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"61f47b18-15f4-4a4d-bb93-a4232dd22043",
"content_type":"order_product",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28"}]}}]
}
}
When we execute the below query
{!parent which="*:* -_nest_path_:*"}event_id:order-0
OR
{!parent which="content_type:profile"}event_id:order-0
For this example, the queries do the same thing and both return the same incorrect result.
{
"id":"da88463c-fcca-4405-8656-0371809ccb28",
"name_s":"Profile 3",
"description_t":"test description for profile 3",
"age_is":[34,
39],
"creationdate_l":1658990989648,
"content_type":"profile",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28"
}
Which is not correct, the correct response would be
{
"id":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"name_s":"Profile 1",
"description_t":"test description",
"age_is":[28,
34],
"creationdate_l":1658990989645,
"content_type":"profile",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72"
}

After some more trial and error we discovered that the issue lies in
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">id asc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<int name="inner.deletesPctAllowed">20</int>
</mergePolicyFactory>
If this part is removed the results are correct.
We are still doing further investigation to identify what exactly is going wrong. Will keep updating the thread as we find more details.

Related

Missing 2 fields after applying the schema file puzzler

I am using Solr 7.4 and creating core using the 3 files from the gist (one can download the files and save them in the directory <dir>/test/conf).
solr create -c test -d <dir>/test
The schema has 14 files, while only 12 end up in schema browser in Admin UI.
The schema file looks like:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="collection" version="1.6"
xmlns:inc="http://www.w3.org/2001/XInclude">
<types>
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />
<fieldType name="int" class="solr.IntPointField" sortMissingLast="true"/>
<fieldType name="long" class="solr.LongPointField" sortMissingLast="true"/>
</types>
<fields>
<field name="childCode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="parentCode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<filed name="sortOrder" type="int" indexed="true" stored="true" multiValued="false" />
<filed name="locked" type="boolean" indexed="true" stored="true" multiValued="false" />
<field name="status" type="string" indexed="true" stored="true" multiValued="false" />
<field name="filename" type="string" indexed="false" stored="true" multiValued="false" />
<field name="url" type="string" indexed="false" stored="true" multiValued="false" />
<field name="previewUrl" type="string" indexed="false" stored="true" multiValued="false" />
<field name="shape" type="string" indexed="true" stored="true" multiValued="false" />
<field name="originalHeight" type="int" indexed="true" stored="true" multiValued="false" />
<field name="originalWidth" type="int" indexed="true" stored="true" multiValued="false" />
<field name="sizes" type="string" indexed="true" stored="true" multiValued="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
</schema>
The missing fields are 'sortOrder' and 'locked'. Based on the documentation those are valid field names:
The name of the field. Field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g., version) are reserved. Every field must have a name.
Other int fields with camel case are created such as 'originalHeight' and 'originalWidth'. I am able to go into Admin UI and add the fields manually with the name and the type from the file.
I am puzzled and would appreciate any clue to this disappearing fields mystery.
Your spelling is wrong:
<filed name="sortOrder" ..
<filed name="locked" ..
Change it to <field> and it'll work as the other fields.

Solr exception due to schema

I have the following solr schema
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="testthing" version="1.5">
<fields>
<field name="_version_" type="long" indexed="true" stored="true" required="true"/>
<field name="doc_id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="title" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="doc_type" type="string" indexed="false" stored="true" required="true" multiValued="false"/>
<field name="description" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="allText" type="fs_text" indexed="true" stored="false" required="true" multiValued="true"/>
</fields>
<uniqueKey>doc_id</uniqueKey>
<copyField source="title" dest="allText" />
<copyField source="description" dest="allText" />
<dynamicField name="*" type="ignored" multiValued="true" />
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="fs_text" class="solr.TextField" positionIncrementGap="100"/>
</types>
</schema>
Solr complains about missing field text at dynamic field type
1898 [main] INFO org.apache.solr.servlet.SolrDispatchFilter ? SolrDispatchFilter.init() done
1918 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: undefined field text at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1235)
however, my one and only dynamic field (ignore all not matched) doesn't use text type (it's type=ignore).
What am I missing here?
** so far, renaming the allText to text pretty much fixed the issue but I can't figure out why! Is there something special/predefined about text in Solr 4.1 ?
It is not about field type "text". It is about field named "text".
<defaultSearchField>text</defaultSearchField>
You may have changed or remove the default field in config. If this fixes the issue, then you know somewhere in the configuration you're referring to "text" field, possibly in solrconfig.xml as suggested in

langid UpdateRequestProcessor only mapping first field

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}
Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

SOLR 4.0 alphabetical sorting trouble

I'm having a hard time of getting my head around an issue I have with my SOLR address database.
I built this one up from the example files. I'm basically running the example configuration with a modified schema.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />
<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
<doc>
<field name="id">1352498443_1</field>
<field name="givenname_s">Aynur</field>
<field name="middleinitial_s"/>
<field name="surname_s">Lehnen</field>
<field name="gender_s">F</field>
<field name="pictureuri_s">dummy_assets/female.jpg</field>
<field name="function_s">Zugschaffner/in</field>
<field name="organizationalunit_s">P 07</field>
<field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
<field name="company_s">Lorem Lagna Epsum Emet</field>
<field name="street_s">Erlenweg</field>
<field name="streetnumber_s">82</field>
<field name="postcode_s">76297</field>
<field name="city_s">Lübeck</field>
<field name="building_s"/>
<field name="roomnumber_s">242</field>
<field name="country_s">GERMANY</field>
<field name="countrycode_s">DE</field>
<field name="emailaddress_s">aynur.lehnen#lorem-lagna-epsum-emet.de</field>
<field name="phone1_s">0392984823</field>
<field name="phone2_s">0124111417</field>
<field name="mobile_s">0325117132</field>
<field name="fax_s">0171459177</field>
</doc>
</add>
However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:
{
"responseHeader": {
"status": 0,
"QTime": 5,
"params": {
"sort": "surname_s asc",
"fl": "surname_s",
"indent": "true",
"wt": "json",
"q": "city_s:berlin"
}
},
"response": {
"numFound": 1094,
"start": 0,
"docs": [{
"surname_s": "Weil"
}, {
"surname_s": "Abel"
}, {
"surname_s": "Adam"
}, {
"surname_s": "Ade"
}, {
"surname_s": "Adrian"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Albers"
}]
}
}
Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?
I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:
<field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />
<copyField source="surname_s" dest="surname_s_sort"/>
Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.
Then you can just change your query to sort on the surname_s_sort instead.
Sorting doesn't work well on multivalued and tokenized fields.
Documentation -
Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)
Use string as the field type and copy the title field into the new field.
<field name="surname_s_sort" type="string" indexed="true" stored="false"/>
<copyField source="surname_s" dest="surname_s_sort" />
As #Paige answered you can have keyword tokenizer, lower case filters which do not tokenize the field.
I had similiar issues and I tried the alphaOnlySort. This work for some part, but it starts messing up the sort results when the field contains values like -,/ spaces etc.
So the result was something like
/ abc
aa
/ abc2
So I ended up using the field type lowercase. It was already there so I figured that its a default type. I did use the copy field construction, so my final config was:
<schema>
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
</fields>
<copyField source="job_name" dest="job_name_sort"/>
</schema>

Tika Solr Metadata mapping ignore document title

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?
I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.
For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Resources