langid UpdateRequestProcessor only mapping first field - solr

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}

Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

Related

Solr BlockJoinQuery returns false positives

We are trying to query indexed nested child documents in Solr but when we query for example to return a parent of a child where event_id: order-1 the result has a parent which has a child with event_id: order-5.
We did setup a fresh Solr using Solr's example data and when querying that, the returned results were correct. Idea was that maybe there is something in solrconfig.xml but after removing or setting things back to default, the results were still incorrect.
Currently we're working on checking schema.xml to see if we can correct results that way.
Our current solrconfig.xml
<config>
<luceneMatchVersion>8.11.2</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}" />
<schemaFactory class="ClassicIndexSchemaFactory"/>
<indexConfig>
<lockType>single</lockType>
<ramBufferSizeMB>256</ramBufferSizeMB>
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">id asc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<int name="inner.deletesPctAllowed">20</int>
</mergePolicyFactory>
</indexConfig>
<updateHandler class="solr.DirectUpdateHandler2">
<autoCommit>
<maxDocs>1000000</maxDocs>
<maxSize>2g</maxSize>
<openSearcher>false</openSearcher>
</autoCommit>
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
<query>
<maxBooleanClauses>102400</maxBooleanClauses>
<filterCache class="solr.CaffeineCache" maxRamMB="750" initialSize="0" autowarmCount="0" />
<queryResultCache class="solr.CaffeineCache" size="512" initialSize="0" autowarmCount="0" />
<fieldValueCache class="solr.CaffeineCache" size="1" initialSize="0" autowarmCount="0" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>0</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
<requestDispatcher handleSelect="false">
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" />
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler"></requestHandler>
</config>
Our current schema.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="default-config" version="1.6">
<fieldType name="_nest_path_" class="solr.NestPathField" />
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />
<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true" />
<!-- Numeric field types that index values using KD-trees. Point fields don't support FieldCache, so they must have docValues="true"
if needed for sorting, faceting, functions, etc. -->
<fieldType name="pint" class="solr.IntPointField" docValues="true" />
<fieldType name="pfloat" class="solr.FloatPointField" docValues="true" />
<fieldType name="plong" class="solr.LongPointField" docValues="true" />
<fieldType name="pdouble" class="solr.DoublePointField" docValues="true" />
<fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true" />
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true" />
<fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true" />
<fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true" />
<!-- KD-tree versions of date fields -->
<fieldType name="pdate" class="solr.DatePointField" docValues="true" />
<fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true" />
<uniqueKey>id</uniqueKey>
<!-- Solr automatically populates this with the value of the top/parent ID. E.g. the profile ID. It is required. -->
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<!-- Is populated by Solr automatically with the path of the document in the hierarchy for non-root documents. -->
<field name="_nest_path_" type="_nest_path_" />
<!-- Is populated by Solr automatically to store the ID of each document’s parent document (if there is one). -->
<field name="_nest_parent_" type="string" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false" />
<field name="_indexversion_" type="pint" indexed="true" stored="false" multiValued="false" required="true"
default="4" />
<field name="timestamp" type="pdate" indexed="true" stored="false" default="NOW" />
<field name="content_type" type="string" indexed="true" stored="false" />
<!-- define system values, which are known to be single valued -->
<field name="creationdate_l" type="plong" indexed="true" stored="false" />
<field name="lastmodifieddate_l" type="plong" indexed="true" stored="false" />
<field name="firstvisit_l" type="plong" indexed="true" stored="false" />
<field name="lastvisit_l" type="plong" indexed="true" stored="false" />
<!-- behavioral properties -->
<field name="frequency_bp" type="pint" indexed="true" stored="false" />
<field name="intensity_bp" type="pint" indexed="true" stored="false" />
<field name="recent_intensity_bp" type="pfloat" indexed="true" stored="false" />
<field name="firstvisit_behavior_bp" type="pint" indexed="true" stored="false" />
<field name="lastvisit_behavior_bp" type="pint" indexed="true" stored="false" />
<!-- Profile meta data fields only have one value -->
<field name="propertycount_i" type="pint" indexed="true" stored="false" />
<field name="totalpropertycount_i" type="pint" indexed="true" stored="false" />
<field name="totalpropertysize_i" type="pint" indexed="true" stored="false" />
<field name="maxproperty_s" type="string" indexed="true" stored="false" />
<field name="maxpropertyvalues_i" type="pint" indexed="true" stored="false" />
<field name="system_has_property_s" type="strings" indexed="true" stored="false" />
<field name="sample_id_i" type="pint" indexed="true" stored="false" />
<field name="event_id" type="string" indexed="true" multiValued="false" stored="true" />
<field name="event_type_id" type="string" indexed="true" multiValued="false" stored="true" />
<field name="event_date" type="plong" indexed="true" multiValued="false" stored="true" />
<field name="event_profile_id" type="string" indexed="true" multiValued="false" stored="true" />
<dynamicField name="*_ordinal_i" type="pint" indexed="true" stored="false" />
<dynamicField name="*_i" type="pints" indexed="true" stored="false" />
<dynamicField name="*_l" type="plongs" indexed="true" stored="false" />
<dynamicField name="*_f" type="pfloats" indexed="true" stored="false" />
<dynamicField name="*_s" type="strings" indexed="true" stored="false" />
<dynamicField name="*_b" type="boolean" indexed="true" stored="false" />
<dynamicField name="momentum_bp_*" type="pint" indexed="true" stored="false" />
<dynamicField name="threshold_*" type="plong" indexed="true" multiValued="false" stored="false" />
<dynamicField name="firsttouch_*" type="plong" indexed="true" multiValued="false" stored="false" />
<dynamicField name="reentryrestricted_*" type="string" indexed="true" multiValued="false" stored="false"/>
<dynamicField name="exitentrancerestricted_*" type="string" indexed="true" multiValued="false" stored="false"/>
</schema>
Indexed documents:
{
"id":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"name_s":"Profile 1",
"description_t":"test description",
"age_is":[28,
34],
"creationdate_l":1658990989645,
"content_type":"profile",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"timeline_events":
{
"id":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"event_id":"order-0",
"event_type_id":"order",
"event_date":1658990989644,
"total_revenue_f":865.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"content_type":"timeline_event",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"product":[
{
"id":"9dabaac8-7651-4c56-9fb4-66d56b7175c3",
"name_s":"product-0",
"promotion_s":"NO",
"listprice_f":477.0,
"quantity_i":22,
"variant_ss":["handbags",
"men"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"content_type":"order_product",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72"}]}},
{
"id":"c19483e2-f940-403f-bb24-03adce1bcb02",
"name_s":"Profile 2",
"description_t":"test description for profile 2",
"age_is":[25,
40],
"creationdate_l":1658990989653,
"content_type":"profile",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"timeline_events":
{
"id":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"event_id":"order-4",
"event_type_id":"order",
"event_date":1658990989649,
"total_revenue_f":952.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"content_type":"timeline_event",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02",
"product":[
{
"id":"7a143554-b5f9-4487-b182-9938b91f76b4",
"name_s":"product-4",
"promotion_s":"YES",
"listprice_f":487.0,
"quantity_i":25,
"variant_ss":["junior",
"watches"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"dcde9bfd-97ee-4d76-97d8-5297c1b2e87d",
"content_type":"order_product",
"_version_":1739600934766379008,
"_root_":"c19483e2-f940-403f-bb24-03adce1bcb02"}]}},
{
"id":"da88463c-fcca-4405-8656-0371809ccb28",
"name_s":"Profile 3",
"description_t":"test description for profile 3",
"age_is":[34,
39],
"creationdate_l":1658990989648,
"content_type":"profile",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28",
"timeline_events":
{
"id":"61f47b18-15f4-4a4d-bb93-a4232dd22043",
"event_id":"order-2",
"event_type_id":"order",
"event_date":1658990989647,
"total_revenue_f":838.0,
"_nest_path_":"/timeline_events#",
"_nest_parent_":"da88463c-fcca-4405-8656-0371809ccb28",
"content_type":"timeline_event",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28",
"product":[
{
"id":"1fc4616b-2629-4cc4-8a60-7238f97c9aae",
"name_s":"product-2",
"promotion_s":"YES",
"listprice_f":403.0,
"quantity_i":26,
"variant_ss":["pants",
"women"],
"pages_i":1,
"_nest_path_":"/timeline_events#/product#0",
"_nest_parent_":"61f47b18-15f4-4a4d-bb93-a4232dd22043",
"content_type":"order_product",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28"}]}}]
}
}
When we execute the below query
{!parent which="*:* -_nest_path_:*"}event_id:order-0
OR
{!parent which="content_type:profile"}event_id:order-0
For this example, the queries do the same thing and both return the same incorrect result.
{
"id":"da88463c-fcca-4405-8656-0371809ccb28",
"name_s":"Profile 3",
"description_t":"test description for profile 3",
"age_is":[34,
39],
"creationdate_l":1658990989648,
"content_type":"profile",
"_version_":1739600934768476160,
"_root_":"da88463c-fcca-4405-8656-0371809ccb28"
}
Which is not correct, the correct response would be
{
"id":"99c75c9a-b083-428d-baa1-6a9662c6eb72",
"name_s":"Profile 1",
"description_t":"test description",
"age_is":[28,
34],
"creationdate_l":1658990989645,
"content_type":"profile",
"_version_":1739600934763233280,
"_root_":"99c75c9a-b083-428d-baa1-6a9662c6eb72"
}
After some more trial and error we discovered that the issue lies in
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">id asc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<int name="inner.deletesPctAllowed">20</int>
</mergePolicyFactory>
If this part is removed the results are correct.
We are still doing further investigation to identify what exactly is going wrong. Will keep updating the thread as we find more details.

Solr More Like This (MLT) not returning results

I'm currently looking to implement more like this functionality based on a on a number of fields in my index.
My current configuration is as follows:
Haystack | PySolr | Solr
For this piece I'm using PySolr and passing the parameters to the more_like_this function. The response finds the document but not any related results. Why is that?
Here is the URL I hit:
http://localhost:8080/solr/mlt?q=django_id:12123412&mlt.fl=industry_ids,loc_state,amount,sector_id&mlt.interestingTerms=details
Here is my response from Solr:
<response>
<object type="{XXXXXX-0F1D-4F28-AAA2-XXXXXXXXXXX}" cotype="cs" id="cosymantecbfw" style="width: 0px; height: 0px; display: block;"/>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
</lst>
<result name="match" numFound="1" start="0">
<doc>...</doc>
</result>
<result name="response" numFound="0" start="0"/>
<lst name="interestingTerms"/>
</response>
solrconfig.xml
<!-- More Like This -->
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
schema.xml
<field name="award_amount" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="estatus" type="slong" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="loc_state" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="orgtype_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="industry_ids" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" />
<field name="award_amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id_exact" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true"/>
Any help would be appreciated!
Your text fields must have type text, which processes them to make them searchable. The string fields are stored and queried as they are, so they are not searchable, making them useless for MLT.
Refer copy fields if you ever want to store the same data as both text and string (for example, faceting).
I see you also intend to find numbers closest to our query. MLT is not right for that. You want to compose a function query for that. SolR : More Like This on number fields

The simplest Solr DIH indexing

I'm trying to index data from a database in Solr using the DIH.
So I have modified the two config files as follows:
solrconfig.xml :
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config.xml :
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
</entity>
</document>
</dataConfig>
source_scellee being the name of my table on my test database. It contains many fields.
Obviously, I'm trying to run nothing else than a simple test. When running http://localhost:8983/solr/dataimport?command=full-import&clean=false&commit=true I get the following result :
<str name="Full Dump Started">2012-01-27 12:27:01</str><str name="">Indexing completed. Added/Updated: 4 documents. Deleted 0 documents.</str><str name="Committed">2012-01-27 12:27:02</str>
<str name="**Total Documents Failed**">4</str>
Besides no warning nor error on the server logs. 4 is my number of records inside table "source_scellee". But it says all documents fail.
If I run a query from http://localhost:8983/solr/admin/
no results appear, at all !! How can I solve it ?
(":" shows no results)
Thank you for your help!!!
----edit---
I have added these lines to my schema.xml :
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="boolean" indexed="true" stored="true"/>
<field name="dispositif_medical" type="boolean" indexed="true" stored="true"/>
<field name="forme_speciale" type="boolean" indexed="true" stored="true" />
<field name="exemption_cpa" type="boolean" indexed="true" stored="true"/>
<field name="marquage_ce" type="boolean" indexed="true" stored="true"/>
<field name="element_cible" type="boolean" indexed="true" stored="true"/>
However the result is still the same: no results when querying (I tried to restart solr, and to re-index all also)
------second edit---
I have tried the dynamic import
Now my data-config.xml looks like this :
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID_i" />
<field column="reference_catalogue" name="reference_catalogue_s" />
<field column="reference_capsule" name="reference_capsule_s" />
<field column="organisme_certificateur" name="organisme_certificateur_s" />
<field column="reference_certificat" name="reference_certificat_s" />
<field column="duree_d_utilisation" name="duree_d_utilisation_s" />
<field column="activite_nominale" name="activite_nominale_s" />
<field column="activite_minimale" name="activite_minimale_s" />
<field column="activite_maximale" name="activite_maximale_s" />
<field column="coffret" name="coffret_b" />
<field column="dispositif_medical" name="dispositif_medical_b" />
<field column="forme_speciale" name="forme_speciale_b" />
<field column="exemption_cpa" name="exemption_cpa_b" />
<field column="marquage_ce" name="marquage_ce_b" />
<field column="element_cible" name="element_cible_b" />
</entity>
</document>
1.) You can take a look to the statistics page to see, how much docs are indexed right now:
http://localhost:8983/solr/admin/stats.jsp
2.) The result of your search depends on your schema.xml, because there it's defined how docs are indexed/stored, which fields are processed and how searchs are handled on query time.
Please take a look at this file or post the field definition from the schema.xml and also the schema/design from your table source_scellee.
Does the columns and the fields have the same name?
//Edit: This should work, if coulmname and filedname are the same:
<document>
<entity name="source_scellee"
pk="ID"
query="select * from source_scellee">
</entity>
</document>
is having NULL values in data an issue ?
that depends on the destination field.
Are your running solr in an tomcat or someting like that?
Take a look in the Java EE Container output, like catalina.out or so.
I am pretty sure the issue lies in how the DIH is trying to map fields. Thanks for adding the information from your schema file... However, I believe that what you have done is added configuration that needs to be added separately to both the schema.xml and the data-config.xml for the DIH.
Based on the Full Import Example from the Solr Wiki, I would try the following.
schema.xml
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="date_de_creation" type="date" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="int" indexed="true" stored="true"/>
<field name="dispositif_medical" type="int" indexed="true" stored="true"/>
<field name="forme_speciale" type="int" indexed="true" stored="true" />
<field name="exemption_cpa" type="int" indexed="true" stored="true"/>
<field name="marquage_ce" type="int" indexed="true" stored="true"/>
<field name="element_cible" type="int" indexed="true" stored="true"/>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID"/>
<field column="reference_catalogue" name="reference_catalogue"/>
<field column="reference_capsule" name="reference_capsule"/>
<field column="date_de_creation" name="date_de_creation"/>
<field column="organisme_certificateur" name="organisme_certificateur"/>
<field column="reference_certificat" name="reference_certificat"/>
<field column="duree_d_utilisation" name="duree_d_utilisation"/>
<field column="activite_nominale" name="activite_nominale"/>
<field column="activite_minimale" name="activite_minimale"/>
<field column="activite_maximale" name="activite_maximale"/>
<field column="coffret" name="coffret"/>
<field column="dispositif_medical" name="dispositif_medical"/>
<field column="forme_speciale" name="forme_speciale"/>
<field column="exemption_cpa" name="exemption_cpa"/>
<field column="marquage_ce" name="marquage_ce"/>
<field column="element_cible" name="element_cible"/>
</entity>
</document>
</dataConfig>
There is a way to setup the schema.xml to dynamically add fields that it encounters by using some naming conventions. Please see the Dynamic Fields details in the Solr Wiki for more details and some examples of how this can be done.

Tika Solr Metadata mapping ignore document title

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?
I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.
For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Indexing office formats with a custom field type schema

We have the following Solr (3.4) schema for indexing html/text documents:
<fields>
<field name="text" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="title" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="created" type="date" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="modified" type="date" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="filesize" type="integer" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="mimetype" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="id" type="string" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="tag" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="relpath" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<dynamicField name="tika_*" type="ignored" />
</fields>
The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.
Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.
The generated requestHandler for the extraction is:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.text">tika_content</str>
<str name="lowernames">false</str>
<str name="uprefix">tika_</str>
</lst>
</requestHandler>
But after uploading a PDF file through curl I can not find any indication that it
has been index (no changes in the document stats etc.).
What is the trick here?
[Update]
I am using
curl "http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text" -F "myfile=#1.pdf"
to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).
This seems to have solved the problem.
fmap is basically field mapping for the content generated by tika.
Tika handler extracts the content of the document uploaded and assigns it to the field name content.
<str name="fmap.content">text</str> maps the content field to the text field defined in the schema.
As you have text field defined in the schema, this will work.
However, for <str name="fmap.text">tika_content</str> there is not field tika_content defined nor I think the text gets generated, so would not result in any matches.

Resources