Solr how to indexing file content to multiple field?

Solr how to indexing file content to multiple field? - solr

Solr version:
7.3.0
I want to indexing file and register extracted text to multi field (word splitted field and bi-gram field) for search flexibility.
I wrote below configset, but it does not work, solr indexed only to content_text ,or content_text_bi (upper defined fmap.content field only)
solrconfig.xml
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content_text</str>
<str name="fmap.content">content_text_bi</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_text" type="text_ja" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="content_text_bi" type="text_ja_bi" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="filepath" type="string" indexed="true" stored="true" />
<field name="filename" type="string" indexed="true" stored="true" />
<field name="storage_id" type="pint" indexed="true" stored="true" />
...
How can I make it work as I want?

I solved to use copyField in schema.xml.
1. Add this line to schema.xml
<copyField source="content_text" dest="content_text_bi" />
2.and remove this line in in solrconfig.xml
<str name="fmap.content">content_text_bi</str>

Related

Solr Boosting on custom fields

Solr Experts,
I am trying to change the score for few of my custom fields through index time/field boosting but doesn't seems to be changing the score. Please help.
These are my custom fields in schema.xml
<field name="doc_id" type="string" indexed="true" stored="true" omitNorms="false"/>
<field name="doc_name" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
<field name="doc_author" type="text_autocomplete" indexed="true" stored="true" multiValued="true" />
<field name="modifieddate" type="text_autocomplete" indexed="true" stored="true" multiValued="true"/>
<field name="doc_content" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
<field name="doc_title" type="text_autocomplete" indexed="true" stored="true" multiValued="true"/>
<field name="doc_description" type="text_autocomplete" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
I posted 3 test docs as below using SolrJ
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="doc_id">7781</field>
<field name="doc_name" boost="1.5">Cat</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="1.5">Cat life history. Cat life cycle. Cat Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="1.5">Cat related details</field>
</doc>
<doc>
<field name="doc_id">7782</field>
<field name="doc_name" boost="2.5">Dog</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="2.5">Dog life history. Dog life cycle. Dog Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="2.5">Dog details</field>
</doc>
<doc>
<field name="doc_id">7783</field>
<field name="doc_name" boost="2.7">Cow</field>
<field name="doc_author">Nsd80</field>
<field name="modifieddate">11 30</field>
<field name="doc_content" boost="2.7">Cow life history. Cow life cycle. Cow Foods</field>
<field name="doc_title">Titled</field>
<field name="doc_description" boost="2.7">Cow lifecycle</field>
</doc>
</add>
When I query to find the scores as below,
localhost:8983/solr/select/?q=doc_id:*&fl=*,score
it shows 1.0 as score for all three docs
I was trying to boost them as
localhost:8983/solr/select?defType=edismax&q=doc_description:*Cow*^195
but doesnt seems to be working either
<arr name="doc_description">
<str>Dog details</str>
</arr>
<long name="_version_">1479948142366425088</long>
<float name="score">1.0</float>
</doc>
<doc>
Also tried to elevate as
localhost:8983/solr/elevate?q=doc_id:7781&enableElevation=true&fl=doc_id,score,[elevated]
but result was negative
<result name="response" numFound="1" start="0" maxScore="7.23343">
<doc>
<str name="doc_id">7781</str>
<float name="score">7.23343</float>
<bool name="[elevated]">false</bool>
</doc>
</result>
</response>
My requirement is just to boost the docs to have more scores so that I can retrieve them based on scores. If you look at my xml docs, I have tried field boosting at index time, later tried to boost the docs using edismax (mentioned here) and also a elevation
Can anyone help me with detailed example?

automatic language detect in solr 4.5.1 during indexing time

I need your helps.
I want to detect Korean and English language during indexing time in solr.
My solr directory structure is
/opt/tmocat7/webapps/solr (solr webapp)
/usr/share/solr/collection1 (solr core)
/usr/share/solr/lib/langid (lib for langid)
First, I copy some libraries(jsonic-1.2.7.jar, langdetect-1.1-20120112.jar, solr-langid-4.5.1.jar) into specific directory(/usr/share/solr/lib/langid) - my solr is located
My solrconfig.xml is
<lib dir="../lib/langid/" regex=".*\.jar" />
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">dedupe</str>
<str name="update.chain">uuid</str>
<str name="update.chain">langid</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="langid">
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">title,content,comment</str>
<str name="langid.langField">lang</str>
<str name="langid.langsField">langs</str>
<str name="langid.lcmap">ko:ko kor:ko en_GB:en en_US:en</str>
<str name="langid.whitelist">ko,en</str>
<bool name="langid.map">true</bool>
<str name="langid.map.fl">title,content,comment</str>
<bool name="langid.map.keepOrig">true</bool>
<bool name="langid.map.individual">true</bool>
<str name="langid.fallback">ko</str>
<str name="langid.map.lcmap">ko:ko kor:ko en_GB:en en_US:en</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and schema.xml is
<field name="lang" type="string" indexed="true" stored="true" multiValued="false" />
<field name="langs" type="string" indexed="true" stored="true" multiValued="true" />
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="title" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="comment" type="text_ko" indexed="true" stored="true" multiValued="true" />
<field name="site" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="page" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="fileloc" type="text_general" indexed="true" stored="true"
multiValued="false"/>
<field name="filename" type="text_general" indexed="true" stored="true"
multiValued="false" />
<field name="storeddate" type="date" indexed="true" stored="true" multiValued="false"/>
<!-- for english web data-->
<field name="title_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="content_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="comment_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="title_ko" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="content_ko" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="comment_ko" type="text_ko" indexed="true" stored="true" multiValued="true" />
<copyField source="title" dest="title_en"/>
<copyField source="content" dest="content_en"/>
<copyField source="comment" dest="comment_en"/>
<copyField source="title" dest="title_ko"/>
<copyField source="content" dest="content_ko"/>
<copyField source="comment" dest="comment_ko"/>
I read a some books and searching web to get a information about detecting language in solr, but can't detect language.
What is my fault?
For more information, add my post.sh and log
This is post.sh
#!/bin/sh
FILES=$*
URL=http://locahost:port/solr/collection1/update
for f in $FILES; do
echo Posting file $f to $URL
curl $URL --data-binary #$f -H 'Content-type:application/xml'
echo
done
#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:application/xml'
echo
some part of tomcat logs during indexing
70634079 [http-bio-7070-exec-38] TRACE org.apache.solr.handler.UpdateRequestHandler – body
70634079 [http-bio-7070-exec-38] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE add{,id=2f2323f4f7966e0d} {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634125 [http-bio-7070-exec-38] TRACE org.apache.solr.update.UpdateLog – TLOG: added id 2f2323f4f7966e0d to tlog{file=/usr/share/solr/collection1/data/tlog/tlog.0000000000000000129 refcount=1} LogPtr(29407) map=614254179
70634125 [http-bio-7070-exec-38] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE FINISH {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634126 [http-bio-7070-exec-38] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/update params={} {add=[2f2323f4f7966e0d (1473490520171872256)]} 0 68
70634146 [http-bio-7070-exec-33] TRACE org.apache.solr.handler.UpdateRequestHandler – body
70634146 [http-bio-7070-exec-33] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE add{,id=329ee20831e1a0c7} {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634148 [http-bio-7070-exec-33] TRACE org.apache.solr.update.UpdateLog – TLOG: added id 329ee20831e1a0c7 to tlog{file=/usr/share/solr/collection1/data/tlog/tlog.0000000000000000129 refcount=1} LogPtr(46005) map=614254179
70634148 [http-bio-7070-exec-33] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE FINISH {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634148 [http-bio-7070-exec-33] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/update params={} {add=[329ee20831e1a0c7 (1473490520241078272)]} 0 2
I can't find any other warn or error.
I need your advice
Thanks all

I think you use /update/extract instead of /update
In Solr 5.3.1, it works fine when I use with /update/extract.
Here's the full config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
<str name="update.chain">langid</str>
</lst>

Thanks for the question and the great answers, they helped me configure my system appropriately. I don't know how I managed to get the JAR file solr-langdetect.*.*.*.jar into my lib directory, but each time when I started solr it would show me the following error:
org.apache.solr.common.SolrException: com.cybozu.labs.langdetect.DetectorFactory.loadProfile(Ljava/util/List;)V
After removing that JAR file everything worked fine. The other three JAR files mentioned in the question (jsonic-*.*.*.jar, langdetect-*.*.jar, solr-langid-*.*.*.jar) are however required.

langid UpdateRequestProcessor only mapping first field

I am trying to use solr's langid UpdateRequestProcessor. Here is the config:
<updateRequestProcessorChain name="languages">
<processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="invariants">
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
<str name="langid.whitelist">en,fr</str>
<str name="langid.fallback">en</str>
<str name="langid.langField">detectedlang</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">false</bool>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
My fields look like this:
<fields>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<!-- raw fields from sql db -->
<field name="expertise_id" type="int" indexed="true" stored="true" />
<field name="person_id" type="int" indexed="true" stored="true" />
<field name="mod_date" type="date" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="focus" type="text_general" indexed="true" stored="true" />
<field name="expertise" type="text_general" indexed="true" stored="true" />
<field name="platforms" type="text_general" indexed="true" stored="true" />
<field name="partners" type="text_general" indexed="true" stored="true" />
<field name="participation" type="text_general" indexed="true" stored="true" />
<field name="additional" type="text_general" indexed="true" stored="true" />
<field name="tag" type="text_general" termVectors="true" multiValued="true" />
<field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/>
<!-- language detected by solr -->
<field name="detectedlang" type="string" indexed="true" stored="true" />
<!-- defined locale fields -->
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" />
<copyField source="tag" target="facet_tag"/>
</fields>
When I run an update or a dataimport I know that the "languages" update chain is used because focus is mapped to focus_en and detectedlang is set. However, none of the other fields in langid.fl are mapped. Why?
An example update query:
{
"additional": "here is some other information about me.",
"expertise_id": "10000",
"id": "foo_10000",
"focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced."
}
And here is the result of a query for expertise_id=10000. Note that additional has not been moved to additional_en:
"response":{"numFound":1,"start":0,"docs":[
{
"additional":"here is some other information about me.",
"expertise_id":10000,
"id":"foo_10000",
"detectedlang":"en",
"focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.",
"_version_":1447088846110982144}]
}

Turns out that the problem is a syntax error. This line:
<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str>
must be
<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str>
The docs state that the field list should be comma or space separated values. Evidently, comma and space screws things up (though it works fine in other Solr contexts like fl in a requestHandler which langid.fl is supposedly modelled on). I tried the space-separated syntax as well, but it did not fix my issue.
I hope this helps someone.

Solr More Like This (MLT) not returning results

I'm currently looking to implement more like this functionality based on a on a number of fields in my index.
My current configuration is as follows:
Haystack | PySolr | Solr
For this piece I'm using PySolr and passing the parameters to the more_like_this function. The response finds the document but not any related results. Why is that?
Here is the URL I hit:
http://localhost:8080/solr/mlt?q=django_id:12123412&mlt.fl=industry_ids,loc_state,amount,sector_id&mlt.interestingTerms=details
Here is my response from Solr:
<response>
<object type="{XXXXXX-0F1D-4F28-AAA2-XXXXXXXXXXX}" cotype="cs" id="cosymantecbfw" style="width: 0px; height: 0px; display: block;"/>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
</lst>
<result name="match" numFound="1" start="0">
<doc>...</doc>
</result>
<result name="response" numFound="0" start="0"/>
<lst name="interestingTerms"/>
</response>
solrconfig.xml
<!-- More Like This -->
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
schema.xml
<field name="award_amount" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="estatus" type="slong" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="loc_state" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="orgtype_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="industry_ids" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" />
<field name="award_amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id_exact" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true"/>
Any help would be appreciated!

Your text fields must have type text, which processes them to make them searchable. The string fields are stored and queried as they are, so they are not searchable, making them useless for MLT.
Refer copy fields if you ever want to store the same data as both text and string (for example, faceting).
I see you also intend to find numbers closest to our query. MLT is not right for that. You want to compose a function query for that. SolR : More Like This on number fields

Tika Solr Metadata mapping ignore document title

I have the following config file for solr:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
<str name="fmap.application_name">type</str>
<str name="fmap.content_type">mime</str>
<str name="fmap.stream_size">size</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">false</str>
</lst>
</requestHandler>
and this is my schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="access_type" type="string" indexed="true" stored="false"/>
<field name="access_restriction" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="string" indexed="true" stored="true" multiValued="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_en_splitting" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="createdby" type="string" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="modifiedby" type="string" indexed="true" stored="true"/>
<field name="source" type="string" indexed="true" stored="true" />
<field name="version" type="string" indexed="true" stored="true" />
<field name="resourcelink" type="string" indexed="true" stored="true" />
<field name="downloadlink" type="string" indexed="true" stored="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="mime" type="string" indexed="true" stored="true" />
<field name="size" type="string" indexed="true" stored="true" />
I want to set the title myself. But Tika keeps setting it's own title (that's why I set multiValued="true" temporarily), which I find strange because I have to manually map stuff like stream_size and content_type.
What solution is possible to this issue?
I'd like Tika to override the title I assign, like this:
I have 3 documents, for one of those, Tika doesn't extract a title, in this case, I have my own title I set passing literal.title, when Tika does extract a title, I want it to override the one I passed in literal.title. Is this possible?

I was working on the same issue some time ago, but I hit a wall as well :(
I let Tika take "title", and use literal.other_title_like_field to store proper title.
This is not a best solution, but worked for me.

For those who are still struggling with this problem, I solved it by adding
<str name="fmap.title">ignored_</str>
in my ExtractingRequestHandler defaults.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr how to indexing file content to multiple field? - solr

I solved to use copyField in schema.xml. 1. Add this line to schema.xml <copyField source="content_text" dest="content_text_bi" /> 2.and remove this line in in solrconfig.xml <str name="fmap.content">content_text_bi</str>

Related

Solr Boosting on custom fields

automatic language detect in solr 4.5.1 during indexing time

langid UpdateRequestProcessor only mapping first field

Solr More Like This (MLT) not returning results

Tika Solr Metadata mapping ignore document title

Categories

Resources