automatic language detect in solr 4.5.1 during indexing time - solr

I need your helps.
I want to detect Korean and English language during indexing time in solr.
My solr directory structure is
/opt/tmocat7/webapps/solr (solr webapp)
/usr/share/solr/collection1 (solr core)
/usr/share/solr/lib/langid (lib for langid)
First, I copy some libraries(jsonic-1.2.7.jar, langdetect-1.1-20120112.jar, solr-langid-4.5.1.jar) into specific directory(/usr/share/solr/lib/langid) - my solr is located
My solrconfig.xml is
<lib dir="../lib/langid/" regex=".*\.jar" />
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">dedupe</str>
<str name="update.chain">uuid</str>
<str name="update.chain">langid</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="langid">
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">title,content,comment</str>
<str name="langid.langField">lang</str>
<str name="langid.langsField">langs</str>
<str name="langid.lcmap">ko:ko kor:ko en_GB:en en_US:en</str>
<str name="langid.whitelist">ko,en</str>
<bool name="langid.map">true</bool>
<str name="langid.map.fl">title,content,comment</str>
<bool name="langid.map.keepOrig">true</bool>
<bool name="langid.map.individual">true</bool>
<str name="langid.fallback">ko</str>
<str name="langid.map.lcmap">ko:ko kor:ko en_GB:en en_US:en</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and schema.xml is
<field name="lang" type="string" indexed="true" stored="true" multiValued="false" />
<field name="langs" type="string" indexed="true" stored="true" multiValued="true" />
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="title" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="comment" type="text_ko" indexed="true" stored="true" multiValued="true" />
<field name="site" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="page" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="fileloc" type="text_general" indexed="true" stored="true"
multiValued="false"/>
<field name="filename" type="text_general" indexed="true" stored="true"
multiValued="false" />
<field name="storeddate" type="date" indexed="true" stored="true" multiValued="false"/>
<!-- for english web data-->
<field name="title_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="content_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="comment_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<field name="title_ko" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="content_ko" type="text_ko" indexed="true" stored="true" multiValued="true"/>
<field name="comment_ko" type="text_ko" indexed="true" stored="true" multiValued="true" />
<copyField source="title" dest="title_en"/>
<copyField source="content" dest="content_en"/>
<copyField source="comment" dest="comment_en"/>
<copyField source="title" dest="title_ko"/>
<copyField source="content" dest="content_ko"/>
<copyField source="comment" dest="comment_ko"/>
I read a some books and searching web to get a information about detecting language in solr, but can't detect language.
What is my fault?
For more information, add my post.sh and log
This is post.sh
#!/bin/sh
FILES=$*
URL=http://locahost:port/solr/collection1/update
for f in $FILES; do
echo Posting file $f to $URL
curl $URL --data-binary #$f -H 'Content-type:application/xml'
echo
done
#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:application/xml'
echo
some part of tomcat logs during indexing
70634079 [http-bio-7070-exec-38] TRACE org.apache.solr.handler.UpdateRequestHandler – body
70634079 [http-bio-7070-exec-38] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE add{,id=2f2323f4f7966e0d} {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634125 [http-bio-7070-exec-38] TRACE org.apache.solr.update.UpdateLog – TLOG: added id 2f2323f4f7966e0d to tlog{file=/usr/share/solr/collection1/data/tlog/tlog.0000000000000000129 refcount=1} LogPtr(29407) map=614254179
70634125 [http-bio-7070-exec-38] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE FINISH {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634126 [http-bio-7070-exec-38] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/update params={} {add=[2f2323f4f7966e0d (1473490520171872256)]} 0 68
70634146 [http-bio-7070-exec-33] TRACE org.apache.solr.handler.UpdateRequestHandler – body
70634146 [http-bio-7070-exec-33] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE add{,id=329ee20831e1a0c7} {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634148 [http-bio-7070-exec-33] TRACE org.apache.solr.update.UpdateLog – TLOG: added id 329ee20831e1a0c7 to tlog{file=/usr/share/solr/collection1/data/tlog/tlog.0000000000000000129 refcount=1} LogPtr(46005) map=614254179
70634148 [http-bio-7070-exec-33] DEBUG org.apache.solr.update.processor.LogUpdateProcessor – PRE_UPDATE FINISH {{params({params(),defaults(update.chain=dedupe&update.chain=uuid&update.chain=langid)}),defaults(wt=xml)}}
70634148 [http-bio-7070-exec-33] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/update params={} {add=[329ee20831e1a0c7 (1473490520241078272)]} 0 2
I can't find any other warn or error.
I need your advice
Thanks all

I think you use /update/extract instead of /update
In Solr 5.3.1, it works fine when I use with /update/extract.
Here's the full config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
<str name="update.chain">langid</str>
</lst>

Thanks for the question and the great answers, they helped me configure my system appropriately. I don't know how I managed to get the JAR file solr-langdetect.*.*.*.jar into my lib directory, but each time when I started solr it would show me the following error:
org.apache.solr.common.SolrException: com.cybozu.labs.langdetect.DetectorFactory.loadProfile(Ljava/util/List;)V
After removing that JAR file everything worked fine. The other three JAR files mentioned in the question (jsonic-*.*.*.jar, langdetect-*.*.jar, solr-langid-*.*.*.jar) are however required.

Related

Solr Deduplication (dedupe) is not working, getting error while updating document

I have followed the example listed in the below documentation :
https://solr.apache.org/guide/8_4/de-duplication.html
My requirement is to ignore duplicate records, but after implementing dedupe I am not able to add any document(even if it is unique) and getting same error :
Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/my_core: Document contains multiple values for uniqueKey field: id=[0011, affa84b255f98fd800dd0056b7040855]
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
solrconfig.xml :
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<str name="fields">first_name,last_name,phone_no</str>
<bool name="overwriteDupes">false</bool>
<str name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
schema.xml :
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="dummydata" version="1.5">
<field name="first_name" type="string" indexed="true" stored="true" multiValued="false" />
<field name="last_name" type="string" indexed="true" stored="true" multiValued="false" />
<field name="location" type="string" indexed="true" stored="true" multiValued="false" />
<field name="phone_no" type="string" indexed="true" stored="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
</schema>
Java code used :
{
String urlString = "http://localhost:8983/solr/my_core";
SolrClient Solr = new HttpSolrClient.Builder(urlString).build();
UpdateResponse response;
SolrInputDocument myDocumentInstantlycommited = new SolrInputDocument();
myDocumentInstantlycommited.addField("id", "0011");
myDocumentInstantlycommited.addField("first_name", "T11");
myDocumentInstantlycommited.addField("last_name","L11");
myDocumentInstantlycommited.addField("phone_no","9912121312");
myDocumentInstantlycommited.addField("location","TESt211");
response=Solr.add( myDocumentInstantlycommited);
Solr.commit();
Solr.close();
System.out.println("Documents Updated");
}

Solr how to indexing file content to multiple field?

Solr version:
7.3.0
I want to indexing file and register extracted text to multi field (word splitted field and bi-gram field) for search flexibility.
I wrote below configset, but it does not work, solr indexed only to content_text ,or content_text_bi (upper defined fmap.content field only)
solrconfig.xml
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content_text</str>
<str name="fmap.content">content_text_bi</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field -->
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_text" type="text_ja" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="content_text_bi" type="text_ja_bi" indexed="true" stored="true" storeOffsetsWithPositions="false"/>
<field name="filepath" type="string" indexed="true" stored="true" />
<field name="filename" type="string" indexed="true" stored="true" />
<field name="storage_id" type="pint" indexed="true" stored="true" />
...
How can I make it work as I want?
I solved to use copyField in schema.xml.
1. Add this line to schema.xml
<copyField source="content_text" dest="content_text_bi" />
2.and remove this line in in solrconfig.xml
<str name="fmap.content">content_text_bi</str>

Migrating solr to solr with no end result. Try to fill new field in source Solr core

I have target Solr 5.5.0(local) and source Solr 4.10.2(local). After migration process there no rows in target Solr. Who knowns what am i doing wrong?
There are 2 cores, core1 (source) and core2 (target). All fields in both are identically. BUT, i have a new field (stored, indexed) in source core, that must be filled by copyfield.
Here's data-config.xml (stored fields only):
<dataConfig>
<document>
<entity name="oldRow" processor="SolrEntityProcessor" url="http://localhost:8984/solr/core1" query="*:*">
<field column="_version_" name="_version_" indexed="true" stored="true"/>
<field column="id" name="id" type="text_general" indexed="true" stored="true"/>
<field column="type" name="type" type="text_general" indexed="true" stored="true"/>
<field column="field2" name="field2" type="text_general" indexed="true" stored="true"/>
<field column="field3" name="field3" type="text_general" indexed="true" stored="true"/>
...
</entity>
</document>
</dataConfig>
Here is schemas (both):
...
<fields>
<field name="_version_" type="long" indexed="true" stored="true" required="true"/>
<field name="field1" type="text_general" indexed="true" stored="true"/>
<field name="field2" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="field3" type="text_general" indexed="true" stored="true" />
...
</fields>
<!-- unique key-->
<uniqueKey>uid</uniqueKey>
...
and unique key consists of 2 fields:
<updateRequestProcessorChain name="dedupe">
<processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<bool name="overwriteDupes">true</bool>
<str name="signatureField">uid</str>
<str name="fields">id,type</str>
<str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Import Handler:
<updateRequestProcessorChain name="skip-fields">
<processor class="solr.IgnoreFieldUpdateProcessorFactory">
<str name="fieldRegex">_version_</str>
</processor>
</updateRequestProcessorChain>
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
<str name="update.chain">skip-fields</str>
</lst>
</requestHandler>
And log after import (try to import 1 row):
2016-03-10 05:47:41.552 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DataImporter Loading DIH Configuration: data-config.xml
2016-03-10 05:47:41.580 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DataImporter Data Configuration loaded successfully
2016-03-10 05:47:41.581 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DataImporter Starting Full Import
2016-03-10 05:47:41.598 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
2016-03-10 05:47:41.683 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.SolrEntityProcessor using BinaryResponseParser
2016-03-10 05:47:41.821 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DocBuilder Indexing stopped at docCount = 1
2016-03-10 05:47:41.822 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DocBuilder Import completed successfully
2016-03-10 05:47:41.823 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
2016-03-10 05:47:41.827 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.SimplePropertiesWriter Wrote last indexed time to dataimport.properties
2016-03-10 05:47:41.828 INFO (qtp859417998-21) [ x:core2] o.a.s.h.d.DocBuilder Time taken = 0:0:0.229
Please help!
I had the same problem. you have to rename field "version" to anything else to ignore it. "version" is an internal field, for concurrent locking / updating purposes. You can't update with a different "version"-content, than is in the db. So kill the field by renaming in dataconfig tag
field column="version" name="_old_version_"
Peter
<dataConfig>
<document>
<entity name="oldRow" processor="SolrEntityProcessor" url="http://localhost:8984/solr/core1" query="*:*">
<field column="_version_" name="_old_version_" />
...
</entity>
</document>
</dataConfig>

Solr MoreLikeThis and using Boost Functions (Boost recent Items)

I have a similar question as in "Boost recent item in MoreLikeThis Solr request handler" Boost recent item in MoreLikeThis Solr request handler
I would like to Boost recent Items returned from the MoreLikeThis Handler or Component.
I found out that bf isn't supported for MoreLikeThisHandler as it is a Dismax Parameter.
Therefore I tried following (within my solrconfig.xml):
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">id</str>
<str name="mlt">true</str>
<str name="mlt.count">10</str>
<str name="mlt.fl">project,type,summary,description,environment,fixfor,component</str>
<str name="mlt.mintf">1</str>
<str name="mlt.mindf">2</str>
<str name="mlt.boost">true</str>
<str name="rows">20</str>
<str name="fl">id,key,project,summary,reporter,assignee,updated,score</str>
<str name="bf">ms(NOW/HOUR,updated)</str>
</lst>
<!--<arr name="components">
<str>mlt</str>
</arr>-->
with
<field name="id" type="long" indexed="true" stored="true" required="true" multiValued="false" termVectors="true"/><!-- is termVector by long needed? -->
...
<field name="key" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
...
<field name="description" type="text_general" indexed="true" stored="false" required="true" multiValued="false" termVectors="true"/>
...
<field name="updated" type="date" indexed="true" stored="true" required="false" multiValued="false"/>
Mlt boost does not seem to be supported.
You can probably check the Mlt Sort Patch SOLR-1545

Solr More Like This (MLT) not returning results

I'm currently looking to implement more like this functionality based on a on a number of fields in my index.
My current configuration is as follows:
Haystack | PySolr | Solr
For this piece I'm using PySolr and passing the parameters to the more_like_this function. The response finds the document but not any related results. Why is that?
Here is the URL I hit:
http://localhost:8080/solr/mlt?q=django_id:12123412&mlt.fl=industry_ids,loc_state,amount,sector_id&mlt.interestingTerms=details
Here is my response from Solr:
<response>
<object type="{XXXXXX-0F1D-4F28-AAA2-XXXXXXXXXXX}" cotype="cs" id="cosymantecbfw" style="width: 0px; height: 0px; display: block;"/>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
</lst>
<result name="match" numFound="1" start="0">
<doc>...</doc>
</result>
<result name="response" numFound="0" start="0"/>
<lst name="interestingTerms"/>
</response>
solrconfig.xml
<!-- More Like This -->
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
schema.xml
<field name="award_amount" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="estatus" type="slong" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="loc_state" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="orgtype_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="industry_ids" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" />
<field name="award_amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true" />
<field name="sector_id_exact" type="string" indexed="true" stored="true" multiValued="false" termVectors="true"/>
<field name="amount_exact" type="sfloat" indexed="true" stored="true" multiValued="false" termVectors="true"/>
Any help would be appreciated!
Your text fields must have type text, which processes them to make them searchable. The string fields are stored and queried as they are, so they are not searchable, making them useless for MLT.
Refer copy fields if you ever want to store the same data as both text and string (for example, faceting).
I see you also intend to find numbers closest to our query. MLT is not right for that. You want to compose a function query for that. SolR : More Like This on number fields

Resources