Can not use ICUTokenizerFactory in Solr - solr

I am trying to use ICUTokenizerFactory in Solr schema. This is how I have defined field and fieldType.
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="fld_icu" type="text_icu" indexed="true" stored="true"/>
And, when I start Solr, I am get this error
Plugin init failure for [schema.xml] fieldType "text_icu": Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.ICUTokenizerFactory'
I have searched in for that with no success. I don't know if I am missing something or there is some problem in schema.
If someone has tried ICUTokenizerFactory then please suggest what could be the problem.

Add this at the top of your solrconfig.xml:
<config>
<lib dir="${user.dir}/../contrib/analysis-extras/lucene-libs/" />
<lib dir="${user.dir}/../contrib/analysis-extras/lib/" />
This assumes that you are running from example directory with solr.solr.home set to your instance. Otherwise, just use absolute path to your Solr installation.
You can also copy all those jars into lib directory (under your core, not solr home). But the above is an easier way.

From the Wiki:
Lucene provides support for segmenting these languages into syllables with solr.ICUTokenizerFactory in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib

Related

Can not apply patch LUCENE-2899.patch to SOLR on Windows

I am trying to apply patch LUCENE-2899.patch to Solr.
I have done this:
Cloned solr from official repo (I am on master branch)
Downloaded and installed ant and GNU patch, i get it here http://gnuwin32.sourceforge.net/packages/patch.htm
Put Ant and GNU patch to PATH env var.
And I got this...
```
D:\utils\solr_master\lucene-solr>patch -p1 -i LUCENE-2899.patch --dry-run
patching file dev-tools/idea/.idea/ant.xml
Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
```
UPDATE 1
I am trying to compile, but build failed.
D:\utils\solr_master\lucene-solr>ant compile
Buildfile: D:\utils\solr_master\lucene-solr\build.xml
BUILD FAILED
D:\utils\solr_master\lucene-solr\build.xml:21: The following error occurred while executing this line:
D:\utils\solr_master\lucene-solr\lucene\common-build.xml:623: java.lang.NullPointerException
at java.util.Arrays.stream(Arrays.java:5004)
at java.util.stream.Stream.of(Stream.java:1000)
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
at org.apache.tools.ant.util.ChainedMapper.lambda$mapFileName$1(ChainedMapper.java:36)
at java.util.stream.ReduceOps$1ReducingSink.accept(ReduceOps.java:80)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:484)
at org.apache.tools.ant.util.ChainedMapper.mapFileName(ChainedMapper.java:35)
at org.apache.tools.ant.util.CompositeMapper.lambda$mapFileName$0(CompositeMapper.java:32)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
at org.apache.tools.ant.util.CompositeMapper.mapFileName(CompositeMapper.java:33)
at org.apache.tools.ant.taskdefs.PathConvert.execute(PathConvert.java:363)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:346)
at org.apache.tools.ant.Target.execute(Target.java:448)
at org.apache.tools.ant.helper.ProjectHelper2.parse(ProjectHelper2.java:172)
at org.apache.tools.ant.taskdefs.ImportTask.importResource(ImportTask.java:221)
at org.apache.tools.ant.taskdefs.ImportTask.execute(ImportTask.java:165)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:346)
at org.apache.tools.ant.Target.execute(Target.java:448)
at org.apache.tools.ant.helper.ProjectHelper2.parse(ProjectHelper2.java:183)
at org.apache.tools.ant.ProjectHelper.configureProject(ProjectHelper.java:93)
at org.apache.tools.ant.Main.runBuild(Main.java:824)
at org.apache.tools.ant.Main.startAnt(Main.java:228)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:283)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:101)
Total time: 0 seconds
UPDATE 2
I have downloaded Solr from
https://builds.apache.org/job/Solr-Artifacts-7.3/lastSuccessfulBuild/artifact/solr/package/ and https://builds.apache.org/job/Solr-Artifacts-master/lastSuccessfulBuild/artifact/solr/package/
but neither for 7.3 version nor for 8.0(master) version I don't see opennlp dir in contrib dir. Where can I find it?
UPDATE 3
I have run version from master branch witch I have downloaded here https://builds.apache.org/job/Solr-Artifacts-master/lastSuccessfulBuild/artifact/solr/package/ and I have trying to run OpenNLP like gentleman in this post:
Exception while integrating openNLP with Solr
But I have the same error as he.
numberplate_shard1_replica_n1:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: >Could not load conf for core numberplate_shard1_replica_n1: Can't load schema >managed-schema: Plugin init failure for [schema.xml] fieldType >"text_opennlp_nvf": Plugin init failure for [schema.xml] analyzer/tokenizer: >Error instantiating class: 'org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory'
If patch LUCENE-2899 is merged into master why I have this error?
UPDATE 5
I have restarted solr and errors were gone. But...
I was trying to add fields ( to managed-schema ) to form example ( https://wiki.apache.org/solr/OpenNLP ) :
<fieldType name="text_opennlp" class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="opennlp/en-sent.bin"
tokenizerModel="opennlp/en-token.bin"
/>
</analyzer>
</fieldType>
<field name="content" type="text_opennlp" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" docValues="false" termVectors="true" multiValued="true" required="true"/>
But when I try to run Solr in Cloud mode I got this:
D:\utils\solr-7.3.0-7\solr-7.3.0-7\bin>solr -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]:
1
Ok, let's start up 1 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
Solr home directory D:\utils\solr-7.3.0-7\solr-7.3.0-7\example\cloud\node1\solr already exists.
Starting up Solr on port 8983 using command:
"D:\utils\solr-7.3.0-7\solr-7.3.0-7\bin\solr.cmd" start -cloud -p 8983 -s "D:\utils\solr-7.3.0-7\solr-7.3.0-7\example\cloud\node1\solr"
Waiting up to 30 to see Solr running on port 8983
Started Solr server on port 8983. Happy searching!
INFO - 2018-03-26 14:42:26.961; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at localhost:9983 ready
Now let's create a new collection for indexing documents in your 1-node cluster.
Please provide a name for your new collection: [gettingstarted]
numberplate
Collection 'numberplate' already exists!
Do you want to re-use the existing collection or create a new one? Enter 1 to reuse, 2 to create new [1]:
1
Enabling auto soft-commits with maxTime 3 secs using the Config API
POSTing request to Config API: http://localhost:8983/solr/numberplate/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
ERROR: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/numberplate/config. Reason:
<pre> Not Found</pre></p>
</body>
</html>
SolrCloud example running, please visit: http://localhost:8983/solr
D:\utils\solr-7.3.0-7\solr-7.3.0-7\bin>
UPDATE 6
I have created new collection and I get more precise error:
test_collection_shard1_replica_n1: > org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Could not load conf for core test_collection_shard1_replica_n1: Can't load > schema managed-schema: org.apache.solr.core.SolrResourceNotFoundException: > Can't find resource 'opennlp/en-sent.bin' in classpath or '/configs/_default', > cwd=D:\utils\solr-7.3.0-7\solr-7.3.0-7\server
Please check your logs for more information
Maybe I need to copy somewhere OpenNLP models http://opennlp.sourceforge.net/models-1.5/
But where can I put this models?
Can you help me? What I do wrong?
As you can see on LUCENE-2899, the patch is already applied to 8.0 (master), as well as 7.3.
You can find pre-built nightlies at Solr-Artifacts-master for (currently) 8.0 and at Solr-Artifacts-7.3 for 7.3.
The opennlp libraries are bundled inside the artifacts:
solr-8.0.0-3304 find . -name '*nlp*'
[...]
./contrib/langid/lib/opennlp-tools-1.8.3.jar
./contrib/analysis-extras/lib/opennlp-maxent-3.0.3.jar
./contrib/analysis-extras/lib/opennlp-tools-1.8.3.jar
./contrib/analysis-extras/lucene-libs/lucene-analyzers-opennlp-8.0.0-3304.jar
You then have to tell Solr to load these jars, which you can do through solrconfig.xml.
<lib dir="../../../contrib/analysis-extras/lib/" regex="opennlp-.*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-opennlp-.*\.jar" regex=".*\.jar" />
Confirm that the jars are loaded as you expect in Solr's log file.

Errors while trying to configure Solr 5.3.1 on Windows 10

I'm trying to setup a very basic configuration of Solr, to read some text from a mysql table and index it. I'm following the steps in DIH Quick Start document.
The document doesn't tell you where to place solrconfig.xml.
At first I tried placing it under the solr5.3.1 folder (next to bin). That failed. Then I noticed the "add core" button was looking for it in server\solr\new_core. So I put it there, but then got this other error:
My data import handler looks like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
And here's data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/ctcrets"
user="root"
password="xxxx"/>
<document>
<entity name="id"
query="select RETS_STAGE1_QUEUE_ID as id, LN_LIST_NUMBER as name, xmlText as desc from RETS_STAGE1_QUEUE">
</entity>
</document>
</dataConfig>
What could be the problem?
The document assumes you already know the solr.home [1] directory structure. On top of that, I think it assumes you started the sample Solr instance (e.g. ./solr start -p 8984) where everything should be already set.
Once started you can see on the dashboard where the configuration is exactly located. Go there, change the files as suggested and RELOAD the core through the admin console (CoreAdmin). If you want you can also do a stop / restart.
As side notes:
the DIH is not part of the Solr core, so you should put some "lib" directive within the solrconfig.xml, as far as I remember, the sample config already has those directives so you don't need to "import" the DIH lib
the JDBC driver that allows the connection with the database is not included so your classpath (i.e. JVM or Solr classpath - through the same lib directive) must include this additional lib(s).
[1] http://www.solrtutorial.com/configuring-solr.html

SOLR Field not reflected in schema browser

I created a solr core using bin/solr -c core1 and then copied the schema.xml file from basic config set to core1/conf folder and added a field
<field name="title" type="text" indexed="true" stored="true"/>.
But this field is not reflected in schema browser.
What configurations should I make to get the new fields reflected in schema browser in solr admin ui?
I am using solr 5.3.1
By default when you create a solr core it will use managed schema. You will see the following configuration in solrconfig.xml after core is created.
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
Above this configuration you will find the comments on how use managed-schema. Comment this out and uncomment the following to use schema.xml
<schemaFactory class="ClassicIndexSchemaFactory"/>
You need to reload the core: go to http://yourhost:8983/solr/#/~cores/core1 and press "Reload" button.

I am not able analyze WhitespaceAnalyzer in schema.xml

When I add analyzer in schema.xml
<fieldType name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldType>
And reload the core in solr 5.0.0, I get following error:
testcore: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not load conf for core testcore: Plugin init failure for [schema.xml]
fieldType "nametext": Cannot load analyzer:
org.apache.lucene.analysis.WhitespaceAnalyzer. Schema file is E:\files\future\solr-5.0.0\server\solr\testcore\conf\schema.xml
What am I missing?
In Solr 5.0.0 this class is in the org.apache.lucene.analysis.core package, so the class should be org.apache.lucene.analysis.core.WhitespaceAnalyzer - see documentation: http://lucene.apache.org/core/5_0_0/analyzers-common/index.html

Configure DataImportHandler in SolrCloud with ZooKeeper

I have a SolrCloud configured like this: exploration of SolrCloud, the difference is that I use Solr 4.0.0 Beta. Shortly the configuration:
ZooKeeper on default port 2181
3 instances of Solr running on different ports
This is just for testing purpose. The desired configuration is with 3 ZooKeeper instances (one for every Solr instance). I manage to index some XML files with curl command.
Questions:
How can I configure DIH/collection? I managed to change the solrconfig.xml (config for dataimport-handler), add in lib the proper driver for DB connection, but in solr admin I get "sorry, no dataimport-handler defined!" The changes can be watched in zookeeper (I see the data_config.xml) and in solr admin panel I can see the updated version of solrconfig.xml.
Any good tutorial for a production deploy of solrcloud (with somthink like the desired configuration mentioned before) on single or multiple machine for Ubuntu 12.04 LTS?
Any advice would be appreciated! Thanks in advance!
Normally DIH config has nothing to do with wether you're using a single Solr instance or multiple instances in a solrCloud config. DIH will write data in the current instance's Lucene index, and then it's up to zooKeeper to speread it around on the other instances.
Make sure your DIH is propertly configured:
In solrconfig.xml, all necessary libraries are loaded. This means the two DIH jars:
<lib dir="../../../dist/" regex="solr-dataimporthandler-4.3.0.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-extras-4.3.0.jar" />
as well as others jars you may need (like Database JDBC driver, etc).
Still in solrconfig.xml make sure the DIH handler is declared, something like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Finally, the config file you declared in the DIH handler (data-config.xml) should be in the same "conf" dir as solrconfig.xml and should have proper content, something like:
<dataConfig>
<dataSource type="JdbcDataSource" name="myDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#someHost:1521:someDb" user="someUser" password="somePassword" batchSize="5000"/>
<document name="myDoc" >
<entity name="myDoc" dataSource="myDatasource" transformer="my.custom.Transformer" query="select col1, col2, col3 from table1 where whatever" />
</document>
</dataConfig>

Resources