Solr 8.4.1 cloud : bin/post - File not Found problem - solr

I am new to Solr and have been working through the tutorial of 8.4.0. Having followed successfully the techproducts example using SolrCloud, I'm now trying to use a schemaless approach to index some PDF files. For that, I used the following, again from the tutorial, to index several files which are stored int the ~/Documents/pdf folder:
bin/solr create -c localpdf -s 2 - rf 2
bin/post -c localpdf ~/Documents/pdf
When executing the above, I get the following error:
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/localpdf/update/extract. Reason:
<pre> Not Found</pre></p>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/localpdf/update/extract?resource.name=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf&literal.id=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf
Running the same command with techproducts, i.e. running:
bin/post -c techproducts ~/Documents/pdf
at least finds the files (it gives me some other errors related to PDFBox and some fonts, but that's another matter)
I can add other files, for instance XML to localpdf from the example/exampledocs folder, but not the pdfs.
What am I missing here?

You must configure your core / collection to load the extracting request handler - otherwise it's not available. The techproducts core does this by default. Add the jars to the list of jars to load:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
​<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
And add the request handler definition (from the guide linked above):
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>

Related

Failed to create solr core in Solr 8.9.0 using Solr API

I have created Solr war file followed process as mentioned in
https://gist.github.com/fschiettecatte/836d13be0c95f1fd159e45d3af861952
as I want to run Solr as standalone application through my specific
version of jetty server.
After creating war file I started solr through jetty successfully by
running below command:
$ java -Djetty.home=/var/solr -Djetty.base=/var/solr
-Dsolr.solr.home=/var/solr/solr -Dsolr.log.dir=/var/solr/solr
-Dbootstrap_confdir=/var/solr/solr/conf -Dcollection.configName=conf
-DzkRun -Djava.util.logging.config.file=/var/solr/solr/solr-log.properties
-jar /var/solr/start.jar
2021-08-20 07:49:40.869:INFO::main: Logging initialized #155ms to
org.eclipse.jetty.util.log.StdErrLog
2021-08-20 07:49:41.021:INFO:oejs.Server:main: jetty-9.4.18.v20190429;
built: 2019-05-10T18:03:12.512Z; git:
7ef7435fd940d3eb73c256b765d93aff5849c6e8; jvm 11.0.5+10
2021-08-20 07:49:41.029:INFO:oejdp.ScanningAppProvider:main:
Deployment monitor [file:///data/git/runtime/solr/webapps/] at
interval 1
2021-08-20 07:49:41.568:INFO:oejw.StandardDescriptorProcessor:main: NO
JSP Support for /solr, did not find
org.apache.jasper.servlet.JspServlet
2021-08-20 07:49:41.572:INFO:oejs.session:main:
DefaultSessionIdManager workerName=node0
2021-08-20 07:49:41.573:INFO:oejs.session:main: No SessionScavenger
set, using defaults
2021-08-20 07:49:41.573:INFO:oejs.session:main: node0 Scavenging every 600000ms
2021-08-20 07:49:41.575:WARN:oejs.SecurityHandler:main:
ServletContext#o.e.j.w.WebAppContext#294425a7{solr,/solr,file:///tmp/jetty-0.0.0.0-8983-solr.war-_solr-any-17126331786779836602.dir/webapp/,STARTING}{/solr.war}
has uncovered http methods for path: /
ERROR StatusLogger No Log4j 2 configuration file found. Using default
configuration (logging only errors to the console), or user
programmatically provided configurations. Set system property
'log4j2.debug' to show Log4j 2 internal initialization logging. See
https://logging.apache.org/log4j/2.x/manual/configuration.html for
instructions on how to configure Log4j 2
2021-08-20 07:49:43.647:INFO:oejsh.ContextHandler:main: Started
o.e.j.w.WebAppContext#294425a7{solr,/solr,file:///tmp/jetty-0.0.0.0-8983-solr.war-_solr-any-17126331786779836602.dir/webapp/,AVAILABLE}{/solr.war}
2021-08-20 07:49:43.652:INFO:oejs.AbstractConnector:main: Started
ServerConnector#6da9dc6{HTTP/1.1,[http/1.1]}{0.0.0.0:8983}
2021-08-20 07:49:43.653:INFO:oejs.Server:main: Started #2939ms
My solr is working just fine I run few command to verify:
$ curl "http://0.0.0.0:8983/solr/admin/collections?action=clusterstatus&wt=xml"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19</int>
</lst>
<lst name="cluster">
<lst name="collections"/>
<arr name="live_nodes">
<str>192.168.1.2:8983_solr</str>
</arr>
</lst>
</response>
When I tried to create core it was failing with below error, before
running this command I created folder name “a10” under solr home
directory “/var/solr/solr/cores”
$ curl "http://0.0.0.0:8983/solr/admin/cores?action=CREATE&name=a10&instanceDir=cores/a10&shard=shard10&collection=conf1&coreNodeName=a10&wt=xml"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">10067</int>
</lst>
<lst name="error">
<lst name="metadata">
<str name="error-class">org.apache.solr.common.SolrException</str>
<str name="root-error-class">org.apache.solr.cloud.ZkController$NotInClusterStateException</str>
</lst>
<str name="msg">Error CREATEing SolrCore 'a10': coreNodeName a10 does
not exist in shard shard10, ignore the exception if the replica was
deleted</str>
<int name="code">400</int>
</lst>
</response>
Backtrace because of this error in jetty console:
07:54:36.047 [qtp466505482-21] ERROR org.apache.solr.handler.RequestHandlerBase - org.apache.solr.common.SolrException: Error CREATEing SolrCore 'a1': coreNodeName a1 does not exist in shard shard1, ignore the exception if the replica was deleted
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1136)
at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:92)
at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:758)
at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:739)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:511)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:505)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:724)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:830)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.cloud.ZkController$NotInClusterStateException: coreNodeName a1 does not exist in shard shard1, ignore the exception if the replica was deleted
at org.apache.solr.cloud.ZkController.checkStateInZk(ZkController.java:1874)
at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1773)
at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1180)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1097)
... 41 more
My issue got resolved when I started Solr in stand alone mode. I started solr in cloud mode because of that create core command was failing.
Solr API should be used only when solr started in standalone mode and Collection related API should be sued when solr started in cloud mode.

HTML sample file not indexing in Solr 8.8

I am trying out indexing the exampledocs in the examples folder with the SimplePostTool on windows 10 using solr 8.8. All the documents index except sample.html. For that file I get the following error:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\sample.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file sample.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/gettingstarted/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.086
However the json and all other file types index with no problem. For example:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\books.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Just following this tutorial:https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support
The extracting request handler that allows indexing of rich documents has to be enabled before it can be used. If you look at the paths in both your request, you can see that your first request goes to /extract and it gives a 404, while your second request goes to /update and works.
You can find a description of how to enable and configure the endpoint in the Solr documentation:
If you are not working with an example configset, the jars required to use Solr Cell will not be loaded automatically. You will need to configure your solrconfig.xml to find the ExtractingRequestHandler and its dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml. The following is the default configuration found in Solr’s _default configset, which you can modify as needed:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>

apache solr provisioning - how to keep config in VCS but data on machine

I need to make some adoptions in project that utilizes apache solr for fulltext searches. Someone configured everything on the production machine and i want to prepare everything locally and deploy the whole new version at once.
I already created working vagrant setup for everything and it works well.
But my problem is - i am not very experienced with configuring apache solr and cant manage to get it working.
Here is my installation script:
apt-get install -q -y openjdk-8-jdk
# install apache solr
if [[ ! -e "/etc/default/solr.in.sh" ]]
then
wget http://www-eu.apache.org/dist/lucene/solr/7.7.1/solr-7.7.1.tgz
tar xzf solr-7.7.1.tgz solr-7.7.1/bin/install_solr_service.sh --strip-components=2
chmod u+x ./install_solr_service.sh
./install_solr_service.sh solr-7.7.1.tgz
cat /vagrant/config/solr/solr.in.sh >> /etc/default/solr.in.sh
rm -f /opt/solr-7.7.1/server/solr/solr.xml
ln -s /vagrant/config/solr/solr.xml /opt/solr-7.7.1/server/solr/solr.xml
fi
contents of /vagrant/config/solr/solr.in.sh
(content taken from production config - i dont really understand the purpose)
# this is just a partial file - we append its contents to the original
SOLR_RECOMMENDED_OPEN_FILES=65000
Content of linked solr.xml
<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<str name="coreRootDirectory">${coreRootDirectory:/vagrant/config/solr/cores}</str>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>
<int name="distribUpdateConnTimeout">${distribUpdateConnTimeout:60000}</int>
<str name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider}</str>
<str name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider}</str>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:600000}</int>
<int name="connTimeout">${connTimeout:60000}</int>
<str name="shardsWhitelist">${solr.shardsWhitelist:}</str>
</shardHandlerFactory>
</solr>
The cores directory contains all the information from the production machine, i just added the following value to the core.properties file within each core
dataDir=/var/solr/data/NAME_OF_CORE
I figured this way the data would be part of my machine but the config part of my repository.
But when i browse to localhost:8983 (which works perfectly) i dont see any core. Neither can i create a new core, when creating a new core called "new_core" it says:
new_core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core new_core: Error loading solr config from /var/solr/data/new_core/conf/solrconfig.xml
So - how would i provision solr correctly to keep all my config in git but the data on the machine?
The company that set up everything is not helpful, they provide ZERO information.
Kind regards,
Philipp

Issue while creating Solr core on HDFS

I am trying to create solr core on HDFS in a stand alone instance (solr-5.3.0 and Hadoop 2.7). I have started the service like below,
$ bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.data.dir=hdfs://localhost:9000/tmp -Dsolr.updatelog=hdfs://localhost:9000/tmp -s solr-cores/core1
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=42277). Happy searching!
And trying to create core like below,
bin/solr create -c hdfsstarted -d /home/admin/HadoopTools/solr-5.3.0/server/solr/configsets/data_driven_schema_configs_hdfs/conf -n hdfsstarted
But getting below error:
Setup new core instance directory:
/home/admin/HadoopTools/solr-5.3.0/solr-cores/core1/hdfsstarted
Creating new core 'hdfsstarted' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=hdfsstarted&instanceDir=hdfsstarted
ERROR: Error CREATEing SolrCore 'hdfsstarted': Unable to create core [hdfsstarted] Caused by: Protocol message end-group tag did not match expected tag.
I have modified the solrconfig.xml like below,
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://10.67.5.244:50070/tmp</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">false</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">false</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>
hdfs
</lockType>
Kindly let me know how to create core correctly in HDFS.
Caused by: Protocol message end-group tag did not match expected tag.
This error happens because you are using incorrect HDFS port. See hdfs - ls: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: .
Here you need to change port from 50070 (which looks like NameNode web ui port) to 8020 or whatever port you are using as NameNode RPC port:
<str name="solr.hdfs.home">hdfs://10.67.5.244:50070/tmp</str>

Errors while trying to configure Solr 5.3.1 on Windows 10

I'm trying to setup a very basic configuration of Solr, to read some text from a mysql table and index it. I'm following the steps in DIH Quick Start document.
The document doesn't tell you where to place solrconfig.xml.
At first I tried placing it under the solr5.3.1 folder (next to bin). That failed. Then I noticed the "add core" button was looking for it in server\solr\new_core. So I put it there, but then got this other error:
My data import handler looks like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
And here's data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/ctcrets"
user="root"
password="xxxx"/>
<document>
<entity name="id"
query="select RETS_STAGE1_QUEUE_ID as id, LN_LIST_NUMBER as name, xmlText as desc from RETS_STAGE1_QUEUE">
</entity>
</document>
</dataConfig>
What could be the problem?
The document assumes you already know the solr.home [1] directory structure. On top of that, I think it assumes you started the sample Solr instance (e.g. ./solr start -p 8984) where everything should be already set.
Once started you can see on the dashboard where the configuration is exactly located. Go there, change the files as suggested and RELOAD the core through the admin console (CoreAdmin). If you want you can also do a stop / restart.
As side notes:
the DIH is not part of the Solr core, so you should put some "lib" directive within the solrconfig.xml, as far as I remember, the sample config already has those directives so you don't need to "import" the DIH lib
the JDBC driver that allows the connection with the database is not included so your classpath (i.e. JVM or Solr classpath - through the same lib directive) must include this additional lib(s).
[1] http://www.solrtutorial.com/configuring-solr.html

Resources