Solr - Indexing error with UTF-8 characters - solr

I am 100% new to Solr. I installed solr-5.1 for Windows and followed the tutorial.
I need some direction as to what may have caused the error below, e.g. need to add config to core xml file, UTF-8 encoding problem, etc...
start solr with :] solr.cmd -start
create a core :] solr create -c myExample
index pdf files :] jar -Dc=myexample -Dfiletypes=pdf -jar ../example/exampledocs/post.jar E:\solr_docs\*.pdf
Errors:
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/myExample/update using content-type application/xml...
POSTing file Intrusion detection by machine learning.pdf to [base]
SimplePostTool: WARNING: Solr returned an error \#400 (Bad Request) for url: http://localhost:8983/solr/myExample/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response><lst name="responseHeader">
<intname="status">400</int><intname="QTime">0</int>
</lst><lst name="error"><str name="msg">Invalid UTF-8 middle byte 0xe3 (at char
\#10, byte \#-1)</str><int name="code">400</int></lst>
</response>

You are feeding Solr a PDF file is if it were a text file. You need to configure and use a suitable URP chain to have Solr work with PDF files.

Related

HTML sample file not indexing in Solr 8.8

I am trying out indexing the exampledocs in the examples folder with the SimplePostTool on windows 10 using solr 8.8. All the documents index except sample.html. For that file I get the following error:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\sample.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file sample.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/gettingstarted/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.086
However the json and all other file types index with no problem. For example:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\books.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Just following this tutorial:https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support
The extracting request handler that allows indexing of rich documents has to be enabled before it can be used. If you look at the paths in both your request, you can see that your first request goes to /extract and it gives a 404, while your second request goes to /update and works.
You can find a description of how to enable and configure the endpoint in the Solr documentation:
If you are not working with an example configset, the jars required to use Solr Cell will not be loaded automatically. You will need to configure your solrconfig.xml to find the ExtractingRequestHandler and its dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml. The following is the default configuration found in Solr’s _default configset, which you can modify as needed:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>

Error deploying configuration descriptor Solr

I have done the below steps for Solr Integration to tomcat on windows machine.Can you please clarify what am I doing wrong here.
1) Download Solr and unzipped Solr 5.2.1 to the below directory C:\downloads\solr-5.2.1\solr-5.2.1.
2)Download Tomcat 7 zipped version and unzipped it to below location C:\downloads\apache-tomcat-7.0.62\apache-tomcat-7.0.62
3)Copy Jar files from C:\downloads\solr-5.2.1\solr-5.2.1\dist\solrj-lib directory to C:\downloads\apache-tomcat-7.0.62\apache-tomcat-7.0.62\lib directory.
4) Create a solr.xml in the C:\downloads\apache-tomcat-7.0.62\apache-tomcat-7.0.62\conf\Catalina\localhost folder.
<?xml version='1.0' encoding='UTF-8'?>
<context docBase="C:/downloads/apache-tomcat-7.0.62/apache-tomcat-7.0.62/webapps/solr.war" debug="0" crossContext="true" >
<environment name="solr" type="java.lang.String" value="/apache-tomcat-7.0.62/webapps/" override="true"></environment>
</context>
5)Copy solr.war file from C:\downloads\solr-5.2.1\solr-5.2.1\server\webapps to
C:\downloads\apache-tomcat-7.0.62\apache-tomcat-7.0.62\webapps folder.
6)Start the tomcat using startup.bat command in bin folder
7)Edit web.xml to
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>C:/downloads/solr-5.2.1/solr-5.2.1</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
8)Restart the tomcat and hit the url http://localhost:8080/solr I get 404 Not found Error.The error in the console is
SEVERE: Error deploying configuration descriptor C:\downloads\apache-tomcat-7.0.
62\apache-tomcat-7.0.62\conf\Catalina\localhost\solr.xml
java.lang.NullPointerException
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.ja
va:645)
The Solr wiki states that running 5.x versions on Tomcat is no longer supported:
Internally, Solr is still implemented via Servlet APIs and is powered by Jetty -- but this is simply an implementation detail. Deployment as a "webapp" to other Servlet Containers (or other instances of Jetty) is not supported, and may not work in future 5.x versions of Solr when additional changes are likely to be made to Solr internally to leverage custom networking stack features.

Solr/ SimplePostTool cannot index any file 503 error

I am new to Solr and trying to figure out the basics of indexing one file. I've started with this tutorial http://lucene.apache.org/solr/quickstart.html, but being on windows I am hitting a wall when it comes to running the command to index.
This is what my output looks like:
SolrCloud example is running, please visit http://localhost:8983/solr"
C:\Projects\solr-5.1.0>java -Dc=gettingstarted -Dtype=text/csv -Dfiletypes=cs -
jar example/exampledocs/post.jar Z:/Indexer/tfs/ShippingOptionsPerRecipient.aspx
.cs
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update usi
ng content-type text/csv...
POSTing file ShippingOptionsPerRecipient.aspx.cs to [base]
SimplePostTool: WARNING: Solr returned an error #503 (Service Unavailable) for u
rl: http://localhost:8983/solr/gettingstarted/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">503</int><int name="QTime">4057</i
nt></lst><lst name="error"><str name="msg">No registered leader was found after
waiting for 4000ms , collection: gettingstarted slice: shard1</str><int name="co
de">503</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException
: Server returned HTTP response code: 503 for URL: http://localhost:8983/solr/ge
ttingstarted/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/updat
e...
Time spent: 0:00:04.255
Unfortunately, I can't find much documentation on any of these errors.
Any help would be appreciated.

Error while indexing .xml files in solr

I am trying to index xml files in solr search engine using following command:
java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml
But I am getting following error:
SimplePostTool version 1.5
Posting files to base url http://10.1.11.143:8080/solr/#/ using content-type application/xml..
POSTing file solr.xml
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://10.1.11.143:8080/solr/#/
1 files indexed.
COMMITting Solr index changes to http://10.1.11.143:8080/solr/#/..
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error for url http://10.1.11.143:8080/solr/#/?commit=true
Time spent: 0:00:00.017
Please help me to come out of this error.
Content of solr.xml is as shown in the picture:
The issue is because of the URL. You didn't mention any requestHandler while updating. Use the following command. It'll work.
java -Durl=http://10.1.11.143:8080/solr/update?commit=true -jar post.jar solr.xml
/update is the requestHandler to index data into Solr.

I'm following the Nutch tutorial, and getting a "No URLs to fetch" error

Following the Apache Nutch tutorial here:
As indicated in the tutorial, I've set the last line of my regex-urlfilter.txt to:
+^http://([a-z0-9]*\.)*nutch.apache.org/
My nutch-site.xml file contains only the lines
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
And my seed.txt file is:
http://nutch.apache.org/
However, when I crawl with
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
I get a "No URLs to fetch" error. Anyone know why?
Configuration looks fine to me. You have made these changes in runtime/local folder right?
seed.txt will be in NUTCH_HOME/runtime/local/urls folder and
regex-urlfilter.txt and nutch-site.xml will be in NUTCH_HOME/runtime/local/conf folder
NUTCH_HOME is installation directory

Resources