HTML sample file not indexing in Solr 8.8 - solr

I am trying out indexing the exampledocs in the examples folder with the SimplePostTool on windows 10 using solr 8.8. All the documents index except sample.html. For that file I get the following error:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\sample.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file sample.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/gettingstarted/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.086
However the json and all other file types index with no problem. For example:
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs\books.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Just following this tutorial:https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support

The extracting request handler that allows indexing of rich documents has to be enabled before it can be used. If you look at the paths in both your request, you can see that your first request goes to /extract and it gives a 404, while your second request goes to /update and works.
You can find a description of how to enable and configure the endpoint in the Solr documentation:
If you are not working with an example configset, the jars required to use Solr Cell will not be loaded automatically. You will need to configure your solrconfig.xml to find the ExtractingRequestHandler and its dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml. The following is the default configuration found in Solr’s _default configset, which you can modify as needed:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>

Related

Solr 8.4.1 cloud : bin/post - File not Found problem

I am new to Solr and have been working through the tutorial of 8.4.0. Having followed successfully the techproducts example using SolrCloud, I'm now trying to use a schemaless approach to index some PDF files. For that, I used the following, again from the tutorial, to index several files which are stored int the ~/Documents/pdf folder:
bin/solr create -c localpdf -s 2 - rf 2
bin/post -c localpdf ~/Documents/pdf
When executing the above, I get the following error:
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/localpdf/update/extract. Reason:
<pre> Not Found</pre></p>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/localpdf/update/extract?resource.name=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf&literal.id=%2Fhome%2Fuser%2FDocuments%2Fpdf%2Ftest234.pdf
Running the same command with techproducts, i.e. running:
bin/post -c techproducts ~/Documents/pdf
at least finds the files (it gives me some other errors related to PDFBox and some fonts, but that's another matter)
I can add other files, for instance XML to localpdf from the example/exampledocs folder, but not the pdfs.
What am I missing here?
You must configure your core / collection to load the extracting request handler - otherwise it's not available. The techproducts core does this by default. Add the jars to the list of jars to load:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
​<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
And add the request handler definition (from the guide linked above):
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>

Errors while trying to configure Solr 5.3.1 on Windows 10

I'm trying to setup a very basic configuration of Solr, to read some text from a mysql table and index it. I'm following the steps in DIH Quick Start document.
The document doesn't tell you where to place solrconfig.xml.
At first I tried placing it under the solr5.3.1 folder (next to bin). That failed. Then I noticed the "add core" button was looking for it in server\solr\new_core. So I put it there, but then got this other error:
My data import handler looks like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
And here's data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/ctcrets"
user="root"
password="xxxx"/>
<document>
<entity name="id"
query="select RETS_STAGE1_QUEUE_ID as id, LN_LIST_NUMBER as name, xmlText as desc from RETS_STAGE1_QUEUE">
</entity>
</document>
</dataConfig>
What could be the problem?
The document assumes you already know the solr.home [1] directory structure. On top of that, I think it assumes you started the sample Solr instance (e.g. ./solr start -p 8984) where everything should be already set.
Once started you can see on the dashboard where the configuration is exactly located. Go there, change the files as suggested and RELOAD the core through the admin console (CoreAdmin). If you want you can also do a stop / restart.
As side notes:
the DIH is not part of the Solr core, so you should put some "lib" directive within the solrconfig.xml, as far as I remember, the sample config already has those directives so you don't need to "import" the DIH lib
the JDBC driver that allows the connection with the database is not included so your classpath (i.e. JVM or Solr classpath - through the same lib directive) must include this additional lib(s).
[1] http://www.solrtutorial.com/configuring-solr.html

Solr/ SimplePostTool cannot index any file 503 error

I am new to Solr and trying to figure out the basics of indexing one file. I've started with this tutorial http://lucene.apache.org/solr/quickstart.html, but being on windows I am hitting a wall when it comes to running the command to index.
This is what my output looks like:
SolrCloud example is running, please visit http://localhost:8983/solr"
C:\Projects\solr-5.1.0>java -Dc=gettingstarted -Dtype=text/csv -Dfiletypes=cs -
jar example/exampledocs/post.jar Z:/Indexer/tfs/ShippingOptionsPerRecipient.aspx
.cs
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update usi
ng content-type text/csv...
POSTing file ShippingOptionsPerRecipient.aspx.cs to [base]
SimplePostTool: WARNING: Solr returned an error #503 (Service Unavailable) for u
rl: http://localhost:8983/solr/gettingstarted/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">503</int><int name="QTime">4057</i
nt></lst><lst name="error"><str name="msg">No registered leader was found after
waiting for 4000ms , collection: gettingstarted slice: shard1</str><int name="co
de">503</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException
: Server returned HTTP response code: 503 for URL: http://localhost:8983/solr/ge
ttingstarted/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/updat
e...
Time spent: 0:00:04.255
Unfortunately, I can't find much documentation on any of these errors.
Any help would be appreciated.

Solr - Indexing error with UTF-8 characters

I am 100% new to Solr. I installed solr-5.1 for Windows and followed the tutorial.
I need some direction as to what may have caused the error below, e.g. need to add config to core xml file, UTF-8 encoding problem, etc...
start solr with :] solr.cmd -start
create a core :] solr create -c myExample
index pdf files :] jar -Dc=myexample -Dfiletypes=pdf -jar ../example/exampledocs/post.jar E:\solr_docs\*.pdf
Errors:
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/myExample/update using content-type application/xml...
POSTing file Intrusion detection by machine learning.pdf to [base]
SimplePostTool: WARNING: Solr returned an error \#400 (Bad Request) for url: http://localhost:8983/solr/myExample/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response><lst name="responseHeader">
<intname="status">400</int><intname="QTime">0</int>
</lst><lst name="error"><str name="msg">Invalid UTF-8 middle byte 0xe3 (at char
\#10, byte \#-1)</str><int name="code">400</int></lst>
</response>
You are feeding Solr a PDF file is if it were a text file. You need to configure and use a suitable URP chain to have Solr work with PDF files.

Error while indexing .xml files in solr

I am trying to index xml files in solr search engine using following command:
java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml
But I am getting following error:
SimplePostTool version 1.5
Posting files to base url http://10.1.11.143:8080/solr/#/ using content-type application/xml..
POSTing file solr.xml
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://10.1.11.143:8080/solr/#/
1 files indexed.
COMMITting Solr index changes to http://10.1.11.143:8080/solr/#/..
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error for url http://10.1.11.143:8080/solr/#/?commit=true
Time spent: 0:00:00.017
Please help me to come out of this error.
Content of solr.xml is as shown in the picture:
The issue is because of the URL. You didn't mention any requestHandler while updating. Use the following command. It'll work.
java -Durl=http://10.1.11.143:8080/solr/update?commit=true -jar post.jar solr.xml
/update is the requestHandler to index data into Solr.

Resources