django haystack solr: index file not being created - solr

solr is not able to create index, get following error:
All documents removed.
Indexing 100 notes.
Failed to add documents to Solr: [Reason: None]
<html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"/>
<title>Error 404 NOT_FOUND</title></head><body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/boatsite/update/. Reason:<pre> NOT_FOUND</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>
any suggestions?

Answer is in Beta 2.0.0 docs under trouble shooting. Was using a bad URL in settings.py for HAYSTACK CONNECTIONS. Docs suggest test URL in browser if you get this error. I was trying http://127.0.0.1:8983/solr/mysite ---should have been using http://127.0.0.1:8983/solr

Related

Solr index custom file types

Basically, I am a Solr newbie and have had 0 experience with this as our Solr expert left the company. We are receiving a file from a client that is a proprietary file. I don't have access to the application in which it was generated from.
When uploading to Solr we receive the following error
SOLR Log
solr-cloud.log: {"msg":"2022-01-19 08:10:06.915 ERROR (qtp349420578-3516) [c:<collection> s:shard2 r:core_node5 x:<redacted>] o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: ucar/nc2/NetcdfFile"}
Our App logging
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/<collection>: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /solr/<collection>/update/extract. Reason:
<pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.NoClassDefFoundError: ucar/nc2/NetcdfFile
at org.apache.tika.parser.hdf.HDFParser.parse(HDFParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
Other normal file types works (e.g. doc, pdf, zip)
I cannot open or edit the file to see what fields are in there to
index so is there a way to be able to index this?
If not, is there anything else I can do to handle this file type
TIA
file is being parsed by Solr/Tika using an HDF parser which in turn depends on NetCDF parser -
https://www.unidata.ucar.edu/downloads/netcdf-java/

Solarium return Solr HTTP error : OK (404)

I use Solarium to access Solr with Symfony. It works without problem on my computer and dev computer but not on prod server.
On the prod server, Sorl is running with the same configuration, same port, same logins.
Do you have any idea of what can be the problem?
Here is the error
Solr HTTP error: OK (404)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Not Found</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Not Found</h2>
<hr><p>HTTP Error 404. The requested resource is not found.</p>
</BODY></HTML>
Problem solved. There was a wrong proxy installed on the windows server.

why nutch index to a wrong solr collection even though set solr.server.url parameter?

integrate nutch 1.15 with solr8.0, but when i use the following command
nutch/bin/crawl -i -D solr.server.url=http://192.168.199.109:8983/solr/csdn -s ./csdn-seed/ ./data/csdn 1
to index crawled data from nutch to solr it throw out the exception in hadoop.log
2019-03-23 02:03:07,491 WARN mapred.LocalJobRunner - job_local1877827743_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/nutch/update. Reason:
<pre> Not Found</pre></p>
</body>
</html>
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/nutch/update. Reason:
<pre> Not Found</pre></p>
</body>
</html>
but actually, i set solr.server.url to /solr/csdn isn't it? but why it told me that it is indexing to /solr/nutch?
The way indexer plugins are configured has changed with Nutch 1.15: all indexer plugins are now configured in a single XML file (conf/index-writers.xml), setting or overwriting configuration parameters via Nutch properties is not possible anymore.
See https://wiki.apache.org/nutch/IndexWriters how to configure the Solr server URL. This breaking change was necessary to allow multiple indexers of the same type, e.g. multiple Solr instances.

What SOLR configuration is required to fetch an html page and parse it?

I've been consulting one tutorial after another and have spent oodles of time searching.
I installed SOLR from scratch and start it up.
bin/solr start
I successfully navigate to the SOLR admin. Then I create a new core.
bin/solr create -c core_wiki -d basic_configs
I look at the help for the bin/post command.
bin/post -h
...
* Web crawl: bin/post -c gettingstarted http://lucene.apache.org/solr -recursive 1 -delay 1
...
So I try to make a similar call... but I keep getting a FileNotFound error.
bin/post -c core_wiki http://localhost:8983/solr/ -recursive 1 -delay 10
/usr/lib/jvm/java-7-openjdk-amd64/jre//bin/java -classpath /home/ubuntu/src/solr-5.4.0/dist/solr-core-5.4.0.jar -Dauto=yes -Drecursive=1 -Ddelay=10 -Dc=core_wiki -Ddata=web org.apache.solr.util.SimplePostTool http://localhost:8983/solr/
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/core_wiki/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, depth=1, delay=10s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/core_wiki/update/extract?literal.id=http%3A%2F%2Flocalhost%3A8983%2Fsolr&literal.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/core_wiki/update/extract. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/core_wiki/update/extract?literal.id=http%3A%2F%2Flocalhost%3A8983%2Fsolr&literal.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr
SimplePostTool: WARNING: An error occurred while posting http://localhost:8983/solr
0 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core_wiki/update/extract...
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/core_wiki/update/extract?commit=true
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/core_wiki/update/extract. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
Time spent: 0:00:00.041
I'm still fairly new to SOLR indexing. Any hints that could point me in the right direction would be appreciated.
It seems that the request handler named /update/extract is missing from your configuration.
The ExtractingRequestHandler is not incorporated into the solr war
file, it is provided as a SolrPlugins, and you have to load it (and
it's dependencies) explicitly. (Apache Solr Wiki)
It should be defined in solrconfig.xml, like :
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">

Indexing .tar.gz files in Solr 5.3.1: HTTP Error 405 POST not supported

Under a Solr 5.3.1 installation with /update working as expected I tried to index a .tar.gz file with the update/extract query handler,
curl "http://localhost:8983/solr/#/myfirstcore/update/extract?literal.id=adocument&commit=true" -H 'Content-type:application/octet-stream' --data-binary "#encapsulate.tar.gz"
But receive the following,
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 405 HTTP method POST is not supported by this URL</title>
</head>
<body><h2>HTTP ERROR 405</h2>
<p>Problem accessing /solr/admin.html. Reason:
<pre> HTTP method POST is not supported by this URL</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
Under the admin panel, the update/extract specification is
/update/extract
class:org.apache.solr.handler.extraction.ExtractingRequestHandler
version:5.3.1
description:Add/Update Rich document
src:null
And solr was generally installed according to these directions: Digital Ocean: Installing Solr 5.2.1 on Ubuntu 14.4
Given the above error message how can I configure Solr to index zipped files (including .tar.gz)? The use case is to associate content with taxonomy metadata stored in json format by zipping them together. This way Solr will index both documents and associated taxonomy metadata together and follow on partial update commands are not needed.
Solution, change:
curl "http://localhost:8983/solr/#/myfirstcore/update/extract?literal.id=adocument&commit=true" -H 'Content-type:application/octet-stream' --data-binary "#encapsulate.tar.gz"
to
curl "http://localhost:8983/solr/myfirstcore/update/extract?literal.id=adocument&commit=true" -H 'Content-type:application/octet-stream' --data-binary "#encapsulate.tar.gz"
And a query for id=adocument returns 1 hit. That it didn't pick up the fields is a separate issue.

Resources