solr-cell search works for some pdfs not others - solr

I have been searching for two days and have not been able to find an answer.
I have solr installed from the repos on an Ubuntu server running on tomcat 6. I have added the solr-cell jar and tika libraries.
I can run a curl command that works for some pdf files and indexes them fine, but it does not not work for others. At first i thought that some files were corrupted but that does not appear to be be the case. There does not appear to me to be any major difference between the ones thaqt work and those that don't.
The error i get is a 500 error - see example here
The curl request i make is:
$ curl 'http://mysolrserver.com:port/solr/update/extract?map.content=text&map.stream_name=id&extractOnly=true&commit=true' -F "file=#/absolute/path/to/file.pdf"
This does work for some PDFs fine, just not others.
I believe I have solr 1.4.0 installed.
Any help would be appreciated - thank you
--EDIT--
I am using Ubuntu 10.04.1 if that helps at all.

A NullPointerException is probably a bug. Report it to PDFBox and/or Tika.

OK the nightly snapshot of solr uses PDFBox 1.3.1 as opposed to the current stable which uses 0.7.* which is a fair amount of revision changes.
I can index all the pdfs using this snapshot version of solr. This seems to me something that will be
fixed in the next stable version.

Related

Mint 18.1, Ckan, SOLR schema version not supported

I'm a front-end developer and I need to do a Ckan Theme. To do so, I need a working source install of CKAN on my system. I'm using Mint 18.1 and installing Ckan 2.6.2.
Following the steps of the installation of ckan's docs I've got a warning and an error at step 6 as shown on the image.
As you can see the last line says SOLR schema version not supported: 2.7. Supported versions are [2.3] and I can't proceed with the installation. Searching on the Internet I found people having the same problem, but using Docker (have no idea what is this) and their solutions didn't work for me.
Because I have a really short time to build this theme I gave up CKAN 2.6.2 and installd 2.5.2 and everything worked fine.
The SOLR schema that comes with CKAN 2.6.2 is version 2.3, so somehow you have got 2.7, which is provided with later versions of CKAN. Maybe you installed CKAN master and the schema is lingering from then.
Here are some steps so that you can find out where the problem is:
You can check the version of the schema in the CKAN source repo on your disk:
grep 'name="ckan" version=' /usr/lib/ckan/default/src/ckan/ckan/config/solr/schema.xml
You would have then installed this file into Solr (in Step 5, using the 'ln' command). You can check the version in Solr:
grep 'name="ckan" version=' /etc/solr/conf/schema.xml
(When this file is changed, you need to restart SOLR (i.e. jetty) for it to take effect - see the docs again).
You can see what schema SOLR is actually using:
curl -s 'http://localhost:8983/solr/admin/file/?contentType=text/xml;charset=utf-8&file=schema.xml'|grep 'name="ckan" version='
Please do feed back on these.
It sounds like your Docker container for SOLR is a newer version than that is not compatible with CKAN 2.6.2.

How to upgrade Solr 5 version already in production on Linux (installed as a service)?

What is the best way to update a Solr 5 version in production (in other words installed as a service) on Linux? I have an already installed Solr 5.0 (via the Service Installation Script) and now need to upgrade it to Solr 5.2.1. Realizing some of the config files will need to be changed to take advantage of recent changes, after stopping the current instance, is the best way to simply run the new Solr 5.2.1 Service Installation Script or just untar the 5.2.1 solr-5.2.1.tgz to /opt or something else? Fortunately, I have a very simple set up (not SolrCloud).
After actually looking into the /opt folder it is fairly obvious I just need to untar solr into that folder and change the solr symbolic link to point to the new version. This should work most of the time keeping in mind that occasionally, as Jay pointed out, there could be changes to the solr files that could possibly require more than this.

can't index rich documents on both solr 3.6 and solr 4.0 using update/extract getting "#500 lazy loading error"

I'v just started to learn solr. From last 3 days I'm in trouble. I can not
index rich documents on solr 3.6 and 4.0. I am using windows7 64bit.
what i tried is as:
First I installed solr 3.6 with tomcat-jetty.using BitNami Apache
1.tried -Durl command what i got :
error #500 lazy loading error
2.Download curl for my window machine and tried curl i got: error #500 lazy loading error
3.copied a program from solr tutorial to upload a file using solrJ for
SolrJ in NetBeans IDE and tried a pdf files to indexed using
update/extract
then i got:
org.apache.solr.common.SolrException: Server at
"myServer:port/solr" returned non ok status:500, message:Internal
Server Error
4.changed solconfig.xml so removed startup=lazy from update/extract
request handler and got the same thing
I re-installed solr 3.6 again but can't succeed. 4.0 gives the same error.
Same problem with some other request handler also like /browse says
etc.
Should i switch to Linux?
Looks like the packager (Bitnami) did not include that library, even though they left Solr configured to use that library. You may ask them to resolve it. Or you can deploy it yourself.
Here's how to deploy Solr on Tomcat. Its equally easy to install on Windows; and it starts as a Windows service. Once installed, to enable the rich document support, copy the contents of contrib/extraction/lib/ to a directory and point the sharedLib in solr.xml to that directory. If you have used that guide, you will understand those new terms :-)

error when using solr and Integrating nutch and solr(HTTP ERROR 500)

I have Linux Ubuntu 12.04 installed and I'm trying to install nutch 1.5.1 and solr 3.6.1 and integrate theme together to crawl seed urls.
I'm using This tutorial to get this work.
I followed the steps before 3.2 and skipped to step 4 and I can access to
localhost:8983/solr/admin/
without error.
but when going to step 6 and copying schema.xml from conf folder of nutch to example/solr/conf folder of solr
solr/admin page occurs a java error,below:
How can I handle that?
one more thing to ask....
I have another tutorial for this that looks good but in first step it mentions that add some code to nutch-site.xml file in /conf/ and /runtime/local/conf/ folder
but in nutch folder there is no runtime folder.In step 4 this folder mentioned too.
any suggestion?
thanks in advance
This is just bit of red herring. The line that specifies version number something like:
<schema name="nutch" version="1.5.1">
is causing it because the value of version is being parsed as float. remove the extra dot. Change it to 1.5 or 1.51 to make it valid float and restart your solr instance. The exception should disappear.
Check,please, whether are Nutch 1.5.1 and Solr 3.6.1 compatible (are they having same versions of lucene-core and solr-solrj jars). I got some problems with incompatible versions, but not with 1.5/3.6 .

Solr + Jetty Gives HTTP 503 on Debian

(This is a cross-post from servefault. I'm posting it here because no one answered my post there, and I feel that this sort of hits an awkward space half-way between both stackoverflow and serverfault.)
I have modified the example project included with Solr for my needs (removing things like the example stopwords and defining my own schema). Running this project on my mac, everything works fine: I can start Jetty and run search queries. But when I push the project out to a Debian system, I get this error when I try to do search queries:
HTTP ERROR: 503
SERVICE_UNAVAILABLE RequestURI=/solr
Powered by jetty://
The request log shows that a request was made:
10.10.124.14 - - [22/06/2010:22:34:52 +0000] "GET /solr
HTTP/1.1" 503 1311
No error log is produced (at least not on in the ./logs directory).
I have tried to run this project both on openjdk and the Sun JRE. Both started jetty fine, but produced the same error when searching. I am running Debian 9.0.4.
The issue is probably that the datastore in Debian is /var/lib/solr/data and you need to set that line in your version solrconfig.xml instead of the default which is in the base directory /usr/share/solr/ which could be a read only file system.
I've packaged the last solr version in Debian Testing. It seems, that there is some error in the solr configuration so that jetty starts, but it can't start the solr servlet. You must look in the jetty error log to find the reason.
There's lack of manpower in Java Packaging for Debian, so it may well be that there is an error in the solr-jetty package.
The solr-jetty package in Debian stable doesn't work as I recall. Please try from Debian testing!
If you indeed find an error, please don't use random forums but post a bug on bugs.debian.org!
Success!

Resources