parsering (using Tika) on remote glassfish - solr

I'm using Tika parser to index my files into Solr. I created my own parser (which extends XMLParser). It uses my own mimetype.
I created a jar file which inside looks like this:
src
|-main
|-some_packages
|-MyParser.java
|resources
|-META-INF
|-services
|-org.apache.tika.parser.Parser (which contains a line:some_packages.MyParser.java)
|_org
|-apache
|-tika
|-mime
|-custom-mimetypes.xml
In custom-mimetypes I put the definition of new mimetype becouse my xml files have some special tags.
Now where is the problem: I've been testing parsing and indexing with Solr on glassfish installed on my local machine. It worked just fine. Then I wanted to install it on some remote server. There is the same version of glassfish installed (3.1.1). I copied-pasted Solr application, it's home directory with all libraries (including tika jars and the jar with my custom parser). Unfortunately it doesn't work. After posting files to Solr I can see in content-type field that it detected my custom mime type. But there are no fields that suppose to be there like if MyParser class was never runned. The only fields I get are the ones from Dublin Core. I checked (by simply adding some printlines) that Tika is only using XMLParser.
Have anyone had similar problem? How to handle this?

Problem was that I was using Java 7 to compile my parser but Apache Tika was compiled with Java 5...

Related

uploaded data is not visible in solr

I am using solr4.7. I have created a new core by copying "collection1(default example provided by solr)" to different name say "wiki" and updated core.properties with new name. Hence new core is visible at solr admin panel.
After starting solr, I am trying to import the data to new core like below.
$ java -jar post.jar ../../../enwiki-20150602-pages-articles1.xml -Durl='http://localhost:8983/solr/#/wiki/update'
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
POSTing file enwiki-20150602-pages-articles1.xml
SimplePostTool: WARNING: No files or directories matching -Durl=http:/localhost:8983/solr/#/wiki/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
Time spent: 0:00:03.671
I also tried
$ java -jar post.jar ../../../enwiki-20150602-pages-articles1.xml
But still while querying at solr admin panel I am not getting any data.So my question is if data has been indexed then why I can't see it. Where exactly I am doing wrong.
Not sure this has been resolved.
I had the exact same problem. You need to specify the path/location of where your file to be ingested is located.
C:\test-solr>java -Durl=http://localhost:8983/solr/testdemo/update -Dtype=text/csv -jar C:/test-solr/exampledocs/post.jar "C:\test-solr\exampledocs\ingestMeFile.csv"

Solr (4.4+) solrconfig.xml location when creating cores

I'm Trying to setup a multi core solr server for our webapplication but i'm having trouble creating new core through the coreadmin service.
I'm using Solr-4.4 because 4.3 ran into problems persisting the cores in solr.xml (datadir wasn't preserved) So i'm using the new Solr.xml configuration 4.4 and beyond
My solr.xml currently looks like:
<solr>
<str name="coreRootDirectory">default-instance/cores/</str>
</solr>
solrconfig.xml is located at (solrhome)/default-instance/conf/solrconfig.xml
When trying to create a core with the url
http:/example.org/solr/admin/cores?action=CREATE&name=test-name&schema=schema-test.xml&loadOnStartup=false
gives me the error:
Error CREATEing SolrCore 'test-name': Unable to create core: test-name
Caused by: Can't find resource 'solrconfig.xml' in classpath or
'default-instance/cores/test-name/conf/', cwd=/var/lib/tomcat7
The following seems to work:
http:/example.org/solr/admin/cores?action=CREATE&name=test-name&schema=schema-test.xml&loadOnStartup=false&config=/absolute/file/path/to/solrconfig.xml
The problem is this only seems to work with a absolute path (or possibly a relative path from /var/lib/tomcat7) which is not a workable solution.
What i'm looking for is a way to place solrconfig.xml so it can be used to create new cores with that config (or a way the create those cores with the current location).
More or less the same will be needed for schemas
This worked. Ran on command line and was viewable in admin console:
solr create -c (name for core or collection)
See README.txt for more info.
In my case I took advantage of the Core Discovery feature in 4.4+, rather than creating the core using the management web interface.
This simply involved copying the example collection1 folder from the examples directory (which I usually use as a starting point).
Then I had to make sure that there is core.properties in the root of my new core with name=<new core name> inside. Solr automatically detected the new core and allowed me to use it without any fuss.
This avoided the trouble of having to copying solrconfig.xml and schema.xml into any special location.
I had the same problem: solrconfig.xml was not in the classpath. I solved it by copying my configuration file templates into the classpath.
So I took a look at http://localhost:8983/solr/#/~java-properties to see solrs classpath definition and then i copied the template solrconfig.xml and schema.xml into the folder C:\servers\solr-4.4.0\example\resources. Furthermore i copied all the stopwords stuff there...
This solution is not a fully satisfying, but it works. Adding another path to the classpath should work, too. I'm slightly astonished that no default configuration for new cores can be declared within solr.xml
I recommend the new Config Sets for this use case.
If you place your schema.xml and solrconfig.xml (and other config files like stopwords etc.) in a directory $SOLR_HOME/configsets/myConfig/conf, you can create a new core with this config by calling:
http://solr/admin/cores?action=CREATE&name=mycore&instanceDir=my_instance&configSet=myConfig
See https://cwiki.apache.org/confluence/display/solr/Config+Sets
But they are not available until Solr 4.8, see https://issues.apache.org/jira/browse/SOLR-4478

How to install JTS in Solr 4?

Using the Solr 4 spatial field types seems to require an external library, the Java Topology Suite. How does one install this suite for use with Solr 4.1.0 on Ubuntu Server 12.04 with Java 1.6.0_24?
Thank you.
If you are running Solr in Tomcat on your Ubuntu Server and have deployed the Solr WAR into your <path to Tomcat>/webapps folder. Then according to the Lucene / Solr 4 Spatial documentation on the Solr Wiki, you just need to copy the all the jar files from the JTS distribution /lib folder to the WEB-INF/lib folder where Solr is running.
Update
Since you are using Jetty to run Solr, you will need to include the location of the JTS jar files as a classpath. Based on the Classloading Jetty documentation, something like the following should work:
java -Dsolr.solr.home=/mnt/SolrFiles/solr
-Djetty.class.path=<insert path to JTS here> -jar /opt/solr-4.1.0/example/start.jar
The JTS JAR file needs to be placed in the Solr web application's WEB-INF/lib folder. Otherwise you may encounter a NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry when starting Solr.

can't index rich documents on both solr 3.6 and solr 4.0 using update/extract getting "#500 lazy loading error"

I'v just started to learn solr. From last 3 days I'm in trouble. I can not
index rich documents on solr 3.6 and 4.0. I am using windows7 64bit.
what i tried is as:
First I installed solr 3.6 with tomcat-jetty.using BitNami Apache
1.tried -Durl command what i got :
error #500 lazy loading error
2.Download curl for my window machine and tried curl i got: error #500 lazy loading error
3.copied a program from solr tutorial to upload a file using solrJ for
SolrJ in NetBeans IDE and tried a pdf files to indexed using
update/extract
then i got:
org.apache.solr.common.SolrException: Server at
"myServer:port/solr" returned non ok status:500, message:Internal
Server Error
4.changed solconfig.xml so removed startup=lazy from update/extract
request handler and got the same thing
I re-installed solr 3.6 again but can't succeed. 4.0 gives the same error.
Same problem with some other request handler also like /browse says
etc.
Should i switch to Linux?
Looks like the packager (Bitnami) did not include that library, even though they left Solr configured to use that library. You may ask them to resolve it. Or you can deploy it yourself.
Here's how to deploy Solr on Tomcat. Its equally easy to install on Windows; and it starts as a Windows service. Once installed, to enable the rich document support, copy the contents of contrib/extraction/lib/ to a directory and point the sharedLib in solr.xml to that directory. If you have used that guide, you will understand those new terms :-)

NoClassDefFoundError MimeTypeException with PDF extraction

I am getting an exception trying to use update/extract with PDF files
My Set up is:-
Ubuntu Server 11.10
Tomcat 6
Solr 3.5.0.2011.11.22.15.54.38
I can browse to solr/admin OK
I have put all the contrib/extract and apache-solr-cell3.5.0.jar libraries into the tomcat folder webapps/solr/WEB-INF/lib
I am calling extract using:-
curl "http://localhost:8080/solr/update/extract?uprefix=attr_&fmap.content=attr_content&commit=true" -F "file=/path/to/my.pdf"
error is
java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:383)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:461)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:248)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:239)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
Would appreciate any pointers - the only time this error seems to come up elsewhere is with Nutch and cached results.
I have tried sending the mimetype in the querystring and also a *.doc file but got the same error.
According to the error message it is not a MimeTypeException exception you get: The problem is a NoClassDefFoundError, because Solr cannot load the class MimeTypeException.
Normally this class is present in tika-core.jar.
Make sure you actually have that file and also check if you have a lib statement in your solrconfig.xml pointing to the right directory.
This was due to the basic error of copying the necessary tika libraries (to tomcat6/webapps/solr/WEB-INF/lib) but leaving ownership of the jar files as ROOT instead of chown-ing them to TOMCAT6. After setting the right permission and restarting Tomcat it started working OK
Found the solution of this problem, I was using SolrJ to update my pdf indexing.
after deploy solr to tomcat, I didn't include the following libraries into the tomcat/webapp
and I get all the lazy loading problem, etc etc
I even try to get apache tika...
until I do this...
shutdown tomcat
\apache-solr-3.5.0\contrib\extraction
copy the libraries above to below
\apache-tomcat-7.0.26\webapps\solr\WEB-INF\lib
startup tomcat
cheers

Resources