Integrating grobid with tika and solr - solr

I'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations, etc. I got grobid up and running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml
The tika-config looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.journal.JournalParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
I'm getting a ClassNotFound exception when I try to import a document, but can't figure out where to set the classpath to fix it.

As mentioned on the Solr user's list, the latest version of Solr (6.0.0) is using a version of Tika (1.7) that predates the addition of grobid (which came in in Tika 1.11) permalink. To follow the upgrade to Tika 1.13, see SOLR-8981

Related

Errors while trying to configure Solr 5.3.1 on Windows 10

I'm trying to setup a very basic configuration of Solr, to read some text from a mysql table and index it. I'm following the steps in DIH Quick Start document.
The document doesn't tell you where to place solrconfig.xml.
At first I tried placing it under the solr5.3.1 folder (next to bin). That failed. Then I noticed the "add core" button was looking for it in server\solr\new_core. So I put it there, but then got this other error:
My data import handler looks like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
And here's data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/ctcrets"
user="root"
password="xxxx"/>
<document>
<entity name="id"
query="select RETS_STAGE1_QUEUE_ID as id, LN_LIST_NUMBER as name, xmlText as desc from RETS_STAGE1_QUEUE">
</entity>
</document>
</dataConfig>
What could be the problem?
The document assumes you already know the solr.home [1] directory structure. On top of that, I think it assumes you started the sample Solr instance (e.g. ./solr start -p 8984) where everything should be already set.
Once started you can see on the dashboard where the configuration is exactly located. Go there, change the files as suggested and RELOAD the core through the admin console (CoreAdmin). If you want you can also do a stop / restart.
As side notes:
the DIH is not part of the Solr core, so you should put some "lib" directive within the solrconfig.xml, as far as I remember, the sample config already has those directives so you don't need to "import" the DIH lib
the JDBC driver that allows the connection with the database is not included so your classpath (i.e. JVM or Solr classpath - through the same lib directive) must include this additional lib(s).
[1] http://www.solrtutorial.com/configuring-solr.html

Default core name in solr 1.4.1

I cannot set up default core in solr 1.4.1
<cores adminPath="/admin/cores" defaultCoreName="core0">
It doesn't work. The server starts as usual and works but doesn't allow making requests without core name. I went through their release notes and couldn't find when they started supporting this parameter. Does solr 1.4.1 support it? What are the other options?
UPD: The whole solr configs looks this way
<solr persistent="true">
<cores adminPath="/admin/cores" defaultCoreName="core0">
<core name="core0" instanceDir="./core0" />
<core name="core1" instanceDir="./core1" />
</cores>
</solr>
How I check:
1) Check without core(returns HTTP 400 now, "missing solr core name in path")
http://127.0.0.1:8080/solr/select?q=test&version=2.2&start=0&rows=10&indent=on
2) Check with core(response is correct) http://127.0.0.1:8080/solr/core0/select?q=test&version=2.2&start=0&rows=10&indent=on
The answer is No. DefaultCoreName is not supported in solr 1.4.1
I installed solr 3.5 and set up multicore environment, after specifying defaultCoreName I was able to perform the following two requests and the response was the same
http://127.0.0.1:8080/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on
and
http://127.0.0.1:8080/solr/core0/select/?q=solr&version=2.2&start=0&rows=10&indent=on
Yes it supported in 1.4.
You can try by making it as persistant = true if you want to add more core by the using solr api for managing the core.
If you don't want the default core you can remove it from the xml.
You can have you solr.xml as below and try.
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
<cores adminPath="/admin/cores" defaultCoreName="collection1">
<core name="collection1" instanceDir="./"/>
</cores>
</solr>
another would be if add your own core by mentioning the data dir and insatance dir..
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
<cores adminPath="/admin/cores">
<core name="Test" instanceDir="/home/abhijit/Downloads/Solr/" dataDir="/home/abhijit/Downloads/Solr/Test/data"/>
</cores>
</solr>

Getting clear content (without markup) with Nutch 1.9

Using Nutch 1.9, how do I get clear content (without html markup) of crawled pages and save the .content in readable form. Is Solr way to do that or can it be done without it and how?
And a subquestion, how do I control the crawling depth with bin/crawl script? There was an option to that (and topN) in bin/nutch crawl command, but it is deprecated now and won't execute.
Add this in nutch site.xml
<!-- tika properties to use BoilerPipe, according to Marcus Jelsma -->
<property>
<name>tika.use_boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
// This is for nutch 1.7, I'm not sure about 1.9
Use jsoup to get plain text.

My Solr Core1 (mutlicore) is not working?

I am trying for multicores somehow i am able to run my core0 but core1 is not finding. 404 err0r is there can any buddy tell me what is the right configuration in solr.xml.
I am preferring sorl wiki core admin help.
Thanks!
As it is described Solr Wiki, solr.xml should look like :
<solr persistent="true" sharedLib="lib">
<cores adminPath="/admin/cores">
<core name="core0" instanceDir="core0" />
<core name="core1" instanceDir="core1" />
</cores>
</solr>
And your solr directory should be like :
-solr
-core0
+conf
+data
-core1
+conf
+data
+lib
solr.xml
Also run create core for adding new cores. The command for creating core is :
http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path_to_instance_directory&config=config_file_name.xml&schema=schem_file_name.xml&dataDir=data
You should give more details about your problem, in order to get detailed answer.

Configure DataImportHandler in SolrCloud with ZooKeeper

I have a SolrCloud configured like this: exploration of SolrCloud, the difference is that I use Solr 4.0.0 Beta. Shortly the configuration:
ZooKeeper on default port 2181
3 instances of Solr running on different ports
This is just for testing purpose. The desired configuration is with 3 ZooKeeper instances (one for every Solr instance). I manage to index some XML files with curl command.
Questions:
How can I configure DIH/collection? I managed to change the solrconfig.xml (config for dataimport-handler), add in lib the proper driver for DB connection, but in solr admin I get "sorry, no dataimport-handler defined!" The changes can be watched in zookeeper (I see the data_config.xml) and in solr admin panel I can see the updated version of solrconfig.xml.
Any good tutorial for a production deploy of solrcloud (with somthink like the desired configuration mentioned before) on single or multiple machine for Ubuntu 12.04 LTS?
Any advice would be appreciated! Thanks in advance!
Normally DIH config has nothing to do with wether you're using a single Solr instance or multiple instances in a solrCloud config. DIH will write data in the current instance's Lucene index, and then it's up to zooKeeper to speread it around on the other instances.
Make sure your DIH is propertly configured:
In solrconfig.xml, all necessary libraries are loaded. This means the two DIH jars:
<lib dir="../../../dist/" regex="solr-dataimporthandler-4.3.0.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-extras-4.3.0.jar" />
as well as others jars you may need (like Database JDBC driver, etc).
Still in solrconfig.xml make sure the DIH handler is declared, something like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Finally, the config file you declared in the DIH handler (data-config.xml) should be in the same "conf" dir as solrconfig.xml and should have proper content, something like:
<dataConfig>
<dataSource type="JdbcDataSource" name="myDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#someHost:1521:someDb" user="someUser" password="somePassword" batchSize="5000"/>
<document name="myDoc" >
<entity name="myDoc" dataSource="myDatasource" transformer="my.custom.Transformer" query="select col1, col2, col3 from table1 where whatever" />
</document>
</dataConfig>

Resources