Configure DataImportHandler in SolrCloud with ZooKeeper - solr

I have a SolrCloud configured like this: exploration of SolrCloud, the difference is that I use Solr 4.0.0 Beta. Shortly the configuration:
ZooKeeper on default port 2181
3 instances of Solr running on different ports
This is just for testing purpose. The desired configuration is with 3 ZooKeeper instances (one for every Solr instance). I manage to index some XML files with curl command.
Questions:
How can I configure DIH/collection? I managed to change the solrconfig.xml (config for dataimport-handler), add in lib the proper driver for DB connection, but in solr admin I get "sorry, no dataimport-handler defined!" The changes can be watched in zookeeper (I see the data_config.xml) and in solr admin panel I can see the updated version of solrconfig.xml.
Any good tutorial for a production deploy of solrcloud (with somthink like the desired configuration mentioned before) on single or multiple machine for Ubuntu 12.04 LTS?
Any advice would be appreciated! Thanks in advance!

Normally DIH config has nothing to do with wether you're using a single Solr instance or multiple instances in a solrCloud config. DIH will write data in the current instance's Lucene index, and then it's up to zooKeeper to speread it around on the other instances.
Make sure your DIH is propertly configured:
In solrconfig.xml, all necessary libraries are loaded. This means the two DIH jars:
<lib dir="../../../dist/" regex="solr-dataimporthandler-4.3.0.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-extras-4.3.0.jar" />
as well as others jars you may need (like Database JDBC driver, etc).
Still in solrconfig.xml make sure the DIH handler is declared, something like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Finally, the config file you declared in the DIH handler (data-config.xml) should be in the same "conf" dir as solrconfig.xml and should have proper content, something like:
<dataConfig>
<dataSource type="JdbcDataSource" name="myDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#someHost:1521:someDb" user="someUser" password="somePassword" batchSize="5000"/>
<document name="myDoc" >
<entity name="myDoc" dataSource="myDatasource" transformer="my.custom.Transformer" query="select col1, col2, col3 from table1 where whatever" />
</document>
</dataConfig>

Related

Solr 8.10 sample core files and how to add new cores

I'm using regular Solr 8.10.1 (no Solr Cloud)
I start it like C:\solr-8.10.1\bin\solr start -p 8983
My folder structure:
- solr-8.10.1
- server
- solr
- configsets
- sample_techproducts_configs
- conf
- mytest
- conf
- lang
data-config.xml
managed-schema
protwords.txt
solrconfig.xml
stopwords.txt
synonyms.txt
- data
- samplecatalog
- conf
data-config.xml
schema.xml
solrconfig.xml
solr.xml
I also copied files from my solr 4.3.2 instance samplecatalog to a new folder in 8.10.1.
But when I got to http://localhost:8983/solr/#/~cores
I see no cores.
solr.xml
<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<int name="maxBooleanClauses">${solr.max.booleanClauses:1024}</int>
<str name="sharedLib">${solr.sharedLib:}</str>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>
<int name="distribUpdateConnTimeout">${distribUpdateConnTimeout:60000}</int>
<str name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider}</str>
<str name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider}</str>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:600000}</int>
<int name="connTimeout">${connTimeout:60000}</int>
<str name="shardsWhitelist">${solr.shardsWhitelist:}</str>
</shardHandlerFactory>
</solr>
I just want to have a sample core folder with a schema.xml handlers and a data-config.xml for my entities, so I can start and expand from that foundation.
I checked the tutorials but I can't find any samples or see where I can define cores via my config files.
I also checked here, but that's for a very old version.
Short answer : cd into solr bin directory and run solr create -c "mytest"
(#see solr create command).
Basically you can follow this few steps to define a configuration set and create the corresponding core.
Define SOLR_HOME (where to put Solr core(s) config/data) in solr's bin/solr.in.sh, or bin\solr.in.cmd on windows. It's recommended you separate it from solr sources & binaries.
Create/move your configuration set in SOLR_HOME directory and ensure solr has ownership.
Run the solr create command
Here a bash script based on one I oftenly use that does the job (I noticed you are on a windows machine but the principle remains the same) :
#!/bin/bash
SOLR_SRC="/opt/solr" # symlink to your solr-<version> directory
SOLR_ROOT="/var/solr"
SOLR_HOME="${SOLR_ROOT}/data"
CORE="mytest"
# Create core config set in SOLR_HOME
cd ${SOLR_HOME}/
mkdir -p ${CORE}/data
# cp -R ${SOLR_SRC}/server/solr/configsets/_default/conf/ ./${CORE}/ # from default conf
cp -R ${SOLR_SRC}/server/solr/${CORE}/conf/ ./${CORE}/
# Set ownership
chown -R solr:solr ${SOLR_HOME}
# Create core
su - solr -c "${SOLR_SRC}/bin/solr create -c ${CORE}"

apache solr provisioning - how to keep config in VCS but data on machine

I need to make some adoptions in project that utilizes apache solr for fulltext searches. Someone configured everything on the production machine and i want to prepare everything locally and deploy the whole new version at once.
I already created working vagrant setup for everything and it works well.
But my problem is - i am not very experienced with configuring apache solr and cant manage to get it working.
Here is my installation script:
apt-get install -q -y openjdk-8-jdk
# install apache solr
if [[ ! -e "/etc/default/solr.in.sh" ]]
then
wget http://www-eu.apache.org/dist/lucene/solr/7.7.1/solr-7.7.1.tgz
tar xzf solr-7.7.1.tgz solr-7.7.1/bin/install_solr_service.sh --strip-components=2
chmod u+x ./install_solr_service.sh
./install_solr_service.sh solr-7.7.1.tgz
cat /vagrant/config/solr/solr.in.sh >> /etc/default/solr.in.sh
rm -f /opt/solr-7.7.1/server/solr/solr.xml
ln -s /vagrant/config/solr/solr.xml /opt/solr-7.7.1/server/solr/solr.xml
fi
contents of /vagrant/config/solr/solr.in.sh
(content taken from production config - i dont really understand the purpose)
# this is just a partial file - we append its contents to the original
SOLR_RECOMMENDED_OPEN_FILES=65000
Content of linked solr.xml
<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<str name="coreRootDirectory">${coreRootDirectory:/vagrant/config/solr/cores}</str>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>
<int name="distribUpdateConnTimeout">${distribUpdateConnTimeout:60000}</int>
<str name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider}</str>
<str name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider}</str>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:600000}</int>
<int name="connTimeout">${connTimeout:60000}</int>
<str name="shardsWhitelist">${solr.shardsWhitelist:}</str>
</shardHandlerFactory>
</solr>
The cores directory contains all the information from the production machine, i just added the following value to the core.properties file within each core
dataDir=/var/solr/data/NAME_OF_CORE
I figured this way the data would be part of my machine but the config part of my repository.
But when i browse to localhost:8983 (which works perfectly) i dont see any core. Neither can i create a new core, when creating a new core called "new_core" it says:
new_core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core new_core: Error loading solr config from /var/solr/data/new_core/conf/solrconfig.xml
So - how would i provision solr correctly to keep all my config in git but the data on the machine?
The company that set up everything is not helpful, they provide ZERO information.
Kind regards,
Philipp

Integrating grobid with tika and solr

I'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations, etc. I got grobid up and running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml
The tika-config looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.journal.JournalParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
I'm getting a ClassNotFound exception when I try to import a document, but can't figure out where to set the classpath to fix it.
As mentioned on the Solr user's list, the latest version of Solr (6.0.0) is using a version of Tika (1.7) that predates the addition of grobid (which came in in Tika 1.11) permalink. To follow the upgrade to Tika 1.13, see SOLR-8981

Errors while trying to configure Solr 5.3.1 on Windows 10

I'm trying to setup a very basic configuration of Solr, to read some text from a mysql table and index it. I'm following the steps in DIH Quick Start document.
The document doesn't tell you where to place solrconfig.xml.
At first I tried placing it under the solr5.3.1 folder (next to bin). That failed. Then I noticed the "add core" button was looking for it in server\solr\new_core. So I put it there, but then got this other error:
My data import handler looks like this:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
And here's data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/ctcrets"
user="root"
password="xxxx"/>
<document>
<entity name="id"
query="select RETS_STAGE1_QUEUE_ID as id, LN_LIST_NUMBER as name, xmlText as desc from RETS_STAGE1_QUEUE">
</entity>
</document>
</dataConfig>
What could be the problem?
The document assumes you already know the solr.home [1] directory structure. On top of that, I think it assumes you started the sample Solr instance (e.g. ./solr start -p 8984) where everything should be already set.
Once started you can see on the dashboard where the configuration is exactly located. Go there, change the files as suggested and RELOAD the core through the admin console (CoreAdmin). If you want you can also do a stop / restart.
As side notes:
the DIH is not part of the Solr core, so you should put some "lib" directive within the solrconfig.xml, as far as I remember, the sample config already has those directives so you don't need to "import" the DIH lib
the JDBC driver that allows the connection with the database is not included so your classpath (i.e. JVM or Solr classpath - through the same lib directive) must include this additional lib(s).
[1] http://www.solrtutorial.com/configuring-solr.html

SOLR Field not reflected in schema browser

I created a solr core using bin/solr -c core1 and then copied the schema.xml file from basic config set to core1/conf folder and added a field
<field name="title" type="text" indexed="true" stored="true"/>.
But this field is not reflected in schema browser.
What configurations should I make to get the new fields reflected in schema browser in solr admin ui?
I am using solr 5.3.1
By default when you create a solr core it will use managed schema. You will see the following configuration in solrconfig.xml after core is created.
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
Above this configuration you will find the comments on how use managed-schema. Comment this out and uncomment the following to use schema.xml
<schemaFactory class="ClassicIndexSchemaFactory"/>
You need to reload the core: go to http://yourhost:8983/solr/#/~cores/core1 and press "Reload" button.

Resources