Need some information about Crawl Anywhere and solr - solr

I gone through the crawl anywhere documentation but i am very much confuse about its installation steps.
What i understood is Apache is optional. But do need independent tomcat instance for crawl? Because what i saw in folder structure, there is tomcat folder already present and war file is also there?
Also do we need independent instance of Apache solr also ?
If we want to add postgresql database to crawl, how we can do that?
Please provide some link also so that I can go through it and clarify any doubt I have in my mind.

Apache is needed to use admin interface. Tomcat is needed for some interactive features. You can crawl without both of them.
No.
MySQL and MongoDB are supported. The code is open source, so you can add postgresql support.
Try Google Groups for other questions

Related

Solr luceneMatchVersion syntax

I have Solr 4.10 and I have collection on it with solorconfig.xml has the value for <luceneMatchVersion> as follows:
<luceneMatchVersion>4.7</luceneMatchVersion>
Is this correct? I saw other examples that has values such as LUCENE_35 What I need to know also, how could I express LUCENE_xx from my current Solr version?
You should use:
<luceneMatchVersion>4.10.4</luceneMatchVersion>
I recommend you to check your current solr version, in my case was 4.10.4.
if you are going to reindex, then both numbers should match. The only reason you might want to have them different, is if you had and index created with say Lucene 4.7, then you would have
<luceneMatchVersion>4.7</luceneMatchVersion>
Then, you upgrade lucene to 4.10.
Now, if among the changes in between 4.7 and 4.10 there are things that work differently regarding analysis (you get the same sentence analysed in both versions and get different output as a result), then, you might want to keep the version number at 4.7, otherwise some queries that contain affected terms might not work (as they were analysed at index time in a different way than at query time). You have to asses how critical that issue might be.
That is why the recommendation is to upgrade, change the setting to the current number, and reindex. This way you are sure to avoid any issue.
If anyone is using Drupal, the Search API Solr (search_api_solr) module has config templates by version in /sites/all/modules/search_api_solr/solr-conf/.
The template README.md states the following:
The solr-conf-templates directory contains config-set templates for
different Solr versions.
These are templates and are not to be used as config-sets!
To get a functional config-set you need to generate it via the Drupal
admin UI or with drush solr-gsc. See README.md in the module
directory for details.
The module's README.md lists these instructions:
Make sure you have Apache Solr started and accessible (i.e. via port 8983). You can start it without having a core configured at
this stage.
Visit Drupal configuration (/admin/config/search/search-api) and create a new Search API Server according to the search_api
documentation using "Solr" as Backend and the connector that
matches your setup. Input the correct core name (which you will
create at step 4, below).
Download the config.zip from the server's details page or by using drush solr-gsc with proper options, for example for a server named
"my_solr_server": drush solr-gsc my_solr_server config.zip 8.4.
Copy the config.zip to the Solr server and extract.
I generated a config file for 8.x, and it uses this:
<luceneMatchVersion>${solr.luceneMatchVersion:LUCENE_80}</luceneMatchVersion>

Hoster requirements for running Solr

I am planning to add a full-text search engine for searching a MySQL database to a website. Most recommendations on a nice, user-friendly implementation I found, mentioned the use of Apache Solr.
Keeping this in mind I started searching for the requirements for a hoster to use Solr but I didn't find any useful information expect for "it should support java". So I picked a random host that states it has Java JRE installed (http://wiki.dreamhost.com/What_We_Support) and asked if they supported Solr. Unfortunately, the answer was "no".
So, what would I need to be looking for? Do I need a dedicated server, a VPN, or are there shared hosting solutions where it is possible to run Solr?
What are the system requirements?
I hope there is someone out there, who knows a bit about this. Thanks!
The Solr requirements can be found here: https://wiki.apache.org/solr/SolrInstall
So an installed JRE is needed, but also an servlet container, which also needs an JRE.
If I would be in your situation, I would rent an virtual server.
An other option is an hosting service, specialized for solr hosting: search the web for "apache solr hosting". There are offers for free or payed offers also.
I've been running two drupal websites + apache solr on an ssd vps from rosehosting utilizing 2 cpu cores and 1gb ram. I wasn't able to setup apache solr and java myself so that's why I rented a managed vps service.
If you're not that technical I suggest you add managed to the keywords mentioned by #The Bndr and make sure you check with your host that they will support apache solr and java on your vps

Embedded Solr on Amazon AWS

Currently, I have developed a web application. In my web application, I used embedded solr server to make indexing. After that I deployed onto the Tomcat 6 on window xp. Everything is ok. Next, I have tried my web application to deploy on Amazon AWS. My platform is linux + mysql. When I deployed, I got the exception related with embedded solr.
[ WARN] 19:50:55 SolrCore - [] Solr index directory 'solrhome/./data/index' doesn't exist. Creating new index...
[ERROR] 19:50:55 CoreContainer - java.lang.RuntimeException: java.io.IOException: Cannot create directory: /usr/share/tomcat6/solrhome/./data/index
at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:403)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:552)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:480)
So how to fix my problem. I am novie to linux.
My guess is that the user you are running Solr under does not have permission to access that directory.
Also, which version of Solr are you using? Looks like 3+. The latest version is 4, so it may make sense to try using that from the start. Probably a bit more troubleshooting to start, but a much better pay off that starting with legacy configuration.
I got solution. That is because of permission affair on Amazon Linux with ec2-user. So , I changed permission by following.
sudo chmod -R ugo+rw /usr/share/tomcat6
http://wiki.apache.org/solr/SolrOnAmazonEC2strong text
t should allow access to ports 22 and 8983 for the IP you're working from, with routing prefix /32 (e.g., 4.2.2.1/32). This will limit access to your current machine. If you want wider access to the instance available to collaborate with others, you can specify that, but make sure you only allow as much access as needed. A Solr instance should not be exposed to general Internet traffic. If you need help figuring out what your IP is, you can always use whatismyip.com. Please note that production security on AWS is a wide ranging topic and is beyond the scope of this tutorial.

Install Jetty or run embedded for Solr install

I am about to install Solr on a production box. It will be the only Java applet running and be on the same box as the web server (nginx).
It seems there are two options.
Install Jetty separately and configure to use with Solr
Set Solr's embedded Jetty server to start as a service and just use that
Is there any performance benefit in having them separate?
I am a big fan of KISS, the less setup the better.
Thanks
If you want KISS there is no question: 2. stick to vanilla Solr distrib with included jetty.
Doing the work of installing an external servlet engine would make sense if you needed Tomcat for example, but just to use the same thing (Jetty) Solr already includes...no way.
Solr is still using jetty 6. So there would be some benefits if you can get the solr application to run in a recent jetty distribution. For example you could use jetty 9 and use features like SPDY to enhance the response times of your application.
However I have no idea or experience if it's possible to run the solr application standalone in a servlet engine.
Another option for running Solr and keeping it simple is to use Solr-Undertow which is a high performance with small footprint server for Solr. It is easy to use on local machines for development and also production. It supports simple config files for running instances with different data directories, ports and more. It also can run just by pointing it at a distribution .zip file without needing to unpack it.
(note, I am the author of Solr-Undertow)
Link here: https://github.com/bremeld/solr-undertow with releases under the "Releases" tab.

Which are the necessary files to execute Solr in a remote host and how to configure the security (permissions, etc)?

I have a database (mysql) at localhost. I use Solr to index and make queries, it works fine.
Now I want to put the Solr index on a remote host. I know that I must have the permission to run java and ssh access. But I must admit that, even though I can make Solr work, I don't understand very well each file that are part of Solr.
So what I want to know is which are the files estrictly necessary to make Solr index work to make queries? data files? yes, but what else? And how should I configure the permissions? read only? execute? And... I guess that there are another security items I must pay attention.
Usually you have to install at least a web-container (e.g. Tomcat or Jetty) and deploy the solr-[version].war file onto it.
For security reasons you can restrict the access to the Server via web-interface in the configuration of this web-container.
Beneath that solr needs a home-directory in which Solr stores its index and configuration. I think this must have rw permissions since Solr changes the index on import and maintains index-usage-information.

Resources