Setting up Solr and Querying it - solr

I am new to Solr.
I am not able to find out a proper document which could help me understand what all do I need to add in the solrconfig.xml and what is to be removed.
My SolrDocument would contain id, field1, field2. Out of the 2 fields, I want to update 1 of them. How do I do? I tried a few things but it overwrites the entire document.
/update is not working.
I have to add documents and retrieve them from inside a Java class.

You can refer to Solr Wiki for Solr Config.xml it is a good starting point to understand the configuration options.
Solr does not really have an update concept, it always deletes the existing document and replaces it with new document. There is a feature request open years back JIRA-139 to address this problem, but as of today it shows the fix version to be 4.1. But Solr 4.0 has a new feature Atomic update that you could try, if this is something very critical for you. Note: Solr 4.0 is still a Beta.
'/update' not working -> do you mean not working since it is replacing the old document with new document or do you get error/exception ?
To add & retrieve documents from Java, you can use SolrJ. SolrJ is Java client to access Solr programmatically. SolrJ - Solr Wiki.

Related

Solr - Migrate Documents from one Collection to another existing one

I need to move all Solr Documents from one collection to another (already existing collection) - there are 500,000 documents.
I have tried the solr migrate but cannot get the routing key correct. I have tried:
curl 'http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=oldCollection&target.collection=newCollection&split.key=!'
I have solr 4.10.3 installed in a cloudera installation.
Copy your existing oldCollection, and rename the as newCollection,
After that you may need to update some config files for the same.
Or create a new one using the below api
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1
The answer and the question are quite old, starting from 8.1 solr version, there is a feature specific for this purpose which is the reindexcollection api which can directly be used to reindex docs from source to a target collection with a lot of configurable options. Here is the link to the official doc : https://lucene.apache.org/solr/guide/8_1/collections-api.html#reindexcollection

Solr Upgrade from 3.4 to 4

In order to make use of pivot feature present on Solr 4, I upgraded from 3.4.
Shall I proceed with a full reindex of the content due this upgrade or are they compatible somehow?
And regarding my client-applications that are currently accessing my solr server 3.4, will they present problem after upgrade? (The preliminary test I did they are running, seems the xml schema returned in a query response didn't changed when you don't use new features)
You need to do a full reindex if you want to use the Solr 4 index structure. Else you need to change the Lucene version in solrconfig to use the old index.
The schema will need a new field called _version_ if you want to use the Real Time Get functionality.
Other then that most things are pretty much the same for the client.

Updating Solr Field Value

is there any possibility to update a value of a Solr-Field without reindexing the whole document?
Nope.
You need to index the document again with all the fields.
Solr will delete and insert the document again.
There is nice talk about it you may want to hear.
This functionality is available in the Solr version 4.0. That version is still in Beta, but will most likely be released before the end of the year. Please see the post - Solr 4.0: Partial documents update for more details on how this works.
When you want to change a single field of a document you will have to reindex the whole document, as solr does not support updating of a field only.

Making one of Liferay communities (called sites) not indexed in solr

We are using Liferay (6.1.20 EE) with Solr search engine.
Now Solr indexes everything. Can we somehow set up Solr (or Liferay) to prevent one Site from being indexed?
It means all articles documents present on that Site would not be indexed and would not be present in Solr.
1) Should this be done with Solr configurations/schema filters before Index starts?
OR
2) Should it be customized in Liferay Indexer classes (with help of Hooks or EXT) to skip content being indexed.
Thanks for your thoughts and suggestions.
Regards,
Kris
You could create a custom version of the solr-web WAR file that you need to install to make the Liferay/SOLR integration work. In the WAR file you'll find SolrIndexWriterImpl. This is the place that everything passes through that will be indexed in SOLR. You could create your own custom implementation of this class that uses the information in the SearchContext parameter, that's passed into each method, to decide if something should be indexed or not.
The latest code for solr-web can be found here: http://svn.liferay.com/repos/public/plugins/trunk/webs/solr-web/
Based on this code I was also able to create a solr-web.war that works on the more recent SOLR versions instead of the ancient 1.4.1 version Liferay uses by default.

Identifying strings in documents, with nutch+solr?

I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.
I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?
Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?
Thanks.
You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry).
See Behemoth for an example of how to embed GATE in mapreduce
Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.
To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.

Resources