Solr - Combining automated crawl values and other manually entered data - solr

new to Solr and have implemented a crawl for specific URLs with Nutch/Solr. If I add new fields to the Solr schema, can I still use it to crawl with Nutch and import into Solr without problems, or will the manually entered fields cause the crawl to not work properly? Thank you for your time.

Adding some fields to the Solr schema should not cause any trouble at all, keep in mind that this will be true unless you specify one of the new fields as required in which case an exception will be thrown. Otherwise you're fine. Of course since Nutch is unaware of these fields no value will be set during the Nutch crawl.

Related

How to reindex solr after schema change?

We need to modify the schema fields dynamically, which requires reindexing.The solr ref guide recommands deleting all the documents and then re-run the original index process,but that does fit us.Does anyone have other ideas?
Thanks for any help.
Reindexing is required. Best option is using Data Import Handler. If you don't want to run the queries from database or don't want to pass documents, then also solr provides way to index the solr core through other solr core.
You can do something like this :
For more details, read solr DIH and SolrEntityProcessor : https://solr.apache.org/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#solrentityprocessor

Migrate solr standalone index to solrcloud

There are indexes of some solr cores which I convert them from solr4 to solr6 but in solr standalone mode. so they don't have the "version" field that solrcolud require.
Here now I want to migrate to solrcloud 6 and I need to put them under cluster. Because the version field dose not exist there in these indexes when I put them Under a solrcloud leader core on the data directory the replicas in the shard didn't update as I saw. so I decided to read them by lucene, get each doc fields, add them to a solrdoc and then put them doc by doc in solrcloud. But cause there are fields that not stored in these indexes so all fields that exist here in these indexes don't move there.
At the end it seems there is no way for me than re-indexing.
I appreciate if there is any better idea or solutions that can help me migrate more easily.
If there is any chance to reindex, just do so, it's going to be the best in the end (you have to deal with two separate issues: a) migrate from 4.X to 6.0 and b)from standalone to SolrCloud...it's going to be messy).
If you cannot reindex:
are all your fields stored OR have docValues=true? If so, you can get the original contents of your docs. Read them and index them with solrj or with some script.
if not, and you have a version field: try to manually put the index in Solrcloud. Not straighforward, but possible.
if you don't have a version field, I think it is impossible to put the index as is in Solrcloud (although some post on the net make you think it is). You could try to write some lucene code to add version field to all docs (with values that make sense), but this should be the very last resort.

How to Insert a document to a specific shard in Apache Solr Cloud Mode

Is there a way we can add documents into a specific shard?
For example, documents type A will always get inserted into shard1 and document type B always go to shard2.
I have tried using custom router but it does not guaranty that different prefix will route to different shard.
PS. I am on Solr 5 using cloud mode.
A caveat: I'm using SolrNet to access SolrCloud, and it doesn't integrate with ZooKeeper yet. For Java clients, this might be far easier.
Despite what I read here and here with regard to the CompositeId Router, I could never get it to work. What #jay helped me figure out is a way to use "implicit" routing to achieve this. If you create your collection like this (leave out the numShards parameter):
http://localhost:8983/solr/admin/collections?action=CREATE&name=myCol&maxShardsPerNode=2&router.name=implicit&shards=shard1,shard2&router.field=shard
then add a field to your schema.xml named "shard" (matching the router.field parameter), you can index to a specific shard simply by adding the shard field to the document being indexed and specifying the shard name. At query time, you can specify shards to search -- more here (I was able to simply specify the shard name w/o a specific address).
I haven't tested this in production yet, but have verified using multiple VirtualBox instances, with ZooKeeper, HAProxy, and several Solr nodes, and it's doing exactly what I expected. Corrections and comments welcome.

Setting up Solr and Querying it

I am new to Solr.
I am not able to find out a proper document which could help me understand what all do I need to add in the solrconfig.xml and what is to be removed.
My SolrDocument would contain id, field1, field2. Out of the 2 fields, I want to update 1 of them. How do I do? I tried a few things but it overwrites the entire document.
/update is not working.
I have to add documents and retrieve them from inside a Java class.
You can refer to Solr Wiki for Solr Config.xml it is a good starting point to understand the configuration options.
Solr does not really have an update concept, it always deletes the existing document and replaces it with new document. There is a feature request open years back JIRA-139 to address this problem, but as of today it shows the fix version to be 4.1. But Solr 4.0 has a new feature Atomic update that you could try, if this is something very critical for you. Note: Solr 4.0 is still a Beta.
'/update' not working -> do you mean not working since it is replacing the old document with new document or do you get error/exception ?
To add & retrieve documents from Java, you can use SolrJ. SolrJ is Java client to access Solr programmatically. SolrJ - Solr Wiki.

Identifying strings in documents, with nutch+solr?

I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.
I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?
Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?
Thanks.
You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry).
See Behemoth for an example of how to embed GATE in mapreduce
Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.
To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.

Resources