Adding and Updating Solr and lucene field

Adding and Updating Solr and lucene field - solr

I am new to solr. can someone address below questions.
1. Currently I have an index with 1.5 mill records. I am having a need to update value of a field to a new value. How do I do it. Will it be a re-indexing? Sample code will be helpful.
I have another need where I want to add a index field but don't want to reindex the entire content. I have document ids with me. For this requirement I can use lucene if that helps.

Currently I have an index with 1.5 mill records. I am having a need to update value of a field to a new value. How do I do it. Will it be a re-indexing? Sample code will be helpful.
Well, the good news is that the latest versions of Solr (starting with 4.3 or 4.4, I think) allows you to do what they call Atomic Updates. See here:
http://wiki.apache.org/solr/Atomic_Updates
From the coding point of view, it as if you were only updating the desired field. Using the Java SolrJ API it's something like this:
Let's say you have a document with a multi value field called "stuffedAnimals". The field already contains "teddy bear" and "stuffed turtle" as values. You want to update it and add a new value like "pink fluffy flamingo". What you can do is:
SolrInputDocument updateDocument = new SolrInputDocument();
//here you must add the id field with the desired value, corresponding to the doc you want to update:
updateDocument.addField("id", 2312312);
//tell it to add the new value to the existing ones, rather then replace them with it:
updateDocument.addField("stuffedAnimals", new HashMap(){{put("add","pink fluffy flamingo");}});
Problem with this is performance: what actually happens when you do this is that the document is removed and re-added entirely (not just the field). This is something you need to take into consideration if you plan on doing a lot of such operations.
I have another need where I want to add a index field but don't want to reindex the entire content. I have document ids with me. For this requirement I can use lucene if that helps.
Well, as I was saying above: when you update a field, the document is actually re-written entirely, so that means it's re-indexed with the new field as well. If you're using Solr 4.4 or earlier you need to declare the new fields in the schema.xml file. If you're using Solr 4.5 or newer you don't need to worry about the schema.xml any more.
Finally, as a remark for both questions: if you want to update a Solr document, make sure all its fields are marked as "stored" (stored=true in schema.xml). Since a partial update on a field translates into Solr removing and re-adding the document (with the update applied), if certain fields are not stored, Solr won't know what value to put in them after the update.

Take a look at atomic update feature added in 4.0.
It allows You to change value of particular field without reindexing whole document.
Remember that all fields in your schema have to be stored(without copyFields). If You need further assistance please write more detailed description.

Related

Solr indexing fails over media_black_point

In front i want to say that i dont have much experience with Solr.
Problem we are facing, we only want to index content of files and not want to add dynamic fields, is this possible and if so how?
Problem 2: If Problem one is a No, how would we exclude media_black_point,
media_white_point with indexing?
Error code where Solr trips:
{"responseHeader":{"status":400,"QTime":149},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"incompatible dimension (2) and values (313/1000 329/1000). Only 0 values specified","code":400}}

Dynamic Fields and schemaless mode are both there to catch fields you did not declare explicitly. If neither are used, the assumption is that every field you send to Solr (including output from extract handler that generates a Solr document internally) needs to be explicitly mapped. This helps to avoid spelling errors and other unexpected edge-cases.
If you want to ignore all the fields you did not define explicitly, you can use dynamic field with stored/indexed/docValues all set to false. Solr ships with one example out of the box, you just need to uncomment it.
The other option is to ignore specific fields. You can do that by defining a custom UpdateRequestProcessor chain (or individual URP in the latest Solr) and using IgnoreFieldUpdateProcessorFactory with your specific field name or a name pattern.

How To intercept Document in Solr

I want to manipulate doc and change the token value for field(s) by prepending some value to each token. I am doing bulk update through DIH and also posting Documents through SOLRJ. I have replication factor as 2, so Replication should also work. The value that I want to prepend is there in the document as a separate field. I am interested to know the place where I can intercept the document before the indexing so that I can manipulate it. One of the option I can think of overriding DirectUpdateHandler2. Is this the right place?
I can do it by externally processing the document and passing it to SOLR But I want to do it inside SOLR.
Document fields are :
city:mumbai
RestaurantName:Talk About
Keywords:Cofee, Chines, South Indian, Bar
I want to index keywords as
mumbai_cofee
mumbai_Chines
mumbai_South Indian
mumbai_Bar

the right place is an Update Request Processor, you make sure you plug that in sorlconfig.xml into all udpate handlers you are using (including DIH), and the single URP will cover all updates.
In your java code in the URP you can easily get the value of a field and then prepend it to all the others in another field etc. This happens before the doc is indexed.

How to check that a document is really indexed?

Does someone know how to check if a document is well indexed after an update with Solr ?
I've tried to read the response after calling the add() method of SolrServer as below but it doesn't seem to work :
SolrInputDocument doc = new SolrInputDocument();
/*
* Processing on document to add fields ...
*/
UpdateResponse response = server.add(doc);
if(response.getStatus()==0){
System.out.println("File Added");
}
else{
System.out.println("Error when Adding File");
}
In the javadoc, there is no way to know what returns the add() method. Does it always return 0 ?
In this case, what is the best way to check that a file is well indexed after an update ?
Thank
Corentin

You need to perform a commit to be able to see the documents added.
Add will simply add the document to the Index.
However, the document is still not returned as search result unless you commit.
When you are indexing documents to solr none of the changes (add/delete/update) you make will appear until you run the commit command.
A commit operation makes index changes visible to new search requests.
Also check for Soft commits which will perform in a more performant manner.

To add to Jayendra's answers, there might be situations where you might be trying to index existing document again. e.g. to test a different index-time chain of analyzers.
In these cases, you might not be able to deduce if the document was indexed again if no content changes.
In such cases, _version_ field might come to rescue. _version_ field always changes its value when the document is indexed again. Please refer my answer here to know more about _version_ field.

Solr Spell Check result based filter query

I implemented Solr SpellCheck Component based on the document from http://wiki.apache.org/solr/SpellCheckComponent , it works good. But i am trying to filter the spell check result based on some other filter. Consider the following schema
product_name
product_text
product_category
product_spell -> copy string from product_name and product_text . And tokenized using white space analyzer
For the above schema, i am trying to filter the spell check result based on provided category. I tried querying like http://127.0.0.1:8080/solr/colr1/myspellcheck/?q=product_category:160%20appl&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true . Spellcheck results does not consider the product_category:160
Is it because the dictionary was build for all the categories? If so is it a good idea to create the dictionary for every category?
Is it not possible to have another filter condition in spellcheck component?
I am using solr 3.5

I previously understood from the SOLR-2010 issue that filtering through the fq parameter should be possible using collation, but it isn't, I think I misunderstood.
In fact, the SpellCheckComponent has most likely a separate index, except for the DirectoSolrSpellChecker implementation. It means the field you select is indexed in a different index, which contains only the information about that specific field you chose to make spelling corrections.
If you're curious, you can also have a look how that additional index looks like using luke, since it's of course a lucene index. Unfortunately filtering using other fields isn't an option there, simply because there is only one field there, the one you use to make spelling corrections.

How to sort by tag considering the tags weights related to every document?

I'm building up a Solr search engine to search on a 300k documents collection. Among the many indexed fields, an important one is tags.
My idea is to assign to every document a vector of tags, each one with a given weight (basically depending on the number of users who chose that tag for that document). For instance
Doc1 = {tag1:0.3, tag2:0.7, tag3:0.8, tag4:1}
Doc2 = {tag2:0.5, tag3:0.8, tag4:0.8, tag5=0.9}
Using this example, when someone ask for documents tagged with tag4, I would give back both the documents of course, but Doc1 with an highest score since it has tag4 weighted higher.
Ideally, the way to implement this on Solr, would be something like creating a multiValued field called "tags", and assign at indexing time a weight to each tag contained in such a field. So, first question:
Is it possible to assign a term frequency (as a tag weigth) manually at indexing time?
To what I found... seems not! Ok... a workaround is to copy for instance tag4 10 times on the tags field of Doc1 and just 8 on the tags field of Doc2. Of course has some drawbacks and limitations.
However here comes the bigger problem I cannot solve even with a workaround. I would like to define my own score. The one that fit better my specific case would be something like sort=tf(tags,tag4). In fact TF is in this case much more important than IDF! Unfortunately this feature (Relevance Functions) will be released just in Solr 4: http://wiki.apache.org/solr/FunctionQuery#tf
Have you got any idea about how to change the scoring function in Solr 3.5 giving more importance to TF and less to IDF?
Is there any hack to do it simply, or would you change the Lucene source code (if yes... what and where?), or would you use the Solr4 night build?
Thanks in advance for your advices!