Solr Indexing duplicate documents - solr

I am using solr to store filepaths and my 'id' (uniquekey) and index its content. When I change the file contents and re-index it, it replaces the contents of the file in the index. Is there anyway I can retain the old version of file under the same id ? I tried adding the overwrite=false parameter with no luck. I am using solr 6.1.0

I think you cannot do that under the same id as id is the uniquekey.
Even its not possible to achieve on RDBMS type.
It could be achieved by providing another id and maintaining the relations of new id(where the document content is changed, consider it as new document with new id) and then have relation of new id and the old id .
You can have a similar concept for solr as well, but every document you need to have another field like id and older_id .
Here in the older_id you can have the id of the document which id the older version and has the content of old document.
And with this your older documents will not be deleted from solr as they will have the new document and new id and older_id the previous document id.

Related

solr don't send the id field with the result

i am pretty new to solr. and i don't know what is the best practice for the id column.
currently i wish to exclude the internal "id" parameter from solr search results (i am using my custom user_id field ).
i know i can use the fl=field1,field2. but this means specifying all my fields here. and i don't have a deep knowledge in solr and i fear this will hurt performance. ?
another question is it recommended to add another field user_id or overwrite the default id field ?
thank you very much.
If the value you have in your user_id field is unique, index that into your id column or define the user_id field as your unique key instead and don't use the id field.
The important thing is that there's a unique field in your document so that Solr knows when a document should be updated compared to when a new document should be added instead.
If the id field is not relevant / secret, I'm not sure why you'd be worried about including it.

reindexing json object into solr by adding only unique elements

I have indexed json object into solr using httpclient
and when I tried to index again, duplicate records are getting indexed.
So how to update the records into solr, everytime I index I want to update the records.
Thanks in advance
In your JSON Object include an ID field inside your json object and it should be unique, for example some random number like 65746 . When you will try to index this document again, solr will check for id .If id is same, solr will not index that whole document again . Now the question is how you declare a unique field in solr schema . So for that go to your schema.xml file or managed-schema file which is inside your core configuration and define unique field like this id . Now solr will identify id coming from your JSON as unique , and won't indexed already indexed documents.Hence there will be no duplicate records. Let me know if that helped you :)

Elasticsearch Unique field

I want to store urls in an index but I want unique url.
I'm making POST request to store my documents but I want to avoid duplicate document based on the url field.
Is there a way to specify a unique constraint on the url field ?
I have around 5 million of data so I don't want to make url as the document ID instead as it will slowdown my search query.
No, the _id is the only field that can have the uniqueness restriction. You probably know this but a new document with existing id would override the existing document with same id. You can use op_type=create or /my_index/my_type/ID/_create in order to get back an error if a document with same id already exists.

changing solr id from string to uuid

I am very new to solr.
Initially the "id" in my solr schema was of type string.
I have 30,000 documents, but now I want to use uuid instead of a string.
Simply changing the id to uuid and following instructions from http://wiki.apache.org/solr/UniqueKey
It did not work because it tried to string id as uuid and it failed.
My question is how do i change my id to uuid without deleting any data ?
Any info on this will be helpful.
Hope your id field is be mentioned as uniqueKey in the schema.xml. That means every solr document in your Solr instance must contain the id field. When you modify the type of any field in the schema, the previously created index for those fields get messed up. Now you can't query on those field, though they are still present in your Solr instance.
What good is that if you can not query on the data, you indexed to query? So, there is no good keeping the old document in your Solr, on which you can't query. And this time you have modified the uniqueKey field. So, you must re-index. If you would have modified the type of other field except uniqueKey, then Atomic update or partial update would have been a solution.

Solr schema modifications that do not affect existing Documents

I am trying to figure out whether I need to re-index a [very large] document base in Solr in the following scenarios:
I want to add a few new fields to the schema: none of the old Documents need to be updated to add values for these fields, only new documents that I will be adding after the schema update will have these fields. Do I still need to re-index Solr?
I want to remove couple of not-used fields from the schema (they were added prematurely ...): none of the existing documents has any of these fields. Do I still need to re-index the Solr after the schema update?
I saw many recommendations for updating existing documents when adding/modifying fields, but this is not the case for me - I only want to update the schema, not the existing documents.
Thanks!
Marina
Answer 1: You are correct, you can add new field, you do not need to reindex if you want only new documents going forward to have value for that new field.
Answer 2: Yes, you can remove field without rebuilding index if none of documents have value for that field. You can make sure by looking at that field under:
http://localhost:8080/admin/schema.jsp
If one of documents has value for field you want to remove, you have to rebuild index, else it will give error.

Resources