Solrcloud duplicate documents with id field - solr

I am using solrcloud-4.3.0 and zookeeper-3.4.5 on windows machine. I have a collection of index with unique field "id". I observed that there were duplicate documents in the index with same unique id value. As per my understanding this should not happen cause the purpose of the unique field is to avoid such situations. Can anyone help me out here what causes this problem ?

In the "/conf/schema.xml" file there is a XML element called "", which seems to be "id" by default... that is supposed to be your "key".
However, according to Solr documentation (http://wiki.apache.org/solr/UniqueKey#Use_cases_which_do_not_require_a_unique_key) you do not always need to have always to have a "unique key", if you do not require to incrementally add new documents to an existing index... maybe that is what is happening in your situation. But I also had the impression you always needed a unique ID.

Probably too late to add an answer to this question, but it is also possible to duplicate documents with unique keys/fields by merging indexes with duplicate documents/fields.
Apparently when indexes are merged either via the lucene IndexMergeTool or the solr CoreAdminHandler, any duplicate documents will be happily appended to the index. (as of lucene and solr 4.6.0)
de-duplication seems to happen at retrieval time.
https://cwiki.apache.org/confluence/display/solr/Merging+Indexes

Related

Unique key field in solr

There is a field named "id" which is used as unique key in solr. Although it's not directly used for faceting or sorting queries, it still comes up in fieldcache and occupies lot of memory.
Please help me understand how this id field came in field cache and also if there is a way to avoid this from fieldcache.

How to ensure that a document field is unique in ArangoDB Cluster

I have a 3 nodes Arango Cluster (Community edition).
I created a database with writeConcern=3 and replicationFactor=3 and a collection with shards=3, and replicationFactor=3.
I have a Hash index on a field of that collection with the unique property set to true. However I am still able to create different documents with the same field value.
I would like to know if there is some strategies to ensure uniqueness of a collection field in the cluster.
The section Indexes On Shards in the Arango docs says the following:
Unique indexes (hash, skiplist, persistent) on sharded collections are only allowed if the fields used to determine the shard key are also included in the list of attribute paths for the index
The reason behind this is simple - it would be very expensive to ensure uniqueness of an attribute x if it is not guaranteed that all documents with identical values of x are stored on the same node.

Missing records from solr index?

Is there anyway to find the missing records from solr index.
I am running crawling against a SQL DB. My primaryKey is "id".
There are a few records missing in index. Is there any specific way to find those all??
Is it going to make any difference between a long value and string primary key, if we are using range query??
Thanks in advance....!!
If you mean that those records went "missing" during indexation, you can write them down in a file during indexation, because you will know more or less which records will not make it through.
If you are talking about comparing the database with Solr the only way is to crawl all the database and search for the record in Solr.
You can do it with a range query on group of ids if your ids are numeric for example and then if the result does not match you can narrow down the search.
they easiest way though is to just compare the ids one by one but it's also the slowest way. It depends on your database.
Primary keys in Solr are string only, but nobody say you can't have a numeric unique key alongside.

Does SOLR remove rows from index if they are not returned by deltaImportQuery?

I have a SOLR instance that is updated using deltaQuery/deltaImportQuery.
There is a row in SOLR that was changed in the source database table since last SOLR update.
During the next update deltaQuery returns primary key of this row (because it was changed recently). deltaImportQuery should select data for the particular primary key. This query contains additional filter on some field like IsSearchableItem=1 (I don't want to make searchable some rows).
So, deltaImportQuery does not return any data for the row (this particular row IsSearchable=0). Will this row be removed from SOLR index in this case?
I believe if DIH does not generate a replacement document (I think what you call row), it will not get deleted. Instead, you could look at checking for using $deleteDocById when IsSearchableItem is 1. Check $skipDoc usage in Wikipedia dump example.
Or use deletedPkQuery.

database design asking for advice

I need to store entries of the schema like (ID:int, description:varchar, updatetime:DateTime). ID is unique primary key. The usage scenario is, I will frequently insert new entries, frequently query entries by ID and less frequently remove expired entries (by updatetime field, using another SQL Job run daily to avoid database ever increasing). Each entry is with 0.5k size.
My question is how to optimize the database schema design (e.g. tricks to add index, transaction/lock levels or other options) in my scenario to improve performance? Currently I plan to store all information in a single table, not sure whether it is the best option.
BTW: I am using SQL Server 2005/2008.
thanks in advance,
George
Additionally to your primary key, just add index on updatetime.
Your decision to store everything in a single table needs to be reviewed. There are very few subject matters that can really be well modeled by just one table.
The problems that arise from using just one table are usually less obvious than the problems that arise from not creating the right indexes and things like that.
I'm interested in the "description" column (field). Do all descriptions describe the same kind of thing? Do you ever retrieve sets of descriptions, rather than just one description at a time? How do you group descriptions into sets?
How do you know the ID for the description you are trying to retrieve? Do you store copies of the ID in some toher place, in order to reference which ones you want?
Do you know what a "foreign key" is? Was your choice not to include any foreign keys in this table deliberate?
These are some of the questions that need to be answered before you can know whether a single table design really suits your case.
Your ID is your primary key and it has automatically an index.
You can put onther index for the expiration date. Indexes
are going to help you for searching but decreases the performance
when inserting, deleting and updating. Anyway one index is not
an issue.
It sounds for me somehow strange -I am not saying that it is an error-
that you have ALL the information in one table. Re-think that point.
See if you can refactorize something.
It sounds as simple as it gets, except for possibly adding an index on updatetime as OMax suggested (I recommend).
If you would also like to fetch items by description, you should also consider a text index or full-text index on that column.
Other than that - you're ready to go :)

Resources