How to ensure that a document field is unique in ArangoDB Cluster - database

I have a 3 nodes Arango Cluster (Community edition).
I created a database with writeConcern=3 and replicationFactor=3 and a collection with shards=3, and replicationFactor=3.
I have a Hash index on a field of that collection with the unique property set to true. However I am still able to create different documents with the same field value.
I would like to know if there is some strategies to ensure uniqueness of a collection field in the cluster.

The section Indexes On Shards in the Arango docs says the following:
Unique indexes (hash, skiplist, persistent) on sharded collections are only allowed if the fields used to determine the shard key are also included in the list of attribute paths for the index
The reason behind this is simple - it would be very expensive to ensure uniqueness of an attribute x if it is not guaranteed that all documents with identical values of x are stored on the same node.

Related

Why do entities need keys in Appengine datastore

What is the usage of keys in the appengine datastore: I am new to Appengine, any info on it would be great.
Comparison
To keep things simple, let's assume MySQL stores all the rows of a table in a single file. That way, it can find all the rows by scanning that file.
App Engine's datastore (BigTable) does not have a concept of tables. Each entity (~row in MySQL) is stored separately. [It can also have a individual structure (~columns).] Because entities are not connected in any way, there is no "default" method to go through all of them. Each entity needs an ID and must be indexed.
Key Structure
A key consists of:
App ID (the closest thing in MySQL is a database).
Kind (the closest thing in MySQL is a table).
ID or name (the closest thing in MySQL is a primary key).
(Optionally) Parent key (all the above of another entity). (Details omitted for the sake of simplicity.)
Please note that what is meant by the closest thing is conceptual similarity. Technically, these things are not related. In MySQL, databases and tables represent actual storage structures. In BigTable they are just IDs, and the storage is actually flat, i.e. every entity is essentially a file.
In other words, identity-wise, a key is to an entity as the database + table + primary key are to a row in a MySQL table.
Key's Responsibilities
An entity's key:
States what application the entity belongs to.
What kind (class, table) it is of.
By the means of the above and either a numeric key ID or a textual key name, identifies the entity uniquely.
(Optionally) What the parent entity of the entity is. (Details omitted for the sake of simplicity.)
Usage
So that you can retrieve all entities of a kind, App Engine automatically builds indexes. That means App Engine maintains a list of all your entities. More specifically, it maintains a list of your entities' keys.
Complex indexes may be defined to run queries on multiple properties (~columns).
In contrast to MySQL, every BigTable query requires an index. Whenever a query is run, the corresponding index is scanned to find the entities that meet the query's conditions, and then the individual entities are retrieved by key.
A common high-level use is to identify an entity in a URL, as every key can be represented as a URL-safe string. When an entity's key is passed in the URL, the entity can be retrieved unambiguously, as the key identifies it uniquely.
Moreover, retrieving an entity by its key is strongly consistent, as opposed to queries on indexes, which means that when entity is retrieved by its key, it's guaranteed to be the latest version.
Tips
Every entity stored in BigTable has a key. Such a key may be programmatically created in your application and given an arbitrary key name. If it's not, an numeric ID will be allocated transparently, as the entity is being stored.
Once an entity is stored, its key may not be changed.
The optional parent component might be used to define a hierarchy of entities, but what it's really important for is transactions and strong consistency.
Entities that share a parent are said to belong to the same entity group.
Queries within a group are strongly consistent.
Just to reiterate, retrieving an entity by its key or querying an index by a parent key are strongly consistent. Retrieving entities in other ways (e.g. by a query on a property) is eventually consistent.
Glossary
Entity - a single key-value document.
Eventual consistency - retrieving an entity (often a number of them) without the guarantee that the replication has completed, which may result in some entities being an old version and some being missing, as they have not yet been brought from the server they were stored on.
Key - an entity's ID.
Kind - arbitrary textual name of a class of entities, such as User or Article.
Key ID - a numeric identifier of a key. Usually automatically allocated.
Key name - a textual identifier of a key.
Strong consistency - retrieving an entity in such a way that its latest version is retrieved.
(I intentionally used MySQL in the examples, as I'm much more familiar with it than with any other relational database.)
Please read https://developers.google.com/appengine/docs/java/datastore/#Java_Entities ... you may want to delete your question and ask again after you have studied this documentation section.
(This is meant to help you, not complain.)

Reducing Google App Engine Built-in Index Size

I've got an Google App Engine entity with over 2 million entries, which takes up about 2GB.
According to the datastore statistics, the built-in indexes are 13GB (75 million entries), and the composite indexes are 1GB (4 million entries).
I understand that the size of my composite indexes are related to how many indexes I have defined in my index.yaml file.
However, why are my built-in indexes so much larger that the data itself, and what can I do to reduce the built-in indexes?
Most of the model properties are indexed by default, refer here:
https://developers.google.com/appengine/docs/python/ndb/properties
picke, json, localstructured, blob & json properties are not indexed by default. Meaning if you didn't specify indexed=False on any other property it will have the built-in index.
class User(ndb.Model):
display_name = ndb.StringProperty(indexed=False) # will not be indexed
modified = ndb.DateTimePropert(indexed=False) # will not be indexed
Most of the time you have a lot of these things that you never query for. But right now you can't have composite index on a an un-indexed property which is already reported feature. https://code.google.com/p/googleappengine/issues/detail?id=4231
Then once you added indexed=False and want to remove all existing built-in indexes you'll need to rerun entity.put() on all of the existing entities.

Solrcloud duplicate documents with id field

I am using solrcloud-4.3.0 and zookeeper-3.4.5 on windows machine. I have a collection of index with unique field "id". I observed that there were duplicate documents in the index with same unique id value. As per my understanding this should not happen cause the purpose of the unique field is to avoid such situations. Can anyone help me out here what causes this problem ?
In the "/conf/schema.xml" file there is a XML element called "", which seems to be "id" by default... that is supposed to be your "key".
However, according to Solr documentation (http://wiki.apache.org/solr/UniqueKey#Use_cases_which_do_not_require_a_unique_key) you do not always need to have always to have a "unique key", if you do not require to incrementally add new documents to an existing index... maybe that is what is happening in your situation. But I also had the impression you always needed a unique ID.
Probably too late to add an answer to this question, but it is also possible to duplicate documents with unique keys/fields by merging indexes with duplicate documents/fields.
Apparently when indexes are merged either via the lucene IndexMergeTool or the solr CoreAdminHandler, any duplicate documents will be happily appended to the index. (as of lucene and solr 4.6.0)
de-duplication seems to happen at retrieval time.
https://cwiki.apache.org/confluence/display/solr/Merging+Indexes

When doing a non ancestor query with a sort by key, will my result be ordered by entity groups?

I need to change a fair amount of entities belonging to different entity groups.
If I do a non-ancestor query, sorted by key, like:
Query query = new Query( "Kind" )
.setFilter( ... )
.addSort( Entity.KEY_RESERVED_PROPERTY, ASC or DESC );
Will I always have a result ordered by entity-groups? I am planning to iterate through the
result until the parent (or grand-parent) key changes, and create a single transaction for all the entities in the same group - to avoid contention.
Will this work as expected? Any other suggestion?
Thank you.
Yes. Sorting by keys orders them by each entity in the ancestor list in order - eg, first by root entities, then by their children, and so forth.
Kindles Queries or Ancestor Queries can only be sorted by KEY.
You are sorting by key and that is ok.
The key is a result of the PARENT+KIND+ID
Each Kind's keys is a part of the KEY. So all your results will be sorted by kind, and then by key.
From GAE KEYS
Every model instance has an identifying key, which includes the
instance's entity kind along with a unique identifier. The identifier
may be either a key name string, assigned explicitly by the
application when the instance is created, or an integer numeric ID,
assigned automatically by App Engine when the instance is written
(put) to the Datastore.

Reading directly from the Doctrine Searchable index table

I've got a Doctrine table with the Searchable behavior enabled.
Whenever a record is created, an index is made in another table. I have a model called Entry and the behavior automatically created the table entry_index.
My question now is: How can I - without using the search(...) methods of my model use the data from this table?
I want to create a tag cloud of the words most used, and the data in the index table is exactly what I need.
Doctrine generates table EntryIndex that should be available from Doctrine::getTable('EntryIndex').
Additionally Entry has EntryIndex relation that refers to index table and EntryIndex has Entry relation. The relation is standard one-to-many (1-n) relation between Entry and EntryIndex.

Resources