I don't have so much experience with MongoDB so for that I think it's better to try to explain my request through an example.
I have those two entities. The Tree has a list of at least one Leaf and a Leaf can't exists without a Tree.
data class Tree(
val id: UUID,
val name: String,
val leaves: List<Leaf>,
)
data class Leaf(
val id: UUID,
val names: List<String>,
)
I would like the id of the Leaf to be unique per Tree.
For example I can have the first Tree document with a Leaf which has the id: 250bb131-2134-5667-0000-000000000000 and another Tree can has a Leaf with the same id: 250bb131-2134-5667-0000-000000000000.
I am assuming you want to store the Trees and Leafs in a MongoDB given the title of your question.
First, know that MongoDB has no joining, unlike a relational (SQL) database, so you have two choices:
One collection. You have one Collection with all the Trees in it.
Tree: If you change the type of Tree.id to be BigInteger Mongo and leave the value null, Mongo will take care of allocating a unique key*.
You need to allocate unique keys yourself for the Leaf objects perhaps by using a Set to ensure you get a unique one
Two collections. You have two collections one for Leaf and one for Tree.
Tree: If you change the type of Tree.id to be BigInteger Mongo and leave the value null, Mongo will take care of allocating a unique key*.
Leaf: If you change the type of Leaf.id to be BigInteger Mongo and leave the value null, Mongo will take care of allocating a unique key*.
then you have to handle the reference from Tree.leaves. This is all manual - you need change the leaves List to contain references to the Leaf object keys and handle the fetch yourself, e.g. val leaves: List<BigInteger>,. (Mongo does have a DBRef which is no more than an formal reference to database/collection/id)
*Mongo's default key is an ObjectId that will serialize into BigInteger or you could (pollute?) your code with the org.bson.types.ObjectId type. This class has some useful properties, such as sorting in datetime ascending order. But more than that you should read how such keys are generated. Unlike a central RDBMS which is responsible for allocating a lock, issuing a key, and releasing the lock - Mongo's strategy is create the primary key on the client. How could this work safely? Well: it uses a combination of date time, a client (random) key and an increment (see spec). You could this idea yourself in option 1, where you need to guarantee a unique Leaf identifier by using the ObjectId constructor
Related
Lets say, I have data structure like
type User struct {
UUid string
Username string
Email String
Password string
FirstName string
LastName string
}
I am storing Users []User into a key/value database in levelDB. The unique key will be UUid and then user struct will be endoed and stored against this UUID.
var network bytes.Buffer // Stand-in for a network connection
enc := gob.NewEncoder(&network)
err := enc.Encode(user)
if err != nil {
log.Println("Error in encoding gob")
return "", err
}
err = dbSession.DBSession.Put([]byte(user.UserID), network.Bytes(), nil)
Since the key for all the entries is the unique uuid, I want to make a secondary index on email so that I dont necessarily have to scan all the entries present in the database to find a particular entry corresponding to an Email.
What I have Done:
I have created a key called as SIndex and stored a map[string][string] data structure in it, where a key will be an email and value will be the uuid. Every time a new entry comes in, This Sindex will be updated to acommodate the new uuid and email.
Its a bad approach:
Because as data grows, Whole map corresponding to Sindex needs to be fetched and decoded, If email doesn't exists, add a new key to Sindex, encode it and store back again.
A B-tree would be a better fit.
My question : Is it right to store secondary index data in the Database itself, if not what strategies shall I use to implement a secondary Index, I know the choice of secondary index greatly influenced by the data but Are there any good out of box indexing algorithms other than B-Tree, HashMaps?
Is it right to store secondary index data in the Database itself
Yes, this is okay. But as pointed out by Jonas in the comment, you should put the email as key and UUID as value. Another option is to use email as the key for your database instead of using UUID. This way you don't need to use a secondary index.
Another strategy for better performance, you can use in-memory databases such as Redis (or maybe LevelDB itself can be used to store the data in memory) to store the secondary index (email as key and UUID as value).
Are there any good out of box indexing algorithms other than B-Tree, HashMaps
Anyway, B-Tree and HashMap are data structures, not algorithms. And what you did actually is not indexing with HashMap, it's just storing HashMaps as values for your key. Indexing usually depends on the DBMS implementation (we can only choose from the options they provided).
So, about the data structures used for indexing, whether it's good or not, really depends on the use cases. For example, if you need to do range search you can use B-Tree (used by default by most of the DBMSs), B+ tree (used by default by MySQL InnoDB), and Skip List (Redis use this data structure for its Sorted Set). You can read more about secondary indexing with Redis Sorted Set here.
And for your case, you only need to store email as key and UUID as value. Hash Table is commonly used for this. Most of the DBMSs use this data structure to do primary key access with just O(1) time complexity. And I believe LevelDB implementation is also based on this data structure.
I need to store a list of strings as field along with the Id: listId, <list>.
Now I need following operations in order O(1) time:-
Removing a given string from an existing listId.
Adding a new string in an existing listId.
Is there any DB which could support above operations? Having HashSet as one of its datatype would help. Note that I need a highly scale-able solution where list could have 10Mn keys in 1000+ listIds.
I understand that such datatype if exists in any database would have considerable indexing overhead. I believe that chances are really slim for something similar to exist. If not, then I would implement something myself.
What you describe sounds like a textbook case for normalization.
You'd have two tables: one that contains the lists, and another that contains the list elements.
They are linked through the list ID:
Lists table:
id name (+ whatever else you need)
List elements table:
id listId (connected to an id in the lists table) (+ whatever else you need)
I have to implement a database using b trees for a school project. the database is for storing audio files(songs), and a number of different queries can be made like asking for all the songs of a given artist or a specific album.
The intuitive idea is to use on b tree for each field ( songs, albums, artists, ...), the problem is that one can be asked to delete any member of any field, and in-case you delete an artist you have to delete all his albums and songs from the other b trees, keeping in mind that for example all the songs of a given artist don't have to be near each other in the b tree that corresponds to songs.
My question is: is there a way to do so (delete the songs after a delete to an author has been made) without having to iterate over all elements of the other b trees? I'm not looking for code just ideas because all the ones I've come up with are brute force ones.
This is my understanding and may not be entirely right.
Typically in a database implementation B Trees are used for indexes, so unless you want to force your user to index every column, defaulting to creating a B Tree for each field is unnecessary. Although this many indexes will lead to a fast read in virtually every case (with an index on everything, you wont have to do a full table scan), it will also cause an extremely slow insert/update/delete, as the corresponding data has to be updated in each tree. As I'm sure you know, modern databases for you to have at least one index (the primary key), so you will have at least one B Tree with a key for the primary key, and a pointer to the appropriate node.
Every node in a B Tree index should have a pointer/reference to the full object it represents.
Future indexes created would include the attributes you specify in the index, such as song name, artist, etc, however will still contain the pointer/reference to the corresponding node. Thus when you modify, lets say, the song title, you will want to modify the referenced node which all the indexes reference. If you have any indexes that have the modified reference as an attribute, you will have to modify the values in that index itself.
Unfortunately I believe you are correct in your belief that you will have to brute-force your way through the other B Trees when deleting/updating, and is one of the downsides of using alot of indexes (slowed update/delete time). If you just delete the referenced nodes, you will likely end up with pointers to deleted objects, which will (depending on your language) give you some form of a NullPointerException. In order to prevent this they references will have to be removed from all the trees.
Keep in mind though that doing a full scan of your indexes will still be much better than doing full table scans.
I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!
What is document data store? What is key-value data store?
Please, describe in very simple and general words the mechanisms which stand behind each of them.
In a document data store each record has multiple fields, similar to a relational database. It also has secondary indexes.
Example record:
"id" => 12345,
"name" => "Fred",
"age" => 20,
"email" => "fred#example.com"
Then you could query by id, name, age, or email.
A key/value store is more like a big hash table than a traditional database: each key corresponds with a value and looking things up by that one key is the only way to access a record. This means it's much simpler and often faster, but it's difficult to use for complex data.
Example record:
12345 => "Fred,fred#example.com,20"
You can only use 12345 for your query criteria. You can't query for name, email, or age.
Here's a description of a few common data models:
Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
Key-value systems basically support get, put, and delete operations based on a primary key.
Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
From this blog post I wrote: Visual Guide to NoSQL Systems.
From wikipedia:
Document data store: As opposed to relational databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.
Key Value: An associative array (also associative container, map, mapping, dictionary, finite map, and in query-processing an index or index file) is an abstract data type composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values). The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding. For example, if the value associated with the key "bob" is 7, we say that our array maps "bob" to 7.
More examples at NoSQL.