Lets say, I have data structure like
type User struct {
UUid string
Username string
Email String
Password string
FirstName string
LastName string
}
I am storing Users []User into a key/value database in levelDB. The unique key will be UUid and then user struct will be endoed and stored against this UUID.
var network bytes.Buffer // Stand-in for a network connection
enc := gob.NewEncoder(&network)
err := enc.Encode(user)
if err != nil {
log.Println("Error in encoding gob")
return "", err
}
err = dbSession.DBSession.Put([]byte(user.UserID), network.Bytes(), nil)
Since the key for all the entries is the unique uuid, I want to make a secondary index on email so that I dont necessarily have to scan all the entries present in the database to find a particular entry corresponding to an Email.
What I have Done:
I have created a key called as SIndex and stored a map[string][string] data structure in it, where a key will be an email and value will be the uuid. Every time a new entry comes in, This Sindex will be updated to acommodate the new uuid and email.
Its a bad approach:
Because as data grows, Whole map corresponding to Sindex needs to be fetched and decoded, If email doesn't exists, add a new key to Sindex, encode it and store back again.
A B-tree would be a better fit.
My question : Is it right to store secondary index data in the Database itself, if not what strategies shall I use to implement a secondary Index, I know the choice of secondary index greatly influenced by the data but Are there any good out of box indexing algorithms other than B-Tree, HashMaps?
Is it right to store secondary index data in the Database itself
Yes, this is okay. But as pointed out by Jonas in the comment, you should put the email as key and UUID as value. Another option is to use email as the key for your database instead of using UUID. This way you don't need to use a secondary index.
Another strategy for better performance, you can use in-memory databases such as Redis (or maybe LevelDB itself can be used to store the data in memory) to store the secondary index (email as key and UUID as value).
Are there any good out of box indexing algorithms other than B-Tree, HashMaps
Anyway, B-Tree and HashMap are data structures, not algorithms. And what you did actually is not indexing with HashMap, it's just storing HashMaps as values for your key. Indexing usually depends on the DBMS implementation (we can only choose from the options they provided).
So, about the data structures used for indexing, whether it's good or not, really depends on the use cases. For example, if you need to do range search you can use B-Tree (used by default by most of the DBMSs), B+ tree (used by default by MySQL InnoDB), and Skip List (Redis use this data structure for its Sorted Set). You can read more about secondary indexing with Redis Sorted Set here.
And for your case, you only need to store email as key and UUID as value. Hash Table is commonly used for this. Most of the DBMSs use this data structure to do primary key access with just O(1) time complexity. And I believe LevelDB implementation is also based on this data structure.
Related
I need to write some metadata to index in Lucene. These metadata describes the relationship between indexes, which helps me to do cross-index query.
The data structure of metadata is key-value pair. The key may be Integer or String. And the value is a list of Integer or String.
In the begining, I tried to extend the Codec. Obvously this key-value pair is not belongs to existing Formats. Then I turned into write this by adding a field. But it is not belongs to the index either and field is hard to change.
How to extends this metadata? Thank you.
I'm new at DynamoDB technologies but not at NoSQL (I've already done some project using Firebase).
Read that a DynamoDB best practice is one table per application I've been having a hard time on how to design my 1 to N relationship.
I have this entity (pseudo-json):
{
machineId: 'HASH_ID'
machineConfig: /* a lot of fields */
}
A machineConfig is unique for each machine and can change rarely and only by an administration (no consistency issue here).
The issue is that I have to manage a log of data from the sensors of each machine. The log is described as:
{
machineId: 'HASH_ID',
sensorsData: [
/* Huge list of: */
{ timestamp: ..., data: /* lot of fields */ },
...
]
}
I want to keep my machineConfig in one place. Log list can't be insert into the machine entity because it's a continuous stream of data taken over time.
Furthermore, I don't understand which could be the composite key, the partition key obviously is the machineId, but what about the order key?
How to design this relationship taking into account the potential dimensions of data?
You could do this with 1 table. The primary key could be (machineId, sortKey) where machineId is the partition key and sortKey is a string attribute that is going to be used to cover the 2 cases. You could probably come up with a better name.
To store the machineConfig you would insert an item with primary key (machineId, "CONFIG"). The sortKey attribute would have the constant value CONFIG.
To store the sensorsData you could use the timestamp as the sortKey value. You would insert a new item for each piece of sensor data. You would store the timestamp as a string (as time since the epoch, ISO8601, etc)
Then to query everything about a machine you would run a Dynamo query specifying just the machineId partition key - this would return many items including the machineConfig and the sensor data.
To query just the machineConfig you would run a Dynamo query specifying the machineId partition key and the constant CONFIG as the sortKey value
To query the sensor data you could specify an exact timestamp or a timestamp range for the sortKey. If you need to query the sensor data by other values then this design might not work as well.
Editing to answer follow up question:
You would have to resort to a scan with a filter to return all machines with their machineId and machineConfig. If you end up inserting a lot of sensor data then this will be a very expensive operation to perform as Dynamo will look at every item in the table. If you need to do this you have a couple of options.
If there are not a lot of machines you could insert an item with a primary key like ("MACHINES", "ALL") and a list of all the machineIds. You would query on that key to get the list of machineIds, then you would do a bunch of queries (or a batch get) to retrieve all the related machineConfigs. However since the max Dynamo item size is 400KB you might not be able to fit them all.
If there are too many machines to fit in one item you could alter the above approach a bit and have ("MACHINES", $machineIdSubstring) as a primary key and store chunks of machineIds under each sort key. For example, all machineIds that start with 0 go in ("MACHINES", "0"). Then you would query by each primary key 0-9, build a list of all machineIds and query each machine as above.
Alternatively, you don't have to put everything in 1 table - it is just a guideline that fits a lot of use cases. If there are too many machines to fit in less than 400KB but there aren't tens of thousands and you aren't trying to query all of them all the time, you could have a separate table of machineId and machineConfig that you resort to scanning when necessary.
I do not want to create an autogenerated key for my entities so I specify my own:
Entity employee = Entity.newBuilder().setKey(makeKey("Employee", "bobby"))
.addProperty(makeProperty("fname", makeValue("fname").setIndexed(false)))
.addProperty(makeProperty("lname", makeValue("lname").setIndexed(false)))
.build();
CommitRequest request = CommitRequest.newBuilder()
.setMode(CommitRequest.Mode.NON_TRANSACTIONAL)
.setMutation(Mutation.newBuilder().addInsert(employee))
.build();
datastore.commit(request);
When I check to see what the entity looks like it looks like this:
Why is this auto-generated key generated if I specified my own key (bobby)? It seems bobby was also created, but now I have bobby and this autogenerated key. What is the difference between the key and id/name?
You can't specify your own key, keys actually contain information necessary for the datastore operation. This note in the documentation gives you an idea:
Note: The URL-safe string looks cryptic, but it is not encrypted! It
can easily be decoded to recover the original entity's kind and
identifier:
key = Key(urlsafe=url_string)
kind_string = key.kind()
ident = key.id()
If you use such URL-safe keys, don't use sensitive data such as email
addresses as entity identifiers. (A possible solution would be to use
the MD5 hash of the sensitive data as the identifier. This stops third
parties, who can see the encrypted keys, from using them to harvest
email addresses, though it doesn't stop them from independently
generating their own hash of a known email address and using it to
check whether that address is present in the Datastore.)
What you can specify is the ID portion of the key, either as a number or as a string:
A key is a series of kind-ID pairs. You want to make sure each entity
has a key that is unique within its application and namespace. An
application can create an entity without specifying an ID; the
Datastore automatically generates a numeric ID. If an application
picks some IDs "by hand" and they're numeric and the application lets
the Datastore generate some IDs automatically, the Datastore might
choose some IDs that the application already used. To avoid, this, the
application should "reserve" the range of numbers it will use to
choose IDs (or use string IDs to avoid this issue entirely).
This is the url-safe version of your key, suitable for use in links. Use KeyFactory.stringToKey to convert it to an actual key, and you'll see that it contains your string name.
What you create with makeKey("Employee", "bobby") is a key for an Entity with the entity name Employee and the name bobby. What you see as Key in the datastore viewer is a representation for exactly that.
Generally speaking a key always consists of
optional parent key (with entity type and name/id)
entity type
entity name/id
Maybe someone here can tell you how to decode the key into its components but rest asured that you're doing everything right and the behavior is as expected.
I've just started looking into Amazon's DynamoDB. Obviously the scalability appeals, but I'm trying to get my head out of SQL mode and into no-sql mode. Can this be done (with all the scalability advantages of dynamodb):
Have a load of entries (say 5 - 10 million) indexed by some number. One of the fields in each entry will be a creation date. Is there an effective way for dynamo db to give my web app all the entries created between two dates?
A more simple question - can dynamo db give me all entries in which a field matches a certain number. That is, there'll be another field that is a number, for argument's sake lets say between 0 and 10. Can I ask dynamodb to give me all the entries which have value e.g. 6?
Do both of these queries need a scan of the entire dataset (which I assume is a problem given the dataset size?)
many thanks
Is there an effective way for dynamo db to give my web app all the
entries created between two dates?
Yup, please have a look at the of the Primary Key concept within Amazon DynamoDB Data Model, specifically the Hash and Range Type Primary Key:
In this case, the primary key is made of two attributes. The first
attributes is the hash attribute and the second one is the range
attribute. Amazon DynamoDB builds an unordered hash index on the hash
primary key attribute and a sorted range index on the range primary
key attribute. [...]
The listed samples feature your use case exactly, namely the Reply ( Id, ReplyDateTime, ... ) table facilitates a primary key of type Hash and Range with a hash attribute Id and a range attribute ReplyDateTime.
You'll use this via the Query API, see RangeKeyCondition for details and Querying Tables in Amazon DynamoDB for respective examples.
can dynamo db give me all entries in which a field matches a certain
number. [...] Can I ask dynamodb to give
me all the entries which have value e.g. 6?
This is possible as well, albeit by means of the Scan API only (i.e. requires to read every item in the table indeed), see ScanFilter for details and Scanning Tables in Amazon DynamoDB for respective examples.
Do both of these queries need a scan of the entire dataset (which I
assume is a problem given the dataset size?)
As mentioned the first approach works with a Query while the second requires a Scan, and Generally, a query operation is more efficient than a scan operation - this is a good advise to get started, though the details are more complex and depend on your use case, see section Scan and Query Performance within the Query and Scan in Amazon DynamoDB overview:
For quicker response times, design your tables in a way that can use
the Query, Get, or BatchGetItem APIs, instead. Or, design your
application to use scan operations in a way that minimizes the impact
on your table's request rate. For more information, see Provisioned Throughput Guidelines in Amazon DynamoDB.
So, as usual when applying NoSQL solutions, you might need to adjust your architecture to accommodate these constraints.
What is document data store? What is key-value data store?
Please, describe in very simple and general words the mechanisms which stand behind each of them.
In a document data store each record has multiple fields, similar to a relational database. It also has secondary indexes.
Example record:
"id" => 12345,
"name" => "Fred",
"age" => 20,
"email" => "fred#example.com"
Then you could query by id, name, age, or email.
A key/value store is more like a big hash table than a traditional database: each key corresponds with a value and looking things up by that one key is the only way to access a record. This means it's much simpler and often faster, but it's difficult to use for complex data.
Example record:
12345 => "Fred,fred#example.com,20"
You can only use 12345 for your query criteria. You can't query for name, email, or age.
Here's a description of a few common data models:
Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
Key-value systems basically support get, put, and delete operations based on a primary key.
Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
From this blog post I wrote: Visual Guide to NoSQL Systems.
From wikipedia:
Document data store: As opposed to relational databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.
Key Value: An associative array (also associative container, map, mapping, dictionary, finite map, and in query-processing an index or index file) is an abstract data type composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values). The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding. For example, if the value associated with the key "bob" is 7, we say that our array maps "bob" to 7.
More examples at NoSQL.