Reverse Indexing and Data modeling in Key-Value store - database

I am new to key-value stores. My objective is to use an embedded key-value store to keep the persistent data model. The data model comprises of few related tables if designed with conventional RDBMS. I was checking a medium article on modeling a table for key value store. Although the article uses Level DB with Java I am planning to use RocksDB or FASTER with C++ for my work.
It uses a scheme where one key is used for every attribute of each row, like the following example.
$table_name:$primary_key_value:$attribute_name = $value
The above is fine for point lookups when usercode is aware about exactly which key to get. But there are scenarios like searching for users having same email address, or searching for users above a certain age or searching for users of one specific gender. In search scenarios the article performs a linear scan through all keys. In each iterations it checks the pattern of the key and applies the business logic (checking the value for match) once a key with a matching pattern is found.
It seems that, such type of searching is inefficient and in worst case it needs to traverse through the entire store. To solve that a reverse lookup table is required. My question is
How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
One solution that readily comes in mind is to have a separate ? store for each index-able property like the following.
$table_name:$attribute_name:$value_1 = $primary_key_value
With this approach the immediate question is
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
As an immediate solution, instead of storing a single value an array of multiple primary keys can be stored as shown below.
$table_name:$attribute_name:$value_1 = [$primary_key_value_1, ... , $primary_key_value_N]
But such type of modeling requires usercode to parse the array from string and again serialize that to string after manipulation several times (assuming the underlying key-value store is not aware about array values).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
Assuming that the stringifi'ed array like design works, there has to be such indexes for each indexable properties. So this gives a fine grained control on what to index and what not to index. Next design decision that comes in mind is where these indexes will be store ?
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
For this question, I don't have a clue because both of these approaches require more or less same amount of I/O. However having large data file will have more things on disk and fewer things on memory (so more I/O), whereas for multiple files there will be more things on memory so less page faults. This assumption could be totally wrong depending on the architecture of the specific key-value store. At the same time having too many files turns into a problem of managing a complicated file structure. Also, maintaining indexes require transactions for insert, update and delete operations. Having multiple files results into single updation in multiple trees, whereas having single file results into multiple updation in single tree.
Is transaction more specifically transaction involving multiple store/files supported ?
Not only the indices there are some meta information of the table that are also required to be kept along with the table data. To generate a new primary key (auto incremented) it is required to have prior knowledge about the last row number or last primary key generated because something like a COUNT(*) won't work. Additionally as all keys are not indexed, the meta information may include what properties are indexed and what properties are not indexed.
How to store the meta information of each table ?
Again the same set of questions appear for the meta table also. e.g. should the meta be a separate store/file ? Additionally as we have noticed that not all properties are indexed we may even decide to store each row as a JSON encoded value in the data store and keep that along with the index stores. The underlying key-value store vendor will treat that JSON as a string value like the following.
$table_name:data:$primary_key_value = {$attr_1_name: $attr_1_value, ..., $attr_N_name: $attr_N_value}
...
$table_name:index:$attribute_name = [$primary1, ..., $primaryN]
However reverse lookups are still possible through the indexes pointing towards the primary key.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
So far I could not find any draw backs using this method, other than forcing the user to use JSON encoding, and some heap allocation in for JSON encoding/decoding.
The problems mentioned above is not specific to any particular application. These problems are generic enough to be associated to all developments using key-value store. So it is essential to know whether there is any reinvention of wheel.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?

How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
All the ways you describe are valid ways to create an index.
It does not re-invent the wheel in RocksDB because RocksDB does not support indices.
It really depends on the data, in general you will need to copy the index value and the primary key into another space to create the index.
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
You can serialize pks using JSON (or something else). The problem with that approach is when the pks grow very large (which might or might not be a thing).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
With RocksDB, you have nothing that will make it "easier".
You did not mention the following approach:
$table_name:$attribute_name:$value_1:$primary_key_value_1 = ""
$table_name:$attribute_name:$value_1:$primary_key_value_2 = ""
...
$table_name:$attribute_name:$value_1:$primary_key_value_n = ""
Where the value is empty. And the indexed pk is part of the key.
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
It depends on the key-value store. With rocksdb, if you need transactions, you must stick to one db file.
Is transaction more specifically transaction involving multiple store/files supported ?
Only Oracle Berkeley DB and WiredTiger support that feature.
How to store the meta information of each table ?
metadata can be in the database or the code.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
Yeah, like I said above, if you encoded all pks into a single value, it might lead to problem downstream when the number of pk is large. For instance, you need to read the whole list to do pagination.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?
To summarize:
With RocksDB, Use a single database file
In the index, encode the primary key inside the key, and leave value empty, to be able to paginate.

Related

Database performance: Using one entity/table with the max. possible properties or split to different entities/tables?

im need to design some database tables but im not sure about the performance impact. In my case its more about the read performance than for saving the data.
The situation
With the help of pattern recognition im finding out how many values of a certain object needs to be saved in my postgresql database.
Amount other lets say fixed properties the only difference is if 1, 2 or 3 values of the same type needs to be saved.
Currently im having 3 entities/tables which differ only in having having 1, 2 or 3 not nullable properties of the same type.
For example:
EntityTestOne/TableOne {
... other (same) properties
String optionOne;
}
EntityTestTwo/TableTwo {
... other (same) properties
String optionOne;
String optionTwo;
}
EntityTestThree/TableThree {
... other (same) properties
String optionOne;
String optionTwo;
String optionThree;
}
I expect to have several million records in production and im thinking what could be the performance impact of this variant and what could be alternatives.
Alternatives
Other options which come into my mind:
Use only one entity class or table with 3 options (optionTwo and optionThree will be nullable then). If to talk of millions of expected records
plus caching im asking myself isn't it a kind of 'waste' to save millions of null values in at least two (caching) layers (database itself and hibernate). In a another answer i read yesterday saving a null value in postgresql need only 1 bit what i think isnt that much if we talk about several millions of records which can contain some nullable properties (link).
Create another entity/table and use a collection (list or set) relationship instead
For example:
EntityOption {
String value;
}
EntityTest {
... other (same) properties
List<EntityOption> options;
}
If to use this relationship: What would give a better performance in case of creating new records:
Creating for every new EntityTest new EntityOption's or doing a
lookup before and reference a existing EntityOption if exists? What about the read performance while fetching them later and the joins which will be needed then?
Compared to the variant with one plain Entity with three options i can imagine it could be slightly slower...
As im not that strong in database design and working with hibernate im interested of the pros and cons of these approaches and if there are even more alternatives.
I even would like to ask the question if postgresql is the right choice for this or if should think about using another (free) database.
Thanks!
The case is pretty clear in my opinion: If you have an upper limit of three properties per object, use a single table with nullable attributes.
A NULL value does not take up any space in the database. For every row, PostgreSQL stores a bitmap that contains which attributes are NULL. This bitmap is always stored, except when all attributes are not nullable. See the documentation for details.
So don't worry about storage space in this case.
Using three different tables or storing the attributes in a separate table will probably lead to UNIONs or JOINs in your queries, which will make the queries more complicated and slow.
There are many inheritance strategy for creating entity class, I think you should go with single table strategy, where there will be a discriminator column (managed by hibernate itself), and all common filed will be used by each entity and some specific fields will be use by specific entity and remain null for other entity.
This will get improved read performance.
For your ref. :
http://www.thejavageek.com/2014/05/14/jpa-single-table-inheritance-example/

Datastore efficiency, low level API

Every Cloud Datastore query computes its results using one or more indexes, which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated incrementally to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
Generally, I would like to know if
datastore.get(List<Key> listOfKeys);
is faster or slower than a query with the index file prepared (with the same results).
Query q = new Query("Kind")(.setFilter(someFilter));
My current problem:
My data consists of Layers and Points. Points belong to only one unique layer and have unique ids within a layer. I could load the points in several ways:
1) Have points with a "layer name" property and query with a filter.
- Here I am not sure whether the datastore would have the results prepared because as the layer name changes dynamically.
2) Use only keys. The layer would have to store point ids.
KeyFactory.createKey("Layer", "layer name");
KeyFactory.createKey("Point", "layer name"+"x"+"point id");
3) Use queries without filters: I don't actually need the general kind "Point" and could be more specific: kind would be ("layer name"+"point id")
- What are the costs to creating more kinds? Could this be the fastest way?
Can you actually find out how the datastore works in detail?
faster or slower than a query with the index file prepared (with the same results).
Fundamentally a query and a get by key are not guaranteed to have the same results.
Queries are eventually consistent, while getting data by key is strongly consistent.
Your first challenge, before optimizing for speed, is probably ensuring that you're showing the correct data.
The docs are good for explaining eventual vs strong consistency, it sounds like you have the option of using an ancestor query which can be strongly consistent. I would also strongly recommend avoiding using the 'name' - which is dynamic - as the entity name, this will cause you an excessive amount of grief.
Edit:
In the interests of being specifically helpful, one option for a working solution based on your description would be:
Give a unique id (a uuid probably) to each layer, store the name as a property
Include the layer key as the parent key for each point entity
Use an ancestor query when fetching points for a layer (which is strongly consistent)
An alternative option is to store points as embedded entities and only have one entity for the whole layer - depends on what you're trying to achieve.

Does JSONB make PostgreSQL arrays useless?

Suppose that you want to store "tags" on your object (say, a post). With release 9.4 you have 3 main choices:
tags as text[]
tags as jsonb
tags as text (and you store a JSON string as text)
In many cases, 3rd would be out of question since it wouldn't allow query conditional to 'tags' value. In my current development, I don't need such queries, tags are only there to be shown on posts list, not to filter posts.
So, choice is mostly between text[] and jsonb. Both can be queried.
What would you use? And why?
In most cases I would use a normalized schema with a table option_tag implementing the many-to-many relationship between the tables option and tag. Reference implementation here:
How to implement a many-to-many relationship in PostgreSQL?
It may not be the fastest option in every respect, but it offers the full range of DB functionality, including referential integrity, constraints, the full range of data types, all index options and cheap updates.
For completeness, add to your list of options:
hstore (good option)
xml more verbose and more complex than either hstore or jsonb, so I would only use it when operating with XML.
"string of comma-separated values" (very simple, mostly bad option)
EAV (Entity-Attribute-Value) or "name-value pairs" (mostly bad option)
Details under this related question on dba.SE:
Is there a name for this database structure?
If the list is just for display and rarely updated, I would consider a plain array, which is typically smaller and performs better for this than the rest.
Read the blog entry by Josh Berkus #a_horse linked to in his comment. But be aware that it focuses on selected read cases. Josh concedes:
I realize that I did not test comparative write speeds.
And that's where the normalized approach wins big, especially when you change single tags a lot under concurrent load.
jsonb is a good option if you are going to operate with JSON anyway, and can store and retrieve JSON "as is".
I have used both a normalized schema and just a plain text field with CSV separated values instead of custom data types (instead of CSV you can use JSON or whatever other encoding like www-urlencoding or even XML attribute encoding). This is because many ORM's and database libraries are not very good at supporting custom datatypes (hstore, jsonb, array etc).
#ErwinBrandstetter missed a couple of other benefits of normalized one being the fact that it is much quicker to query for all possible previously used tags in a normalized schema than the array option. This is a very common scenario in many tag systems.
That being said I would recommend using Solr (or elasticsearch) for querying for tags as it deals with tag count and general tag prefix searching far better than what I could get Postgres to do if your willing to deal with the consistency aspects of synchronizing with a search engine. Thus the storage of the tags becomes less important.

What is a good way to manage keys in a key-value store?

Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.

Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?
Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.
In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.
(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.

Resources