Is there a existing in-memory production-ready KV storage that allow me to retrive a single value via any of multiple keys?
Let say I have millions of immutable entities that have a primary key associated. Any of this entity can have multiple aliases and most common scenario is to retrieve the enity by such alias(90% of all requests). The second common scenario is to be able to retrive the entity via the primary key and after that put the new alias record(the last 10%). One special thing about this step - it always prepended by the alias searching and happens only if alias search was unsuccessful.
The entire dataset does fit into the RAM but probably doesn't if entire record data will be duplicated accross all aliases.
I'm higly concerned about data retrieval latency and less concerned on writing speed.
This can be done with Redis in two sequential lookups or via any SQL/Mongodb. I think both ways is suboptimal. The first one obviously because of two round trips for every search attempt and the second one because of latency concerns.
Any suggestions?
Can you do two hashmaps one that goes pk -> record data and the other that goes from alias -> pk ?
Another option is to have some sort of deterministic alias so that you can go from the alias to the primary key directly in code without doing a lookup in a datastore
Related
I am new to Snowflake and want to know, can we use hashcodes for joining tables or finding unique records or deleting duplicate records in Snowflake(or in any other database in general)?
I am designing an ETL flow, what are the advantages or disadvantages of using hashcodes and why are they generally not used often in most Data warehousing designs?
If you mean hashing with something like md5_binary or sha1_binary then yes absolutely,
Binary values are half the byte length of the equivalent varchar length and so you should use that. The benefit of using hash-keys (effectively) is that you only need a single join column if for instance the natural keys of a table might be a composite key. Now you could instead a numeric/int data type, sequence key but that imposes a load order. Example only after the related dimension tables have loaded should you build the related fact table --- if you are doing that.
Data Vault prefers durable hash-keys because it does not impose any load ordering, load in any order independently.
Anyway I digress, yes hash-keys have great advantages, just make sure they're binary data types when loaded.
Seeking thoughts and advice in addressing a customer request. We're working in an existing database that contains primary keys but no foreign keys, which won't change.
The request is for us to integrate data from outside sources which
may contain duplicate values in the data.
The data is from a records-management system, so key id values will
increase with each record collected.
The key fields, for the most part, are decimal(22,0).
This is a daily ETL load of a relatively small amount of data, so
performance isn't the most important concern.
Is it better to prefix, suffix, or use some other strategy to create unique key values that can be traced back to the source?
For instance, if the max existing value were 123456789, is it a good idea to prefix 100000 for external site A, 110000 for external site B, etc? We've batted a few ideas about as a team and there seems to be pro's and con's with everything we can think of. Not seeing much guidance on the web.
Thanks in advance for any ideas!
Generally it's better to have a separate field tracking whether the data is from A or B which is either part of your primary key or a supplementary index/constraint where you use a surrogate identity/auto-increment field as the primary key.
Among other things, this makes it easier to later do things like table partitioning or conditional indexes.
IMHO that's a design anti-pattern. You're ascribing special meaning to internal database keys. Bad idea. Also if you ever did hit the 10,000 limit you're completely stuffed. I know you think you won't... but you might and it will be a catastrophe.
The normal pattern (again in my experience and opinion) is to add a SRC_System and a SRC_Key field with a unique index/constraint on them. These are used to track where the record came from.
In addition to these two new fields, you also have your original primary key in this table which is an internal key used by the system.
This design allows any number of additional systems with any range of keys to be added to your system.
The only challenge then is working out
How to merge - i.e. if two systems have records representing the same thing
How to add additional reference data if required
Given a distributed system which is persisting records with a primary key being 'url'. Given that multiple servers are collecting data, the 'url' is a handy/convenient and accurate means of guaranteeing uniqueness. Our system queries documents by as frequently as 10,000 times per minute at the moment.
We would like to add another unique key, being a 'uuid' so that we can refer to resources as:
http://example.com/fju98hfhsiu
Rather than, for example:
http://example.com/?u=http%3A%2F%2Fthis.is.a.long.url.com%2Fthis_is%2Fa%2Fpagewitha%2Flong-url.html
It seems that creation of secondary index of UUID's is not ideal in cassandra. Is there any way to avoid creating a secondary index of UUID's in cassandra?
Let's start with the fact, that best practice and the main pattern of Cassandra is to create tables for queries, and not queries for tables, if you need to create index on table, it is "auto" anti pattern. Based on this, the simplest solution is just to use 2 tables with 2 keys.
In your case, the "uuid", is not UUID, it is some concatenation of domain and hash, of the rest of the URL i believe .If your application can generate this key on the time of request, you can just use it as the partition key, and the full URL as clustering key.
Also, if there is no hot domains,(for example http://example.com) you can use the domain as the partition key, and hash and long urls as clustering keys, creating materialized views to support different queries.
In the end, just add secondary index and see performance impact in your specific case. If it works for you, and you don't want do deal with 2 tables, materialized views etc, just use it.
I don’t know much of the behind-the-scene work of a database. But this is what I believe it does. When you use a primary key to retrieve a record. The database go through all the records to find the one with that key. Then retrieve data.
However, In my scenario a record can contain a lot of references to other records. So, everytime I want to access these referenced records I have to first retrieve a list of primary keys, and then use the keys to retrieve the referenced records. The way databases do this is to go through the records many many times, only to find the ones with those keys, reducing performance, which cannot be compromised in my case.
What I want is a database that makes the primary keys(or whatsoever) to refer to the physical location on disk. This way as soon as I have read the list of keys of the referenced object, I can directly retrieve those object without looking through the whole database.
The "Primary Keys" don’t necessarily need to have any "meanings" - just like the pointers used in most programming languages.
I tried to write my data in binary. But it complicates things too much when it comes to resizing and caching, and I doubt it will be more efficient than the existing databases.
Your basic assumption is false:
When you use a primary key to retrieve a record. The database go through all the records to find the one with that key. Then retrieve data.
Database management systems finds a record through a primary key in a very efficient way with the use of indexes. Their efficiency is such that each record can be accessed either with a single access or (on the average) little more. So their efficiency is comparable to that of a direct physical pointer, but it offer two great advantages over physical pointers:
They make the access to the record independent from the physical position, and this allow moving the record around for managing efficiently the use of memory.
They allow the possibility of searching records through a range of values of the attribute on which they are defined.
I'm wondering what the best way to setup the keys for a table holding activity stream data. Each activity type will have different attributes (with some common ones). Here is an example of what some items will consist of:
A follow activity:
type
user_id
timestamp
follower_user_id
followee_user_id
A comment activity
type
user_id
timestamp
comment_id
commenter_user_id
commented_user_id
For displaying the stream I will be querying against the user_id and ordering by timestamp. There will also be other types of queries - for example I will occasionally need to query user_id AND type as well as stuff like comment_id, follower_user_id etc.
So my questions are:
Should my primary key be a hash and range key using user_id and timestamp?
Do I need secondary indexed for every other item - e.g. comment_id or will results return quick enough without the index? Secondary indexes are limited to 5 which wouldn't be enough for all the types of queries I will need to perform.
I'd consider whether you could segment the data into two (or more) tables - allowing better use of your queries. Combine the two as (and if) needed, ie - your type becomes your table rather than a discriminator like you would do in SQL
If you don't separate the tables, then my answers would be
Yes - I think that would be the best bet given that it seems like most of the time, that will be the way you are using it.
No. But you do need to consider what the most frequent queries are and the performance considerations around it. Which ones need to be performant - and which ones are "good enough" good enough?
A combination of caching and asynchronous processing can allow a slow performing scan to be good enough - but it doesn't eliminate the requirement to have some local secondary indexes.