how to load de-normalised data using hazelcast? - analytics

We are using Hazelcast for in memory data grid. We want to extend it for analytic using in memory computation.I have few question regarding this
Which data structure to use ? (I do not have primary key as de-normalize table and have a huge data )
If IMap the only option then can we use composite key or dummy key which should have support for index and predicate?
This is not the right use case i.e Hazelcast can not used for analytics?

You can generate random keys based on UUID::randomUUID or you can create composite keys. Indexes can be created over values and keys (for keys use the magic keyword __key# and add the property of the key you're interested in.
Predicates use the same keyword if you're looking to run it against a composite key property, otherwise just query as you expect it from any other data.

Related

Key-Value database with key aliasing or searching by value

Is there a existing in-memory production-ready KV storage that allow me to retrive a single value via any of multiple keys?
Let say I have millions of immutable entities that have a primary key associated. Any of this entity can have multiple aliases and most common scenario is to retrieve the enity by such alias(90% of all requests). The second common scenario is to be able to retrive the entity via the primary key and after that put the new alias record(the last 10%). One special thing about this step - it always prepended by the alias searching and happens only if alias search was unsuccessful.
The entire dataset does fit into the RAM but probably doesn't if entire record data will be duplicated accross all aliases.
I'm higly concerned about data retrieval latency and less concerned on writing speed.
This can be done with Redis in two sequential lookups or via any SQL/Mongodb. I think both ways is suboptimal. The first one obviously because of two round trips for every search attempt and the second one because of latency concerns.
Any suggestions?
Can you do two hashmaps one that goes pk -> record data and the other that goes from alias -> pk ?
Another option is to have some sort of deterministic alias so that you can go from the alias to the primary key directly in code without doing a lookup in a datastore

Do databases use foreign keys transparently?

Do database engines utilize foreign keys transparently or a query should explicitly use them?
Based on my experience there is no explicit notion of foreign keys on a table, except that a constraint that maintains uniqueness of the key and the fact that the key (single or a group of fields) is a key which makes search efficient.
To clarify this, here is an example why it is important: I have a middleware (in particular ArcGIS for my case), for which I can control the back-end database (so I can create keys, indices, etc.) and I usually use the front (a RESTful API here). The middleware itself is a black box and to provide effective tools to take advantage of the underlying DBMS's capabilities. So what I want to understand is that if I build foreign key constraints and use queries that if implemented normally would translate into queries that would use those foreign keys, should I see performance improvements?
Is that generally the case or various engines do it differently? (I am using PostgresSQL).
Foreign keys aren't there to improve performance. They're there to enforce data integrity. They will decrease performance for inserts/updates/deletes, but they make no difference to queries.
Some DBMSs will automatically add an index to the foreign key field, which may be where the confusion is coming from. Postgres does not do this; you'll need to create the index yourself. (And yes, the database will use this index transparently.)
As far as I know Database engines needs specific queries to use foreign keys. You have to write some sort of join queries to get data from related tables.
However some Data access framework hides the complexity of accessing data from foreign keys by providing transparent way of accessing data from related tables but I am not sure that may provide much improvement in performance.
This is completely depends on the database engine.
In PostgreSQL constraints won't cause performance improvements directly, only indexes will do that.
CREATE INDEX is a PostgreSQL language extension. There are no provisions for indexes in the SQL standard.
However, adding some constraints will automatically create an index for that column(s) -- f.ex. UNIQUE & PRIMARY KEY constraints creates a btree index on the affected column(s).
The FOREIGN KEY constraint won't create indexes on the referencing column(s), but:
A foreign key must reference columns that either are a primary key or form a unique constraint. This means that the referenced columns always have an index (the one underlying the primary key or unique constraint); so checks on whether a referencing row has a match will be efficient. Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns too. Because this is not always needed, and there are many choices available on how to index, declaration of a foreign key constraint does not automatically create an index on the referencing columns.

About database keys

I'm trying to figure out a way to design my SQL Azure database. There is a lot of information to be found about what your primary key should be (int versus guid) and advantages/disadvantages of both approaches, so I'm aware of the war going on there :)
But, I was thinking to apply best of both worlds by adding the following three columns to my tables:
InternalID
IDENTITY of type int
defined as clustered index
used in joins
default value generated by database
unique in the table only
used internally only
can never change
ExternalId
a Guid
default value generated by the client (or domain)
globally unique
used internally and externally.
can never change
UrlTitle
a string
generated by the domain
unique in the table only
user-friendly representation of the entity used in public url's
can change (but preferably does not change)
By doing so it would have the performance of integer primary keys thanks to the InternalId, but still flexible enough because of the ExternalId.
I'm not a database specialist, far from it, so I would like to hear from you whether this is a feasible thing to do or maybe just plain ridiculous?
There's nothing unusual about using several different keys in a table for different purposes. Make sure you enforce all the keys with uniqueness constraints and create appropriate indexes. Make sure your developers understand what each key is for and that they use them in a consistent way.

How do you query DynamoDB?

I'm looking at Amazon's DynamoDB as it looks like it takes away all of the hassle of maintaining and scaling your database server. I'm currently using MySQL, and maintaining and scaling the database is a complete headache.
I've gone through the documentation and I'm having a hard time trying to wrap my head around how you would structure your data so it could be easily retrieved.
I'm totally new to NoSQL and non-relational databases.
From the Dynamo documentation it sounds like you can only query a table on the primary hash key, and the primary range key with a limited number of comparison operators.
Or you can run a full table scan and apply a filter to it. The catch is that it will only scan 1Mb at a time, so you'd likely have to repeat your scan to find X number of results.
I realize these limitations allow them to provide predictable performance, but it seems like it makes it really difficult to get your data out. And performing full table scans seems like it would be really inefficient, and would only become less efficient over time as your table grows.
For Instance, say I have a Flickr clone. My Images table might look something like:
Image ID (Number, Primary Hash Key)
Date Added (Number, Primary Range Key)
User ID (String)
Tags (String Set)
etc
So using query I would be able to list all images from the last 7 days and limit it to X number of results pretty easily.
But if I wanted to list all images from a particular user I would need to do a full table scan and filter by username. Same would go for tags.
And because you can only scan 1Mb at a time you may need to do multiple scans to find X number of images. I also don't see a way to easily stop at X number of images. If you're trying to grab 30 images, your first scan might find 5, and your second may find 40.
Do I have this right? Is it basically a trade-off? You get really fast predictable database performance that is virtually maintenance free. But the trade-off is that you need to build way more logic to deal with the results?
Or am I totally off base here?
Yes, you are correct about the trade-off between performance and query flexibility.
But there are a few tricks to reduce the pain - secondary indexes/denormalising probably being the most important.
You would have another table keyed on user ID, listing all their images, for example. When you add an image, you update this table as well as adding a row to the table keyed on image ID.
You have to decide what queries you need, then design the data model around them.
I think you need create your own secondary index, using another table.
This table "schema" could be:
User ID (String, Primary Key)
Date Added (Number, Range Key)
Image ID (Number)
--
That way you can query by User ID and filter by Date as well
You can use composite hash-range key as primary index.
From the DynamoDB Page:
A primary key can either be a single-attribute hash key or a composite
hash-range key. A single attribute hash primary key could be, for
example, “UserID”. This would allow you to quickly read and write data
for an item associated with a given user ID.
A composite hash-range key is indexed as a hash key element and a
range key element. This multi-part key maintains a hierarchy between
the first and second element values. For example, a composite
hash-range key could be a combination of “UserID” (hash) and
“Timestamp” (range). Holding the hash key element constant, you can
search across the range key element to retrieve items. This would
allow you to use the Query API to, for example, retrieve all items for
a single UserID across a range of timestamps.

What is the easiest way to simulate a database table with an index in a key value store?

What is the easiest way to simulate a database table with an index in a key value store? The key value store has NO ranged queries and NO ordered keys.
The things I want to simulate (in order of priority):
Create tables
Add columns
Create indexes
Query based on primary key
Query based on arbitrary columns
If you use Redis (an advanced key-value store that supports strings, lists, sets, etc.) Then this is quite easy. I have already developed a C# redis client that has native support for storing POCO's data models. These exact same POCO's can be used by OrmLite to store it in a RDBMS.
By the way Redis is fast, I have a benchmark that stores and retrieves the entire Northwind Database (3202 records) in under 1.2 seconds (running inside a UnitTest on a 3yo iMac).
I store entities in two ways
Distinct entities, where I combine the Class type name and Primary Key to create a unique key e.g. urn:user:1
I then maintain a separate set of primary keys (in a Redis Set) to keep track of all my entities, using a key like: ids:user
In a Redis server side list - which acts very much like a table with support for paging, using a key like: lists:user
Use a hashtable or dictionary. If you want unique key values you could use a GUID or hashcode.
The key-value store should support ordering the keys and ranged access to the keys.
Then you should create two dictionaries:
id -> payload
and
col1, id -> NULL
, where payload should contain all the data the database table would contain, and the keys of the second dictionary should contain the values of (col1, id) from each entry of the first dictionary.

Resources