DynamoDB - Querying a LookUp Table - database

I'm just starting to build a Social Site into DynamoDB.
I will have a fair amount of data that relates to a user and I'm planning on putting this all into one table - eg:
userid
date of birth
hair
photos urls
specifics
etc - there could potentially be a few hundred attributes.
Question:
is there anything wrong with putting this amount of data into one table?
how can I query that data (could I do a query like this "All members between this age, this color hair, this location, and logged on this time) - assuming all this data is contained in the table?
if the contents of a table are long and I'm running queries on that table like above would the read IO's cost be high - might be a lot of entries in the table in the long run...
Thanks

No. You can't query DynamoDB this way. You can only query the primary key (and a single range optionally). Scanning the tables in DynamoDB is slow and costly and will cause your other queries to hung.
If you have a small number of attributes, you can easily create index tables for these attributes. But if you have more than a few, it becomes too complex.
Main Table:
Primary Key (Type: Hash) - userid
Attributes - the rest of the attributes
Index Table for "hair":
Primary Key (Type: Hash and Range) - hair and userid
You can check out Amazon SimpleDB that is adding an index for the other attributes as well, therefore allowing such queries as you wanted. But it is limited in its scale and ability to support low latency.
You might also consider a combination of several data stores and tables as your requirements are different between your real time and reporting:
DynamoDB for the quick real time user lookup
SimpleDB/RDBMS (as MySQL or Amazon RDS) for additional attributes filters and queries
In Memory DB (as Redis, Casandra) for counters and tables as leader boards or cohort
Activity logs that you can analyze to discover patterns and trends

Related

Partitioning the table

This query is regarding partitioning in hive/delta tables.
Which column should we pick for partitioning the table if the table is always being used to join based on key which have only unique value.
Ex: we have a table Customer(id, name, otherDetails)
which field be suitable to partition this table.
Thanks,
Deepak
Good question. Below are factors you need to consider while partitioning -
Requirement - when you have lots of data, heavily used table, frequently data added to it, and you want to manage it better.
Distribution of data - choose a field or fields on which data is evenly distributed. Most common is date or month or year, and normally transactional data is somewhat evenly distributed on these fields. You can choose something like country or region as well to partition when you have data evenly distributed.
Loading strategy - You can load/insert/delete each partition separately. So choose some columns which will help you deciding a better strategy. Like you can choose to delete old data based on date every time you load. so chose load date as partition.
Reasonable number of partitions - Make sure you do not have thousands of partitions but less that 500 is good ( check your systems performance).
Do not choose unique key/composite key as partition key. because hive creates folders with data files for each partition and it will be very difficult to manage thousands of partitions.

Big data indexing advice on SQL Server

I am about to import around 500 million rows of telemetry data into SQL Server 2008 R2, and I want to make sure I get the indexing/schema right to allow for fast searches of the data. I've been working with databases for a while but nothing on this scale. I'm hoping I can describe my data and the application, and someone can advise me on a good strategy for indexing it.
The data is instrument readings from a data collection system, and has 3 columns: SentTime (datetime2(3)), Topic (nvarchar(255), and Value(float). The SentTime precision is to the millisecond, and is NOT unique. There are around 400 distinct Topics (ex: "Voltage1", "PumpPressure", etc) in the data, and my plan was to break out the data into about 30 tables, each with 10-15 columns, grouped into logical groupings like Voltages, Pressures, Temperatures, etc, each with their own SentTime column.
A typical search will be to retrieve various Values (could be across several tables) for a given time range. Another possible search will be to retrieve all times/values for a given value range and topic. The user interface will show coarse graphs of the data, to allow the user to find the interesting data and export it to Excel or CSV.
My main question is, if I add an index based on SentTime alone, will that speed searches for a given time range? Would it be better to make a composite index on time and value, since the time is not unique? Any point in adding a unique primary key? Is there any other overall strategy or schema I should be looking at for this application?
Another note, I will not be inserting any data once the import is done, so no need to worry about the insertion overhead of indexes.
It seems that you'll be doing a lot of range searches over the SentTime column. In that case, I would create a clustered index on SentTime; with the nonclustered index there would be the overhead of lookups (to retrieve additional data). It is not important that SentTime is not unique, engine will add an uniquifier to it.
Does the Topic column have to be nvarchar; why not a varchar?
My relational self will punish me for this, but it seems that you don't need an additional PK. The data is read-only, right?
One more thought: check the sparse columns feature, it seems that it would be a perfect fit in your scenario. There could be a large number of sparse columns (up to 10.000 if I'm not mistaken), they can be grouped and manipulated as XML, and the main point is that NULLs are almost free storage-wise.

How should I setup my DynamoDB keys for an activity stream table

I'm wondering what the best way to setup the keys for a table holding activity stream data. Each activity type will have different attributes (with some common ones). Here is an example of what some items will consist of:
A follow activity:
type
user_id
timestamp
follower_user_id
followee_user_id
A comment activity
type
user_id
timestamp
comment_id
commenter_user_id
commented_user_id
For displaying the stream I will be querying against the user_id and ordering by timestamp. There will also be other types of queries - for example I will occasionally need to query user_id AND type as well as stuff like comment_id, follower_user_id etc.
So my questions are:
Should my primary key be a hash and range key using user_id and timestamp?
Do I need secondary indexed for every other item - e.g. comment_id or will results return quick enough without the index? Secondary indexes are limited to 5 which wouldn't be enough for all the types of queries I will need to perform.
I'd consider whether you could segment the data into two (or more) tables - allowing better use of your queries. Combine the two as (and if) needed, ie - your type becomes your table rather than a discriminator like you would do in SQL
If you don't separate the tables, then my answers would be
Yes - I think that would be the best bet given that it seems like most of the time, that will be the way you are using it.
No. But you do need to consider what the most frequent queries are and the performance considerations around it. Which ones need to be performant - and which ones are "good enough" good enough?
A combination of caching and asynchronous processing can allow a slow performing scan to be good enough - but it doesn't eliminate the requirement to have some local secondary indexes.

snowflake is better than indexing?

Here is the problem, I have a sales information table which contains sales information, which has columns like (Primary Key ID, Product Name, Product ID, Store Name, Store ID, Sales Date). I want to do analysis like drill up and drill down on store/product/sales date.
There are two design options I am thinking about,
Create individual index on columns like product name, product ID, Store Name, Store ID, Sales Date;
Using data warehouse snowflake model, treating current sales information table as fact table, and create product, store, and sales date dimension table.
In order to have better analysis performance, I heard snowflake model is better. But why it is better than index on related columns from database design perspective?
thanks in advance,
Lin
Knowing your app usage patterns and what you want to optimize for are important. Here are a few reasons (among many) to choose one over the other.
Normalized Snowflake PROs:
Faster queries and lower disk and memory requirements. Due to each normalized row having only short keys rather than longer text fields, your primary fact table becomes much smaller. Even when an index is used (unless the query can be answered directly by the index itself), partial table scans are often required, and smaller data means fewer disk reads and faster access.
Easier modifications and better data integrity. Say a store changes its name. In snowflake, you change one row, whereas in a large denormalized table, you have to change it every time it comes up, and you will often end up with spelling errors and multiple variations of the same name.
Denormalized Wide Table PROs:
Faster single record loads. When you most often load just a single record or small number of records, having all your data together in one row will incur only a single cache miss or disk read, whereas in the snowflake the DB might have to read from multiple tables in different disk locations. This is more like how NoSQL databases store their "objects" associated with a key.

How do you query DynamoDB?

I'm looking at Amazon's DynamoDB as it looks like it takes away all of the hassle of maintaining and scaling your database server. I'm currently using MySQL, and maintaining and scaling the database is a complete headache.
I've gone through the documentation and I'm having a hard time trying to wrap my head around how you would structure your data so it could be easily retrieved.
I'm totally new to NoSQL and non-relational databases.
From the Dynamo documentation it sounds like you can only query a table on the primary hash key, and the primary range key with a limited number of comparison operators.
Or you can run a full table scan and apply a filter to it. The catch is that it will only scan 1Mb at a time, so you'd likely have to repeat your scan to find X number of results.
I realize these limitations allow them to provide predictable performance, but it seems like it makes it really difficult to get your data out. And performing full table scans seems like it would be really inefficient, and would only become less efficient over time as your table grows.
For Instance, say I have a Flickr clone. My Images table might look something like:
Image ID (Number, Primary Hash Key)
Date Added (Number, Primary Range Key)
User ID (String)
Tags (String Set)
etc
So using query I would be able to list all images from the last 7 days and limit it to X number of results pretty easily.
But if I wanted to list all images from a particular user I would need to do a full table scan and filter by username. Same would go for tags.
And because you can only scan 1Mb at a time you may need to do multiple scans to find X number of images. I also don't see a way to easily stop at X number of images. If you're trying to grab 30 images, your first scan might find 5, and your second may find 40.
Do I have this right? Is it basically a trade-off? You get really fast predictable database performance that is virtually maintenance free. But the trade-off is that you need to build way more logic to deal with the results?
Or am I totally off base here?
Yes, you are correct about the trade-off between performance and query flexibility.
But there are a few tricks to reduce the pain - secondary indexes/denormalising probably being the most important.
You would have another table keyed on user ID, listing all their images, for example. When you add an image, you update this table as well as adding a row to the table keyed on image ID.
You have to decide what queries you need, then design the data model around them.
I think you need create your own secondary index, using another table.
This table "schema" could be:
User ID (String, Primary Key)
Date Added (Number, Range Key)
Image ID (Number)
--
That way you can query by User ID and filter by Date as well
You can use composite hash-range key as primary index.
From the DynamoDB Page:
A primary key can either be a single-attribute hash key or a composite
hash-range key. A single attribute hash primary key could be, for
example, “UserID”. This would allow you to quickly read and write data
for an item associated with a given user ID.
A composite hash-range key is indexed as a hash key element and a
range key element. This multi-part key maintains a hierarchy between
the first and second element values. For example, a composite
hash-range key could be a combination of “UserID” (hash) and
“Timestamp” (range). Holding the hash key element constant, you can
search across the range key element to retrieve items. This would
allow you to use the Query API to, for example, retrieve all items for
a single UserID across a range of timestamps.

Resources