How to efficiently store and retrieve aggregated data in DynamoDB? [duplicate] - database

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.

Basic aggregation can be done using scan() and query() in lambda.

Related

Can we use Elastic Search as Backend Data Store by replacing Postgres?

We are using Postgres to store and work with app data, the app data contains mainly:
We need to store the incoming request json after processing that.
We need to search the particular JSON using the Identifier field, for which we are creating a separate column, for each row in the table
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json
All these things are ok at present with Postgres, when I am reading some blog article, where they mentioned that we can use ElasticSearch as backend data store also, instead of just as search server, if we can use like that, can we replace Postgres with ElasticSearch? What advantages I can get in doing this, what are the pros of postgres when compared with ElasticSearch for my case, what are cons?
Can anyone given some advice please.
Responding the questions one by one:
We need to store the incoming request json after processing that.
Yes and No. ElasticSearch allows to store JSON objects. This works if the JSON structure is known beforehand and/or is stable (i.e. the same keys in the JSON have the same type always).
By default the mapping (i.e. schema of the collection) is dynamic, means it allows to infer schema based on the value inserted. Say we insert this document:
{"amount": 1.5} <-- insert succeeds
And immediately after try to insert this one:
{"amount": {"value" 1.5, "currency": "EUR"]} <-- insert fails
ES will reply with an error message:
Current token (START_OBJECT) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#757a68a8; line: 1, column: 13]
If you have JSON objects of unknown structure you can still store them in ES, it could be done by using the type object and setting property enabled: false; this will not allow you to do any kind of queries on the content of such field though.
We need to search the particular JSON using the Identifier field , for which we are creating a separate column ,for each row in the table
Yes. This can be done using field of type keyword if identifier is an arbitrary string, or integer if it is an integer.
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json.
As per 1), yes and no. If JSON schema is known and strict, it can be done. If JSON structure is arbitrary, it can be stored but will not be queryable.
Though I would say ElasticSearch is not suitable for your case, there are some some guys that make JDBC and ODBC drivers for ElasticSearch, apparently in some cases ElasticSearch can be used as relational database.
elasticsearch is a HTTP wrapper to Apache Lucene. Apache Lucene stores object in a columnar fashion in order to speed-up search (Lucene segments).
I am completing the very good Nikolay answer:
The good:
Both Lucene and Elasticsearch are solid project
Elasticsearch is (my opinion) the best and easiest software for clustering (sharding and replication)
Support version conflict (https://www.elastic.co/guide/en/elasticsearch/guide/current/concurrency-solutions.html)
The bad:
not realtime (https://www.elastic.co/guide/en/elasticsearch/guide/current/near-real-time.html)
No support ACID transaction (Changes to individual documents are ACIDic, but not changes involving multiple documents.)
Slow to get lot of data (must use search scroll, very slow comparing to a SQL database fetch)
No authentication and access-control
My opinion is use elasticsearch as a kind of view of your database, with read-only access.

NoSQL write-once, automatic timestamp indexed

I'm looking for the least-effort solution to store data in a database. Here are the requirements:
this will be the storage backend for a test automation tool
data will be messages captured from queues: can be JSON, XML, binary... but could be converted to a uniform representation
data will be written once, whatever is written will not change
there will be multiple indexes necessary, however the base index should be the timestamp of the messages inserted into the database - it would be nice if the database of choice could be configured to provide this automatically (eg. query messages inserted between two timestamps - should work out of the box)
ease of query is important (SQL would be best, however the structure of the messages is not always known in advance)
performance is not important
fault tolerance, partition tolerance, reliability etc are not important
ease of access (eg. REST API, API from multiple platforms - JVM, JS, etc) is important.
I was looking at MongoDB, CouchDB, maybe Riak... All of these could work, I just don't know which is the least resistance for the requirements above. I am familiar with Riak, but its strengths are not really what I'm after...
#geraldss has addressed the INSERT question. Let me add the example.
Indexing: you can create indices one one or more fields and the query will use them automatically.
create index idx_ins_time on my_bucket(insert_time);
select my_message from my_bucket
where insert_time
between "2016-04-03T10:46:33.857-07:00" and "2016-04-05T10:46:33.857-07:00";
Use EXPLAIN to see the plan, just like SQL.
You can create multiple indices with one or more keys each.
Couchbase N1QL supports REST API, JDBC/ODBC and SDKs for most popular languages.
It seems that CouchBase is the best alternative, simply because N1QL:
http://developer.couchbase.com/documentation/server/current/n1ql/n1ql-intro/data-access-using-n1ql.html
It ticks all the other boxes (except for the automatic timestamp indexes, but then adding that and doing range queries is straightforward thanks to the query language).
If you use Couchbase, you can use N1QL's INSERT statement to automatically add the timestamp:
INSERT INTO my_bucket(KEY, VALUE)
VALUES ($my_key, {
"insert_time": NOW_STR(),
__my other data fields__
}
)

SimpleDB Select VS DynamoDB Scan

I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.

Possible storage option for data extracted from DBPedia

I'm developing an application that allows users tag product purchases (via a Web App).
I intend to use the tags to automatically query DBPedia (Possible other Open Data Sources such as FreeBase).
The top N results returned from DBPEdia will be displayed to users and they will select the one that most closely resembles the tag they entered. (I will only extract specific data).
For example:
User enters tag 'iPhone' and SparSQL query sent to DBPedia. Results are parsed and some data on each result shown to user who then selects the one that most closely resembles what they bought.
I want to extract some of the data from the users selected DBpedia result and store it for marketing purposes at a later stage. (Ideally via some call to an API)
I was thinking either Bigdata or Protege OWL but have no experience of using either
Can anybody suggest the best tool for this task and advantages/disadvantages/learning curve/etc...?
Thanks
It all depends on what you want to do with the data that you've extracted. The simplest option is just to store the reconciled entity URI along with your other data in a relational database or even a NoSQL database. This lets you easily query Freebase and DBpedia for that entity later on.
If you want to pull in "everything there is to know" about an entity from Freebase and DBpedia, then you're probably better off with a triple store. With this approach, you can query all the data locally; but now you have to worry about keeping it updated.
For the kind of thing you have in mind, I don't think you necessarily need a highly scalable triplestore solution. More important seems to me that you have a toolkit for easy execution of SPARQL queries, result processing, and quick local caching of RDF data.
With those things in mind, I'd recommend having a look at OpenRDF Sesame. It's a Java toolkit and API for working with RDF and SPARQL with support for multiple storage backends. It has a few built-in stores that perform well for what you need (scaling up to about 100 million facts in a single store), and if you do find you need a bigger/better storage solution, stores like BigData or OWLIM are pretty much just drop-in replacements for Sesame's own storage backends, so you get to switch without having to make large changes to your code.
Just to give you an idea: the following lines of code use Sesame to fire a SPARQL query against DBPedia and process the result:
SPARQLRepository dbpediaEndpoint = new SPARQLRepository("http://dbpedia.org/sparql");
dbpediaEndpoint.initialize();
RepositoryConnection conn = dbpediaEndpoint.getConnection();
try {
String queryString = " SELECT ?x WHERE { ?x a foaf:Person } LIMIT 10";
TupleQuery query = conn.prepareTupleQuery(Querylanguage.SPARQL, queryString);
TupleQueryResult result = query.evaluate();
while(result.hasNext()) {
// and so on and so forth, see sesame manual/javadocs
// for details and examples
}
}
finally {
conn.close();
}
(disclosure: I work on Sesame)

Resources