What is considered to be a single API call to Cloudant?
To my understanding these are all single api calls respectively?:
Get a single document
Insert/update a document
Use the getAllDocuments function to retrieve all documents
Get all documents by using a view.
Insert documents by sending all at the same time (Bulk update)
Perform a search query with a search index
Download an attachment from a cloudant document.
Could you say that which ever function / rest request you are making to Cloudant it is considered as a single API call no matter how much data / how many documents that are transferred as the response?
You are correct. Each of the above actions can be performed with single API calls. Let's deal with each in turn:
Get a single document - GET /db/:id
Insert/update a document - PUT /db/:id
Retrieve all documents - GET /db/_all_docs
Using a view - GET /db/_design/mydesigndoc/_view/myview - Although views can be used to return a selection of documents (with startkey/endkey parameters) or to aggregate the data (by using a 'reduce' operation and optionally grouping by keys)
Bulk insert/update/delete - POST /db/_bulk_docs
Cloudant Query - POST /db/_find
Get attachment - GET /db/:id/:attachmentname
As a rule of thumb, limit your calls to _bulk_docs to batches of around 500. You can retrieve lots of data from views or _all_docs: Cloudant will happily spool you all the data it has. More commonly, views (or the primary index that powers _all_docs) can be used to retrieve sub-sets of the data by passing startkey/endkey parameters, or supplying skip/limit parameters.
Related
How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.
We are using Postgres to store and work with app data, the app data contains mainly:
We need to store the incoming request json after processing that.
We need to search the particular JSON using the Identifier field, for which we are creating a separate column, for each row in the table
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json
All these things are ok at present with Postgres, when I am reading some blog article, where they mentioned that we can use ElasticSearch as backend data store also, instead of just as search server, if we can use like that, can we replace Postgres with ElasticSearch? What advantages I can get in doing this, what are the pros of postgres when compared with ElasticSearch for my case, what are cons?
Can anyone given some advice please.
Responding the questions one by one:
We need to store the incoming request json after processing that.
Yes and No. ElasticSearch allows to store JSON objects. This works if the JSON structure is known beforehand and/or is stable (i.e. the same keys in the JSON have the same type always).
By default the mapping (i.e. schema of the collection) is dynamic, means it allows to infer schema based on the value inserted. Say we insert this document:
{"amount": 1.5} <-- insert succeeds
And immediately after try to insert this one:
{"amount": {"value" 1.5, "currency": "EUR"]} <-- insert fails
ES will reply with an error message:
Current token (START_OBJECT) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#757a68a8; line: 1, column: 13]
If you have JSON objects of unknown structure you can still store them in ES, it could be done by using the type object and setting property enabled: false; this will not allow you to do any kind of queries on the content of such field though.
We need to search the particular JSON using the Identifier field , for which we are creating a separate column ,for each row in the table
Yes. This can be done using field of type keyword if identifier is an arbitrary string, or integer if it is an integer.
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json.
As per 1), yes and no. If JSON schema is known and strict, it can be done. If JSON structure is arbitrary, it can be stored but will not be queryable.
Though I would say ElasticSearch is not suitable for your case, there are some some guys that make JDBC and ODBC drivers for ElasticSearch, apparently in some cases ElasticSearch can be used as relational database.
elasticsearch is a HTTP wrapper to Apache Lucene. Apache Lucene stores object in a columnar fashion in order to speed-up search (Lucene segments).
I am completing the very good Nikolay answer:
The good:
Both Lucene and Elasticsearch are solid project
Elasticsearch is (my opinion) the best and easiest software for clustering (sharding and replication)
Support version conflict (https://www.elastic.co/guide/en/elasticsearch/guide/current/concurrency-solutions.html)
The bad:
not realtime (https://www.elastic.co/guide/en/elasticsearch/guide/current/near-real-time.html)
No support ACID transaction (Changes to individual documents are ACIDic, but not changes involving multiple documents.)
Slow to get lot of data (must use search scroll, very slow comparing to a SQL database fetch)
No authentication and access-control
My opinion is use elasticsearch as a kind of view of your database, with read-only access.
Currently, I have two databases that share only one field. I need to append the data from one database into the document generated by the other, but the mapping is one to many, such that multiple documents will have the new data appended to it. Is this possible in SOLR? I've read about nested documents, however, in this case the "child" documents would be shared by many "parent" documents.
Thank you.
I see two main options:
you can write some client code using SolrJ that reads all data needed for a given doc from all datasources (doing a SQL join, looking up separate db, whatever), and then write the doc to Solr. Of course, you can (should) do this in batches if you can.
you can index the first DB into Solr (using DIH if it's doable so it's quick to develop). It is imporntant you store all fields (or use docvalues) so you can have all your data back later. Then you write some client code that:
a) retrieves all data about a doc
b)gets all data that must be added from the other DB
c) build a new representation of the doc (with client docs if needed)
d) you update the doc, overwriting it
I've created an API on Google Cloud Endpoints that getting all datas from a single entity in Datastore. The NoSQL request (A really simple one : Select * from Entity) is performed with Objectify.
This datastore entity is populated with 200 rows (entities) and each row (entity) has a list of children entities of same kind :
MEAL:
String title
int preparationTime
List< Ingredient > listOfIngredients (child entities...)
...
So when I fetch API, a JSON is returned. It's size is about 641Ko and it has 17K lines.
When I look at the API explorer, it tells me that request takes 4 seconds to execute :
I would like to decrease that time, because it's a really high one... I've already :
Increase GAE instance to F2
Enable Memcache
It helps a little but I don't think this is the best efficient way...
Should I use Big Query to generate the JSON file faster ? Or maybe there is another solution ?
Do you need all the entity in a single request ?
if Not, then you can batch fetch entities using Cursor Queries and display as per your need, say for eg: fetch 20 or 30 entities at a time depending on your need.
If Yes,
Does your meal entity changes often
If No, you can generate a json file and store it in GCS, and whenever your entity changes you can update the json file, so that on the client end fetching will be lot faster and using etag header, new content can be pulled easily
If Yes,
then i think batch fetching is only effective way to pull those many entities
Here is the scenario:
I am handling a SQL Server database with a stored procedure which takes care of returning headers for Web feed items (RSS/Atom) I am serving as feeds through a web application.
This stored procedure should, when called by the service broker task running at a given interval, verify if there has been a significant change in the underlying data - in that case, it will trigger a resource intensive activity of formatting the feed item header through a call to the web application which will get/retrieve the data, format them and return to the SQL database.
There the header would be stored ready for a request for RSS feed update from the client.
Now, trying to design this to be as efficient as possible, I still have a couple of turning point I'd like to get your suggestions about.
My tentative approach at the stored procedure would be:
get together the data in a in-memory table,
create a subquery with the signature columns which change with the information,
convert them to XML with a FOR XML AUTO
hash the result with MD5 (with HASHBYTES or fn_repl_hash_binary depending on the size of the result)
verify if the hash matches with the one stored in the table where I am storing the HTML waiting for the feed requests.
if Hash matches do nothing otherwise proceed for the updates.
The first doubt is the best way to check if the base data have changed.
Converting to XML inflates significantly the data -which slows hashing-, and potentially I am not using the result apart from hashing: is there any better way to perform the check or to pack all the data together for hashing (something csv-like)?
The query is merging and aggregating data from multiple tables, so would not rely on table timestamps as their change is not necessarily related to a change in the result set
The second point is: what is the best way to serve the data to the webapp for reformatting?
- I might push the data through a CLR function to the web application to get data formatted (but this is synchronous and for multiple feed item would create unsustainable delay)
or
I might instead save the result set instead and trigger multiple asynchronous calls through the service broker. The web app might retrieve the data stored in some way instead of running again the expensive query which got them.
Since I have different formats depending on the feed item category, I cannot use the same table format - so storing to a table is going to be hard.
I might serialize to XML instead.
But is this going to provide any significant gain compared to re-running the query?
For the efficient caching bit, have a look at query notifications. The tricky bit in implementing this in your case is you've stated "significant change" whereas query notifications will trigger on any change. But the basic idea is that your application subscribes to a query. When the results of that query change, a message is sent to the application and it does whatever it is programmed to do (typically refreshing cached data).
As for serving the data to your app, there's a saying in the business: "don't go borrowing trouble". Which is to say if the default method of serving data (i.e. a result set w/o fancy formatting) isn't causing you a problem, don't change it. Change it only if and when it's causing you a significant enough headache that your time is best spent there.