Expire documents in Cloudant - cloudant

Is there any way to expire documents from the database? For example set a time of 10 hours and all documents aged 10hours+ are deleted from the database.
Our data only makes sense during the day, anything older is useless and can be deleted.

In short no. What you can do however is build it into your application logic, but that is going to cause overheads which may be detrimental to your application's performance.

Related

Problems and solutions when using a secondary datastore alongside the main database?

I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.

Scale database that receives streaming data with small resources

My use case is the following: I run about 60 websockets from 7 data sources in parallel that record stock tickers (so time-series data). Currently, I'm writing the data into a mongodb that is hosted on a Google Cloud VM such that every data source has its own collection and all collections are hosted inside the same database.
However, the database has grown to 0.6 GB and ~ 10 million rows after only five days of data. I'm pretty new to such questions, but I have a feeling that this is not a viable long-term solution. I will never need all of the data at once, but I need all of the data in order to query by date / currency. However, as I understood those queries might become impossible once the dataset is bigger than my RAM, is that true?
Moreover, this is a research project, but unfortunately I'm currently not able to use a university cluster, therefore I'm hosting the data on a private VM. However, this is subject to a budget constraint, and highly performant machines quickly become very expensive. That's why I'm questioning my design choice. Currently, I'm thinking of either switching to another kind of database, but fear that I'm running into the same issues again, or exporting the database once per week / month / whatever to CSV and wiping out. This would be quite a hastle though and I'm also scared of losing data.
So my question is, how can I design this database such that I can subset the data per one of the keys (either datetime or ticker_id) even when the database grows larger than my machine's RAM? Diskspace is not an issue.
On top of what Alex Blex already commented about storage and performance.
Query response time,in 5 days you have close to 10M rows, will worsen as data set grows. You can look at sharding to break the table down to reasonable chunks and still have acees to all data for query purpose.

Google appengine, least expensive way to run heavy datastore write cron job?

I have a Google appengine application, written in Go, that has a cron process which runs once a day at 3am. This process looks at all of the changes that have happened to my data during the day and stores some meta data about what happened. My users can run reports on this meta data to see trends that have happened over several months. The process does around 10-20 million datastore writes every night. It all works just fine, but since I have started running it I have noticed a significant increase in my monthly bill from Google (from around $50/month to around $400/month).
I have just setup a very basic taskqueue that this runs in, I have not changed the default settings at all. Is there a better way that I could be running this process at night that could save me money? I have never messed around with the backends (which are now depreciated) or the modules api, and I know they've changed a lot of this stuff recently so I'm not sure where to start looking. Any advice would be greatly appreciated.
Look at your instances at 3am. It might be that GAE spins up a lot of them to handle the job. You could configure your job to make it run less paralel so it will take longer but perhaps it will need only 1 instance then.
However, if your database writes are indeed the biggest factor this won't make a big impact.
You can try looking at your data models and indexes. Remember that each indexed field costs 2 writes extra, so see if you can remove indexes from some fields if you don't need them.
One improvement that you can do is to batch your write operations, you can use memcache for this (pay the dedicated one since it's more reliable). Write the updates to memcache, once it's about 900K, flush it to datastore. This will reduces the number of write to datastore A LOT, especially if your metadata's size is small.

Improving database record retrieval throughput with appengine

Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.
Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.
Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.
Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.

How stable is Cassandra?s

I've been using Cassandra instance without reboot for a few days for a simple
task for storing tweets, 1-2 saves in a second. After then Cassandra got really slow and I had to kill a restart it.
Is this Cassandras's expected stability now? Would it be a good solution to write a daemon to kill/restart it every day or two?
No. Cassandra is widely expected to be more stable than that. If it is not stable, there is a substantial chance you have configured it wrong. It may be attempting to use more memory than you expect, for instance. If you have encountered a bug or defect in Cassandra, it is not one which is afflicting the majority of users.
As for your "restart daemon" plan, I'm going to go with "that's a horrible solution for pretty much everything, and especially so for something that you're trusting with any data you actually care about."
From https://cassandra.apache.org/
Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more
companies that have large, active data sets. The largest known
Cassandra cluster has over 300 TB of data in over 400 machines.
It was (is?) largely used at Facebook too. I would say it's stable. :)
And btw, I don't think it's meant to be restarted every 1-2 days at all: you use it if you have huge data sets with high availability (HA) requirements, and going down every 2 days is not HA.

Resources