I am searching for an easy way to stream CloudWatch logs (just one specific group) to postgresql database. Streaming does not have to be in real time [insert logs into db once in an hour is enough].
Options, that I have found:
Use Subscription filters with AWS Lambda, [https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#LambdaFunctionExample], but I am not sure if this solution is robust enough and can handle such a load. Also I am not sure about scaling this solution in the future.
Writing my own solution using for example AWS SDK for Python (Boto3), but then I would need to maintain this code and it seems like unnecessary work.
I would like to know if there is already some tool, which is build for this or what's the easiest way to achieve this.
Related
I am developing various services which talk directly to BigQuery, by streaming rows into the database. Right now I am updating the schema directly from the Google cloud UI, which has been causing issues as you can imagine due to forgetfulness!
I would like to understand how best to keep code & schemas aligned for what are still fast evolving services and schemas.
My current ideas are:
use something like Terraform, but I am unsure on how this works on live tables which need updating or migrating
add code to the service to check / set the schema, which would at least throw errors if not automate the process
Thanks in advance!
EDIT:
To give more clarity as request in the comments; we are using a cloud run microservices to stream rows into bigquery, the services are written in python/node. Their primary goal is to do some light transform on the data and store in BQ.
Not really sure what more to add, my ideal scenario is that we have something in the code which also defines or at least checks the schema, to keep the code and db in sync.
As any databases, you have to follow some rules or best practices to avoid some errors and conflicts. For example, you have to avoir to update manually the schema, you can choose to do it always with code (which is better because you can follow the schema change thanks to your git history).
Then you can have a side process that check the schema and update it with the newer changes. Or do this at startup (only if you aren't in serverless and the app start duration isn't a concern for you).
Terraform is perfect to deploy infrastructure, but much more limited to update/patch existing components. I don't recommend it.
I'm currently doing real time synchronization with elastic search (upon save in database, I save on elastic search).
The problem that I have is synchronization of all entitites through some tool (probably Logstash) - though I'm not sure about best practices. I would like to be able to synchronize specific entity (or all entities), which is not a problem since I have DB View for each entity but I'm not sure about performance of whole DB synchronization, and are there any limitations on logstash/other tools?
Basically the idea is to run full synchronization on initial project setup, and then just run synchronization if something goes wrong or model changes and needs elastic search update. I don't have too many records for now (<1M overall I'd say).
Any suggestion would be well appreciated!
You can use the JDBC plugin of logstash. You can even use the cron job style scheduler that is built in the plugin.
The documentation is here JDBC input Plugin
But you have to understand that Elasticsearch is a Search Engine. So real-time is not possible but near real time is.
I've got a RDS database with a table containing a ton of data in several columns (some with geo spatial data) I want to search across. SQL queries and good covering indexes on this data is still far too slow to use for something like an AJAX type ahead suggestion field.
As such, I'm investigating options for search and came across Amazon CloudSearch (now powered by Apache Solr) and it seems to fit my needs. The problem is, I can't seem to find a way via the AWS console to import or provide data from RDS. Am I missing something? Other solutions like ElasticSearch have plugins like river to connect an transform MySQL data.
I know there are command line tools for uploading CSV and XML data into CloudSearch. So far the easiest thing I can find is to mysqldump table into CSV or XML format and manually load it with the CLI tools. Is this with some re-occuring cron job the best way to do get data?
As of 2014-06-17 this feature is not available on Amazon Cloudsearch.
I think AWS Data Pipeline can help. It works like a cron and you can program reoccuring jobs easily using this.
Ran into the same thing, it is only possible to pull directly from RDS if you are using noSQL and AWS's dynamoDB.
Looking into Elasticsearch after finding this out.
I have an AppEngine application that currently has about 15GB of data, and it seems to me that it is impractical to use the current AppEngine bulk loader tools to back up datasets of this size. Therefore, I am starting to investigate other ways of backing up, and would be interested in hearing about practical solutions that people may have used for backing up their AppEngine Data.
As an aside, I am starting to think that the Google Cloud Storage might be a good choice. I am curious to know if anyone has experience using the Google Cloud Storage as a backup for their AppEngine data, and what their experience has been, and if there are any pointers or things that I should be aware of before going down this path.
No matter which solution I end up with, I would like a backup solution to meet the following requirements:
1) Reasonably fast to backup, and reasonably fast to restore (ie. if a serious error/data deletion/malicious attack hits my website, I don't want to have to bring it down for multiple days while restoring the database - by fast I mean hours, as opposed to days).
2) A separate location and account from my AppEngine data - ie. I don't want someone with admin access to my AppEngine data to necessarily have write/delete access to the backup data location - for example if my AppEngine account is compromised by a hacker, or if a disgruntled employee were to decide to delete all my data, I would like to have backups that are separate from the AppEngine administrator accounts.
To summarize, given that getting the data out of the cloud seems slow/painful, what I would like is a cloud-based backup solution that emulates the role that tape backups would have served in the past - if I were to have a backup tape, nobody else could modify the contents of that tape - but since I can't get a tape, can I store a secure copy of my data somewhere, that only I have access to?
Kind Regards
Alexander
There are a few options here, though none are (currently) quite what you're looking for.
With the latest release of version 1.5.5 of the SDK, we now support interfacing with Google Storage directly - you can see how, here. With this you can write data to Google Storage, but to the best of my knowledge there's no way to write a file that the app will then be unable to delete.
To actually gather the data, you could use the App Engine mapreduce API. It has built in support for writing to the App Engine blobstore; writing to Google Storage would require you to implement your own output writer, currently.
Another option, as WoLpH suggests, is to use the Datastore Admin tool to back up data to another app. With a little extra effort you could modify the remote_api stub to prohibit deletes to the target (backup) app.
One thing you should definitely do regardless is to enable two-factor authentication for your Google account; this makes it a lot harder for anyone to get control of your account, even if they discover your password.
The bulkloader is probably one of the fastest way to backup/restore your data.
The problem with the AppEngine is that you have to do everything through views. So you have the restrictions that views have... the result is that a fast backup/restore still has to use the same API's as the rest of your app. So the bulkloader (possibly with a few modifications) is definately your best option here.
Perhaps though... (haven't tried it yet), you can use the new Datastore Admin to copy the data to another app. One which only you control. That way you can copy it back from the other app when needed.
I have a youtube style site but it revolves around pictures.
On the homepage I want to show latest pictures that have been uploaded above the most popular pictures of all time.
Is it a good idea to do a database/cache query for every user when they hit the page in order to check what the latest images are and display them or should I do this another way to ensure the database isn't constantly flooded with requests for the latest posted pictures?
Maybe some sort of batch job?
Any ideas?
The most basic proactive thing you can do here is to cache the results of the DB query - either in your app code (less preferable) or in an existing piece of infrastructure integrate-able in your web "stack", for example something like Memcached:
http://memcached.org/
This has helped many a DB-backed site achieve some minimal level of scalability/performance.
Depending on your DB, you can also cache such queries as part of DB functionality itself, but it's better if you can intercept such things before they even get to the DB.