I made a mistake by accidentially exported a large database (cloudSQL) of 20TB into SQL to a google bucket, it is currently running, I am afraid this bulky file will cost me a huge amount of money on google cloud and wish to stop this export, but there is no way I can stop this current export. Please help.
With the current error message,
Exporting data to gs://.../bak/Cloud_SQL_Export_2023-02-17
(15:55:15).sql. This may take a few minutes. While this operation is
running, you may continue to view information about the instance.
I searched all the online tutorials, and try to contact google cloud, but found that I didn't purchase the support.
Related
For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)
Where can I find the total front-end instance hours that I have used in my previous days? It seems I can only see today's total.
AFAIK the actual instance-hours numbers are not directly available, at least not in the developer console.
What you might find helpful would be the historical graph of the instance usage, for the last (up to) 30 days, which you'll find in the Dashboard after selecting the Instances display mode, the desired timescale and the billed instance estimate graph:
Hover the mouse cursor over the graph and the actual billed instance estimate value will be displayed below the graph and the corresponding date and time is displayed in the top right corner. Note that there is some noticeable delay in updating the values, tho.
The graph detail level depends on the selected timescale, the coarsest being ~2h (on the 30 days timescale), so if you need more than just a general idea about usage trending you'd need to take care of:
averaging and/or integrating on a daily basis
accounting for the free daily quotas
accounting for instance class (I'm unsure if that's already taken into account or not)
You could also try the billing date export feature. From
Export billing data to a file:
You can export your daily usage and cost estimates automatically to a
CSV or JSON file stored in a Google Cloud Storage bucket you
specify. You can then access the data via the Cloud Storage API, CLI
tool, or Google Cloud Platform console
...
Alternatively, you can export detailed data to a Google BigQuery
dataset. For more information, see Export billing data to
BigQuery.
Note: I haven't actually tried this export feature. so I can't tell if the instance hours values are in there. There's also the still open GAE issue 10716 which suggests GAE stats might not be included.
It seems like something that should take place in a flash, at least locally. Is there some time emulation of the production server that causes it?
It depends how much data is saved in your local datastore and how fast your disk is. I've noticed if you have a lot of deleted data in your datastore, the file still ends up being big.
If you want to clear the database, it's better to delete the whole file than delete all the individual entities.
Does anyone have any advice on making the logging in Google App Engine better? I am currently trying to use Splunk Storm, but they are finicky regarding input and go down often. Has anyone else encountered this and solved it in some capacity?
Currently I have a process that runs in a backend that reads from the LogService and pipes the logs into Splunk Storm via REST api. This often fails, or storm goes down, or the backend IP changes.
My issue is with the logging provided within App Engine, as the logs disappear when new versions are pushed and querying the logs with the provided dashboard is almost unusable. Splunk was a potential solution, but the cloud solution leaves a lot to be desired.
Anything that would provide a better interface into my logs would be appreciated.
You can export logs from GAE to BiqQuery which has quite capable query language. You can use Mache, an open-source project that already does this. You should write your own exporter, to expose (and make queryabe) fields (columns) you are interested in.
Since you've decided to use Splunk (or another external service) as permanent storage, it sounds like you need a location to buffer logs between the times when they're written to App Engine's log service and when Splunk is available to accept the logs. To avoid losing logs before version churn causes them to fall out of App Engine, this buffer needs to be fast and highly available.
One reasonable choice is the AE datastore. There's no unreliable hop to a 3rd party, it has an availability SLA, and it can be scaled arbitrarily by sharding writes. The downside would be the cost of R/W operations and the storage footprint of in-flight logs, but you'll incur a comparable cost for another backing store.
Whatever choice of service, have one batch process (e.g. backend or cronjob) write to the buffer from the logs reader API. As long as it runs more often than app updates, logs will always exist in durable storage. Then have another batch process wait for Splunk to be available then upload to it from the buffer and delete as you get receipt confirmation from Splunk.
I never used Amazon EC2 or RDS Service. I am trying to calculate my cost using http://calculator.s3.amazonaws.com/calc5.html
I searched a little but could locate answers to some basic things. Can you help me out with this:
What does DB Instance means? 1 Database = 1 Instance or 1 Connection = 1 Instance
How to calculate hours/month usage? It should depend on the transfer rates or processing time. Is there a way I can get rough Idea about it?
What if I already have my DB Ready and want to upload it directly (it would be few GBs) then how will it be calculated.
I am new to amazon EC2 and searched stackoverflow and serverfault before posting this question. Got some idea but not specific what I am looking for. Can someone help me out here?
In general, one database = one instance. You spin up instances, and do what you like with them. Definitely possible to have more connections to it.
Hours per month is just that. How many hours per month you have the instance active. If you plan to have the instance active 24/7, you may find more cost effective alternatives with other cloud providers. If you run it less often than that, you save money when it's not active. It's billed hourly to your account at the rate specified.
Upload data is counted at the standard transfer rates. A few GBs doesn't cost much, but you will be paying for the service starting the moment you spin up the instance.