How to access terabytes of data sitting in cloud quickly? - database

We have Terabytes of data sitting in the google hard drives. Initially, since we were using google cloud VMs, so we were doing development work in the cloud and were able to access the data.
Now, we bought our own servers where our application is running and we are bringing the data to our local disks which would be accessed by our application. The things is transferring the data especially terabytes on network using scp is quite slow. Can anyone suggest a way to fix this issue?
What I am thinking is there isn't a way that we can keep running a script waiting for a request on the google cloud instance(it send the requested data over HTTP!), and from local_server, we can request for data at a time!
I know this again is happening over the network but, I think we can scale in this approach, but I could be wrong!. it's kind of client-server(1:1) layout using in building interaction between frontend and backend! any suggestions?
Would that be slow? slower than bringing the data using SCP!

You could download the full VM disk and mount int on you servers or download the disk then just copy the data and delete the VM disk. For any case you should follow the next steps:
Create a snapshot of your VM which will have all the data.
Build and export the VM image to your servers.
Run the image on your servers according to GCE requirements.
It would take a lot less of time, since you're doing it on on premises and avoiding network traffic.

Related

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Google Cloud Bigtable Python Client Performance Issue

I'm running into a performance issue with Google Cloud Bigtable Python Client. I'm working on a flask API that writes to and reads from a GCP Bigtable instance. The API uses the python client to communicate with Bigtable, and was deployed to GCP App Engine flexible environment.
Under low traffic, the API works fine. However during a load test, the endpoints that reads and writes to Bigtable suffers a huge performance decrease compare to a similar endpoint that doesn't communicate with Bigtable. Also, a large percentage of requests went to the endpoint receives a 502 Bad Gateway, even when health check was turned off in App Engine.
I'm aware of that the client is currently in Alpha. I wonder if the performance issue is known, or if anyone also ran into the same issue
Update
I found a documentation from Google stating:
There are issues with the network connection. Network issues can
reduce throughput and cause reads and writes to take longer than
usual. In particular, you'll see issues if your clients are not
running in the same zone as your Cloud Bigtable cluster.
In my case, my client is in a different region, by moving it to the same region had a huge increase in performance. However the performance issue still exist, and the recommendation from the documentation is to put client in the same zone as Bigtable.
I also considered using Container engine or Compute Engine where it is easier to specify the zone, but I want stay with App Engine for its autoscale functionality and managed services.
Bigtable client take somewhere between 3 ms to 20 ms to complete each request, and because python is single threaded, during that period of time it will just wait until the response comes back. The best solution we found was for any writes, publish the request to Pubsub, then use Dataflow to write to Bigtable. It is significantly faster because publishing a message in Python would take way below 1 ms to complete, and because Dataflow can be set to exactly the same region as Bigtable, and it is easy to parallel, it can write much faster.
Though it doesn't solve the scenario where you need frequent read or write need to be instantaneous

Web scraping vs Cloud Storage with AWS

My team has run into a design conflict. We are working on a project that involves scraping historical data from yahoo for all stocks for the last year to run some ML analysis on it. The latency is unbearably slow, not sure if it's the network or the web scraper. I proposed we use AWS RDS to store the data so we can access it quicker. However, a team member said that storing the data in the cloud would not solve our latency issue. I rebutted with the fact that the data will be organized and stored in a way to access the data significantly faster. He came back with something else and this went on. Is it true that a cloud DB won't offer any additional speed compared to a scraper? If so does AWS have a service that allows us to access the data we store faster through another service, almost as if the database was on our own server?
I am not that all familiar with cloud services but I do understand databases pretty well. So please dumb down the AWS stuff if you wish and feel free to point me to any duplicates or links that may help me understand this more.
Lots of good reasons to use RDS as a database, but speeding up your scraping isn't one of them - it likely isn't your bottleneck.
I have written lots of scrapers over the years, and by far the biggest performance boost will be to have a fast network connection between the scraper machine(s) and the host you are scraping, and even then, using a multi-threaded scraper for each scraping machine will give you another HUGE speed improvement.
Most time spent scraping is waiting on the host to return the results to you, not parsing the page and not saving the database to a database.
A MySQL DB on AWS RDS would be the same as the one that you'd install yourself on some machine. So, it isn't going to be different or slower just because it is in the cloud.
If you scrape some data and process it only once, then there is no point in introducing a DB in between. But if your scraper is slow and you process scraped data multiple times, then storing it in a DB should improve latencies. That is because the latencies of a DB read will be much lesser than that of scraping (assuming you design your DB schema properly; your hosts are in the same availability zones, or at least regions, as your DB etc.).
For e.g., if scraping a webpage takes ~10s and you process the scraped data twice, it'd take you ~20s if you don't have a DB. If you have a DB which has latencies of ~500ms you'd only take ~11s.

Fastest Open Source Content Management System for Cloud/Cluster deployment

Currently clouds are mushrooming like crazy and people start to deploy everything to the cloud including CMS systems, but so far I have not seen people that have succeeded in deploying popular CMS systems to a load balanced cluster in the cloud. Some performance hurdles seem to prevent standard open-source CMS systems to be deployed to the cloud like this.
CLOUD: A cloud, better load-balanced cluster, has at least one frontend-server, one network-connected(!) database-server and one cloud-storage server. This fits well to Amazon Beanstalk and Google Appengine. (This specifically excludes CMS on a single computer or Linux server with MySQL on the same "CPU".)
To deploy a standard CMS in such a load balanced cluster needs a cloud-ready CMS with the following characteristics:
The CMS must deal with the latency of queries to still be responsive and render pages in less than a second to be cached (or use a precaching strategy)
The filesystem probably must be connected to a remote storage (Amazon S3, Google cloudstorage, etc.)
Currently I know of python/django and Wordpress having middleware modules or plugins that can connect to cloud storages instead of a filesystem, but there might be other cloud-ready CMS implementations (Java, PHP, ?) and systems.
I myself have failed to deploy django-CMS to the cloud, finally due to query latency of the remote DB. So here is my question:
Did you deploy an open-source CMS that still performs well in rendering pages and backend admin? Please post your average page rendering access stats in microseconds for uncached pages.
IMPORTANT: Please describe your configuration, the problems you have encountered, which modules had to be optimized in the CMS to make it work, don't post simple "this works", contribute your experience and knowledge.
Such a CMS probably has to make fewer than 10 queries per page, if more, the queries must be made in parallel, and deal with filesystem access times of 100ms for a stat and query delays of 40ms.
Related:
Slow MySQL Remote Connection
Have you tried Umbraco?
It relies on database, but it keeps layers of cache so you arent doing selects on every request.
http://umbraco.com/azure
It works great on azure too!
I have found an excellent performance test of Wordpress on Appengine. It appears that Google has spent some time to optimize this system for load-balanced cluster and remote DB deployment:
http://www.syseleven.de/blog/4118/google-app-engine-php/
Scaling test from the report.
parallel
hits GAE 1&1 Sys11
1 1,5 2,6 8,5
10 9,8 8,5 69,4
100 14,9 - 146,1
Conclusion from the report the system is slower than on traditional hosting but scales much better.
http://developers.google.com/appengine/articles/wordpress
We have managed to deploy python django-CMS (www.django-cms.org) on GoogleAppEngine with CloudSQL as DB and CloudStore as Filesystem. Cloud store was attached by forking and fixing a django.storage module by Christos Kopanos http://github.com/locandy/django-google-cloud-storage
After that, the second set of problems came up as we discovered we had access times of up to 17s for a single page access. We have investigated this and found that easy-thumbnails 1.4 accessed the normal file system for mod_time requests while writing results to the store (rendering all thumb images on every request). We switched to the development version where that was already fixed.
Then we worked with SmileyChris to fix unnecessary access of mod_times (stat the file) on every request for every image by tracing and posting issues to http://github.com/SmileyChris/easy-thumbnails
This reduced access times from 12-17s to 4-6s per public page on the CMS basically eliminating all storage/"file"-system access. Once that was fixed, easy-thumbnails replaced (per design) file-system accesses with queries to the DB to check on every request if a thumbnail's source image has changed.
One thing for the web-designer: if she uses a image.width statement in the template this forces a ugly slow read on the "filesystem", because image widths are not cached.
Further investigation led to the conclusion that DB accesses are very costly, too and take about 40ms per roundtrip.
Up to now the deployment is unsuccessful mostly due to DB access times in the cloud leading to 4-5s delays on rendering a page before caching it.

Will using a Cloud PaaS automatically solve scalability issues?

I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?
That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.

Resources