Web scraping vs Cloud Storage with AWS

Web scraping vs Cloud Storage with AWS - database

My team has run into a design conflict. We are working on a project that involves scraping historical data from yahoo for all stocks for the last year to run some ML analysis on it. The latency is unbearably slow, not sure if it's the network or the web scraper. I proposed we use AWS RDS to store the data so we can access it quicker. However, a team member said that storing the data in the cloud would not solve our latency issue. I rebutted with the fact that the data will be organized and stored in a way to access the data significantly faster. He came back with something else and this went on. Is it true that a cloud DB won't offer any additional speed compared to a scraper? If so does AWS have a service that allows us to access the data we store faster through another service, almost as if the database was on our own server?
I am not that all familiar with cloud services but I do understand databases pretty well. So please dumb down the AWS stuff if you wish and feel free to point me to any duplicates or links that may help me understand this more.

Lots of good reasons to use RDS as a database, but speeding up your scraping isn't one of them - it likely isn't your bottleneck.
I have written lots of scrapers over the years, and by far the biggest performance boost will be to have a fast network connection between the scraper machine(s) and the host you are scraping, and even then, using a multi-threaded scraper for each scraping machine will give you another HUGE speed improvement.
Most time spent scraping is waiting on the host to return the results to you, not parsing the page and not saving the database to a database.

A MySQL DB on AWS RDS would be the same as the one that you'd install yourself on some machine. So, it isn't going to be different or slower just because it is in the cloud.
If you scrape some data and process it only once, then there is no point in introducing a DB in between. But if your scraper is slow and you process scraped data multiple times, then storing it in a DB should improve latencies. That is because the latencies of a DB read will be much lesser than that of scraping (assuming you design your DB schema properly; your hosts are in the same availability zones, or at least regions, as your DB etc.).
For e.g., if scraping a webpage takes ~10s and you process the scraped data twice, it'd take you ~20s if you don't have a DB. If you have a DB which has latencies of ~500ms you'd only take ~11s.

Related

What is the smallest AWS EC2 instance I can run a postgres db on?

There is the free tier on AWS and I can get a micro EC2 instance for free essentially.. or close to.. I'm sure setting up elastic ips - loads balancers etc is extra.
Would it effectively be possible for me to run a postgres DB - for a small api. Roughly about 50 inserts + 50 reads per second ... say about 6000 operations per min at the most.
I can't seem to find anything online - which makes me think that this might be a silly idea.
For this not to be an "open question" - it's simply: Is it possible and realistic to expect usable performance on an EC2 instance running my postgres DB.

The best way to determine whether the database can handle a particular workload is to test it at that capacity. Launch the database, simulate traffic and monitor its performance. Please note that every application uses a database differently, so nobody can provide "general advice" as to whether a particular-sized database would meet the needs of your particular application.
If you are going to run 'production' workloads, try to avoid using the Burstable performance instances (T2, T3) since they can hit limits under heavy workloads unless the 'Unlimited' option is selected. T2/T3 is great for bursty workloads, but not for sustained workloads.
Comparing m5.xlarge between EC2 and RDS:
Amazon EC2: 19.2c/hr ($4.61/day)
Amazon RDS: 35.6c/hr ($8.54/day)
For the additional price, Amazon RDS provides a fully-managed database, automated backups, CloudWatch metrics, etc. This is probably worth much more than $4 of your time every day.
Alternatively, if you can modify your application to use NoSQL instead of SQL, you could use Amazon DynamoDB where the capacity you mention would cost 4c/hour ($1/day) plus request and storage costs.
Don't underspend on your database — it powers everything you do. Instead, try to save money by turning off non-production systems when they aren't being used (eg weekends and evenings). That will hopefully give you enough savings to afford an appropriately-powered database.

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!

Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.

As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.

Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Will my server cluster/DB solution work or is there a better alternative?

Brief
I'm working on a project where an app will communicate with a database via an API, however my experience with balancing server loads of this scale is limited.
I'm under NDA so I'll try to explain the setup as best as possible, please let me know if there's any details that you need to help understand!
The database will hold information that is likely to change on a 30-second basis.
The difficulty comes with scalability - we're possible to be having thousands of concurrent users so making sure that the server stack can handle the load (and is scalable) is a priority.
I've simplified the lifespans of back- and front-end for explanation purposes:
Back-end
External source sends an XML file with latest data
CRON job runs every 30 seconds to see if files have been updated, then parses and inserts updated data into database A
CRON job runs every 30 seconds to (a) pull data from database A (b) use algorithms to calculate data based on the data from database A (c) input this new data into database B
Front-end
User runs and signs into app
App periodically makes calls to API to retrieve/push data to database B
Cluster
After preliminary discussions, this is the server cluster so far:
CDN
Load Balancer (to distribute requests to the openmost Web Head)
Web Head* (to handle the API request)
Session Server (to handle app authentication only)
Redis* (to store cached queries and reduce load on Database)
Database (to store database)
*This server will have as many clones as necessary
Illustrated as a lifespan of the request:
App Request
|
CDN
|
Load
Balancer
|
Web head 1 (-- Web head 2 -- Web head 3 -- ...)
| \
| \
| \
| Session Server
|
Redis (-- Redis 2 -- Redis 3 -- ...)
|
Database
Questions
Is this a feasible and effective way to layout my server for scalability? Is there something missing in my steps, or do I have surplus?
For parsing the XML data into database A every 30 seconds, is PHP competent or should I use another language (Python for example)?
For the data that is read, modified and then inputted into database B, is PHP the best solution or should I be using another language for this too?
I've looked into multiple servers/NoDB solutions and database caching solutions (Elasticsearch, Redis, Memcached etc). What would be most efficient for this setup?
Again, if you require any more information please ask. If there is a better StackExchange site (I've had a look and couldn't find one) or a better forum in which I should post, let me know.

I'm writing this as an answer, rather than a comment - but really, your question is almost impossible to answer.
Your scalability depends on caching - which in turn depends on cache hit ratios. If every user in
Front-end
User runs and signs into app
App periodically makes calls to API to retrieve/push data to database B
runs the same query, your cache hit ratio will be amazing, and you can do what you want, because your CDN should be able to deal with it all. If they all run different queries, your CDN isn't doing you many favours (how often will the same user re-run the same query), and you have a potential scalability nightmare.
PHP is perfectly fine at parsing XML - most interpreted languages use a native XML parser, so you won't get much benefit from switching languages in most cases.
Question 3: it depends on what the transformation is. If it's just format translation, PHP is fine; if you're doing complex mathematical work, you might benefit from a compiled language (like C/C++) - but the complexity increases fairly dramatically.
Question 4: it all depends on cache hit ratios. Your CDN should bear most of the burden if you design the application right, and your queries are indeed cacheable. If they're not cacheable, no architecture will help.

Central data management for custom desktop applications

I have a background in web programming where both the data and the code live on the server. Web hosts with mysql or the like are plentiful and cheap so using the application from multiple pcs was never a problem.
However I'm considering switching to building desktop applications but the only factor that annoys me is the syncing of data across the many pcs I use. I was thinking of perhaps setting up a light amazon ec2 instance with a postgresql on it and having my desktop applications use that.
I have a few questions:
I'm curious as to what latency I might expect by running the database on ec2 instead of the local network, any experience or insight is appreciated.
Are there better/more obvious/cheaper solutions?
I've looked at the pricing and it seems to come down to 24.48$ per month for a yearly contract. Whilst not really expensive, it is not exactly cheap either. At what point does it become more interesting to run a local server?
I'm obviously not using my applications for large parts of the day (sleep, work,...). I was wondering if I can have the amazon server go into a sort of "sleep" mode and wake up when poked. An initial delay for the first desktop application is acceptable. The reason behind this behavior would be to save money on the instance if it is only actually needed for 10% of the day.
I welcome any feedback at all on how this problem is best tackled.

This could get ugly. Every single query you do will have latency associated with it. If you have a lot of queries, this can add up very fast. So keep your query count low, and try to pre-fetch and cache data when possible.
Not enough information to answer that question.
Depends on the cost of your local server. Keep in mind that you will need to pay for electricity to keep it on.
You can stop your instance when you are not needing it, with the exception of high utilization reservations, you wont get billed when its in stopped state. With high utilization reservations you will still pay the full cost.

Scaling out SQL Server for the web (Single Writer Multiple Readers)

Has anyone had any experience scaling out SQL Server in a multi reader single writer fashion. If not can anyone suggest a suitable alternative for a read intensive web application, that they have experience with

It depends on probably 2 things:
How big each single write is?
Do readers need real time data?
A write will block readers when writing, but if each write is small and fast then readers won't notice.
If you offload, say, end of day reporting then you batch your load onto a separate server because readers do not require real time data. This makes sense
A write on your primary server must be synched to your offload secondary server... which will block there as part of the synch process anyway + you add an overhead load to manage the synch.
Most apps are 95%+ read anyway all the time. For example, an update or delete is a read followed by a write.
My choice would be (probably, based on the low write volume and it's a web app) to scale up and stuff as much RAM as I could in the DB server with separate disk paths for the data and log files of the database.

I don't have any experience with scaling out SQL Server for your scenario.
However for a Read-Intensive application, I would be looking at reducing the load on the database and employ a Cache Strategy using something like Memcache or MS Velocity
There are two approaches that I'm aware of:
Have the entire database loaded into the Cache and manage Adding and Updating of items in the cache.
Add items to the cache only when they are requested and remove them when a write operation is performed.

Some kind of replication would do the trick.
http://msdn.microsoft.com/en-us/library/ms151827.aspx
You of course need to change your app code.
Some people use partitioned tables, with different row ranges being stored on different servers - united with views. This would be invisible to the app. Federation for this practice, I think.
By designing your database, application and server configuration (SQL particulars - location of data/log/system/sql binaries/tempdb), you should be able to handle a pretty good load. Try not to complicate things if you don't have to.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight