Running C simulations on EC2 - c

I've been running parallel, independent simulations on a SGE cluster and would like to transition to using EC2. I've been looking at the documentation for StarCluster but, due to my inexperience, I'm still missing a few things.
1) My code is written in C and uses GSL -- do I need to install GSL on the virtual machines and compile there, or can I precompile the code? Are there any tutorials that cover this exact usage of EC2?
2) I need to run maybe 10,000 CPU hours of code, but I could easily set this up as many short instances or fewer, longer jobs. Given these requirements, is EC2 really the best choice? If so, is StarCluster the best interface for my needs?
Thanks very much.

You can create an AMI (basically an image of your virtual machine) with all your dependencies installed. Then all you need to do is configure job specific parameters at launch.
You can run as many instances as you want on ec2. You may be able to take advantage of spot instances to save money (all you need to be able to tolerate is the instance getting shutdown if the price exceeds your bid.)

Related

TorchServe on SageMaker scaling and latency issues

We’re using TorchServe on SageMaker with the deep learning containers ( https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
Everything seems to be running well, we have very minimal CPU and memory usage, and our latency of requests can be about 3 seconds (which is quite long but okay). Once we start to ramp up usage our model latency increases dramatically - taking up to 1 minute to respond, however, our CPU and memory usage is still minimal... we feel like the TorchServe instance is not scaling properly on the instance and only deploying one worker when it should be at least 4 on a 4vcpu system (ml.m5xlarge instance). We have been following official AWS guides.
How can we resolve this?
We work with large files (1-20Mb images). However, SageMaker has a limitation of 5Mb payload size. To get around this we are using S3 to store our intermediary files (which potentially causes long latency). Our images go through preprocessing, enhancing, segmentation, and postprocessing, which are currently all individual SageMaker endpoints that a lambda function controls. We want to use SageMaker pipelines, but we see no incentive to make changes to our system with the payload limit.
Is there a better way of doing this?
We have been experimenting with Multi-Model instances. Ideally, we would like to have an ml.g4dnxlarge run all our PyTorch models to benefit from the low latency and high throughput. Still, when we deploy a torch serve multimode endpoint with deep learning containers, we get an error when invoking "model version not selected" ???? I have looked everywhere for anyone having similar issues but can't see anything.
Are there any guides for torch serve multi-model instances? Are we making an obvious mistake?
We have experimented with elastic inference and AWS Inf instances. However, we have seen no latency decrease from the standard CPU instance - again,
are we making a mistake here, and the benefits from these instances seem beneficial to us, and we would love to use them?

VmWare SQL Server Integration Services multi core performance?

We run a SQL Server using SSIS. My ops department argue they won't raise the number of cores on the machine because its running on top of WmWare and they say adding more cores slows it down due to overhead of having to find the cores to run on.
My usecase is I have a SSIS job containing of multiple flows, sources, destinations, packages etc. that runs for anything from 10 to 8 hours. I would like to throw some more hardware at it so I asked them if we could up the core count to something like 32.
Price is not an issue. They simple claim that performance would not increase due to VmWare not working well with multiple cores.
I don't trust them. But I can't really counter them since i don't know anything about VmWare other then it sounds wrong. Sure multithreading has overhead etc. but a task running for 8 hours that can utilize multithreading and is not IO/RAM capped.
i do know alot of factors play into performance on a virtual enviorment and i can't really tell you much more then specs i know which are the following:
SQL Server 2012
Windows Server 2012 with 8 cores
VmWare, don't know much else
Runs in a big datacenter I or operations don't have access to the underlying
hardware or VmWare.
They're not wrong, but... there are a ton of variables that go into properly sizing a VM, so they may not be right either.
There are lots of metrics they can look at to check and see whether the system is starved for resources. Also, if they have access to something like vRealize Operations Manager, there are pre-made dashboards they can use to easily see what vROps recommends the VM's size should be. There are several other operations products out that do similar things too.
Here's some reading that can help you understand their position a little more as well as give you some things to follow up and ask them about:
General Rightsizing Overview: http://virtual-red-dot.info/right-sizing-virtual-machines/
SQL Server Performance Whitepaper: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/sql-server-on-vmware-best-practices-guide.pdf
Your ops group needs a lot of help. VMWARE is a very, very mature product that has been around for many years and is the most widely used commercial product to virtualize machines. VMWARE knows how to use additional cores. Its like saying better to get a four cylinder instead of an eight cylinder because you will just confuse the engine.
Depending on what is causing the package to run for 8 hours it could not help much but it is very unlikely that it will run slower. It is easy to add the cores, run the package, and see if it runs faster. Please post the results. I would bet good money it will run faster but unless we see the package we won't know. You know that you can create multiple paths (containers) that execute in parallel (if appropriate) that will take advantage of the additional cores.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Is GAE a viable platform for my application? (if not, what would be a better option?)

Here's the requirement at a very high level.
We are going to distribute desktop agents (or browser plugins) to collect certain information from tons of users (in thousands or possibly millions down the road).
These agents collect data and periodically upload it to a server app.
The server app will allow for analyzing collected data (filter, sort etc based on 4-5 attributes) and summarize in form of charts etc.
We should also be able to export some of the collected data (csv or pdf)
We are looking for an platform to host the server app. GAE seems attractive because of low administrative cost and scalability (as users base increases, the platform will handle the scale... hopefully!).
Is GAE a viable option for us?
One important consideration is that sometimes the volume of uploads from the agents can exceed 50MB per upload cycle. We will have users in places where Internet connections could be very slow too. Apparently GAE has a limit on the duration a request can last. The upload volume may cause the request (transferring data from an agent to the server) to last longer than 30 seconds. How would one handle such situation?
Thanks!
The time of the upload is not considered part of the script execution time, so no worries there.
Google App Engine is very good to perform a vast number of smaller jobs but not so much to do complex long running background jobs (because of the 30 sec limit + even smaller database connection time limit). So probably GAE would be a very good platform to GATHER the data but not for actually ANALYZING it. You probably would like to separate these two.
We went ahead an implemented the first version on GAE anyway. The experience has been very much what is described here http://www.carlosble.com/?p=719
For a proof-of-concept prototype, what we have built so far is acceptable. However, we have decided not to go with GAE (at least in its current shape) for the production version. The pains somewhat outweigh the benefits in our case.
The problems we faced were numerous. Unlike my experience dealing with J2EE stacks, when you run into an issue, many a times it is a dead end. Workarounds are very complicated and ugly, if you can find one.
By writing good prototypes one could figure out whether GAE is right for the solution being built, however, the hype is a problem. Many newcomers are going get overly excited about GAE due to its hype and end up failing badly. Because they will choose GAE for all kinds of purpose that it is not suitable for.

How is MapReduce a good method to analyse http server logs?

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.
But here is my problem : I can't figure out how it can help with http server logs analysis.
My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.
Here are the questions I can't answer right now, any help would be greatly appreciated :
Can the MapReduce concept really be applied to weblogs analysis ?
Is MapReduce the most clever way of doing it ?
How would you split the web log files between the various computing instances ?
Thank you.
Nicolas
Can the MapReduce concept really be applied to weblogs analysis ?
Yes.
You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):
192.168.1.1,FireFox x.x,username1
192.168.1.1,FireFox x.x,username1
192.168.1.2,FireFox y.y,username1
192.168.1.7,IE 7.0,username1
You can extract browsers, ignoring version, using a map operation to get this list:
FireFox
FireFox
FireFox
IE
Then reduce to get this :
FireFox,3
IE,1
Is MapReduce the most clever way of doing it ?
It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.
To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.
You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...
With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.
http://img355.imageshack.us/img355/7355/mqlogs.png
How would you split the web log files between the various computing instances ?
By number of elements or lines if it's a text-based logfile.
In order to test MapReduce, I'd like to suggest that you play with Hadoop.
Can the MapReduce concept really be applied to weblogs analysis ?
Sure. What sort of data are you storing?
Is MapReduce the most clever way of doing it ?
It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.
How would you split the web log files between the various computing instances ?
Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Resources