Have to say im not an administrator of any sorts and never needed to distribute load on a server before, but now im in a situation where i can see that i might have a problem.
This is the scenario and my problem :
I have a IIS running on a server with a MSSQL, a client can send off a request that will retrieve a datapackage with a request (1 request) to the MSSQL database, that data is then sent back to the client.
This package of data can be of different lenght, but generally <10 MB.
This is all working fine, but im now facing a what-if if i have 10.000 clients pounding on the server simulataniously, i can see my bandwith getting smashed probably and also imagine that both IIS and MSSQL will be dying of exhaustion.
So my question is, i guess the bandwith issue is only about hosting ? but how can i distribute this so IIS and MSSQL will be able to perform without exhausting them ?
Really appriciate an explanation of how this can be achieved, its probably standard knowledge but for me its abit of a mystery, but know it can be done when i look at dropbox and whatelse just a big question how i can do it.
thanks alot
You will need to consider some form of Load Balancing. Since you are using IIS, I'm assuming that you are hosting on Windows Server, which provides a software based Network Load Balancer. See Network Load Balancing Overview
You need to identify the performance bottleneck then plan to reduce them. A sledgehammer approach here might not be the best idea.
Setup performance counters and record a day or two's worth of data. See this link on how to do SQL server performance troubleshooting.
The bandwidth might just be one of the problems. By setting up performance counters and doing a analysis of what is actually happening you will be able to plan a better solution with the right data.
Related
My team has run into a design conflict. We are working on a project that involves scraping historical data from yahoo for all stocks for the last year to run some ML analysis on it. The latency is unbearably slow, not sure if it's the network or the web scraper. I proposed we use AWS RDS to store the data so we can access it quicker. However, a team member said that storing the data in the cloud would not solve our latency issue. I rebutted with the fact that the data will be organized and stored in a way to access the data significantly faster. He came back with something else and this went on. Is it true that a cloud DB won't offer any additional speed compared to a scraper? If so does AWS have a service that allows us to access the data we store faster through another service, almost as if the database was on our own server?
I am not that all familiar with cloud services but I do understand databases pretty well. So please dumb down the AWS stuff if you wish and feel free to point me to any duplicates or links that may help me understand this more.
Lots of good reasons to use RDS as a database, but speeding up your scraping isn't one of them - it likely isn't your bottleneck.
I have written lots of scrapers over the years, and by far the biggest performance boost will be to have a fast network connection between the scraper machine(s) and the host you are scraping, and even then, using a multi-threaded scraper for each scraping machine will give you another HUGE speed improvement.
Most time spent scraping is waiting on the host to return the results to you, not parsing the page and not saving the database to a database.
A MySQL DB on AWS RDS would be the same as the one that you'd install yourself on some machine. So, it isn't going to be different or slower just because it is in the cloud.
If you scrape some data and process it only once, then there is no point in introducing a DB in between. But if your scraper is slow and you process scraped data multiple times, then storing it in a DB should improve latencies. That is because the latencies of a DB read will be much lesser than that of scraping (assuming you design your DB schema properly; your hosts are in the same availability zones, or at least regions, as your DB etc.).
For e.g., if scraping a webpage takes ~10s and you process the scraped data twice, it'd take you ~20s if you don't have a DB. If you have a DB which has latencies of ~500ms you'd only take ~11s.
I have a background in web programming where both the data and the code live on the server. Web hosts with mysql or the like are plentiful and cheap so using the application from multiple pcs was never a problem.
However I'm considering switching to building desktop applications but the only factor that annoys me is the syncing of data across the many pcs I use. I was thinking of perhaps setting up a light amazon ec2 instance with a postgresql on it and having my desktop applications use that.
I have a few questions:
I'm curious as to what latency I might expect by running the database on ec2 instead of the local network, any experience or insight is appreciated.
Are there better/more obvious/cheaper solutions?
I've looked at the pricing and it seems to come down to 24.48$ per month for a yearly contract. Whilst not really expensive, it is not exactly cheap either. At what point does it become more interesting to run a local server?
I'm obviously not using my applications for large parts of the day (sleep, work,...). I was wondering if I can have the amazon server go into a sort of "sleep" mode and wake up when poked. An initial delay for the first desktop application is acceptable. The reason behind this behavior would be to save money on the instance if it is only actually needed for 10% of the day.
I welcome any feedback at all on how this problem is best tackled.
This could get ugly. Every single query you do will have latency associated with it. If you have a lot of queries, this can add up very fast. So keep your query count low, and try to pre-fetch and cache data when possible.
Not enough information to answer that question.
Depends on the cost of your local server. Keep in mind that you will need to pay for electricity to keep it on.
You can stop your instance when you are not needing it, with the exception of high utilization reservations, you wont get billed when its in stopped state. With high utilization reservations you will still pay the full cost.
One of our problems is that our outbound email server sucks sometimes. Users will trigger an email in our application, and the application can take on the order of 30 seconds to actually send it. Let's make it even worse and admit that we're not even doing this on a background thread, so the user is completely blocked during this time. SQL Server Database Mail has been proposed as a solution to this problem, since it basically implements a message queue and is physically closer and far more responsive than our third party email host. It's also admittedly really easy to implement for us, since it's just replacing one call to SmtpClient.Send with the execution of a stored procedure. Most of our application email contains PDFs, XLSs, and so forth, and I've seen the size of these attachments reach as high as 20MB.
Using Database Mail to handle all of our application email smells bad to me, but I'm having a hard time talking anyone out of it given the extremely low cost of implementation. Our production database server is way too powerful, so I'm not sure that it couldn't handle the load, either. Any ideas or safer alternatives?
All you have to do is run it through an SMTP server and if you're planning on sending large amounts of mail out then you'll have to not only load balance the servers (and DNS servers if you're planning on sending out 100K + mails at a time) but make sure your outbound Email servers have the proper A records registered in DNS to prevent bounce backs.
It's a cheap solution (minus the load balancer costs).
Yes, dual home the server for your internal lan and the internet and make sure it's an outbound only server. Start out with one SMTP server and if you get bottle necks right off the bat, look to see if it's memory, disk, network, or load related. If its load related then it may be time to look at load balancing. If it's memory related, throw more memory at it. If it's disk related throw a raid 0+1 array at it. If it's network related use a bigger pipe.
I have a very limited experience of database programming and my applications that access databases are simple ones :). Until now :(. I need to create a medium-size desktop application (it's called rich client?) that will use a database on the network to share data between multiple users. Most probably i will use C# and MSSQL/MySQL/SQLite.
I have performed a few drive tests and discovered that on low quality networks database access is not so smooth. In one company's LAN it's a lot of data transferred over network and servers are at constant load, so it's a common situation that a simple INSERT or SELECT SQL query will take 1-2 minutes or even fail with timeout / network error.
Is it any best practices to handle such situations? Of course i can split my app into GUI thread and DB thread so network problems will not lead to frozen GUI. But what to do with lots of network errors? Displaying them to user too often will be not very good :(. I'm thinking about automatic creating local copy of a database on each computer my app is running: first updating local database and synchronize it in background, simple retrying on network errors. This will allow an app to function event if network has great lags / problems.
Any hints and buzzwords what can i look into? Maybe it's some best practices already available that i don't know :)
Sorry this is prob not the answer you are looking for but you mention that a simple insert / update could take 1-2 minutes or even fail with timeout / network error.
This to me sounds like there may be another problem rather than the network itself. If your working on a corporate network there would have to be insane levels of traffic for this sort of behavior. I would do everything in your power to look at improving the network before proceeding. Can you post the result of a ping to the db box?
If your going to architect your application around this type of network it will significantly alter the end product and even possibly result in a poor quality product for other clients.
Depending upon the nature of the application maybe look at implementing an async persistence queue and caching data on startup or even embedding a copy of the db into your application.
Even though async behaviour/queues/caching/copying the database to each local instance etc will help solve the symptoms, the problem will still remain. If the network really is that bad then I'd address it with their I.T. department, or the project manager and build some performance requirement from their side of things into the contract.
With a distributed application, where you have lots of clients and one main server, should you:
Make the clients dumb and the server smart: clients are fast and non-invasive. Business rules are needed in only 1 place
Make the clients smart and the server dumb: take as much load as possible off of the server
Additional info:
Clients collect tons of data about the computer they are on. The server must analyze all of this info to determine the health of these computers
The owners of the client computers are temperamental and will shut down the clients if the client starts to consume too many resources (thus negating the purpose of the distributed app in helping diagnose problems)
You should do as much client-side processing as possible. This will enable your application to scale better than doing processing server-side. To solve your temperamental user problem, you could look into making your client processes run at a very low priority so there's no noticeable decrease in performance on the part of the user.
In a client-server setting, if you care about security, you should always program on the assumption that the client may have been compromised. Even if it hasn't, there is always the risk of somebody using an old version of the client, using a competing or modified version of the client, or just of the net connection being a bit screwy.
So while you do as much work on the client as possible, processing and marshalling information into the right form, the server then needs to do a thorough sanity check on anything the client gives it.
So the answer I guess is "both".
The server must analyze all of this
info to determine the health of these
computers
That is probably the biggest clue so far explaning what your application is kinda about. Are you able to provide a more elaborate briefing on what this application is seeking to achieve in this distributed environment? We do not even know if the client-side processing is disk I/O or processor intensive. How you design the solution is dependent on the nature of what needs to be done to help the users/business accomplish their jobs and objectives.