General question. Which option would have less network traffic.
Option 1: A single database connection to a WebAPI, with multiple clients communicating via the API using the standard request / return data.
Option 2: Each client having direct read only access to the database with their own connections, and reading the data directly.
My expectation is that with only a single user, the direct database approach would have less traffic, but for each additional user the API would have a smaller incremental increase in traffic over the direct database.
However I have no evidence to back this up. Just hoping somebody may know of a resource which has this data already, or has done the experiment themselves (as my google fu is failing me)
Related
I'm currently trying to wrap my head around outbound transfer as a project of mine makes it a concern.
I discovered that if I try to play music directly off of my server, it counts towards my outbound transfer. This is understandable and I can understand the logic of that.
The idea I have is if I happen to host the file elsewhere, would the outbound transfer be counted towards my initial server, my 3rd party server, or both? I'm considering putting the music on dropbox for example and stream it from there through the server.
Is what I want even possible?
"outbound transfer" in this case most likely refers to the amount of bytes sent from that server. If you proxy the 3rd party server, you still send that data through your own server, so it won't net you any benefit, other than storage space. In fact, the latency will probably increase.
What you want to do is of course possible if you let the client connect directly to the streaming service. Just make sure that service allows you to stream data that way through their TOS. Also make sure that the service is actually designed for live streaming of data, or your user experience will be horrible.
I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.
One of our problems is that our outbound email server sucks sometimes. Users will trigger an email in our application, and the application can take on the order of 30 seconds to actually send it. Let's make it even worse and admit that we're not even doing this on a background thread, so the user is completely blocked during this time. SQL Server Database Mail has been proposed as a solution to this problem, since it basically implements a message queue and is physically closer and far more responsive than our third party email host. It's also admittedly really easy to implement for us, since it's just replacing one call to SmtpClient.Send with the execution of a stored procedure. Most of our application email contains PDFs, XLSs, and so forth, and I've seen the size of these attachments reach as high as 20MB.
Using Database Mail to handle all of our application email smells bad to me, but I'm having a hard time talking anyone out of it given the extremely low cost of implementation. Our production database server is way too powerful, so I'm not sure that it couldn't handle the load, either. Any ideas or safer alternatives?
All you have to do is run it through an SMTP server and if you're planning on sending large amounts of mail out then you'll have to not only load balance the servers (and DNS servers if you're planning on sending out 100K + mails at a time) but make sure your outbound Email servers have the proper A records registered in DNS to prevent bounce backs.
It's a cheap solution (minus the load balancer costs).
Yes, dual home the server for your internal lan and the internet and make sure it's an outbound only server. Start out with one SMTP server and if you get bottle necks right off the bat, look to see if it's memory, disk, network, or load related. If its load related then it may be time to look at load balancing. If it's memory related, throw more memory at it. If it's disk related throw a raid 0+1 array at it. If it's network related use a bigger pipe.
I have a j2ee webapp that's being used internally by ~20-30 people.
There is no chance of significant growth in the number of users.
From what I understood there's a trade-off between opening a new DB connection for each request made to the webapp (expensive, but doesn't block other users when the DB is in use), to using the singleton pattern (doesn't open new connections but only allows one user at a time).
I thought that since I know that only 30 users will ever use my webapp at the same time, maybe the simplest and best solution would be to store the connection as a session attribute, thus reducing to a minimum the amount of openings made, while still allocating one connection per user.
What do you think?
From what I understood there's a
trade-off between opening a new DB
connection for each request made to
the webapp
That is what connection pools are for. If you use a connection pool in your application, the pool once initialized, is in charge of providing connections for use in the application as and when needed. In a properly tuned connection pool, there are going to be enough connections created on reserve that can be provided to the application, mitigating the need to create and open a connection only when the application requests for it.
I thought that since I know that only
30 users will ever use my webapp at
the same time, maybe the simplest and
best solution would be to store the
connection as a session attribute
Per-user connections are not a good idea, primarily when a web application is concerned. In a web application, it is perfectly possible for users to initiate multiple requests to the server (think multi-tabbed browsing). In such a case, the use of a single connection per user will result in weird application behavior, unless you synchronize access to the connection.
One must also consider the side-effect of putting transient attributes into the session - Connection objects are not serializable and hence must be marked transient. If the session is deserialized at some point, one has to account for the fact that the Connection object will not be available, and must be re-initialized.
I think you're getting into premature optimization especially given the scale of the application. Opening a new connection is not that expensive and like Makach says, most modern RDBMSs handle connection pooling and will hold connections open for subsequent requests. You'd be trying to write better code than the compiler, so to speak.
No. Don't do that. It's perfectly ok to reconnect to the database every time you need to. Any database management system will do their own connection pool caching I think.
If you want to try to keep open connections you'll make it incredible hard for yourself to manage this in a secure, bug-free, safe etc way.
I'm building a mobile application in VB.NET (compact framework), and I'm wondering what the best way to approach the potential offline interactions on the device. Basically, the devices have cellular and 802.11, but may still be offline (where there's poor reception, etc). A driver will scan boxes as they leave his truck, and I want to update the new location - immediately if there's network signal, or queued if it's offline and handled later. It made me think, though, about how to handle offline-ness in general.
Do I cache as much data to the device as I can so that I use it if it's offline - Essentially, each device would have a copy of the (relevant) production data on it? Or is it better to disable certain functionality when it's offline, so as to avoid the headache of synchronization later? I know this is a pretty specific question that depends on my app, but I'm curious to see if others have taken this route.
Do I build the application itself to act as though it's always offline, submitting everything to a local queue of sorts that's owned by a local class (essentially abstracting away the online/offline thing), and then have the class submit things to the server as it can? What about data lookups - how can those be handled in a "Semi-live" fashion?
Or should I have the application attempt to submit requests to the server directly, in real-time, and handle it if it itself request fails? I can see a potential problem of making the user wait for the timeout, but is this the most reliable way to do it?
I'm not looking for a specific solution, but really just stories of how developers accomplish this with the smoothest user experience possible, with a link to a how-to or heres-what-to-consider or something like that. Thanks for your pointers on this!
We can't give you a definitive answer because there is no "right" answer that fits all usage scenarios. For example if you're using SQL Server on the back end and SQL CE locally, you could always set up merge replication and have the data engine handle all of this for you. That's pretty clean. Using the offline application block might solve it. Using store and forward might be an option.
You could store locally and then roll your own synchronization with a direct connection, web service of WCF service used when a network is detected. You could use MSMQ for delivery.
What you have to think about is not what the "right" way is, but how your implementation will affect application usability. If you disable features due to lack of connectivity, is the app still usable? If you have stale data, is that a problem? Maybe some critical data needs to be transferred when you have GSM/GPRS (which typically isn't free) and more would be done when you have 802.11. Maybe you can run all day with lookup tables pulled down in the morning and upload only transactions, with the device tracking what changes it's made.
Basically it really depends on how it's used, the nature of the data, the importance of data transactions between fielded devices, the effect of data latency, and probably other factors I can't think of offhand.
So the first step is to determine how the app needs to be used, then determine the infrastructure and architecture to provide the connectivity and data access required.
I haven't used it myself, but have you looked into the "store and forward" capabilities of the CF? It may suit your needs. I believe it uses an Exchange mailbox as a message queue to send SOAP packets to and from the device.
The best way to approach this is to always work offline, then use message queues to handle sending changes to and from the device. When the driver marks something as delivered, for example, update the item as delivered in your local store and also place a message in an outgoing queue to tell the server it's been delivered. When the connection is up, send any queued items back to the server and get any messages that have been queued up from the server.