Does ScraperWiki rate limit sites it is scraping? - screen-scraping

Does ScraperWiki somehow automatically rate limit scraping, or should I add something like sleep(1 * random.random()) to the loop?

There is no automatic rate limiting. You can add a sleep command written in your language to add rate limiting.
Very few servers check for rate limiting, and usually servers containing public data don't.
It is, however, good practice to make sure you don't overrun the remote server. By default, scrapers only run in one thread, so there is a built in limit to the load you can produce.

Related

Pubsub pull at a steady rate?

Is there a way to enforce a steady poll rate using the google-cloud-pubsub client?. I want to avoid scenarios where if there is spike in the publish rate, the pull request rate also tend to increase.
The client provides FlowControl settings, by setting the maxOutstanding messages. From my understanding, it sets the max batch size during a pull operation.
I want to understand how to create a constant pull rate, say 1000 RPS.
Message Flow Control can be used to set the maximum number of messages being processed at a given time (i.e., setting max_messages in the case of the python client), which indirectly sets the maximum rate at which messages are received.
While it doesn’t allow you to directly set the exact number of messages received per second (that would depend on the time it takes to process a message and the number of messages being processed), it should avoid scenarios where you get a spike in publish rate.
If you really need to set a rate in messages received per second, AFAIK it’s not made available directly on the client libraries, so you’d have to implement it yourself using an asynchronous pull and using some timers to acknowledge the messages at your desired rate.

how to make 1 million inserts in cassandra

I am parsing thousands of csv files from my application and for each parsed row I am making an insert into Cassandra. It seems that after letting it run it stops at 2048 inserts and throws the BusyConnection error.
Whats the best way for me to make about 1 million inserts?
Should i export the inserts as strings into a file, then run that file directly from CQL to make these massive inserts so I dont actually do it over the network?
We solve such issues using script(s).
The script go through input data and...
At each time it takes a specific amount of data from input.
Wait for specific amount of time.
Continues in reading and inserting of data.
ad 1. For our configuration and data (max 10 columns with mostly numbers and short texts) we found from 500 to 1000 rows are optimal.
ad 2. We define wait time as n * t. Where n is number of rows processed in single run of script. And t is time constant in millisecond. Value of t strongly depends on your configuration; however, for us t = 70 ms is enough to make the process smooth.
1 million requests - it's not so big number really, you can load it from cqlsh using the COPY FROM command. But you can load this data via your Java code as well.
From the error message it looks like that you're using asynchronous API. You can use it for high-performance inserts, but you need to control how many requests are processed at the same time (so-called, in-flight requests).
There are several aspects here:
Starting with version 3 of the protocol, you may have up to 32k in-flight requests per connection instead of 1024 that is used by default. You can configure it when creating Cluster object.
You need to control how many requests are in-flight, by wrapping session.executeAsync with some counter, for example, like in this example (not the best because it limits on the total requests per session, not on the connections to individual hosts - this will require much more logic, especially around token-aware requests).

Library design methodology

I want to make the "TRAP AGENT" library. The trap agent library keeps the tracks of the various parameter of the client system. If the parameter of the client system changes above threshold then trap agent library at client side notifies to the server about that parameter. For example, if CPU usage exceeds beyond threshold then it will notify the server that CPU usage is exceeded. I have to measure 50-100 parameters (like memory usage, network usage etc.) at client side.
Now I have the basic idea about the design, but I am stuck with the entire library design.
I have thought of below solutions:
I can create a thread for each parameter (i.e. each thread will monitor single parameter).
I can create a process for each parameter (i.e. each process will monitor single parameter).
I can classify the various parameters into the various groups, like data usage parameter will fall into network group, CPU memory usage parameter will fall into the system group, and then will create thread for each group.
Now 1st solution is looking good as compare to 2nd. If I am adopting 1st solution then it may fail when I want to upgrade my library for 100 to 1000 parameters. Because I have to create 1000 threads at that time, which is not good design (I think so; if I am wrong correct me.)
3rd solution is good, but response time will be high since many parameters will be monitored in single thread.
Is there any better approach?
In general, it's a bad idea to spawn threads 1-to-1 for any logical mapping in your code. You can quickly exhaust the available threads of the system.
In .NET this is very elegantly handled using thread pools:
Thread vs ThreadPool
Here is a C++ discussion, but the concept is the same:
Thread pooling in C++11
Processes are also high overhead on Windows. Both designs sound like they would ironically be quite taxing on the very resources you are trying to monitor.
Threads (and processes) give you parallelism where you need it. For example, letting the GUI be responsive while some background task is running. But if you are just monitoring in the background and reporting to a server, why require so much parallelism?
You could just run each check, one after the other, in a tight event loop in one single thread. If you are worried about not sampling the values as often, I'd say that's actually a benefit. It does no help to consume 50% CPU to monitor your CPU. If you are spot-checking values once every few seconds that is probably fine resolution.
In fact high resolution is of no help if you are reporting to a server. You don't want to denial-of-service-attack your server by doing a HTTP call to it multiple times a second once some value triggers.
NOTE: this doesn't mean you can't have a pluggable architecture. You could create some base class that represents checking a resource and then create subclasses for each specific type. Your event loop could iterate over an array or list of objects, calling each one successively and aggregating the results. At the end of the loop you report back to the server if any are out of range.
You may want to add logic to stop checking (or at least stop reporting back to the server) for some "cool down period" once a trap hits. You don't want to tax your server or spam your logs.
You can follow below methodology:
1.You can have two threads one thread is dedicated to measure emergency parameter and second thread monitors non emergency parameter.
hence response time for emergency parameter will be less.
2.You can define 3 threads.First thread will monitor the high priority(emergency parameter).Second thread will monitor the intermediate priority parameter. and last thread will monitor lowest priority parameter.
So overall response time will be improved as compared to first solution.
3.If response time is not concern then you can monitor all the parameters in single thread.But in this case response time becomes worst when you upgrade your library to monitor 100 to 1000 parameters.
So in 1st case there will be more response time for non emergency parameter.While in 3rd case there will be definitely very high response time.
So solution 2 is better.

Limit open connections at any one time, maintain rate of X users per second?

I'm writing a Gatling load test which simply bombards a given endpoint over HTTP for a given period of time. I have it gradually ramp up connections per second, and then hold it there for the duration of the test. My setup looks like this:
setUp(
scn.inject(
rampUsersPerSec(10 to 70 during(1 minute),
constantUsersPerSec(70) during(9 minutes)
).protocols(httpConf).throttle(jumpToRps(70) holdFor(10 minutes))
)
This works, but the problem is that our requests take a long time, sometimes much longer than a second.
What ends up happening is that the server slows down and requests start taking longer and longer, and instead of maintaining 70 connections to the server at a time, this quickly grows linearly and I'll have something like 1000 open connections at any given time.
Is there a way to "limit the pool" of Gatling users to maintain X open connections at a given time? I've so far been unsuccessful in trying to throttle it.
What you want is a closed injection model.
In order to do that with Gatling, you have to wrap your scenario content with a loop, and possible flush the HTTP caches and cookie jars. Search the doc.
Note that this model is nowhere realistic, except if your system indeed limit the number of users that it lets enter, with an upfront queue. Typical use case is a call center.

Handling multiple calls to BeginExecuteNonQuery in SQL Server 2008

I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.
Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.
Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)

Resources