Why is my webcrawler throughput so low when parallel connections is over 10K? - c

I have implemented a webcrawler with libcurl and libev. My intention was to make a high performance crawler that uses all available bandwidth. I have succeeded in making a crawler that can sustain over 10,000 parallel connections. However, the stats for bandwidth usage are not all that impressive. Here is some example output from vnstat:
rx | tx
--------------------------------------+------------------
bytes 32.86 GiB | 3.12 GiB
--------------------------------------+------------------
max 747.99 Mbit/s | 25.73 Mbit/s
average 15.69 Mbit/s | 1.49 Mbit/s
min 2.62 kbit/s | 12.29 kbit/s
--------------------------------------+------------------
packets 33015363 | 23137442
--------------------------------------+------------------
max 68804 p/s | 28998 p/s
average 1834 p/s | 1285 p/s
min 5 p/s | 5 p/s
--------------------------------------+------------------
time 299.95 minutes
As you can see my average download speed is only 15.69 Mbps while the network bandwidth can support much more. I do not understand why the application is downloading so slowly and yet still maintaining over 10K connections in parallel. Is this something to do with the URLs that are being downloaded? If I repeatedly download www.google.com, www.yahoo.com and www.bing.com I can achieve speeds of up to 7 Gbps. With general crawling though the speed is as shown above.
Any thoughts or ideas?

Related

How to do parallel transfer memory from CPU to GPU with OpenCL?

all. I have just one GPU device Nvidia GTX 750. I did a test that copy data from CPU to GPU in one single thread with using clEnqueueWriteBuffer. And then I did it by using multiple threads. The result is that multiple threads seems slower.
When using multiple threads, every thread has its own kernel/command queue/context which created by the same device. So my question is that is the clEnqueueWriteBuffer call has some lock for one device? How can I reduce those effection?
Edit: if workloads are too light for the hardware, multiple concurrent command queues can achieve better total bandwidth.
Like opengl, opencl needs to batch multiple buffers into single one to get faster, even using single opencl kernel parameter versus multiple parameters is faster. Because there is operating system/api overhead per operation. Moving bigger but fewer chunks is better.
You could have bought two graphics cards that are equivalent to a gtx 750 when combined, to use multiple pci-e bandwidths (if your mainboard can give two 16x lanes separately)
Pcie lanes are two way so you can try parallelize writes and reads or parallelize visualization and computation or parallelize compute and writes or parallelize compute and reads or parallelize compute+write+read (ofcourse if they are not dependent each other like figure 1-a) if there are such in your algorithm and if your graphics card can do it.
Once I tried divide and conquer on a big array to calculate and sending each part to gpu, it took seconds. Now Im computing with just single call for writes single call for computes. Taking only milliseconds.
Figure 1-a:
write iteration --- compute iteration ---- read iteration --- parallels
1 - - 1
2 1 - 2
3 2 1 3
4 3 2 3
5 4 3 3
6 5 4 3
if there is no dependency between iterations. If there is a dependency, then:
figure 1-b:
write iteration --- compute iteration ---- read iteration --- parallels
half of 1 - - 1
other half of 1 half of 1 - 2
half of 2 other half of 1 half of 1 3
other half of 2 half of 2 other half of 1 3
half of 3 other half of 2 half of 2 3
other half of 3 half of 3 other half of 2 3
If you need parallelization between batches of images with non-constant sizes:
cpu to gpu -------- gpu to gpu ----- compute ----- gpu to cpu
1,2,3,4,5 - - -
- 1,2,3 - -
- 4,5 1,2,3 -
- - 4,5 1,2,3
6,7,8,9 - - 4,5
10,11,12 6,7,8 - -
13,14 9,10,11 6,7 -
15,16,17,18 12,13,14 8,9,10 6

Increase Beaglebone Black ADC sampling rate?

I'm working on a project that requires the use of a microcontroller, and for this reason, I decided to use the Beaglebone Black. I'm still new to the Beaglebone world and I'm facing some problems that I hope you guys can help me with.
In my project I will have to continuously read from all the 7 analog read pins and do some processing accordingly. My question is, what will be the fastest programming language to do so (I must read as much samples as possible and in a very short time!) and how to increase the sampling rate from KHz to MHz?
I tried the following codes:
Javascript Code:
var b = require('bonescript');//this variable is to refer to my beaglebone
time = new Date();
b.analogRead("P9_39");
console.log(new Date() - time);
this code will simply perform one analog read and will print out the time needed to perform the read. Surprisingly, the result was 111ms!! which means that my sampling rate is 10 if I'm not wrong.
An alternative was to use pyhton:
import Adafruit_BBIO.ADC as ADC
import time
ADC.setup()
millis = int(round(time.time() * 1000))
ADC.read_raw("P9_39")
millis = millis = int(round(time.time() * 1000)) - millis
print millis
this code took less time (4ms) but still, if I wanted to read form the 7 analog input pins, I will only be able to read around 35 samples from each.
Using the terminal:
echo cape-bone-iio > /sys/devices/bone_capemgr.*/slots
time cat /sys/devices/ocp.3/helper.15/AIN0
############OR############
time cat /sys/devices/ocp.3/44e0d000.tscadc/tiadc/iio\:device0/in_voltage0_raw
and this took 50ms.
I want my sampling rate to be something in MHz. How can I do so? I know that the Beaglebone Black is capable of that but I could not find a clear way to do so. Any help is appreciated.
Thanks in advance.
Sampling rate of AM335x ADC is 200K (link). This means you won't get into MHz range with stock BeagleBone Black ADC.
To get something working with a latency of 5 µs in non-real-time OS like Linux is impossible. You will be at a mercy of OS to schedule your execution thread. Other kernel threads will take priority and will preempt your thread, even if you assign it the highest scheduling priority.
From my experience with digital IO on BeagleBone Black, I stated seeing missed frames starting around 1K samples per second. Now, it will depend on your level of tolerance to missing samples -- if you only need working semi-reliably you can probably squeeze out 10 K samples per second by switching to C/C++ and increasing priority of your process with nice --10 ... command. However if you cannot tolerate missed frames, you have to do one of these:
Bypass OS entirely and write C program for naked AM335x processor (no OS).
Use another hardware -- an ADC with a buffer to accumulate samples while your program is preempted.
Use PRUSS processors on BBB. They run at 200 MHz, so if you have a tight loop with e.g. 20 assembly instructions you will get reliable sampling rate of 10 MHz. That is if you had a faster ADC in the first place, and of course it would handle the stock 200 KHz ADC easily.
I personally went with option #3 and was happy to see my device perform sub-millisecond GPIO operations extremely reliably.
Use 127 beaglebone blacks plugged into 127 usb hub ports and breakout visual basic and write a usb program to automatically sequencially fire 127 beagle bones 1 after the other and read the data in a textbox...You will get around 16 mhz / msps consective adcs per fast cpu with say windows 10....lyj2021
You may have over lapping data...But you can track this with each fire of each beagle bone black...consecutively...

Solr data import memory usage

I am running Solr 3.2 and 4 GB memory. Whenever I start the Solr , it does full import of all the cores and after that every 30 minutes delta import happens. Among 5 cores, 2 cores are having data around 1.6M. Full import takes for those 2 cores more than 20 hours and it is taking all the memory. Because of less memory delta import is not happening for other cores. This leads to restart of the Solr whenever data is updated in DB.
As until commit happens, it won't release the memory, I have given autocommit interval to 5 minutes for those 2 cores. Even though memory is not reduced.
Is there any other configuration I can check?
Edit 1 My autocommit settings
<autoCommit>
<maxDocs>25000</maxDocs>
<maxTime>300000</maxTime>
</autoCommit>
Edit 2 jconsole
System information & values from jconsole
Operating System: Windows 7 6.1
Architecture: amd64
Number of processors: 1
Committed virtual memory: 2,618,508 kbytes
Total physical memory: 4,193,848 kbytes
Free physical memory: 669,472 kbytes
Total swap space: 9,317,672 kbytes
Free swap space: 2,074,072 kbytes
threads details from jconsole
Live threads: 201
Peak: 207
Daemon threads: 182
Total threads started: 2,770
Check https://wiki.apache.org/solr/SolrPerformanceProblems, especially "Slow startup" part. I think that's your case.

UDP sendto performance over loopback

Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229
Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.
i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.

Controlling Virtualized CPU's Clock Speed?

I'm currently building a small virtual machine in c modelling an old 16-bit CPU, which runs at a super slow clock speed (a few 100 Khz). How would I throttle the virtual machine's processing speed of opcode, etc..? or would I even want to?
As I said in the comments I suggest using some sort of timer mechanism
if you would like to match a certain speed here is how I would do it:
1 kHz 1000 Hz 1/s
----- * ------- * ----- therefore 1 kHz = 1000/s
1 1 kHz 1 Hz
which means every second 1000 operations are occurring, so take the reciprocal to find the amount of time in between operations so 1/1000 s or 1 ms
So lets say you want to match 125 kHz
125 kHz 1000 Hz 1/s
------- * ------- * ----- therefore 125 kHz = 125000/s
1 1 kHz 1 Hz
so 1/125000 s or .008 ms or 8000 ns
Hope this helps!

Resources