coretex a-53: 256bit vector - arm

Following is the info of the CPU in a cortex A53 embedded target.
How can I know is this CPU supports 256bit vectoer (e.g float32x8)
Thank you,
Zvika
sidekiq#z3u:~$ cat /proc/cpuinfo
processor : 0
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
sidekiq#z3u:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1199.9990
CPU min MHz: 299.9990
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
sidekiq#z3u:~$ cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq#vger.kernel.org, please.
analyzing CPU 0:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 1:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 2:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 3:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%

How can I know is this CPU supports 256bit vector
It doesn't.
It supports NEON (the asimd entry in the Features list) which is 128-bit only.

Related

Why is my webcrawler throughput so low when parallel connections is over 10K?

I have implemented a webcrawler with libcurl and libev. My intention was to make a high performance crawler that uses all available bandwidth. I have succeeded in making a crawler that can sustain over 10,000 parallel connections. However, the stats for bandwidth usage are not all that impressive. Here is some example output from vnstat:
rx | tx
--------------------------------------+------------------
bytes 32.86 GiB | 3.12 GiB
--------------------------------------+------------------
max 747.99 Mbit/s | 25.73 Mbit/s
average 15.69 Mbit/s | 1.49 Mbit/s
min 2.62 kbit/s | 12.29 kbit/s
--------------------------------------+------------------
packets 33015363 | 23137442
--------------------------------------+------------------
max 68804 p/s | 28998 p/s
average 1834 p/s | 1285 p/s
min 5 p/s | 5 p/s
--------------------------------------+------------------
time 299.95 minutes
As you can see my average download speed is only 15.69 Mbps while the network bandwidth can support much more. I do not understand why the application is downloading so slowly and yet still maintaining over 10K connections in parallel. Is this something to do with the URLs that are being downloaded? If I repeatedly download www.google.com, www.yahoo.com and www.bing.com I can achieve speeds of up to 7 Gbps. With general crawling though the speed is as shown above.
Any thoughts or ideas?

clock_gettime is precision is only ms

I'm running QNX on a microcontroller board with a 25MHz clock speed. Running clock_gettime(CLOCK_REALTIME, &tp); gives me only a precision of ms. I'm aware that a precion of ns is not possible because of the low clock speed but us should be possible. Do I have to set some configuration flags to get a better precision?

How can I explain a slower execution when perf stat does not give a clue?

My program measures the time that it takes to execute a function 500 times (this time is on the order of 14 seconds) and reports the average time per execution. Since precise time measurement is important, I took much care to get rid of all possible sources of noise in time measurement:
The program is run under Ubuntu 14.04, with root privileges, nice -n -20, on a shielded CPU (with -k on option to cset shield).
Hyper-threading is disabled.
Memory is allocated only once to avoid context switches due to malloc.
A large memset and 500 warm-up function executions are performed before the timing starts in an attempt to obtain similar state of the data caches before the timer starts. The program is pretty small, so I am not that worried about the instruction cache.
The time is measured as the difference of two values returned by clock_gettime(CLOCK_MONOTONIC).
Here are the reported times from 5 consecutive runs (in ms, but the measured time is 500 times this time, so it's on the order of 14 sec): 28.77, 29.35, 28.74, 28.74, 29.79. Note that there are three very consistent timing results here (#1, #3 and #4). I am seeking help in understanding and eliminating the source of the outliers. Here is the report of perf stat for the first and the last runs:
First run:
29176.113027 task-clock (msec) # 0.999 CPUs utilized
596 context-switches # 0.020 K/sec
0 cpu-migrations # 0.000 K/sec
5,061 page-faults # 0.173 K/sec
104,825,303,791 cycles # 3.593 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
104,316,069,689 instructions # 1.00 insns per cycle
19,672,922,035 branches # 674.282 M/sec
346,005,888 branch-misses # 1.76% of all branches
Last run:
30003.678991 task-clock (msec) # 0.999 CPUs utilized
417 context-switches # 0.014 K/sec
0 cpu-migrations # 0.000 K/sec
4,945 page-faults # 0.165 K/sec
107,799,951,303 cycles # 3.593 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
104,310,601,483 instructions # 0.97 insns per cycle
19,671,642,624 branches # 655.641 M/sec
345,885,486 branch-misses # 1.76% of all branches
30.023661486 seconds time elapsed
I do not notice anything in these stats that would give a clue as to why the last run is slower than the first one. I will very much appreciate any help in this.
EDIT: it was noticed that the Instructions per Cycle number is different. It would be great to get to the root cause of this difference.

Solr data import memory usage

I am running Solr 3.2 and 4 GB memory. Whenever I start the Solr , it does full import of all the cores and after that every 30 minutes delta import happens. Among 5 cores, 2 cores are having data around 1.6M. Full import takes for those 2 cores more than 20 hours and it is taking all the memory. Because of less memory delta import is not happening for other cores. This leads to restart of the Solr whenever data is updated in DB.
As until commit happens, it won't release the memory, I have given autocommit interval to 5 minutes for those 2 cores. Even though memory is not reduced.
Is there any other configuration I can check?
Edit 1 My autocommit settings
<autoCommit>
<maxDocs>25000</maxDocs>
<maxTime>300000</maxTime>
</autoCommit>
Edit 2 jconsole
System information & values from jconsole
Operating System: Windows 7 6.1
Architecture: amd64
Number of processors: 1
Committed virtual memory: 2,618,508 kbytes
Total physical memory: 4,193,848 kbytes
Free physical memory: 669,472 kbytes
Total swap space: 9,317,672 kbytes
Free swap space: 2,074,072 kbytes
threads details from jconsole
Live threads: 201
Peak: 207
Daemon threads: 182
Total threads started: 2,770
Check https://wiki.apache.org/solr/SolrPerformanceProblems, especially "Slow startup" part. I think that's your case.

Controlling Virtualized CPU's Clock Speed?

I'm currently building a small virtual machine in c modelling an old 16-bit CPU, which runs at a super slow clock speed (a few 100 Khz). How would I throttle the virtual machine's processing speed of opcode, etc..? or would I even want to?
As I said in the comments I suggest using some sort of timer mechanism
if you would like to match a certain speed here is how I would do it:
1 kHz 1000 Hz 1/s
----- * ------- * ----- therefore 1 kHz = 1000/s
1 1 kHz 1 Hz
which means every second 1000 operations are occurring, so take the reciprocal to find the amount of time in between operations so 1/1000 s or 1 ms
So lets say you want to match 125 kHz
125 kHz 1000 Hz 1/s
------- * ------- * ----- therefore 125 kHz = 125000/s
1 1 kHz 1 Hz
so 1/125000 s or .008 ms or 8000 ns
Hope this helps!

Resources