I'm currently building a small virtual machine in c modelling an old 16-bit CPU, which runs at a super slow clock speed (a few 100 Khz). How would I throttle the virtual machine's processing speed of opcode, etc..? or would I even want to?
As I said in the comments I suggest using some sort of timer mechanism
if you would like to match a certain speed here is how I would do it:
1 kHz 1000 Hz 1/s
----- * ------- * ----- therefore 1 kHz = 1000/s
1 1 kHz 1 Hz
which means every second 1000 operations are occurring, so take the reciprocal to find the amount of time in between operations so 1/1000 s or 1 ms
So lets say you want to match 125 kHz
125 kHz 1000 Hz 1/s
------- * ------- * ----- therefore 125 kHz = 125000/s
1 1 kHz 1 Hz
so 1/125000 s or .008 ms or 8000 ns
Hope this helps!
Related
Following is the info of the CPU in a cortex A53 embedded target.
How can I know is this CPU supports 256bit vectoer (e.g float32x8)
Thank you,
Zvika
sidekiq#z3u:~$ cat /proc/cpuinfo
processor : 0
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
sidekiq#z3u:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1199.9990
CPU min MHz: 299.9990
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
sidekiq#z3u:~$ cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq#vger.kernel.org, please.
analyzing CPU 0:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 1:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 2:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
analyzing CPU 3:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
maximum transition latency: 500 us.
hardware limits: 300 MHz - 1.20 GHz
available frequency steps: 300 MHz, 400 MHz, 600 MHz, 1.20 GHz
available cpufreq governors: performance
current policy: frequency should be within 300 MHz and 1.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 300 MHz:0.00%, 400 MHz:0.00%, 600 MHz:0.00%, 1.20 GHz:100.00%
How can I know is this CPU supports 256bit vector
It doesn't.
It supports NEON (the asimd entry in the Features list) which is 128-bit only.
I have implemented a webcrawler with libcurl and libev. My intention was to make a high performance crawler that uses all available bandwidth. I have succeeded in making a crawler that can sustain over 10,000 parallel connections. However, the stats for bandwidth usage are not all that impressive. Here is some example output from vnstat:
rx | tx
--------------------------------------+------------------
bytes 32.86 GiB | 3.12 GiB
--------------------------------------+------------------
max 747.99 Mbit/s | 25.73 Mbit/s
average 15.69 Mbit/s | 1.49 Mbit/s
min 2.62 kbit/s | 12.29 kbit/s
--------------------------------------+------------------
packets 33015363 | 23137442
--------------------------------------+------------------
max 68804 p/s | 28998 p/s
average 1834 p/s | 1285 p/s
min 5 p/s | 5 p/s
--------------------------------------+------------------
time 299.95 minutes
As you can see my average download speed is only 15.69 Mbps while the network bandwidth can support much more. I do not understand why the application is downloading so slowly and yet still maintaining over 10K connections in parallel. Is this something to do with the URLs that are being downloaded? If I repeatedly download www.google.com, www.yahoo.com and www.bing.com I can achieve speeds of up to 7 Gbps. With general crawling though the speed is as shown above.
Any thoughts or ideas?
I am working with 28c16 2kb parallel eeprom. It has 11 address pins to select one of 2000 bytes we want to work with and 8 I/O pins for reading or writing to that byte. There is an OC (output enable) pin which, when grounded, gives output of selected byte from 8 I/O pins. Similarly, there is a WE (Write enable) pin which, when given low pulse of width less than 1 microsecond, writes to selected byte taking data from I/O pins. The datasheet of this chip says that the width of pulse on WE pin to write onto the selected byte must be between 100 to 1000 nano seconds. The problem is that I want to use arduino to program this chip. But how can I generate 100-1000 nanosecond pulse using arduino? The lowest delay time in arduino is 1 microsecond (1000 ns) plus time taken by digitalWrite and digitalRead functions (working with ports directly still takes more that 120 ns more). So it exceeds 1 microsecond..... Is there any way to generate pulse of width less than one microsecond?
all. I have just one GPU device Nvidia GTX 750. I did a test that copy data from CPU to GPU in one single thread with using clEnqueueWriteBuffer. And then I did it by using multiple threads. The result is that multiple threads seems slower.
When using multiple threads, every thread has its own kernel/command queue/context which created by the same device. So my question is that is the clEnqueueWriteBuffer call has some lock for one device? How can I reduce those effection?
Edit: if workloads are too light for the hardware, multiple concurrent command queues can achieve better total bandwidth.
Like opengl, opencl needs to batch multiple buffers into single one to get faster, even using single opencl kernel parameter versus multiple parameters is faster. Because there is operating system/api overhead per operation. Moving bigger but fewer chunks is better.
You could have bought two graphics cards that are equivalent to a gtx 750 when combined, to use multiple pci-e bandwidths (if your mainboard can give two 16x lanes separately)
Pcie lanes are two way so you can try parallelize writes and reads or parallelize visualization and computation or parallelize compute and writes or parallelize compute and reads or parallelize compute+write+read (ofcourse if they are not dependent each other like figure 1-a) if there are such in your algorithm and if your graphics card can do it.
Once I tried divide and conquer on a big array to calculate and sending each part to gpu, it took seconds. Now Im computing with just single call for writes single call for computes. Taking only milliseconds.
Figure 1-a:
write iteration --- compute iteration ---- read iteration --- parallels
1 - - 1
2 1 - 2
3 2 1 3
4 3 2 3
5 4 3 3
6 5 4 3
if there is no dependency between iterations. If there is a dependency, then:
figure 1-b:
write iteration --- compute iteration ---- read iteration --- parallels
half of 1 - - 1
other half of 1 half of 1 - 2
half of 2 other half of 1 half of 1 3
other half of 2 half of 2 other half of 1 3
half of 3 other half of 2 half of 2 3
other half of 3 half of 3 other half of 2 3
If you need parallelization between batches of images with non-constant sizes:
cpu to gpu -------- gpu to gpu ----- compute ----- gpu to cpu
1,2,3,4,5 - - -
- 1,2,3 - -
- 4,5 1,2,3 -
- - 4,5 1,2,3
6,7,8,9 - - 4,5
10,11,12 6,7,8 - -
13,14 9,10,11 6,7 -
15,16,17,18 12,13,14 8,9,10 6
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
So I am having my exam tomorrow. I missed a lecture but I have a recorded professor's lecture. During the lecture, the professor mentioned that we will need to know how Timers work within embedded procesors.
I have a basic understanding, however I am confused on the Math part. Professor said that for example he will give us a 12-bit timer running at some rate X. We will have to set the timer value initially and wait for it to overflow. If we want the timer to wait for 3 miliseconds, what do we set the timer to?
In addition, professor said that "it is a simple math, giving us clocks equal to 1,000,000 and the time will be 1000 so we can do divisions easily."
Can somebody please explain how timers exactly work and what would I need to do in order to get the math part correctly.
Most embedded timers work as follows:
set timer to some initial value, T
start timer
timer decrements once per timer clock
when timer reaches 0 (or underflows beyond 0) an interrupt is generated
So the elapsed time will just be T / timer_clock_rate, where T is the initial timer value and timer_clock_rate will depend on how you've configured the timer.
So for example if you want a 3 ms delay and your timer clock rate is 1 MHz (i.e. the timer decrements once every 1 µs) then you need an initial timer value of 3000 (3000 x 1 µs = 3 ms).
EDIT: see also #Rev1.0's answer - apparently AVR timers count up rather than down - note that some other micro-controller families use count down timers. The same general principle applies to both however, but the initial constant that you would load will be different depending on whether you are counting up or down.
While Paul's example maybe sufficient for you to get the idea how it works, it addresses the problem somewhat different from what your question suggested.
The mentioned overflow occurs when the timer reaches the maximum value. That would be 4096 for a 12 bit timer (2^12). Having a given clock of 1MHz (1us per tick) you have to count to 3000 to get 3ms like Paul already pointed out.
So you would set the initial value of the timer to 4096 - 3000 = 1096 to get an overflow after 3ms.
Just follow the dimensions. X is cycles per second, assume that one tick is one cycle (if there is a prescaler, then there is another adjustment here, the timer counts in ticks). And one second is 1000 milliseconds. so just arrange all of these dimensions so that they cancel out leaving only the one you want
1 tick X cycles 1 second 3 milliseconds
-------- * -------- * ----------------- * --------------
1 cycles 1 second 1000 milliseconds
cancel out all the units divided by themselves. from grade school math (almost) anything divided by itself is one, just apply that to units. leaving:
1 tick X 1 3
------ * --- * ----- * ---
1 1 1000
So whatever your timer clock source frequency is in cycles per second (Hz) is multiplied by 3 and divided by 1000. If that clock frequency is 1000000 then (1000000*3)/1000 = 3000.
That is how you can easily figure out what is multiplied and what is divided, works for every flavor of conversion. Miles per hour to kilometers per second, whatever.
then just follow Rev1.0's answer or Paul R.
Sometimes there is an N-1 thing you have to be aware of which is either documented or you can test for. For example if you have a down counter the docs will/should say if the timer rolls over or interrupts WHEN it reaches zero or AFTER, usually they are AFTER, when the actual roll over happens, so 3000 to 0 inclusive is 3001 counts, you are off by a little, so for a system like that you need to program the timer with 2999 to get 3000 ticks. For an up counter it is usually one of two ways, it counts from zero to the programmed value, same deal count from 0 to 2999 to get 3000 counts so you would probably program a 2999 in the register not 3000. Or the value you program is the start count as Rev1.0 showed, and the rollover is after the all ones value for whatever the register size is, in this case they tell you 12 bits which is 0xFFF, dont make the 0xFFF - 3000 mistake and get 1095, easy to do it the way Rev1.0 shows (0xFFF+1)-3000 = 4096-3000 = 1096 that is your start count.
Same goes for prescalers, you have to be very careful to read what they are saying do you program a 2 to divide by 2 or program a 1 to divide by 2? What happens if you program a 0 is that a divide by 1 or invalid or a divide by the max value, or an invalid divisor setting? Where does that fit into the dimensional analysis? ticks / cycles. A prescaler that divides by 8 means that for every 8 cycles you get one tick, so that is a 1 tick / 8 cycles.
Now sometimes you will have a system clock that divides down for the peripheral clock and the the peripheral clock is what feeds the timer, then you may have a prescaler there so n system clocks / 1 peripheral clock, then 1 peripheral clock / 1 timer clock, then 1 tick is M timer clocks. Get everything lined up so all but one cancels out and there you go.
You can do this in reverse as well and will and should. From the numbers we have right now 3000 timer ticks is 3ms or 0.003 seconds. 3000/0.003 is ticks per second or cycles per second of 1000000.
But what if we had a different controller or one with a clock we didnt know (or we have a crystal we know but we suspect there is a prescaler somewhere we cant find in the docs) so let the timer roll over 4096 ticks for example, we measure that with a stopwatch or oscilloscope or something, not that accurate in either case but might give us a rough enough idea to figure out if there is a prescaler or what clock we are actually running on if there is a pll multiplying the clock, etc. Say it was 0.0005 seconds for every 4096 timer ticks, 4096/0.0005 = 8192000 hz. Now if the crystal/oscillator in the schematic or we can read off of the part says 16Mhz that would make some sense 8000000/4096 = 0.000512 and 8Mhz is half of 16Mhz, so your measurement is probably off by a smidge and the clock has some accuracy as well and may be off by some amount. So you check the docs to see if there is an internal oscillator, if not then you are probably running off of the 16Mhz clock but there is a divide by 2 that is documented and you have not found it or isnt documented (it happens some times) and your timer is running off of system clock/2. Now you can use that number as X and figure out whatever you need. Why not have the timer count to 1000 or some other number that is easier to compute. that is fine too, but can take more work and more experiments, sometimes you have a free running timer you cannot reset at a some max or min count and instead you can just say
while(1)
{
while((read_timer()&0x1000)==0) continue;
turn_gpio_on();
while((read_timer()&0x1000)!=0) continue;
turn_gpio_off();
}
Measure the on time or the off time and that is 0x1000 ticks. hex not as pretty a number as 1000 decimal when doing decimal math on your calculator, but it is pretty when you can simply use an and operation with one bit to toggle the gpio/led.
This last point being, you can read the docs and schematics and think you know how it is working, but you should test your results, and if you are off by some whole multiple, esp if it is a power of two either your math was wrong by some whole number or there is a clock divisor somewhere in the chip you dont know about or have not looked hard enough to find, this could also mean you may have thought you boosted the crystal speed to some other speed using the pll, and maybe there is a mistake there and everything in the chip is not running at the desired speed.
(if this answer is useful to you then upvote Paul and Rev1.0 before upvoting me, thanks, just expanding on their answers).