what is the principle of low latency in trading application? - c

It seems that all the major investment banks use C++ in Unix (Linux, Solaris) for their low latency/high frequency server applications. How do people achieve low latency in trading high frequency equity? Any book teach how to achieve this?

http://g-wan.com/
http://www.zeromq.org
http://www.quantnet.com/forum/threads/low-latency-trading-system.3163/
check these websites you may get some idea about low latency programming.
but be aware that some codes may not be standardized. Which means it will not for longer time
it may creates some bugs in it.

Related

Is there a difference between a real time system and one that is just deterministic?

At work we're discussing the design of a new platform and one of the upper management types said it needed to run our current code base (C on Linux) but be real time because it needed to respond in less than a second to various inputs. I pointed out that:
That point doesn't mean it needs to be "real time" just that it needs a faster clock and more streamlining in its interrupt handling
One of the key points to consider is the OS that's being used. They wanted to stick with embedded Linux, I pointed out we need an RTOS. Using Linux will prevent "real time" because of the kernel/user space memory split thus I/O is done via files and sockets which introduce a delay
What we really need to determine is if it needs to be deterministic (needs to respond to input in <200ms 90% of the time for example).
Really in my mind if point 3 is true, then it needs to be a real time system, and then point 2 is the biggest consideration.
I felt confident answering, but then I was thinking about it later... What do others think? Am I on the right track here or am I missing something?
Is there any difference that I'm missing between a "real time" system and one that is just "deterministic"? And besides a RTC and a RTOS, am I missing anything major that is required to execute a true real time system?
Look forward to some great responses!
EDIT:
Got some good responses so far, looks like there's a little curiosity about my system and requirements so I'll add a few notes for those who are interested:
My company sells units in the 10s of thousands, so I don't want to go over kill on the price
Typically we sell a main processor board and an independent display. There's also an attached network of other CAN devices.
The board (currently) runs the devices and also acts as a webserver sending basic XML docs to the display for end users
The requirements come in here where management wants the display to be updated "quickly" (<1s), however the true constraints IMO come from the devices that can be attached over CAN. These devices are frequently motor controlled devices with requirements including "must respond in less than 200ms".
You need to distinguish between:
Hard realtime: there is an absolute limit on response time that must not be breached (counts as a failure) - e.g. this is appropriate for example when you are controlling robotic motors or medical devices where failure to meet a deadline could be catastrophic
Soft realtime: there is a requirement to respond quickly most of the time (perhaps 99.99%+), but it is acceptable for the time limit to be occasionally breached providing the response on average is very fast. e.g. this is appropriate when performing realtime animation in a computer game - missing a deadline might cause a skipped frame but won't fundamentally ruin the gaming experience
Soft realtime is readily achievable in most systems as long as you have adequate hardware and pay sufficient attention to identifying and optimising the bottlenecks. With some tuning, it's even possible to achieve in systems that have non-deterministic pauses (e.g. the garbage collection in Java).
Hard realtime requires dedicated OS support (to guarantee scheduling) and deterministic algorithms (so that once scheduled, a task is guaranteed to complete within the deadline). Getting this right is hard and requires careful design over the entire hardware/software stack.
It is important to note that most business apps don't require either: in particular I think that targeting a <1sec response time is far away from what most people would consider a "realtime" requirement. Having said that, if a response time is explicitly specified in the requirements then you can regard it as soft realtime with a fairly loose deadline.
From the definition of the real-time tag:
A task is real-time when the timeliness of the activities' completion is a functional requirement and correctness condition, rather than merely a performance metric. A real-time system is one where some (though perhaps not all) of the tasks are real-time tasks.
In other words, if something bad will happen if your system responds too slowly to meet a deadline, the system needs to be real-time and you will need a RTOS.
A real-time system does not need to be deterministic: if the response time randomly varies between 50ms and 150ms but the response time never exceeds 150ms then the system is non-deterministic but it is still real-time.
Maybe you could try to use RTLinux or RTAI if you have sufficient time to experiment with. With this, you can keep the non realtime applications on the linux, but the realtime applications will be moved to the RTOS part. In that case, you will(might) achieve <1second response time.
The advantages are -
Large amount of code can be re-used
You can manually partition realtime and non-realtime tasks and try to achieve the response <1s as you desire.
I think migration time will not be very high, since most of the code will be in linux
Just on a sidenote be careful about the hardware drivers that you might need to run on the realtime part.
The following architecture of RTLinux might help you to understand how this can be possible.
It sounds like you're on the right track with the RTOS. Different RTOSs prioritize different things either robustness or speed or something. You will need to figure out if you need a hard or soft RTOS and based on what you need, how your scheduler is going to be driven. One thing is for sure, there is a serious difference betweeen using a regular OS and a RTOS.
Note: perhaps for the truest real time system you will need hard event based resolution so that you can guarantee that your processes will execute when you expect them too.
RTOS or real-time operating system is designed for embedded applications. In a multitasking system, which handles critical applications operating systems must be
1.deterministic in memory allocation,
2.should allow CPU time to different threads, task, process,
3.kernel must be non-preemptive which means context switch must happen only after the end of task execution. etc
SO normal windows or Linux cannot be used.
example of RTOS in an embedded system: satellites, formula 1 cars, CAR navigation system.
Embedded System: System which is designed to perform a single or few dedicated functions.
The system with RTOS: also can be an embedded system but naturally RTOS will be used in the real-time system which will need to perform many functions.
Real-time System: System which can provide the output in a definite/predicted amount of time. this does not mean the real-time systems are faster.
Difference between both :
1.normal Embedded systems are not Real-Time System
2. Systems with RTOS are real-time systems.

Raspberry Pi cluster, neuron networks and brain simulation

Since the RBPI (Raspberry Pi) has very low power consumption and very low production price, it means one could build a very big cluster with those. I'm not sure, but a cluster of 100000 RBPI would take little power and little room.
Now I think it might not be as powerful as existing supercomputers in terms of FLOPS or others sorts of computing measurements, but could it allow better neuronal network simulation ?
I'm not sure if saying "1 CPU = 1 neuron" is a reasonable statement, but it seems valid enough.
So does it mean such a cluster would more efficient for neuronal network simulation, since it's far more parallel than other classical clusters ?
Using Raspberry Pi itself doesn't solve the whole problem of building a massively parallel supercomputer: how to connect all your compute cores together efficiently is a really big problem, which is why supercomputers are specially designed, not just made of commodity parts. That said, research units are really beginning to look at ARM cores as a power-efficient way to bring compute power to bear on exactly this problem: for example, this project that aims to simulate the human brain with a million ARM cores.
http://www.zdnet.co.uk/news/emerging-tech/2011/07/08/million-core-arm-machine-aims-to-simulate-brain-40093356/ "Million-core ARM machine aims to simulate brain"
http://www.eetimes.com/electronics-news/4217840/Million-ARM-cores-brain-simulator "A million ARM cores to host brain simulator"
It's very specialist, bespoke hardware, but conceptually, it's not far from the network of Raspberry Pis you suggest. Don't forget that ARM cores have all the features that JohnB mentioned the Xeon has (Advanced SIMD instead of SSE, can do 64-bit calculations, overlap instructions, etc.), but sit at a very different MIPS-per-Watt sweet-spot: and you have different options for what features are included (if you don't want floating-point, just buy a chip without floating-point), so I can see why it's an appealing option, especially when you consider that power use is the biggest ongoing cost for a supercomputer.
Seems unlikely to be a good/cheap system to me.
Consider a modern xeon cpu. It has 8 cores running at 5 times the clock speed, so just on that basis can do 40 times as much work. Plus it has SSE which seems suited for this application and will let it calculate 4 things in parallel. So we're up to maybe 160 times as much work. Then it has multithreading, can do 64 bit calculations, overlap instructions etc. I would guess it would be at least 200 times faster for this kind of work.
Then finally, the results of at least 200 local "neurons" would be in local memory but on the raspberry pi network you'd have to communicate between 200 of them... Which would be very much slower.
I think the raspberry pi is great and certainly plan to get at least one :P But you're not going to build a cheap and fast network of them that will compete with a network of "real" computers :P
Anyway, the fastest hardware for this kind of thing is likely to be a graphics card GPU as it's designed to run many copies of a small program in parallel. Or just program an fpga with a few hundred copies of a "hardware" neuron.
GPU and FPU do this kind of think much better then a CPU, the Nvidia GPU's that support CDUA programming has in effect 100's of separate processing units. Or at least it can use the evolution of the pixel pipe lines (where the card could render mutiple pixels in parallel) to produce huge incresses in speed. CPU allows a few cores that can carry out reletivly complex steps. GPU allows 100's of threads that can carry out simply steps.
So for tasks where you haves simple threads things like a single GPU will out preform a cluster of beefy CPU. (or a stack of Raspberry pi's)
However for creating a cluster running some thing like "condor" Which cand be used for things like Disease out break modelling, where you are running the same mathematical model millions of times with varible starting points. (size of out break, wind direction, how infectious the disease is etc.. ) so thing like the Pi would be ideal. as you are general looking for a full blown CPU that can run standard code.http://research.cs.wisc.edu/condor/
some well known usage of this aproach are "Seti" or "folding at home" (search for aliens and cancer research)
A lot of universities have a cluster such as this so I can see some of them trying the approach of mutipl Raspberry Pi's
But for simulating nurons in a brain you require very low latency between the nodes they are special OS's and applications that make mutiply systems act as one. You also need special networks to link it togather to give latency between nodes in terms of < 1 millsecond.
http://en.wikipedia.org/wiki/InfiniBand
the Raspberry just will not manage this in any way.
So yes I think people will make clusters out of them and I think they will be very pleased. But I think more university's and small organisations. They arn't going to compete with the top supercomputers.
Saying that we are going to get a few and test them against our current nodes in our cluster to see how they compare with a desktop that has a duel core 3.2ghz CPU and cost £650! I reckon for that we could get 25 Raspberries and they will use much less power, so will be interesting to compare. This will be for disease outbreak modelling.
I am undertaking a large amount of neural network research in the area of chaotic time series prediction (with echo state networks).
Although I see using the raspberry PI's in this way will offer little to no benefit over say a strong cpu or a GPU, I have been using a raspberry PI to manage the distribution of simulation jobs to multiple machines. The processing power benefit of a large core will nail that possible on the raspberry PI, not just that but running multiple PI's in this configuration will generate large overheads of waiting for them to sync, data transfer etc.
Due to the low cost and robustness of the PI, i have it hosting the source of the network data, as well as mediating the jobs to the Agent machines. It can also hard reset and restart a machine if a simulation fails taking the machine down with it allowing for optimal uptime.
Neural networks are expensive to train, but very cheap to run. While I would not recommend using these (even clustered) to iterate over a learning set for endless epochs, once you have the weights, you can transfer the learning effort into them.
Used in this way, one raspberry pi should be useful for much more than a single neuron. Given the ratio of memory to cpu, it will likely be memory bound in its scale. Assuming about 300 megs of free memory to work with (which will vary according to OS/drivers/etc.) and assuming you are working with 8 byte double precision weights, you will have an upper limit on the order of 5000 "neurons" (before becoming storage bound), although so many other factors can change this and it is like asking: "How long is a piece of string?"
Some engineers at Southampton University built a Raspberry Pi supercomputer:
I have ported a spiking network (see http://www.raspberrypi.org/phpBB3/viewtopic.php?f=37&t=57385&e=0 for details) to the Raspberry Pi and it runs about 24 times slower than on my old Pentium-M notebook from 2005 with SSE and prefetch optimizations.
It all depends on the type of computing you want to do. If you are doing very numerically intensive algorithms with not much memory movement between the processor caches and RAM memory then a GPU solution is indicated. The middle ground is an Intel PC chip using the SIMD assembly language instructions - you can still easily end up being limited by rate you can transfer data to and from RAM. For nearly the same cost you can get 50 ARM boards with say 4 cores per board and 2Gb RAM per board. That's 200 cores and 100 Gb of RAM. The amount of data that can be shuffled between the CPUs and RAM per second is very high. It could be a good option for neural nets that use large weight vectors. Also the latest ARM GPU's and the new nVidea ARM based chip (used in the slate tablet) have GPU compute as well.

OpenMP debug newbie questions

I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.

Is is possible to turn off the computer fan+ the processor fan with programming?

I am trying to observe if software can damage hardware today or not.For that I choose that choosing to shut the computer fans would be a nice idea(else I would have preferred crashing the harddisk).I need to know the following:
Is it possible to do that anyone's computer?
Can this be dangerous to the hardware?
Choosing C is a good choice or should I go for Assembly language?
Thanks in advance.
First of all, the fans spin for a reason: complex electronics overheat rather quickly. It would be easier to look into a specialized tool that does this, such as SpeedFan for Windows.
On the other hand, if you really want to do it, there are ways - however, they are mostly vendor- and product-specific. For Acer laptops on Linux, see e.g. this - note that it's very low-level (involves BIOS calls) and if it breaks, you get to keep both parts.
The processor fan is usually controlled by hardware and cannot be controlled from software.
Some specialized fans may provide an API to do this, but it would be specific to that hardware only.
If you do so, you will likely trigger thermal alarms in a short time and the pc will shut down in a few instants. Still, there is the danger of irreversibly damaging the hardware.

How do I figure out which SOC or SDK board to use?

Basically I'm working on a model of an automated vacuum cleaner. I currently have made software simulation of the same. How do I figure out which SOC or SDK board to use for the hardware implementation? My code is mostly written in C. Will this be compatible with the sdk provided by board manufacturers? How do i know what clock speed,memory etc the hardware will need?
I'm a software guy and have only basic knowledge about practical hardware implementations. Have some experience in programming the 8086 to carry out basic tasks.
You need to perform some kind of analysis of the required performance of your application. I'm certainly no expert in this, but questions that come to mind include:
How much performance do you need? Profile your application, and try to come up with some estimate of its minimum performance requirements, in e.g. MIPS.
Is your application code and/or data going to be large? Do you need a controller with 8 KB of code space and 100 bytes of RAM, or one with 1 MB of code and 128 KB of RAM? Somewhere inbetween? Where?
Do you need lots (tens) of I/O channels? With what characteristics? Is it enough with just basic digital I/O, a handful of pins, or do you need 20 channels of 10-bit A/D conversion? PWM? Communications peripherals?
Followups:
Manufacturers will of course make sure that their customers can build and run software on their boards. They will provide either free compilers, or (since embedded devlopment is an industry and a very large market, after all) sell them as tools.
There are free development environments, often based around GNU's gcc compiler, for many low-end (and of course many medium and high-end, too) architectures.
You could, for instance, look through Atmel's range of AVR 8-bit controllers, they're very popular in the hobbyist world and easy to port C code to. Free compilers are available, and basic development boards are cheap.

Resources