Quantitative Analysts or "Quants" predict the behavior of markets to maximize profits. I am interested in the software that they use to accomplish this. Are there development platforms, libraries, languages or Data Mining suites specifically tailored to Financial Modeling?
Statistical Modeling:
First, there are statistical computing languages like R which is powerful and open-source, with lots of packages for analysis and plotting.
You will find some R packages that relate to finance:
http://www.quantmod.com/
https://www.rmetrics.org/
https://www.rmetrics.org/ebooks-tseries
Machine Learning and AI to train the system on past data:
Weka Data Minig: http://www.cs.waikato.ac.nz/ml/weka/
libsvm (data classifiers http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
"Artificial Intelligence: Modern Approach" book (code: http://aima.cs.berkeley.edu/code.html)
Backtesting the trading system on past data:
More often that not, broker trading platforms will provide facilities for trading automation, in form of scripts and languages with which you can program the logic of the trading "strategy" (some use common languages like Java, some use proprietary ones). They will also provide some minimal support to test the strategy on past data, and get a detailed report on the taken trades and their outcome.
Connection to broker and System Testing:
Either you use some broker-proprietrary trading API, or go with the more standardized FIX.
Building a FIX server that does a quotation ticks playback to your trading system (which in this case will be a FIX client) is also a very good form of validation of the system. Most reputable ECNs will provide FIX access. So this is more portable than any other interface.
QuickFIX/J is a full featured
messaging engine for the FIX protocol.
It is a 100% Java open source
implementation of the popular C++
QuickFIX engine.
http://www.quickfixj.org/
There aren't any full blown platforms/applications per-se, since pretty much all software in this field is developed in-house, and usually behind the firewall (obviously for competitive advantage; in a fiercely competitive industry)
A well known library that includes a lot of algorithms and pricing models, and makes for a suitable starting point for a framework or app is called quantlib.
The Strata project from OpenGamma provides a comprehensive open source Java library for market risk, including all the basic elements a quant would need to manage things like holidays, trades, valuation and risk measures. Disclaimer, I am an author.
Related
I was trying to get a fix on when using Apache Camel would be appropriate and inappropriate from reading this article -- https://dzone.com/articles/when-use-apache-camel . What the article mentions is that when the number of services is low, using an integration framework like Camel might be overkill, which makes sense. But I was confused by this sentence
Although FuseSource offers commercial support, I would not use Apache
Camel for very large integration projects. An ESB is the right tool
for this job in most cases. It offers many additional features such as
BPM or BAM. Of course, you could also use several single frameworks or
products and „create“ your own ESB, but this is a waste of time and
money (in my opinion).
Is this because the integration framework lacks components that an ESB provides? If so, what are those?
At a functional level, Apache Camel does everything that all the other ESB's do and is a fine choice for pretty much any integration work. It has connectors (components in camel-speak) for every transport you can think of, deals with clustering, can provide a JMS broker and whatever else you need to integrate.
It doesn't have the nice UI's and IDEs that other tools (Tibco, webMethods, Boomi etc) which is a big advantage. The developers might actually write unit tests if you use Camel :) I'm joking of course, integration devs never write unit tests.
In terms of "weight" Camel itself is not too bloated. It can be used as a standalone runtime or you can simply leverage the integration capabilities as a library in another app. It heavily utilises spring and can run a large number of threads, so requires a reasonable amount of memory (~512Mb JVM Heap would be the lower bound) but is not difficult to use. It wont fight you.
JBoss Fuse is the full blown Red Hat supported Enterprise ESB. It is based around Camel, but uses apache karaf as the runtime which is a OSGi container. This is heavy weight but gives you a very powerful ESB runtime, deployment model and management interface and you can buy commercial support for Fuse and ActiveMQ from Red Hat. This is the more traditional "ESB" platform, but deep down inside all the integration functionality comes from Camel.
I have been working with various products that are used to implement ESBs in the past 10 years: Apache Camel, Mule, Oracle Service Bus, IBM Integration Bus (WebSphere Message Broker), Spring Integration, FUSE, etc. I can tell you what are the differences from various perspectives, but most importantly, what you have to understand is that ESB is an abstract concept - basically a "bus" in terms of integration patterns that connects various services in an entreprise. For this reason, you can build an ESB with a product that targets this specific need (e.g. IIB) or you can build it with something much more generic (e.g. SpringIntegration).
Management point of view:
The top-end commercial products (the ones you refer to "ESBs") like the ones from Oracle or IBM are well suited for this job. They have nice user interfaces that are made for implementing a bus pattern. On the workforce market you will need to find (may be simple or not) certified specialists to operate these products. Behind these products you will get support of big companies with which the enterprise may already have existing contracts. Often these products will have very specific connectors that may (or not, see developer point of view) ease your life. To give you an example, IBM's product will have very efficient connectors for IBM MQ or Mainframe CICS. They also offer out-of-the box integrations with other products like BPM. When building bigger implementations of ESBs, the projects are structured by the fact that not everything can be done in such a product - this means that you have smaller risk but also less flexibility. In terms of problems management, you are relatively safe because you blame the big company that supports you.
Towards the more "open" products/frameworks like FUSE, Mule, Camel, etc., you will start getting very flexible software that may be also cheaper as a start price. For working with it you will need generic profiles, but you will also need software architects to design your product because nothing forces you build the message flows in a particular way. At a certain point you will need at least some very skilled developers because you may not get efficient support, depending on the product you choose. In terms of problems management, you are fully responsible for this choice.
Developer point of view:
Top-end commercial products will come with in-built connectors and other operators - quick to build a POC. However, any customization will be a pain (for instance you may have an HTTP connector, but you may need to switch to some new SHA algorithm that has not been standard at the time you installed). The introduction of specific custom connectors will be possible, but much more complicated than with "open" products. You'll get nice user interfaces that will hide the code. This will mean a series of drawbacks for you, for instance code repository usage will not be efficient (can't diff two versions), unit tests are difficult to write, etc. The integrations with other products like BPM will force you to use a pre-chosen product, anything else will be doable but expensive.
The open-end products/frameworks will be flexible, everybody in the team will understand the product quickly (e.g. provided everyone knows Spring). Changing some core functionality might be very cheap. However, things can quickly degenerate if there is no supervision of experienced members because you can code pretty much anythig.
Architect point of view:
That said, there is something about the ESB that you should know in advance. You should be very clear in terms of capabilities of what you need to do (let's say: routing, service discovery, protocol conversion, etc.). Should you need to implement something complex in an ESB, for instance transformation of the business message, means that the business analyst will need to interact with your ESB. So for instance the expert in banking payments will need to write an XSLT for you. This XSLT might be simple, but it may also cause big trouble, for which you will be responsible. Now if this person also needs to know the product of your choice, this complicates things. Therefore the added value of integrating an ESB with BPM, BAM, etc. is very questionable.
Another thing to note is that for most enterprises nowadays, the ESB is not something that would create some competitive advantage, like for instance a web page. So investing in a super complex development or out-of-the-box product needs to be challenged.
Conclusion
Not sure if it was overwhelming or understandable, but in the end it boils down to the resources you have and the complexity of the task.
Just as a comment, contrary to another answer on this question, if you are building something complex:
Pay attention to the Architect's point of view above.
You may seriously consider the open end of products because by experience I can tell you that no matter what kind of support you have, when the things get nasty you better have very skilled people in-house and no official support will really save you (apart from your face).
A recent article by codecentric states as follows:
Apache Camel is a framework full of tools for routing data within an application. It is a framework you use when a full-blown Enterprise Server Bus is not (yet) needed.
You can find the full article here
Don't get too stuck in words like ESB and frameworks. The things that should decide on what kind of platform you need should depend on:
Your current and future system landscape
Your current and future requirements (technical and non-functional)
Your in house competence
Your budget
After going through those steps then you can evaluate and come to a conclusion on what you need .For some apache camel is the best fit for some other platforms may be a better fit. Don't get stuck in words like ESB etc.
The goal is to have occasionally connected clients belonging to one user running on an array of different platforms be able to share a common dataset. Server will be GNU/Linux. Platforms and devices initially targeted include: GNU/Linux, Android and Windows on x86_64, i686, ARMv7, ARMv8. Possible consideration given to OSX and iOS devices. The UI toolkit will likely be Qt5. The non-interface logic will be in c++. My code will be licensed under the GPL or AGPL.
Many applications like Google's Maps, for instance, require a reliable connection to work effectively. My design decision is that each client should be fully functional without a network connection, assuming that the user has not used a different device/platform to change the data. In such cases, only a synchronization with network access should be necessary.
Some applications are written with hand-rolled data persistence and synchronization layers customized for each platform or subset of platforms. While this is quite effective, it is beyond the available skill and manpower available (just me :) ).
Also, though I'm somewhat conditioned by past experience to think of things from a relational database perspective, the data I'm expecting is probably better suited to a JSON/NoSQL database. Though there is UNQLite, it has no more synchronization capabilities than SQLite. And things like MongoDB, CouchDB, etc. are, as far as I have found, solely targeting the data center or cloud at the moment. Really, the same seems to be true of SQL solutions. Are there truly cross platform databases that have native synchronization/replication capabilities?
I've spent the past weeks looking at a variety of options, but have found little truly useful. The closest I've seen to something useful is CouchBase; however, it seems to be rather unstable (from an API standpoint, not relating to crashes), has a huge set of dependencies, seems to be nothing but a SQLite wrapper on mobile platforms and has limited developer docs at the moment.
If there is an applicable library or toolkit, please point it out. If not, or if this is excessively "discussion oriented", please kindly point me to an appropriate resource to find help before closing this question.
Notes:
How would you design & implement a Cross Platform synchronization mechanism? Talks about hand rolling solutions. If this is the only way to go, I would appreciate references specifically applicable to SQLite3 since I know that it is stable and available on just about everything but the kitchen sink.
Cross platform data synchronization Does not really address anything significant. One of the main concerns I have is effectively synchronizing data while minimizing network bandwidth requirements. Even in the US, there are vast areas where high speed data is simply not available. To me, usability in the widest range of circumstances is important.
How to Synchronize Android Database with an online SQL Server? This post is somewhat helpful but I am a total, absolute novice at developing for anything but *NIX on servers and workstations. Again, if it is necessary to do SQL based persistence for this set of requirements, more pointers to guidance would be helpful.
I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support
I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
I learned a lot by looking at what is published about really huge web-based plattforms, and especially how their systems evolved over time to meet their growth.
Here a some examples I found enlightening:
eBay Architecture: Nice history of their architecture and the issues they had. Obviously they can't use a lot of caching for the auctions and bids, so their story is different in that point from many others. As of 2006, they deployed 100,000 new lines of code every two weeks - and are able to roll back an ongoing deployment if issues arise.
Paper on Google File System: Nice analysis of what they needed, how they implemented it and how it performs in production use. After reading this, I found it less scary to build parts of the infrastructure myself to meet exactly my needs, if necessary, and that such a solution can and probably should be quite simple and straight-forward. There is also a lot of interesting stuff on the net (including YouTube videos) on BigTable and MapReduce, other important parts of Google's architecture.
Inside MySpace: One of the few really huge sites build on the Microsoft stack. You can learn a lot of what not to do with your data layer.
A great start for finding much more resources on this topic is the Real Life Architectures section on the "High Scalability" web site. For example they a good summary on Amazons architecture.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
I liked the book Distributed Systems: Principles and Paradigms by Andrew S. Tanenbaum and Maarten van Steen.
At a more abstract and formal level, Communicating and Mobile Systems: The Pi-Calculus by Robin Milner gives a calculus for verifying systems. There are variants of pi-calculus for verifying protocols, such as SPI-calculus (the wikipedia page for which has disappeared since I last looked), and implementations, some of which are also verification tools.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
It is nevertheless true that I didn't read the books from Gerard Tel and Nicola Santoro. Perhaps they are still easier to read...
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex!
You can get the TLA+ book and several examples for free.
There are many classic papers written by Leslie Lamport :
(http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) and Edsger Dijkstra
(http://www.cs.utexas.edu/users/EWD/)
for the database side.
A main stream is NoSQL movement,many project are appearing in the market including CouchDb( couchdb.apache.org) , MongoDB ,Cassandra. These all have the promise of scalability and managability (replication, fault tolerance, high-availability).
One good book is Birman's Reliable Distributed Systems, although it has its detractors.
If you want to formally verify your protocol you could look at some of the techniques in Lynch's Distributed Algorithms.
It is likely that whatever protocol you are trying to implement has been designed and analysed before. I'll just plug my own blog, which covers e.g. consensus algorithms.
Our company has a point of sale system with many extras, such as ordering and receiving functionality, sales and order history etc. Our main issue is that the system was not designed properly from the ground up, so it takes too long to make fixes and handle requests from our customers. Also, the current technology we are using (Progress database, Progress 4GL for the language) incurs quite a bit of licensing expenses on our customers due to mutli-user license fees for database connections etc.
After a lot of discussion it is looking like we will probably start over from scratch (while maintaining the current product at least for the time being). We are looking for a couple of things:
Create the system with a nice GUI front end (it is currently CHUI and the application was not built in a way that allows us to redesign the front end... no layering or separation of business logic and gui...shudder).
Create the system with the ability to modularize different functionality so the product doesn't have to include all features. This would keep the cost down for our current customers that want basic functionality and a lower price tag. The bells and whistles would be available for those that would want them.
Use proper design patterns to make the product easy to add or change any part at any time (i.e. change the database or change the front end without needing to rewrite the application or most of it). This is a problem today because the Progress 4GL code is directly compiled against the database. Small changes in the database requires lots of code recompiling.
Our new system will be Linux based, with a possibility of a client application providing functionality from one or more windows boxes.
So what I'm looking for is any suggestions on which database and/or framework or programming language(s) someone might recommend for this sort of product. Anyone that has experience in this field might be able to point us in the right direction or even have some ideas of what to avoid. We have considered .NET and SQL Express (we don't need an enterprise level DB), but that would limit us to windows (as far as I know anyway). I have heard of Mono for writing .NET code in a Linux environment, but I don't know much about it yet. We've also considered a Java and MySql based implementation.
To summarize we are looking to do the following:
Keep licensing costs down on the technology we will use to develop the product (Oracle, yikes! MySQL, nice.)
Deliver a solution that is easily maintainable and supportable.
A solution that has a component capable of running on "old" hardware through a CHUI front end. (some of our customers have 40+ terminals which would be a ton of cash in order to convert over to a PC).
Suggestions would be appreciated.
Thanks
[UPDATE]
I should note that we are currently performing a total cost analysis. This question is intended to give us a couple of "educated" options to look into to include in or analysis. Anyone who could share experiences/suggestions about client/server setups would be appreciated (not just those who have experience with point of sale systems... that would just be a bonus).
[UPDATE]
For anyone who is interested, we ended up going with Microsoft Dynamics NAV, LS Retail (a plugin for the point of sale and various other things) and then did some (and are currently working on) customization work on top of that. This setup gave us the added benefit of having a fully integrated g/l system, which our current system lacked.
Java for language (or Scala if you want to be "bleeding edge", depending on how you plan to support it and what your developers are like it might be better, but also worse)
H2 for database
Swing for GUI
Reason: Free, portable and pretty standard.
Update: Missed the part where the system should be a client-server setup. My assumption was that the database and client should run on the same machine.
I suggest you first research your constraints a bit more - you made a passing reference to a client using a particular type of terminal - this may limit your options, unless the client agrees to upgrade.
You need to do a lot more legwork on this. It's great to get opinions from web forums, but we can't possibly know your environment as well as you do.
My broad strokes advice would be to aim for technology that is widely used. This way, expertise on the platform is cheaper than "niche" technologies, and it will be easier to get help if you hit a brick wall. Of course, following this advice may not be possible if you have non-negotiable technology already in place at customers.
My second suggestion would be to complete a full project plan, with detailed specs and proper cost estimates, before going with the "rewrite from scratch" option. Right now, you're saying that it would be cheaper to rewrite the system than maintain it, and you don't really know how much it would cost to re-write.
I suggest you use browser for the UI.
Organize your application as a web application.
There are tons of options for the back-end. You can use Java + MySQL. Java backend will save you from windows/linux debate as it will run on both platforms. You won't have any licensing cost for both Java and MySQL. (Edit: Definitely there are a lot of others languages that have run-times for both linux & windows including PHP, Ruby, Python etc)
If you go this route, you may also want to consider Google Web Toolkit (GWT) for creating the browser based front-end in a modular fashion.
One word of caution though. Browsers can be pesky when it comes to memory management. In our experience, this was the most significant challenge in doing browser based POS You may want to checkout Adobe Flex that runs in browser but might be more civil in its memory management.
What is CHUI? Character-UI, as in VT terminals? Or even 3270 style?
It sounds like you need a 3-tier system - the database backend, a middle-layer that runs the bulk of the back-end business processes, and a front-end layer for the CHUI / GUI / data-gateway.
All three layers can reside on one machine; or you can distribute the tiers out to various servers. The front-end layer would control the actual terminals, whether they are VT-terminals, or a web-browser, or a custom-written 'client' application.
Make sure you have considered the hardware needs here -- are you going to have barcode scanners, cash drawers, POS debit/credit terminals, et cetra? If you are using a standard browser, it might be hard to reliably integrate those items. (At the very least, you're likely going to have to write special applets to handle them.)
Finally, consider the possibility of a thin-client technology on Windows. It greatly simplifies system management, since you only have to upgrade the software centrally. Thin-client PC's are cheap -- sub $200.
Golden Code Development (see www.goldencode.com) has a technology that does automated conversion of Progress 4GL (the schema and code... the entire application) to a Java application with a relational database backend (e.g. PostgreSQL). They currently support a very complete CHUI environment and they do refactor the code. For example, the conversion separates the UI, the data model and the business logic into separate Java classes. The entire result is a drop-in replacement that is compatible with the original (users don't need retraining, processes don't need to be modified, the data is migrated too). This is possible because they provide an application server and a set of runtime classes that provide that compatibility. The result of the automated conversion is not something that needs further editing before you can compile and run it. True terminal support is included so hardware terminals still work (it requires a small JNI library to access NCURSES from Java). All the rest of the code in the runtime is pure Java. No Progress Software Corp technology is used in the resulting system and it runs on Linux.
At least one converted system is already in production, running a 24 by 7 mission critical environment. It is a converted ERP system that their mid-sized pilot customer uses to run their entire business.