I recently came across a blog post that introduced the term "Bayesian Spam Filtering" and talked about how this was the approach behind spam filtering for emails.
I also remember a paper (perhaps it was this?) discussing how Game Theory is involved in packet routing, or how it used for Resource Allocation in Cloud Computing. Also, I recall a university course on Formal Methods, and how they're used in Software Engineering.
I am looking for books which talk about how concepts from Mathematics or CS Theory are actually applied in every day technology.
Any good textbook will at least mention and ideally discuss everyday applications of the subject matter whenever possible. For example, the excellent Artificial Intelligence: A Modern Approach by Russell and Norvig does this throughout.
More specialized books will of course tend to have deeper discussions of the applications. What is one 30-page chapter in Russell and Norvig is a whole book by Sutton and Barto: Reinforcement Learning: An Introduction (free pdf at the link, courtesy of the authors). Have a look at chapter 16 for some fascinating applications.
There are also books that start with an application, i.e. a practical problem, and then develop all the theory necessary to solve it. One of my favorites in this category is In Pursuit of the Traveling Salesman: Mathematics at the Limits of Computation by William Cook.
You might be interested in Math for Programmers which has some interesting approaches to programming by pure math. In retrospect, A Programmer's Introduction to Mathematics teaches math using programming, which might be helpful if you want to learn it backwards.
Can anyone tell me what algorithm is used to classify intents and understand entities in Watson assistant? Have they published any papers or articles regarding this?
Yes, they published this paper explaining in a manner how the Watson Work, and for more information you should learn about Cognitive Systems, but in advance it's not just one algorithm used, but many approaches that combined are capable of getting the desired result.
Another material you should learn if this is your interest is the computer science area "Information Retrieval", in which many subjects are used to comprehend what the user wants and give the needed information. The book Modern Information Retrieval is a good start point.
According to IBM Developer Answers:
"Intents are classified using an SVM, with some pre training by IBM. entities use a fuzzy matching algorithm."
https://developer.ibm.com/answers/questions/387916/watson-conversation-algorithm/
Support Vector Machine (SVM) is a supervised machine learning algorithm.
Is there any documentation about what (and how are performed) are the optimizations that are performed when transforming the Application DAG into the physical DAG?
This is the "official" page of Flink Optimizer but it is awfully incomplete.
I've also found other information directly in the source-code on the official github page but it is very difficult to read and understand.
Other information can be found here (section 6) (Stratosphere was the previous name of Flink before it was incubated by Apache) and here (section 5.1) for a more theoretical documentation.
I hope it will be documented better in the future :-)
Im looking for information about how the IMAP protocol works. Google yields only high level information, but not enough to understand the details. I'd like to know enough to be able to create my own implementation. I found a c library which does it, but is poorly documented.
Some basic questions are: what are the IMAP uid's and what are their guaruntees? For example, will an id ever change? will it be reused if deleted?
This looks like a good starting point:
http://www.imapwiki.org/ImapRFCList
In general, the keyword you want when searching for details on an internet protocol is "RFC". Add that to your search along with the name of the protocol and you should get off to a good start.
Google yields only high level information, but not enough to understand the details.
Google is a general search engine, and its results will only be as good as the search terms you supplied. If you want to get detailed and definitive technical information about a protocol or standard or programming language, you should start by searching for the specification; i.e. use "specification" as one of your search terms.
I'd like to know enough to be able to create my own implementation. I found a c library which does it, but is poorly documented.
If you've already found an implementation, why would you want to create another? Or even know enough to (hypothetically) create another?
I'm sure there are other open source implementations of IMAP around in various languages.
It is a bit much to expect an implementation of IMAP to be sufficiently well documented as to serve as a specification.
Some basic questions are: what are the IMAP uid's and what are their guaruntees? For example, will an id ever change? will it be reused if deleted?
I expect that these questions can be answered by reading the IMAP specification; see RFC 3501
I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
I learned a lot by looking at what is published about really huge web-based plattforms, and especially how their systems evolved over time to meet their growth.
Here a some examples I found enlightening:
eBay Architecture: Nice history of their architecture and the issues they had. Obviously they can't use a lot of caching for the auctions and bids, so their story is different in that point from many others. As of 2006, they deployed 100,000 new lines of code every two weeks - and are able to roll back an ongoing deployment if issues arise.
Paper on Google File System: Nice analysis of what they needed, how they implemented it and how it performs in production use. After reading this, I found it less scary to build parts of the infrastructure myself to meet exactly my needs, if necessary, and that such a solution can and probably should be quite simple and straight-forward. There is also a lot of interesting stuff on the net (including YouTube videos) on BigTable and MapReduce, other important parts of Google's architecture.
Inside MySpace: One of the few really huge sites build on the Microsoft stack. You can learn a lot of what not to do with your data layer.
A great start for finding much more resources on this topic is the Real Life Architectures section on the "High Scalability" web site. For example they a good summary on Amazons architecture.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
I liked the book Distributed Systems: Principles and Paradigms by Andrew S. Tanenbaum and Maarten van Steen.
At a more abstract and formal level, Communicating and Mobile Systems: The Pi-Calculus by Robin Milner gives a calculus for verifying systems. There are variants of pi-calculus for verifying protocols, such as SPI-calculus (the wikipedia page for which has disappeared since I last looked), and implementations, some of which are also verification tools.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
It is nevertheless true that I didn't read the books from Gerard Tel and Nicola Santoro. Perhaps they are still easier to read...
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex!
You can get the TLA+ book and several examples for free.
There are many classic papers written by Leslie Lamport :
(http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) and Edsger Dijkstra
(http://www.cs.utexas.edu/users/EWD/)
for the database side.
A main stream is NoSQL movement,many project are appearing in the market including CouchDb( couchdb.apache.org) , MongoDB ,Cassandra. These all have the promise of scalability and managability (replication, fault tolerance, high-availability).
One good book is Birman's Reliable Distributed Systems, although it has its detractors.
If you want to formally verify your protocol you could look at some of the techniques in Lynch's Distributed Algorithms.
It is likely that whatever protocol you are trying to implement has been designed and analysed before. I'll just plug my own blog, which covers e.g. consensus algorithms.