I am reading at https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html. It is very helpful material about flink type system. At the end,it ways:
The next article in this series will use this finding as a starting point to look into a few common pitfalls and obstacles of avoiding Kryo, how to get the most out of the PojoSerializer, and a few more tuning techniques with respect to serialization. Stay tuned for more.
But I didn't find the next articles...looks the author didn't publish the next article?
(So far) there is no part 2, but you might enjoy A Journey to Beating Flink's SQL Performance from the same author.
Related
Is there any real benchmarks between Apache Flink and apache storm in real time processing based on performance comparison ?
Also if I want to make this performance comparison and implement it by myself, is there any stream API (like twitter API) that offers high throughput than twitter and which is open source ?
Thank you !
There are some benchmarks for Stream Processing in general - but they are not always broadly applicable or accessible than the ones you can find for RDBMS.
A main question that you should answer for yourself at first is: What exactly do you mean with performance? There are different metrics how to benchmark such a system.
However, I will try here to list some benchmarking works, that helped me:
A recent benchmarking framework that is implemented for Storm & Flink is the Yahoo Streaming Benchmark. It has a fixed internal architecture using Kafka & Redis and a predefined query/topology. Anyways, it is a good starting point.
Karimov et al have a nice paper regarding benchmarking of these systems. It is worth a read since it really helps to understand possible metrics. Unfortunately, I can not find any implementation or further information on their workload (data and queries) that they use - so it is more helpful for understanding, I would say.
van Dongen et al are doing a more in-depth analysis of several stream processing systems and offer their source code at github. Unfortunately, there is no implementation for Storm. But anyways, there are some interesting ideas & contributions on how to build such a framework.
As you see, Stream Processing has a high diversity in the way you can set-up and benchmark your systems...
I am interested in learning about how Flink works internally, but I am struggling to find documentation on the internal code (like where is a start point of a job) so I am unable to understand the codebase. Is there documentation or some walkthrough for those who want to contribute to Flink itself?
I find that if you understand how some part of Flink works, the source code is generally understandable. The initial challenge then is to have a correct understanding of the expected behavior of the code. To that end, here are some helpful resources:
The best starting point is Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri.
Any significant development work done on Flink in recent years has been preceded by a Flink Improvement Proposal. These are probably the best available resource for getting a deeper understanding of specific topics and areas of the code.
The documentation has a section on "Internals" that covers some topics.
And there have been some excellent Flink Forward talks describing how some of the internals work, such as Aljoscha Krettek's talk on the ongoing work to unify batch and streaming, Nico Kruber's talk on the network stack, Stefan Richter's talks on state and checkpointing, Piotr Nowojski's talk on two phase commit sinks, and Addison Higham's talk on operators, among many others.
I want to do some automatic story generation demonstration and the approach I am taking is using AI planning. I have been reading several relevant papers and have figured out that PDDL is perhaps the most widely used language to form the planning problem. I have been looking at the syntax and several example codes to learn how to use it.
The part where I am stuck is how to get the planner to work. I have found out some popular planners (fast-forward, MBP, IPP) but am not being able to make them work, using the instructions even from the sources itself.
I am using Gnome Terminal on Ubuntu 13.04.
I am very new to planning and this may be a very naive question but I assure that I have been searching for more than 3-4 days without any luck. Also, suggestions on using some other planning system are welcome.
If you are using Linux then I strongly suggest to use Fast Downward (it has its own web page - just google it). First of all, it is currently one of the best-known planning systems in the AI planning community and, further, it is really easy to get it to run. Well, you still need half an hour or so, but there is an easy-to-follow step-by-step description telling you where to check out the code and which commands you need to run.
It has also implemented most of the known planning heuristics that are required to solve problems fast or even optimal (planning requires search and heuristics make the search "goal-oriented" rather than blind and, if the heuristic is admissible and/or monotone (depending on the kind of search algorithm that is chosen -- see fast forward and pddl: is the computed solution the best?), it guarantees to find optimal solutions).
Concerning literature, I suggest to read/skip through the following two journal articles:
Porteous, J.; Cavazza, M.; and Charles, F. 2010. Applying planning to interactive storytelling: Narrative control using state constraints. ACM Trans. Intell. Syst. Tech. 10:1-10:21.
http://dl.acm.org/citation.cfm?id=1869399
Patrik Haslum. "Narrative Planning: Compilations to Classical Planning". Journal of AI Research, vol. 44, p. 383-395, 2012
http://www.jair.org/papers/paper3602.html
Well, both MBP and IPP are really, really old systems. If you're just looking for a ready-made planner to use in an off-the-shelf manner, I'd suggest you to follow the pointers leading to the authors (and software) that took part in the last International Planning Competition (2011):
http://www.plg.inf.uc3m.es/ipc2011-deterministic/ParticipatingPlanners.html
I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
I learned a lot by looking at what is published about really huge web-based plattforms, and especially how their systems evolved over time to meet their growth.
Here a some examples I found enlightening:
eBay Architecture: Nice history of their architecture and the issues they had. Obviously they can't use a lot of caching for the auctions and bids, so their story is different in that point from many others. As of 2006, they deployed 100,000 new lines of code every two weeks - and are able to roll back an ongoing deployment if issues arise.
Paper on Google File System: Nice analysis of what they needed, how they implemented it and how it performs in production use. After reading this, I found it less scary to build parts of the infrastructure myself to meet exactly my needs, if necessary, and that such a solution can and probably should be quite simple and straight-forward. There is also a lot of interesting stuff on the net (including YouTube videos) on BigTable and MapReduce, other important parts of Google's architecture.
Inside MySpace: One of the few really huge sites build on the Microsoft stack. You can learn a lot of what not to do with your data layer.
A great start for finding much more resources on this topic is the Real Life Architectures section on the "High Scalability" web site. For example they a good summary on Amazons architecture.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
I liked the book Distributed Systems: Principles and Paradigms by Andrew S. Tanenbaum and Maarten van Steen.
At a more abstract and formal level, Communicating and Mobile Systems: The Pi-Calculus by Robin Milner gives a calculus for verifying systems. There are variants of pi-calculus for verifying protocols, such as SPI-calculus (the wikipedia page for which has disappeared since I last looked), and implementations, some of which are also verification tools.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
It is nevertheless true that I didn't read the books from Gerard Tel and Nicola Santoro. Perhaps they are still easier to read...
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex!
You can get the TLA+ book and several examples for free.
There are many classic papers written by Leslie Lamport :
(http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) and Edsger Dijkstra
(http://www.cs.utexas.edu/users/EWD/)
for the database side.
A main stream is NoSQL movement,many project are appearing in the market including CouchDb( couchdb.apache.org) , MongoDB ,Cassandra. These all have the promise of scalability and managability (replication, fault tolerance, high-availability).
One good book is Birman's Reliable Distributed Systems, although it has its detractors.
If you want to formally verify your protocol you could look at some of the techniques in Lynch's Distributed Algorithms.
It is likely that whatever protocol you are trying to implement has been designed and analysed before. I'll just plug my own blog, which covers e.g. consensus algorithms.
Long time listener, first time caller.
I'm a full time SE during the day and a full time data mining student at night. I've taken the courses, and heard what our professors think. Now, I come to you - the stackoverflowers, to bring out the real truth.
What is your favorite data mining algorithm and why? Are there any special techniques you've used that have helped you to be successful in the past?
Thanks!
Most of my professional experience involved last-minute feature additions like, "Hey, we should add a recommendation system to this e-Commerce site." The solution was usually a quick and dirty nearest neighbor search - brute force, euclidean distance, doomed to fail if the site ever became popular. But hey, premature optimization and all that...
I do enjoy the idea that data mining can be elegant and wonderful. I've followed the Netflix Prize and played with its dataset. In particular, I like the fact the imagination and experimentation have played such a large part in developing the top ten entries:
Acmehill blog
Acmehill New York Times article
Just a guy in a garage blog
Just a guy in a garage Wired article
So mostly, like a lot of software dev, I think the best algorithm is an open mind and some creativity.
There is a lot of data mining algorithms for different tasks so I found it a little bit hard to choose.
It would say that my favorite data mining algorithm is Apriori because it has inspired hundred of other algorithms and it has several applications. The Apriori algorithm in itself is quite simple. But it has laid the basis for many other algorithms (FPGrowth, PrefixSpan, etc.) that use the so called "Apriori property".
If you can be more specific about the task the data mining algorithm will perform, sure we can help you (classification, clustering, association rules detection, etc)