Understanding NFS - filesystems

HI
I would like to understand the NFS filesystem in some detail. I came across the book NFS illustrated, unfortunately it is only available as google books so some pages are missing.
Has someone maybe another good ressource which would be a good start to understand NFS at a low level? I am thinking of implementing a Client/Server Framework
Thanks

This is documentation how NFS implemented in the Linux kernel
http://nfs.sourceforge.net/
Other resources are RFC documents
http://tools.ietf.org/

NFS Illustrated is available in dead-tree form - amazon have it in stock (as of a minute ago). The 'Illustrated series are generally pretty good and while I haven't read the NFS one I've got good info from others in the series. If you're considering implementing a full client/server framework for it then it might be worth the money.

As my current field of work is on NFS, I really think you may build some knowledge first on:
Red Hat NFS related Documents:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-nfs.html
Microsoft NFS related Documents:
https://technet.microsoft.com/en-us/library/cc753302(v=ws.10).aspx
After you understand the unmapped structure among AD and Unix(It really took me a long period of time), you may get to build some knowledge on sec=krb /krb5i /krb5p by the MIT`s Original Documents ,these are the core of the NFS.

Related

Apache Flink Dev documentation?

I am interested in learning about how Flink works internally, but I am struggling to find documentation on the internal code (like where is a start point of a job) so I am unable to understand the codebase. Is there documentation or some walkthrough for those who want to contribute to Flink itself?
I find that if you understand how some part of Flink works, the source code is generally understandable. The initial challenge then is to have a correct understanding of the expected behavior of the code. To that end, here are some helpful resources:
The best starting point is Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri.
Any significant development work done on Flink in recent years has been preceded by a Flink Improvement Proposal. These are probably the best available resource for getting a deeper understanding of specific topics and areas of the code.
The documentation has a section on "Internals" that covers some topics.
And there have been some excellent Flink Forward talks describing how some of the internals work, such as Aljoscha Krettek's talk on the ongoing work to unify batch and streaming, Nico Kruber's talk on the network stack, Stefan Richter's talks on state and checkpointing, Piotr Nowojski's talk on two phase commit sinks, and Addison Higham's talk on operators, among many others.

Flink's Internal Workings & Architecture Sources of Knowledge

What are the best ways to learn how Flink Architecture (both physical and runtime) is organized and to understand its internal workings (distribution, parallelism, etc.), except from directly reading the code?
How much reliable the papers on Stratosphere (Nephele, PACT, etc.) have to be considered for the current state of the art?
Thank you!
The documentation on their home page explains in detail about the core architecture of Flink. Here's link to it. Also as the company data Artisans is actively supporting the flink project, you can have a look at their training sessions as well.
Since Apache Flink has originated from the Stratosphere project itself, there are similarities but things have changed at the implementation level.

dBASE VS Python/Postgres

A manager at work wants my team to use dBASE instead of Python/Postgres for an upcoming web application project. I know dBASE is obsolete, but that by itself isn't convincing enough, since another (unrelated) department has been running it for decades.
The database comparison charts that I found are slightly out of date, or don't mention dBASE at all.
Wikipedia, blogs, and the official marketing page say that dBASE supports SQL, ODBC, that the language is object oriented, and that it can be used to develop web applications.
Can anyone offer any sort of apples to apples comparison of features, performance, reliability, X, between dBASE and Postgres (or even MySQL)?
Can anyone offer a factual explanation of why dBASE isn't a viable solution for modern web applications?
Thanks in advance.
Although dBase is over, and so too (for the most part) is Visual FoxPro (VFP), there are still people out there using it. It is good in that it does not require a "SERVER" based solution as it has it's own much like others that run based on a dll (such as SQLite, Advantage Database and others).
To help you in your argument, and not knowing the facts...
Consider that the file limits are probably still that of a maximum of 2 gig per single file, not entire database, but single file as that is the max of 32-bit systems when dBase (and VFP) were created.
Yes, it can be a fast engine, but what sort of volume / activity are you going to be hit with. As Frank mentioned, where are you going to find someone with strong knowledge with the language, let alone implementing it to the web expertise. So, unless you have it in-house it might be a chore.
What happens for any database table corruptions, which is going to be more reliable and recoverable if there IS an issue.
So, just a few things I would consider to pass on to mgmt.
I've read your question. One of my first projects was in VFP, it was an accounting system (now defunct), and the absolute paramount critical thing you ought to ask is "Why" - Here's why I stopped using it. (1) not scalable, (2) data integrity, (3) ease of hacking, (4) ancient technology, (5) limited "live" resources to help with technology if run into any problems (and you will), (6) no longer supported by the vendor...

FlockDB - What is it? And best cases for it uses

Just came across FlockDB graph database. Details at github /flockDB. Twitter claims it uses FlockDB for the following:
Twitter runs FlockDB on a large cluster of machines.
we use it to store social graphs (who
follows whom, who blocks whom) and
secondary indices at twitter.
At first glance, setup and trying it doesn't look straight forward. Have anyone already used it / setup this? If so, please answer the following general queries.
What kind of applications is it
better suited for? (Twitter claims it
is simple and very rough, it remains
to see what it meant though)
How is FlockDB better than other graph db /
noSQL db. Have you setup FlockDB,
used it for a application?
Early advices any?
Note: I am evaluating the FlockDB and other graph databases mainly for learning them. Perhaps, I will build an application for that.
Flockdb is still Yet to be released by Twitter, which means the current version you are seeing won't run properly. Going by the history of commits i guess within a couple of days you can see a stable version which you can build and test.
Compared to something like Neo4J you can say Flockdb is not even a graph database. The toughest part of a graph database is how many levels of depth it can handle. From the little documentation of Flockdb it seems like it can't handle more than 1 level of depth. Where FlockDb wins compared to DBs like Neo4J is it's low latency, high throughput and inherent distributed nature.
Regarding Applications - i guess it will be a great fit whenever you need social networking or twitter like behavior. I don't think many will find such use cases though (who gets 20k friend requests per sec ?).
I Just started looking into Flockdb. Right now i am planning to use it in my forum software. Instead of user1 follows user2 relationship, i am planning to use it for user1 read post1, user1 favorites post1 etc. Being one of the highly active online communities we get a lot of such traffic(read/favorite). Can't think of any other use cases now.
Don't miss OrientDB. It's a document-graph dbms with special operator for traversing relationships: http://code.google.com/p/orient/wiki/GraphDatabase

How to design and verify distributed systems?

I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
I learned a lot by looking at what is published about really huge web-based plattforms, and especially how their systems evolved over time to meet their growth.
Here a some examples I found enlightening:
eBay Architecture: Nice history of their architecture and the issues they had. Obviously they can't use a lot of caching for the auctions and bids, so their story is different in that point from many others. As of 2006, they deployed 100,000 new lines of code every two weeks - and are able to roll back an ongoing deployment if issues arise.
Paper on Google File System: Nice analysis of what they needed, how they implemented it and how it performs in production use. After reading this, I found it less scary to build parts of the infrastructure myself to meet exactly my needs, if necessary, and that such a solution can and probably should be quite simple and straight-forward. There is also a lot of interesting stuff on the net (including YouTube videos) on BigTable and MapReduce, other important parts of Google's architecture.
Inside MySpace: One of the few really huge sites build on the Microsoft stack. You can learn a lot of what not to do with your data layer.
A great start for finding much more resources on this topic is the Real Life Architectures section on the "High Scalability" web site. For example they a good summary on Amazons architecture.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
I liked the book Distributed Systems: Principles and Paradigms by Andrew S. Tanenbaum and Maarten van Steen.
At a more abstract and formal level, Communicating and Mobile Systems: The Pi-Calculus by Robin Milner gives a calculus for verifying systems. There are variants of pi-calculus for verifying protocols, such as SPI-calculus (the wikipedia page for which has disappeared since I last looked), and implementations, some of which are also verification tools.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
It is nevertheless true that I didn't read the books from Gerard Tel and Nicola Santoro. Perhaps they are still easier to read...
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex!
You can get the TLA+ book and several examples for free.
There are many classic papers written by Leslie Lamport :
(http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html) and Edsger Dijkstra
(http://www.cs.utexas.edu/users/EWD/)
for the database side.
A main stream is NoSQL movement,many project are appearing in the market including CouchDb( couchdb.apache.org) , MongoDB ,Cassandra. These all have the promise of scalability and managability (replication, fault tolerance, high-availability).
One good book is Birman's Reliable Distributed Systems, although it has its detractors.
If you want to formally verify your protocol you could look at some of the techniques in Lynch's Distributed Algorithms.
It is likely that whatever protocol you are trying to implement has been designed and analysed before. I'll just plug my own blog, which covers e.g. consensus algorithms.

Resources