How to use a Bayesian Network to compute conditional probability queries - artificial-intelligence

I am studying about Bayesian Network of my AI courses.
Does anyone know how to calculate causal inference and diagnostic inference in the attached picture?
Bayesian Network Example

There are lots of ways to perform inference from a Bayesian network, the most naive of which is just enumeration.
Enumeration works for both causal inference and diagnostic inference. The difference is finding out how likely the effect is based on evidence of the cause (causal inference) vs finding out how likely the cause is based on evidence of the effect (diagnostic inference).

The answer from Nick Larsen is a good one. I'll elaborate to give a worked solution to your problem since you might be looking for something a little more specific.
Problem 1: P(C|E). What is the probability of having a promising career (C=1) GIVEN the economic environment is positive (E=1)?
We use the factored structure of the Bayes net to write the full joint probability in terms of the factored variables.
Notice that you have just used the law of total probability to introduce the latent variables (S and J) and then marginalise (sum) them out. I have used the 'hat' to refer to not (~ in your question above). Notice too that once you have applied the rule of total probability, the Bayes net does a lot of the hard work for you by allowing you to factor the joint probability into a number of smaller conditional probabilities.
Problem 2: P(E|C). What is the probability that the economic environment is positive (E=1) GIVEN we observe that you have a promising career (C=1)?
Here we actually need to apply Bayes rule in the first line. Notice that you have an annoying normalising constant P(C) that is carried throughout. This term can be solved in much the same way as you solved Problem 1:
The computation of P(C=1|E=1) is solved in problem 1. I have left out the computation for P(C=0|E=1) = 0.5425 out but it is the same process as Problem 1.
Now you are in a position to solve for P(E|C) = .38/.65125 = .583

Related

In a MAX-MIN ant system (MMAS), how does the initial pheromone depend on the best solution if it's not yet been found?

I'm learning how add the max-min ant system into my current ant system. From what I've read the trial pheromone is initialized tMax, tMax is calculated by,
tMax = 1 / best tour length
But how exactly would it be possible to initialize the trail pheromone to tMax if it depends on a tour which doesn't yet exist?
tMin also depends on tMax which also makes it impossible to initialize without a best solution.
In MMAS, all edges are initialised to tauMax, but the definition of tauMax is slightly different from what you've stated above:
tauMax <-- 1 / (rho * bestTourSoFarLength),
where rho is the evaporation rate (typically set to 0.5), and where focus on the tour length is best so far (see below). tauMax is repeatedly updated during algorithm execution, each time the incumbent best tour (so far) is updated.
For initialisation, an initial feasible tour is constructed heuristically. Typically, the nearest neighbour tour for a random starting city is used.
Recall that in the context of stochastic optimisation methods such as Ant Colony Optimization (ACO)/MMAS, we can generally not prove optimality of the best incumbent solution (tour) at algorithm termination (we know from practice, however, that ACO/MMAS does perform well on some set of problems, most notably variations of the travelling salesman problem (TSP)). Hence, in contexts such as these, the term "best solution" non-rigorously denote varied meanings from varied authors; "best so far", "best at algorithm termination" and so on, so be aware of that when reading literature on the subject.
Finally, as a note, tauMin depends---as you've noted---on tauMax, but is typically not updated after initialisation. When imposing pheromone limits, the important "dynamic" part is the monotonically decreasing tauMax, whereas pheromones for most edges will land at the constant tauMin eventually, due to evaporation. A suitable value of tauMin is given by the hideous expression (based on empirical data)
tauMin = tauMax*(1-(0.05)^(1/n))/((n/2-1)*(0.05)^(1/n)).

Neural Network Architecture Design

I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.

Crossover function for genetic

I am writing a Time table generator in java, using AI approaches to satisfy the hard constraints and help find an optimal solution. So far I have implemented and Iterative construction (a most-constrained first heuristic) and Simulated Annealing, and I'm in the process of implementing a genetic algorithm.
Some info on the problem, and how I represent it then :
I have a set of events, rooms , features (that events require and rooms satisfy), students and slots
The problem consists in assigning to each event a slot and a room, such that no student is required to attend two events in one slot, all the rooms assigned fulfill the necessary requirements.
I have a grading function that for each set if assignments grades the soft constraint violations, thus the point is to minimize this.
The way I am implementing the GA is I start with a population generated by the iterative construction (which can leave events unassigned) and then do the normal steps: evaluate, select, cross, mutate and keep the best. Rinse and repeat.
My problem is that my solution appears to improve too little. No matter what I do, the populations tends to a random fitness and is stuck there. Note that this fitness always differ, but nevertheless a lower limit will appear.
I suspect that the problem is in my crossover function, and here is the logic behind it:
Two assignments are randomly chosen to be crossed. Lets call them assignments A and B. For all of B's events do the following procedure (the order B's events are selected is random):
Get the corresponding event in A and compare the assignment. 3 different situations might happen.
If only one of them is unassigned and if it is possible to replicate
the other assignment on the child, this assignment is chosen.
If both of them are assigned, but only one of them creates no
conflicts when assigning to the child, that one is chosen.
If both of them are assigned and none create conflict, on of
them is randomly chosen.
In any other case, the event is left unassigned.
This creates a child with some of the parent's assignments, some of the mother's, so it seems to me it is a valid function. Moreover, it does not break any hard constraints.
As for mutation, I am using the neighboring function of my SA to give me another assignment based on on of the children, and then replacing that child.
So again. With this setup, initial population of 100, the GA runs and always tends to stabilize at some random (high) fitness value. Can someone give me a pointer as to what could I possibly be doing wrong?
Thanks
Edit: Formatting and clear some things
I think GA only makes sense if part of the solution (part of the vector) has a significance as a stand alone part of the solution, so that the crossover function integrates valid parts of a solution between two solution vectors. Much like a certain part of a DNA sequence controls or affects a specific aspect of the individual - eye color is one gene for example. In this problem however the different parts of the solution vector affect each other making the crossover almost meaningless. This results (my guess) in the algorithm converging on a single solution rather quickly with the different crossovers and mutations having only a negative affect on the fitness.
I dont believe GA is the right tool for this problem.
If you could please provide the original problem statement, I will be able to give you a better solution. Here is my answer for the present moment.
A genetic algorithm is not the best tool to satisfy hard constraints. This is an assigment problem that can be solved using integer program, a special case of a linear program.
Linear programs allow users to minimize or maximize some goal modeled by an objective function (grading function). The objective function is defined by the sum of individual decisions (or decision variables) and the value or contribution to the objective function. Linear programs allow for your decision variables to be decimal values, but integer programs force the decision variables to be integer values.
So, what are your decisions? Your decisions are to assign students to slots. And these slots have features which events require and rooms satisfy.
In your case, you want to maximize the number of students that are assigned to a slot.
You also have constraints. In your case, a student may only attend at most one event.
The website below provides a good tutorial on how to model integer programs.
http://people.brunel.ac.uk/~mastjjb/jeb/or/moreip.html
For a java specific implementation, use the link below.
http://javailp.sourceforge.net/
SolverFactory factory = new SolverFactoryLpSolve(); // use lp_solve
factory.setParameter(Solver.VERBOSE, 0);
factory.setParameter(Solver.TIMEOUT, 100); // set timeout to 100 seconds
/**
* Constructing a Problem:
* Maximize: 143x+60y
* Subject to:
* 120x+210y <= 15000
* 110x+30y <= 4000
* x+y <= 75
*
* With x,y being integers
*
*/
Problem problem = new Problem();
Linear linear = new Linear();
linear.add(143, "x");
linear.add(60, "y");
problem.setObjective(linear, OptType.MAX);
linear = new Linear();
linear.add(120, "x");
linear.add(210, "y");
problem.add(linear, "<=", 15000);
linear = new Linear();
linear.add(110, "x");
linear.add(30, "y");
problem.add(linear, "<=", 4000);
linear = new Linear();
linear.add(1, "x");
linear.add(1, "y");
problem.add(linear, "<=", 75);
problem.setVarType("x", Integer.class);
problem.setVarType("y", Integer.class);
Solver solver = factory.get(); // you should use this solver only once for one problem
Result result = solver.solve(problem);
System.out.println(result);
/**
* Extend the problem with x <= 16 and solve it again
*/
problem.setVarUpperBound("x", 16);
solver = factory.get();
result = solver.solve(problem);
System.out.println(result);
// Results in the following output:
// Objective: 6266.0 {y=52, x=22}
// Objective: 5828.0 {y=59, x=16}
I would start by measuring what's going on directly. For example, what fraction of the assignments are falling under your "any other case" catch-all and therefore doing nothing?
Also, while we can't really tell from the information given, it doesn't seem any of your moves can do a "swap", which may be a problem. If a schedule is tightly constrained, then once you find something feasible, it's likely that you won't be able to just move a class from room A to room B, as room B will be in use. You'd need to consider ways of moving a class from A to B along with moving a class from B to A.
You can also sometimes improve things by allowing constraints to be violated. Instead of forbidding crossover from ever violating a constraint, you can allow it, but penalize the fitness in proportion to the "badness" of the violation.
Finally, it's possible that your other operators are the problem as well. If your selection and replacement operators are too aggressive, you can converge very quickly to something that's only slightly better than where you started. Once you converge, it's very difficult for mutations alone to kick you back out into a productive search.
I think there is nothing wrong with GA for this problem, some people just hate Genetic Algorithms no matter what.
Here is what I would check:
First you mention that your GA stabilizes at a random "High" fitness value, but isn't this a good thing? Does "high" fitness correspond to good or bad in your case? It is possible you are favoring "High" fitness in one part of your code and "Low" fitness in another thus causing the seemingly random result.
I think you want to be a bit more careful about the logic behind your crossover operation. Basically there are many situations for all 3 cases where making any of those choices would not cause an increase in fitness at all of the crossed-over individual, but you are still using a "resource" (an assignment that could potentially be used for another class/student/etc.) I realize that a GA traditionally will make assignments via crossover that cause worse behavior, but you are already performing a bit of computation in the crossover phase anyway, why not choose one that actually will improve fitness or maybe don't cross at all?
Optional Comment to Consider : Although your iterative construction approach is quite interesting, this may cause you to have an overly complex Gene representation that could be causing problems with your crossover. Is it possible to model a single individual solution as an array (or 2D array) of bits or integers? Even if the array turns out to be very long, it may be worth it use a more simple crossover procedure. I recommend Googling "ga gene representation time tabling" you may find an approach that you like more and can more easily scale to many individuals (100 is a rather small population size for a GA, but I understand you are still testing, also how many generations?).
One final note, I am not sure what language you are working in but if it is Java and you don't NEED to code the GA by hand I would recommend taking a look at ECJ. Maybe even if you have to code by hand, it could help you develop your representation or breeding pipeline.
Newcomers to GA can make any of a number of standard mistakes:
In general, when doing crossover, make sure that the child has some chance of inheriting that which made the parent or parents winner(s) in the first place. In other words, choose a genome representation where the "gene" fragments of the genome have meaningful mappings to the problem statement. A common mistake is to encode everything as a bitvector and then, in crossover, to split the bitvector at random places, splitting up the good thing the bitvector represented and thereby destroying the thing that made the individual float to the top as a good candidate. A vector of (limited) integers is likely to be a better choice, where integers can be replaced by mutation but not by crossover. Not preserving something (doesn't have to be 100%, but it has to be some aspect) of what made parents winners means you are essentially doing random search, which will perform no better than linear search.
In general, use much less mutation than you might think. Mutation is there mainly to keep some diversity in the population. If your initial population doesn't contain anything with a fractional advantage, then your population is too small for the problem at hand and a high mutation rate will, in general, not help.
In this specific case, your crossover function is too complicated. Do not ever put constraints aimed at keeping all solutions valid into the crossover. Instead the crossover function should be free to generate invalid solutions and it is the job of the goal function to somewhat (not totally) penalize the invalid solutions. If your GA works, then the final answers will not contain any invalid assignments, provided 100% valid assignments are at all possible. Insisting on validity in the crossover prevents valid solutions from taking shortcuts through invalid solutions to other and better valid solutions.
I would recommend anyone who thinks they have written a poorly performing GA to conduct the following test: Run the GA a few times, and note the number of generations it took to reach an acceptable result. Then replace the winner selection step and goal function (whatever you use - tournament, ranking, etc) with a random choice, and run it again. If you still converge roughly at the same speed as with the real evaluator/goal function then you didn't actually have a functioning GA. Many people who say GAs don't work have made some mistake in their code which means the GA converges as slowly as random search which is enough to turn anyone off from the technique.

To which weight is the correction added in a perceptron?

I'm experimenting with single-layer perceptrons, and I think I understand (mostly) everything. However, what I don't understand is to which weights the correction (learning rate*error) should be added. In the examples I've seen it seems arbitrary.
Well, it looks like you half answered your own question: its true you correct all of the non-zero weights, you don't correct all by the same amount.
Instead, you correct the weights in proportion to their incoming activation, so if unit X activated really strongly and unit Y activated just a lil bit, and there was a large error, then the weight going from unit X to the output would be corrected far more than unit Y's weights-to-output.
The technical term for this process is called the delta rule, and its details can be found in its wiki article. Additionally, if you ever want to upgrade using to multilayer perceptrons (single layer perceptrons are very limited in computational power, see a discussion of Minsky and Papert's argument against using them here), an analogous learning algorithm called back propogation is discussed here.
Answered my own question.
According to http://intsys.mgt.qub.ac.uk/notes/perceptr.html, "add this correction to any weight for which there was an input". In other words, do not add the correction to weights whose neurons had a value of 0.

What is fuzzy logic?

I'm working with a couple of AI algorithms at school and I find people use the words Fuzzy Logic to explain any situation that they can solve with a couple of cases. When I go back to the books I just read about how instead of a state going from On to Off it's a diagonal line and something can be in both states but in different "levels".
I've read the wikipedia entry and a couple of tutorials and even programmed stuff that "uses fuzzy logic" (an edge detector and a 1-wheel self-controlled robot) and still I find it very confusing going from Theory to Code... for you, in the less complicated definition, what is fuzzy logic?
Fuzzy logic is logic where state membership is, essentially, a float with range 0..1 instead of an int 0 or 1. The mileage you get out of it is that things like, for example, the changes you make in a control system are somewhat naturally more fine-tuned than what you'd get with naive binary logic.
An example might be logic that throttles back system activity based on active TCP connections. Say you define "a little bit too many" TCP connections on your machine as 1000 and "a lot too many" as 2000. At any given time, your system has a "too many TCP connections" state from 0 (<= 1000) to 1 (>= 2000), which you can use as a coefficient in applying whatever throttling mechanisms you have available. This is much more forgiving and responsive to system behavior than naive binary logic that only knows how to determine "too many", and throttle completely, or "not too many", and not throttle at all.
I'd like to add to the answers (that have been modded up) that, a good way to visualize fuzzy logic is follows:
Traditionally, with binary logic you would have a graph whose membership function is true or false whereas in a fuzzy logic system, the membership function is not.
1|
| /\
| / \
| / \
0|/ \
------------
a b c d
Assume for a second that the function is "likes peanuts"
a. kinda likes peanuts
b. really likes peanuts
c. kinda likes peanuts
d. doesn't like peanuts
The function itself doesn't have to be triangular and often isn't (it's just easier with ascii art).
A fuzzy system will likely have many of these, some even overlapping (even opposites) like so:
1| A B
| /\ /\ A = Likes Peanuts
| / \/ \ B = Doesn't Like Peanuts
| / /\ \
0|/ / \ \
------------
a b c d
so now c is "kind likes peanuts, kinda doesn't like peanuts" and d is "really doesn't like peanuts"
And you can program accordingly based on that info.
Hope this helps for the visual learners out there.
The best definition of fuzzy logic is given by its inventor Lotfi Zadeh:
“Fuzzy logic means of representing problems to computers in a way akin to the way human solve them and the essence of fuzzy logic is that everything is a matter of degree.”
The meaning of solving problems with computers akin to the way human solve can easily be explained with a simple example from a basketball game; if a player wants to guard another player firstly he should consider how tall he is and how his playing skills are. Simply if the player that he wants to guard is tall and plays very slow relative to him then he will use his instinct to determine to consider if he should guard that player as there is an uncertainty for him. In this example the important point is the properties are relative to the player and there is a degree for the height and playing skill for the rival player. Fuzzy logic provides a deterministic way for this uncertain situation.
There are some steps to process the fuzzy logic (Figure-1). These steps are; firstly fuzzification where crisp inputs get converted to fuzzy inputs secondly these inputs get processed with fuzzy rules to create fuzzy output and lastly defuzzification which results with degree of result as in fuzzy logic there can be more than one result with different degrees.
Figure 1 – Fuzzy Process Steps (David M. Bourg P.192)
To exemplify the fuzzy process steps, the previous basketball game situation could be used. As mentioned in the example the rival player is tall with 1.87 meters which is quite tall relative to our player and can dribble with 3 m/s which is slow relative to our player. Addition to these data some rules are needed to consider which are called fuzzy rules such as;
if player is short but not fast then guard,
if player is fast but not short then don’t guard
If player is tall then don’t guard
If player is average tall and average fast guard
Figure 2 – how tall
Figure 3- how fast
According to the rules and the input data an output will be created by fuzzy system such as; the degree for guard is 0.7, degree for sometimes guard is 0.4 and never guard is 0.2.
Figure 4-output fuzzy sets
On the last step, defuzzication, is using for creating a crisp output which is a number which may determine the energy that we should use to guard the player during game. The centre of mass is a common method to create the output. On this phase the weights to calculate the mean point is totally depends on the implementation. On this application it is considered to give high weight to guard or not guard but low weight given to sometimes guard. (David M. Bourg, 2004)
Figure 5- fuzzy output (David M. Bourg P.204)
Output = [0.7 * (-10) + 0.4 * 1 + 0.2 * 10] / (0.7 + 0.4 + 0.2) ≈ -3.5
As a result fuzzy logic is using under uncertainty to make a decision and to find out the degree of decision. The problem of fuzzy logic is as the number of inputs increase the number of rules increase exponential.
For more information and its possible application in a game I wrote a little article check this out
To build off of chaos' answer, a formal logic is nothing but an inductively defined set that maps sentences to a valuation. At least, that's how a model theorist thinks of logic. In the case of a sentential boolean logic:
(basis clause) For all A, v(A) in {0,1}
(iterative) For the following connectives,
v(!A) = 1 - v(A)
v(A & B) = min{v(A), v(B)}
v(A | B) = max{v(A), v(B)}
(closure) All sentences in a boolean sentential logic are evaluated per above.
A fuzzy logic changes would be inductively defined:
(basis clause) For all A, v(A) between [0,1]
(iterative) For the following connectives,
v(!A) = 1 - v(A)
v(A & B) = min{v(A), v(B)}
v(A | B) = max{v(A), v(B)}
(closure) All sentences in a fuzzy sentential logic are evaluated per above.
Notice the only difference in the underlying logic is the permission to evaluate a sentence as having the "truth value" of 0.5. An important question for a fuzzy logic model is the threshold that counts for truth satisfaction. This is to ask: for a valuation v(A), for what value D it is the case the v(A) > D means that A is satisfied.
If you really want to found out more about non-classical logics like fuzzy logic, I would recommend either An Introduction to Non-Classical Logic: From If to Is or Possibilities and Paradox
Putting my coder hat back on, I would be careful with the use of fuzzy logic in real world programming, because of the tendency for a fuzzy logic to be undecidable. Maybe it's too much complexity for little gain. For instance a supervaluational logic may do just fine to help a program model vagueness. Or maybe probability would be good enough. In short, I need to be convinced that the domain model dovetails with a fuzzy logic.
Maybe an example clears up what the benefits can be:
Let's say you want to make a thermostat and you want it to be 24 degrees.
This is how you'd implement it using boolean logic:
Rule1: heat up at full power when
it's colder than 21 degrees.
Rule2:
cool down at full power when it's
warmer than 27 degrees.
Such a system will only once and a while be 24 degrees, and it will be very inefficient.
Now, using fuzzy logic, it would be like something like this:
Rule1: For each degree that it's colder than 24 degrees, turn up the heater one notch (0 at 24).
Rule2: For each degree that it's warmer than 24 degress, turn up the cooler one notch (0 at 24).
This system will always be somewhere around 24 degrees, and it only once and will only once and a while make a tiny adjustment. It will also be more energy-efficient.
Well, you could read the works of Bart Kosko, one of the 'founding fathers'. 'Fuzzy Thinking: The New Science of Fuzzy Logic' from 1994 is readable (and available quite cheaply secondhand via Amazon). Apparently, he has a newer book 'Noise' from 2006 which is also quite approachable.
Basically though (in my paraphrase - not having read the first of those books for several years now), fuzzy logic is about how to deal with the world where something is perhaps 10% cool, 50% warm, and 10% hot, where different decisions may be made on the degree to which the different states are true (and no, it wasn't entirely an accident that those percentages don't add up to 100% - though I'd accept correction if needed).
A very good explanation, with a help of Fuzzy Logic Washing Machines.
I know what you mean about it being difficult to go from concept to code. I'm writing a scoring system that looks at the values of sysinfo and /proc on Linux systems and comes up with a number between 0 and 10, 10 being the absolute worst. A simple example:
You have 3 load averages (1, 5, 15 minute) with (at least) three possible states, good, getting bad, bad. Expanding that, you could have six possible states per average, adding 'about to' to the three that I just noted. Yet, the result of all 18 possibilities can only deduct 1 from the score. Repeat that with swap consumed, actual VM allocated (committed) memory and other stuff .. and you have one big bowl of conditional spaghetti :)
Its as much a definition as it is an art, how you implement the decision making process is always more interesting than the paradigm itself .. whereas in a boolean world, its rather cut and dry.
It would be very easy for me to say if load1 < 2 deduct 1, but not very accurate at all.
If you can teach a program to do what you would do when evaluating some set of circumstances and keep the code readable, you have implemented a good example of fuzzy logic.
Fuzzy Logic is a problem-solving methodology that lends itself to implementation in systems ranging from simple, small, embedded micro-controllers to large, networked, multi-channel PC or workstation-based data acquisition and control systems. It can be implemented in hardware, software, or a combination of both. Fuzzy Logic provides a simple way to arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input information. Fuzzy Logic approach to control problems mimics how a person would make decisions, only much faster.
Fuzzy logic has proved to be particularly useful in expert system and other artificial intelligence applications. It is also used in some spell checkers to suggest a list of probable words to replace a misspelled one.
To learn more, just check out: http://en.wikipedia.org/wiki/Fuzzy_logic.
The following is sort of an empirical answer.
A simple (possibly simplistic answer) is that "fuzzy logic" is any logic that returns values other than straight true / false, or 1 / 0. There are a lot of variations on this and they tend to be highly domain specific.
For example, in my previous life I did search engines that used "content similarity searching" as opposed to then common "boolean search". Our similarity system used the Cosine Coefficient of weighted-attribute vectors representing the query and the documents and produced values in the range 0..1. Users would supply "relevance feedback" which was used to shift the query vector in the direction of desirable documents. This is somewhat related to the training done in certain AI systems where the logic gets "rewarded" or "punished" for results of trial runs.
Right now Netflix is running a competition to find a better suggestion algorithm for their company. See http://www.netflixprize.com/. Effectively all of the algorithms could be characterized as "fuzzy logic"
Fuzzy logic is calculating algorithm based on human like way of thinking. It is particularly useful when there is a large number of input variables. One online fuzzy logic calculator for two variables input is given:
http://www.cirvirlab.com/simulation/fuzzy_logic_calculator.php

Resources