What is one good for that the other's not in practice? I understand the theory of what they do, but what are their limitations and capabilities in practical use? I'm considering Drools vs a java prolog for a new AI project, but open to other suggestions. What are some popular approaches for inferencing on a complicated relational data set or alternatives?
Backward chaining (a la Prolog) is more like finding what initial conditions form a path to your goal. At a very basic level it is a backward search from your goal to find conditions that will fulfil it.
Backward chaining is used for interrogative applications (finding items that fulfil certain criteria) - one commercial example of a backward chaining application might be finding which insurance policies are covered by a particular reinsurance contract.
Forward chaining (a la CLIPS) matches conditions and then generates inferences from those conditions. These conditions can in turn match other rules. Basically, this takes a set of initial conditions and then draws all inferences it can from those conditions.
The inferences (if asserted) can also be actions or events that can trigger external actions. This is useful in event driven systems, as the rule sets can be configured to (for example) initiate a workflow or some other action. This type of rule engine is the most commonly used in commercial applications.
Event driven systems are a common application of forward chaining rule engines. One example of a forward chaining application might be a telecoms plan provisioning engine (typically used for administering mobile phone plans). Entering a particular user with a particular plan will trigger a range of items to be set up in various phone switches, billing systems, financials, CRM systems etc.
Concerned's answer is very good. When asked to boil the difference down to a sound bite, I usually say something like:
Lots of Output Hypotheses + Lots of Data Up Front => Use Forward Chaining
Fewer Output Hypotheses + Must Query for Data => Use Backward Chaining
But it's just a rule of thumb, not a commandment.
In the old old old old expert systems days they used to say forward chaining was good for looking around (checking for what could be) while backward chaining was good for confirming (checking if "it" really is).
Think configuration (forward chaining, XCON [1]) and medical diagnosis (MYCIN) [2]
http://www.aaai.org/Papers/AAAI/1980/AAAI80-076.pdf
https://www.amazon.com/Rule-Based-Expert-Systems-Addison-Wesley/dp/0201101726
Forward chaining is concerned with the question "what will happen next?", while backward chaining looks at the question "why did this happen?".
An example of forward chaining is predicting whether share market status has an effect on changes in interest rates.
An example of backward chaining is the diagnosing of blood cancer in humans.
Simply put, forward chaining is mainly used for predicting future outcomes while backward chaining is mainly used for analyzing historical data.
Related
I'm working on a flight data analysis project. the flight data is represented in a tabular format. Each quarter of a second, we have the status of different parameters including turobreactor parameters and avionic parameters. I intend to use an expert system to analyse the flight data in order to detect anomalies during the flight. for example T4 (temperature) shouldn't surpass 750 °C over 30 seconds. Is the expert system architecture appropriate to such task?
Every expert system consists of the knowledge base and the inference engine.
If you are going to to use the expert system architecture:
you have to make sure that you have this knowledge gathered from factual and heuristic knowledge. Those are the rules, mostly consisting of an IF part and a THEN part.
how you will apply this rules, is defined by the inference engine - the problem-solving model, where the common paradigm is chaining of IF-THEN rules (e.g. forward chaining and backward chaining).
Now answering your question, to me your example looks like a specification of a discrete cyber-physical system (Depending on other specifications can be considered hybrid too). A cyber-physical system can also be considered as a state machine which is a system that exists in a limited number of conditions and has forbidden states and progresses from one state to the next according to a fixed set of rules. In addition, if you had possible input and output events in your example, you could design Moore, Mealy machines, Petri Nets, Statecharts of your state machine, given the specifications and then use formal verification techniques to verify it.
At first this problem seems trivial: given two ontologies, which term in ontology A best refers to a term in ontology B.
But its simplicity is deceptive: this problem is extremely hard and has currently lead to thousands of academic publications, without any consensus on how to solve this problem.
Naively, one would expect that simply looking at the term "Heart Attack" in both ontologies would suffice.
However, ontologies almost never encode the same phrase.
In simple cases "Heart Attack" might be coded as "Heart Attacks", or "Heart attack (non-fatal)", but in more complicated cases it might only be coded as "Myocardial infarction".
In other cases it is even more complicated, for example dealing with compound (composed) terms.
More importantly, simply matching the term (or string) ignores the "ontological structure".
What if "Heart Attack" in ontology A is coded as caused-by high blood pressure, whereas in ontology B it might be coded as withdrawl-from-trial-non-fatal.
In this case it might be valid to match the two terms, but not trivially so.
And this assumes the equivalent term exists at all.
It's a classical problem called Semantic/Ontology Matching, Alignment, or Harmonization. The research out there involves lexical similarity, term usage in free text, graph homomorphisms, curated mappings (like MeSH/WordNet), topic modeling, and logical inference (first- or higher-order logic). But which is the most user friendly and production ready solution, that can be integrated into a Java(/Clojure) or Python app? I've looked at Ontology matching: A literature review but they don't seem to recommend anything ... any suggestions or experiences?
Have a look at http://oaei.ontologymatching.org/2014/results/ . There were several tracks open for matchers to be sent in and be evaluated. Not every matcher participates in every track. So you might want to read the track descriptions and pick one that seems to be the most similar to your problem. For example if you don't have to deal with multiple languages you probably don't have to check the MultiFarm track. After that check the results by having a look at Recall, Precision and F-Measure and decide for yourself. You also might want to check out some earlier years.
I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Thanks.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.
Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.
Do you have an example or an explanation of ANFIS (Adaptive Neuro-Fuzzy Inference System), I am reading that this could be applied to classify some diseases, What do you think about it?
Usually in order to develop a fuzzy system you have to determine the if-then rules, suitable membership functions, and their parameters. This is not always a trivial task, especially the development of correct if-then rules may be time consuming as we first have to "extract" the expert knowledge somehow.
This is where ANFIS comes into play: Under certain circumstances it can automatically determine suitable parameters for the membership functions. This is the case in particular when we already have a set of input and related output variables and values. Like in an artificial neural network the ANFIS system is able to adapt its nodes and connections between them "automatically".
To your question: you could of course create an ANFIS system for your desease classification, as long as you already have input and output data for system training available. But its not necessarily tied to such systems, you can see ANFIS more an approach usable under the mentioned circumstances, than a tool for a specific problem. It all depends on the requirements for the system you want to create, as well as the known (external) preconditions...
Hope that helps!
As Matthias said ANFIS is not mapped to a particular problem, you can use it on the basis of problem requirement. But where to use ANFIS: You can use it with any problem where something is ambiguous.
Actually this is the property of FIS(Fuzzy Inference System). Adaptive come in role as Matthias explained.
For ex. took famous classification problem, classifying a input to any class is not always perfectly determined, it somewhat ambiguous. So there using ANFIS may give better results then other classification algorithms depending upon whether you are able to model the system correctly or not using ANFIS.
But using ANFIS is computationally expensive as compared to other non-fuzzy approches. As to make FIS to perfect model your problem you will add AN part to it. This only make membership function selection adaptive. What about if-then rules. For that you have to do unsupervised rule selection from the complete possible rule base(this is basically a kind of unsupervised clustering problem, where you are trying to group all the rules whose effect would be same).
So far I have found a university 'Monash' that explains (based on the guide of Matlabs's Fuzzy Logic Toolbox) ANFIS.
The fuzzy inference system that we have considered is a model that maps:
input characteristics to input
membership functions
input
membership function to rules
rules
to a set of output characteristics
output characteristics to output
membership functions
the output membership function to a
single-valued output, or, a decision
associated with the output.
Yes it can be used for Diseases Classification.
Since the idea of ANFIS is combine fuzzy system in architecture of ANN. In this case, ANFIS have two main benefit.
first, you can use fuzzy variable which is support for Linguistic variable and it's fit for Diseases's symptoms that are commonly used as system's input (example of input >> pain levels : low, mid, high).
Second, since the architecture is mapped to ANN layers, ANFIS can do training process which aims to create more accurate result (ex : use Backpropagation method).