In what sequence cluster analysis is done? - database

First find the minimum frequent patterns from the database.
Then divide them into various data types like interval based , binary ,ordinal variables etc and define various distance measures for all the variables.
Finally apply cluster analysis method.
Is this sequence right or am i missing something?

whether you're right or not depends on what you want to do. The general approach that you describe seems to go into the right direction, but you'll never know if your on target until you answer the following questions:
What is your data?
What are trying to find/Which cluster method do you want to use?
From what you describe it seems to me that you want to do 'preprocessing' steps like feature selection and vectorization. Unfortunately, this by itself can be quite challenging. For example, one of the biggest partical problems is the design of a distance function (there's a tremendous amount of research available).
So, please give us more information on your specific target application.

Related

How can I automate this task ? (automatic contouring for radiotherapy)

I've just started my residency as a radiation oncologist. I have a little background in programming (Python, VBA).
I'd like your insights on an issue I have at work.
The issue : For each patient, the radiation oncologist needs to do a contouring. Basically, he contours the main structures (like the aorta, the heart, the lungs, and so on) on a CT scan. This is essential for computing the spatial distribution of the radiations (because you want to avoid those structures). The contouring is done within a 3rd party software (called Isogray). The CT scans come from the hospital database and the radiation distribution is computed on another software.
It takes at least one hour to do a complete contouring. Multiply that by each patients (maybe a dozen per week) and by each oncologists (we are a team of 15 members) and you can see that it represents hundred (maybe even thousand) manhours every year.
There exists softwares that do this automatically, but the hospital doesn't want to rent/buy them. But, seriously, how hard can this be to do a little automation ? Can't I do this myself ?
My plan of action : Here I'd like your insights. How can I automate this task ? The first thing is that I can't change anything within Isogray, so I need to do the automation externally. What I think I should do :
Create a database of the historical contourings : this means I need to be able to read what Isogray uses as an output files
Design an automatic model : I'm thinking deep learning models here. I don't know if there's anything more optimal to do than calibrating a deep learning model on the contoured CT scans I already have
Create a little software : based on the automatic model, the software will take a 'not contoured' Isogray file and turn it into a 'contoured' file. The oncologist only needs to load the new file into Isogray and validate the contouring
What do you think ? Do you see an easier way to do that ? I don't know anything about Isogray (I just know how to use it). Do you think this is doable? What information do I need before I start this project ?
Any insights will be welcomed :)
From what I have understood it is a problem of semantic segmentation.
You have an input image of N dimensions (or black and white) and you use the neural network to indicate which regions correspond to a specific organ.
You can use an architecture like the U-Net for this task: https://medium.com/#keremturgutlu/semantic-segmentation-u-net-part-1-d8d6f6005066
What I do not know is if the degree of reliability would be very high, that depends on many factors.
Neural networks look for differentiating patterns to discriminate zones, the first important component is shape and color. That is why it is more difficult when both the color and the shape are very different.
On the other hand you will need a lot of images but you can create a process called data-augmentation to generate more (artificial).
Another method that is currently used is to work in reverse, we know that the problem of image segmentation is difficult. But you can design a program that simulates real images where segmentation is known perfectly.
There are only some keypoints, I hope I have helped you.
EDIT:
Semantic segmentation in biomedic context: https://towardsdatascience.com/review-u-net-biomedical-image-segmentation-d02bf06ca760
You need to provide more background on the specifics on the contouring, especially given the fact that this is for medical diagnosis. Truthfully, I wouldn't try and automate this for liability reasons.
If you make an error someone it could cause a misdiagnosis, which as you already know can lead to numerous problems including lawsuits and death. The nice thing about 3rd party products is that it is already being tested robustly against numerous scenarios and approved for medical usage and liability reasons.
I'm pretty sure you could make a masters thesis doing something like this
With that being said, there is a nice github repo for problems like this that I think you could potentially start generating ideas from.

How to find similarity for large number of features

I'm not sure if I am asking the question at right place as I'm new to stackoverflow, please move if required.
I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.
I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide. The problem is with the number of features. I cannot load the file in Weka (To try to reduce features by some info gain or something and then try clustering or check if cosine similarity measure)
One more problem is, how to define this as a classification problem ? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.
I want to create a test data set with some of the nodes and and create similar table and label them as positive class or negative class. But my problem is all data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly ?
Any pointers or help is very much appreciated.
Weka can deal with 27K features, it shoudn't be a problem... However, I would approach this problem as a classification problem, but a link-discovery one, which, in this case can be seen as a matching problem.
My approach would be:
1. new node appears
2. search for the most similar elements
3. assume they are related (there is a link) if the similarity is greater than your threshold.
The main problem would be to tune the threshold based on some quality measure.
For this approach Lucene would be probably the best option.
I hope this helps.

Does a Decision Network / Decision Forest take into account relationships between inputs

I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Thanks.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Artificial Intelligence - Intelligence Agent that cleans and paints

I remember when I was in college we went over some problem where there was a smart agent that was on a grid of squares and it had to clean the squares. It was awarded points for cleaning. It also was deducted points for moving. It had to refuel every now and then and at the end it got a final score based on how many squares on the grid were dirty or clean.
I'm trying to study that problem since it was very interesting when I saw it in college, however I cannot find anything on wikipedia or anywhere online. Is there a specific name for that problem that you know about? Or maybe it was just something my teacher came up with for the class.
I'm searching for AI cleaning agent and similar things, but I don't find anything. I don't know, I'm thinking maybe it has some other name.
If you know where I can find more information about this problem I would appreciate it. Thanks.
Perhaps a "stigmergy" approach is closely related to your problem. There is a starting point here, and you can find something by searching for "dead ants" and "robots" on google scholar.
Basically: instead of modelling a precise strategy you work toward a probabilistic approach. Ants (probably) collect their deads by piling up according to a simple rule such as "if there is a pile of dead ants there, I bring this corpse hither; otherwise, I'll make a new pile". You can start by simplifying your 'cleaning' situation with that, and see where you go.
Also, I think (another?) suitable approach could be modelled with a Genetic Algorithm using a carefully chosen combination of fitness functions such as:
the end number of 'clean' tiles
the number of steps made by the robot
of course if the robots 'dies' out of starvation it automatically removes itself from the gene pool, a-la darwin awards :)
You could start by modelling a very, very simple genotype that will be 'computed' into a behaviour. Consider using a simple GA such as this one by Inman Harvey, then to each gene assign either a part of the strategy, or a complete behaviour. E.g.: if gene A is turned to 1 then the robot will try to wander randomly; if gene B is also turned to 1, then it will give priority to self-charging unless there are dirty tiles at distance X. Or use floats and model probability. Your mileage may vary but I can assure it will be fun :)
The problem is reminiscent of Shakey, although there's cleaning involved (which is like the Roomba -- a device that can also be programmed to perform these very tasks).
If the "problem space" (or room) is small enough, you can solve for an optimal solution using a simple A*-based search, but likely it won't be, since that won't leave for very interesting problems.
The machine learning approach suggested here using genetic algorithms is an interesting approach. Given the problem domain you would only have one "rule" (a move-to action, since clean could be eliminated by implicitly cleaning any square you move to that is dirty) so your learner would essentially be learning how to move around an environment. The problem there would be to build a learner that would be adaptable to any given floor plan, instead of just becoming proficient at cleaning a very specific space.
Whatever approach you have, I'd also consider doing a further meta-reasoning step if the problem sets are big enough, and use a partition approach to divide the floor up into separate areas and then conquering them one at a time.
Can you use techniques to create data to use "offline"? In that case, I'd even consider creating a "database" of optimal routes to take to clean certain floor spaces (1x1 up to, say, 5x5) that include all possible start and end squares. This is similar to "endgame databases" that game AIs use to effectively "solve" games once they reach a certain depth (c.f. Chinook).
This problem reminds me of this. A similar problem is briefly mentioned in the book Complexity as an example of a genetic algorithm. These versions are simplified though, they don't take into account fuel consumption.

Resources