Apriori algrithm - finding associations in production data - apriori

I have problem with finding "correct" associations within production data.
The data looks like this
A;B;C;D;E;F;G
1;0;1;0;0;0;0
0;1;0;0;0;0;0
0;0;0;1;0;0;0
0;0;1;0;1;0;0
1;0;0;0;0;0;0
0;0;0;0;0;1;0
0;0;0;0;0;0;1
1;0;1;0;0;0;0
(Of course I have a lot more steps and rows)
Where A,B,C etc are production steps. 0 means that a worker did not perform this production step and 1 means that this step was performed by a worker. For example, first row - 1;0;1;0;0;0;0 means that steps A & C where performed at the same time by a worker. And second row -0;1;0;0;0;0;0 means that (perhaps another worker) performed only production step B.
So it happens that some of the production steps are usually performed simultaneously by the same worker, just like step A & C in the example above (2 out of 3 times they occur together). In order to find which steps tend to be performed together I applied apriori algorithm.
I hoped to receive answer like "If there is 1 in column A, it is likely that 1 will appear in column C". But instead, apriori algorithm found for me this "cool" rules which basically say that there are a lot of 0s in the table. Rules found where like this "If there is 0 in columns A and G, it is likely that there is 0 in column E" - thanks Sherlock
I need this algorithm to focus on rules connected to where are 1s in the table, not 0s. Basically any rule that looks at 0s can be ignored. I just want rules that look at 1s because I want to know which production steps tend to be performed together and I don't care which production steps are not performed together (0s) because obviously majority of the steps are not performed simultaneously.
Does anybody have some idea how to find associations between 1s instead of 0s?
I use Weka software to do the data mining.

Apriori has no notion of what the labels represent, they are just strings.
Have you tried the -Z option, treating the first label in attribute as missing?

Related

Ive got a pipe that consists of 5 pieces, each including 5 properties

Inlet -> front -> middle -> rear -> outlet
Those five properties have a value anything between 4 - 40. Now i want to calculate a specific match for each of those values that is either a full 10 or a 5 when a single property is summed from each pipe piece. There might be hundreds of different pipe pieces all with different properties.
So if i have all 5 pieces and when summed, their properties go like 54,51,23,71,37. That is not good and not what im looking.
Instead 55,50,25,70,40. That would be perfect.
My trouble is there are so many of the pieces that it would be insane to do the miss'matching manually, and new ones come up frequently.
I have manually inserted about 100 of these already into SQLite, but should be easy to convert into any excel or other database formats, so answer can be related to anything like mysql or googlesheets.
I need the calculation that takes every piece in account and results either in "no match" or tells me the id of each piece that is required for a match and if multiple matches are available, it separates them.
Edit: Even just the math needed to do this kind of calculation would be a lot of help here, not much of a math guy myself. I guess there should be a reference piece i need to use and then that gets checked against every possible scenario.
If the value you want to verify is in A1, use: =ROUND(A1/5,0)*5
If the pipes may not be shorter than the given values, use =CEILING(A1,5)

what is the serializability graph of this?

I try to figure out a question, however I do not how to solve it, I am unannounced most of the terms in the question. Here is the question:
Three transactions; T1, T2 and T3 and schedule program s1 are given
below. Please draw the precedence or serializability graph of the s1
and specify the serializability of the schedule S1. If possible, write
at least one serial schedule. r ==> read, w ==> write
T1: r1(X);r1(Z);w1(X);
T2: r2(Z);r2(Y);w2(Z);w2(Y);
T3: r3(X);r3(Y);w3(Y);
S1: r1(X);r2(Z);r1(Z);r3(Y);r3(Y);w1(X);w3(Y);r2(Y);w2(Z);w2(Y);
I do not have any idea about how to solve this question, I need a detailed description. In which resource should I look for? Thank in advance.
There are various ways to test for serializability. The Objective of serializability is to find nonserial schedules that allow transactions to execute concurrently without interfering with one another.
First we do a Conflict-Equivalent Test. This will tell us whether the schedule is serializable.
To do this, we must define some rules (i & j are 2 transactions, R=Read, W=Write).
We cannot Swap the order of actions if equivalent to:
1. Ri(x), Wi(y) - Conflicts
2. Wi(x), Wj(x) - Conflicts
3. Ri(x), Wj(x) - Conflicts
4. Wi(x), Rj(x) - Conflicts
But these are perfectly valid:
R1(x), Rj(y) - No conflict (2 reads never conflict)
Ri(x), Wj(y) - No conflict (working on different items)
Wi(x), Rj(y) - No conflict (same as above)
Wi(x), Wj(y) - No conflict (same as above)
So applying the rules above we can derive this (using excel for simplicity):
From the result, we can clearly see with managed to derive a serial-relation (i.e. The schedule you have above, can be split into S(T1, T3, T2).
Now that we have a serializable schedule and we have the serial schedule, we now do the Conflict-Serialazabile test:
Simplest way to do this, using the same rules as the conflict-equivalent test, look for any combinations which would conflict.
r1(x); r2(z); r1(z); r3(y); r3(y); w1(x); w3(y); r2(y); w2(z); w2(y);
----------------------------------------------------------------------
r1(z) w2(z)
r3(y) w2(y)
w3(y) r2(y)
w3(y) w2(y)
Using the rules above, we end up with a table like above (e.g. we know reading z from one transaction and then writing z from another transaction will cause a conflict (look at rule 3).
Given the table, from left to right, we can create a precedence graph with these conditions:
T1 -> T2
T3 -> T2 (only 1 arrow per combination)
Thus we end up with a graph looking like this:
From the graph, since there it's acyclic (no cycle) we can conclude the schedule is conflict-serializable. Furthermore, since its also view-serializable (since every schedule that's conflict-s is also view-s). We could test the view-s to prove this, but it's rather complicated.
Regarding sources to learn this material, I recommend:
"Database Systems: A practical Approach To design, implementation and management: International Edition" by Thomas Connolly; Carolyn Begg - (It is rather expensive so I suggest looking for a cheaper, pdf copy)
Good luck!
Update
I've developed a little tool which will do all of the above for you (including graph). It's pretty simple to use, I've also added some examples.

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

Document classification with incomplete training set

Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don't belong.
Assumptions
Given your example "French", I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that "French kiss" is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if "pool" is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.
Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.
Suggestion: localized language model + bootstrapping
Think of it this way: You don't have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.
For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don't need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).
Build two such LLMs from your training data (let's call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a "bad" document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.
To formalize, the pseudo-code is:
T_0 := initial training set (consisting of relevant/irrelevant documents)
D0 := additional data to be bootstrapped
N := iterations for bootstrapping
for i = 0 to N-1
T_i+1 := empty training set
Build M0 and M1 as discussed above using a window-size w
for d in D0
if feature-word not in d
then add d to irrelevant documents of T_i+1
else
compute perplexity scores P0 and P1 corresponding to M0 and M1 using
window size w around the feature-word in d.
if P0 < P1 - delta
add d to irrelevant documents of T_i+1
else if P1 < P0 - delta
add d to relevant documents of T_i+1
else
do not use d in T_i+1
end
end
end
Select a small random sample from relevant and irrelevant documents in
T_i+1, and (re)classify them manually if required.
end
T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
Firstly you should take care of how to extract features of the sample docs. Counting every word is not a good way. You might need some technique like TFIDF to teach the classifier that which words are important to classify and which are not.
Build a right dictionary. In your case, the word French kiss should be a unique word, instead of a sequence of French + kiss. Use the right technique to build a right dictionary is important.
The remain errors in samples are normal, we call it "not linear separable". There're a huge amount of advanced researches on how to solve this problem. For example, SVM (support vector machine) would be what you like to use. Please note that single-layer Rosenblatt perceptron usually shows very bad performance for the dataset which are not linear separable.
Some kinds of neural networks (like Rosenblatt perceptron) can be educated on erroneus data set and can show a better performance than tranier has. Moreover in many cases you should make errors for avoid over-training.
You can mark all unlabeled documents randomly, train several nets and estimate theirs performance on the test set (of course, you should not include unlabeled documents in the test set). After that you can in cycle recalculate weights of unlabeled documents as w_i = sum of quality(j) * w_ij, and then repeate training and the recalculate weight and so on. Because procedure is equivalent to introducing new hidden layer and recalculating it weights by Hebb procedure the overall procedure should converge if your positive and negative sets are lineary separable in some network feature space.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

Resources