How to efficiently compute disjointness with OWLAPI? - owl

I'm profiling my application that is OWLAPI based and the only bottleneck I found was about computing disjointness.
I have to check if each class is disjoint from other classes and, if this is asserted or inferred.
It seems to be heavy to compute, because unlike for the equivalence which is based on the Node data structure (and it is efficient to retrieve data), the disjointness is based on the NodeSet in this way I'm forced to perform more loops.
This is the procedure I use:
private void computeDisjointness(OWLClass clazz) {
NodeSet<OWLClass> disjointSetsFromCls = reasoner.getDisjointClasses(clazz);
for (Node<OWLClass> singleDisjoinSet : disjointSetsFromCls) {
for (OWLClass item : singleDisjoinSet) {
for (OWLDisjointClassesAxiom disjAxiom : ontology.getAxioms(AxiomType.DISJOINT_CLASSES)) {
if(disjAxiom.containsEntityInSignature(item))
{
//asserted
}
else
{
//derived
}
}
}
}
As you can see, the bottleneck is given by the 3 for loops that slow down the application; moreover, the procedure computeDisjointness is executed for each class of the ontology.
Is there a more efficient way to get the disjointness and check if the axioms are asserted or derived?

One simple optimization is to move ontology.getAxioms(AxiomType.DISJOINT_CLASSES) to the calling method, then pass it in as a parameter. This method returns a new set on each call, with the same contents every time, since you're not modifying the ontology. So if you have N classes you are creating at least N identical sets; more if many classes are actually disjoint.
Optimization number two: check the size of the disjoint node set. Size 1 means no disjoints, so you can skip the rest of the method.
Optimization 3: keep track of the classes you've already visited. E.g., if you have
A disjointWith B
your code will be called on A and cycle over A and B, then be called on B and repeat the computation.
Keep a set of visited classes, to which you add all elements in the disjoint node set, and when it's B turn you'll be able to skip the reasoner call as well. Speaking of which, I would assume the reasoner call is actually the most expensive call in this method. Do you have profiling data that says otherwise?
Optimization 4: I'm not convinced this code reliably tells you which disjoint axioms are inferred and which ones are asserted. You could have:
A disjointWith B
B disjointWith C
The reasoner would return {A, B, C} in response to asking for disjoints of A. You would find all three elements in the signature of a disjoint axiom, and find out that the reasoner has done no inferences. But the axioms in input are not the same as the axioms in output (many reasoners would in fact run absorption on the input axioms and transform them to an internal representation that is an axiom with three operands).
So, my definition of inferred and asserted would be that the set of nodes returned by the reasoner is the same as the set of operands of one disjoint axiom. To verify this condition, I would take all disjoint axioms, extract the set of operands and keep those sets in a set of sets. Then,
for each class (and keeping in place the optimization to visit each class only once),
obtain the set of disjoint nodes (optimization here to not bother if there's only one class in the set),
transform it to a set of entities,
and check whether the set of sets created earlier contains this new set.
If yes, this is an asserted axiom; if not, it's inferred.

Related

Venn diagram notation for all other sets except one

I'm trying to find a Venn diagram notation that can illustrate data that is only in a single set.
If I can select data from all the other sets, without knowing how many there are, then I can find the intersection of their complement, to select data only in the targeting set.
My current solution looks like this, but it assumes the existance of sets B and C.
The eventual diagram expecting to look like this:
One way to do it would be by using a system based on regions rather than sets. In your case, it would be the region that belongs to set A but does not belong to any other set. You can find the rationale to do that here. The idea is to express the region as a binary chain where 1 means "belongs to set n" and 0 means "does not belong to set n", where n is determined by the ordering of the sets.
In your example, you might define A as the last set, and therefore as the last bit. With three sets CBA, your region would be 001. The nice thing about this is that the leading zeroes can be naturally disregarded. Your region would be 1b, not matter how many sets there are (the b is for "binary").
You might even extend the idea by translating the number to another base. For instance, say that you want to express the region of elements belonging to set B only. With the same ordering as before, it would be 010 or 10b. But you can also express it as a decimal number and say "region 2". This expression would be valid if sets A and B exist, independently of the presence of any other set.

Flink: handle skew by partitioning by a field of the key

I have skew when I keyBy on my data. Let's say the key is:
case class MyKey(x: X, y:Y)
To solve this I am thinking of adding an extra field that would make distribution even among the workers by using this field only for partitioning:
case class MyKey(z: evenlyDistributedField, x: X, y:Y) extends MyKey(x, y) {
override def hashCode(): Int = z.hashCode
}
due to this line my records will use the overridden hashCode and be distributed evenly to each worker and use the original equals method (that takes into consideration only the X and Y fields) to find the proper keyed state in later stateful operators.
I know that same (X, Y) pairs will end in different workers, but I can handle that later. (after making the necessary processing with my new key to avoid skew).
My question is where else is the hashCode method of the Key is used?
I suspect for sure when getting keyed state (what is namespace btw?) as I saw extending classes use the key in a hashMap to get the state for this key. I know that retrieving the KeyedState from the map will be slower as as the hashCode will not consider the X, Y fields. But is there any other place in the flink code that uses the hashcode method of the key?
Is there any other way to solve this? I thought of physical partitioning but I cannot use keyBy as well afaik.
SUMMING UP I WANT TO:
partition my data in each worker randomly to produce an even distribution
[EDITED] do a .window().aggregate() in each partition independently from one another (as if the others dont exists). The data in each window aggregate should be keyed on (X,Y)s of this partition ignoring same (X,Y) keys in other partitions.
merge the conflicts due to same (X,Y) pairs appearing in different partition later (This i need not guidance. I just do a new key by on (X, Y))
In this situation I usually create a transient Tuple2<MyKey, Integer>, where I fill in the Tuple.f1 field with whatever I want to use to partition by. The map or flatMap operation following the .keyBy() can emit MyKey. That avoids mucking with MyKey.hashCode().
And note that having a different set of fields for the hashCode() vs. equals() methods leads to pain and suffering. Java has a contract that says "equals consistency: objects that are equal to each other must return the same hashCode".
[updated]
If you can't offload a significant amount of unkeyed work, then what I would do is...
Set the Integer in the Tuple2<MyKey, Integer> to be hashCode(MyKey) % <operator parallelism * factor>. Assuming your parallelism * factor is high enough, you'll only get a few cases of 2 (or more) of the groups going to the same sub-task.
In the operator, use MapState<MyKey, value> to store state. You'll need this since you'll get multiple unique MyKey values going to the same keyed group.
Do your processing and emit a MyKey from this operator.
By using hashCode(MyKey) % some value, you should get a pretty good mix of unique MyKey values going to each sub-task, which should mitigate skew. Of course if one value dominates, then you'll need another approach, but since you haven't mentioned this I'm assuming it's not the case.

OWLAPI : Performance impact while dealing with update/delete of axioms

I want to update/delete the axioms from a OWL Class (e.g SubclassOf axioms).
I have following two approaches :
1) Delete all old axioms then create all new axioms.
2) Delete selective axioms by comparing it with new axioms.
Note :- Due to some limitations I have to treat update cases as delete + create
Q. Which is the best strategy to go aheas in terms of performance for OWLAPI ?
E.g.
I have following SubclassOF axioms for Class X -
1) A or B
2) name exactly 1 xsd:string
3) P and not Q
and I want to update/delete these axioms with -
1) A [Update]
2) name min 1 xsd:string [Update]
3) Axiom is deleted [Delete]
The performance of axiom removals is equivalent to axiomatization additions. The main actions are searches through maps to find existing elements or add new ones.
The structures involved are O(Constant) for the input, so the total complexity is mostly independent of the ontology size (this might not hold true for very large ontologies, but it's accurate for most ontologies).
In short there is no performance issue with your proposed solution (2).
I would not suggest recreating axioms - this is likely to be expensive in terms of memory use. Axioms are immutable, so the new and old objects behave exactly the same.

Can a trained decision tree always return a prediction for some data input?

If I have a well trained decision tree, is it likely that there are still some combinations of attributes for which the tree has no prediction? What I mean to say is, is it possible to have a decision tree that responds to all possible combinations of inputs from a dataset that it was not trained on? I am not concerned with the accuracy of the tree, instead I wonder if a good decision tree would be expected to have a prediction for all possible combinations of inputs.
Thank you for your help!
It depends on whether by "combination of attributes" you are referring to the set of attributes for which values are provided or you mean the combination of particular values (for all attributes). For example, suppose you have attributes A, B, C, and D. And attribute A can have values {A_1, A_2,...,AA_n} (and similarly for attributes B, C, and D).
If by "combinations of attributes", you mean that sometimes values will be provided for all attributes but other times only a subset (e.g., only values for A, C, and D), then it depends on the particular decision tree implementation. For example ID3 requires that each sample have values for all of the attributes, whereas C4.5 does not (i.e., it handles missing attributes).
If by "combinations of attributes" you mean that all attributes are always present but not all combinations of attribute values were encountered during training (e.g., there was no training sample with the combination (A_2, B_5, C_1, D_4)), then yes, the trained decision tree should be able to handle those cases. More specifically, the trained tree should be able to classify all combinations of values for the attributes on which it was trained.
If a node corresponding to a particular attribute did not have training samples with a particular value of the attribute, then the prediction is made based on the value of the parent node's attribute (the next node closer to the root). For example, suppose you have a new observation (A_2, B_5, C_1, D_4). You could have a trained tree whose root node branches on attribute C. Based on the given attribute value C=C_1, the tree may then branch on attribute B and based on B=B_5, it may make its prediction. It is possible that there are no training samples with the combination (*, B_5, C_1, *). In that case, the prediction is be based solely on the value C=C_1.
Or maybe there are training examples with C=C_1 and B=B_5 but that combination is already sufficient to make the prediction. In that case, the values of A and D for new observations are irrelevant for this combination of B and C. Since all new observations matching (*, B_5, C_1, *) have the same prediction, it is not necessary that associated values of A and D are also present in the training data.

Document classification with incomplete training set

Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don't belong.
Assumptions
Given your example "French", I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that "French kiss" is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if "pool" is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.
Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.
Suggestion: localized language model + bootstrapping
Think of it this way: You don't have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.
For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don't need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).
Build two such LLMs from your training data (let's call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a "bad" document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.
To formalize, the pseudo-code is:
T_0 := initial training set (consisting of relevant/irrelevant documents)
D0 := additional data to be bootstrapped
N := iterations for bootstrapping
for i = 0 to N-1
T_i+1 := empty training set
Build M0 and M1 as discussed above using a window-size w
for d in D0
if feature-word not in d
then add d to irrelevant documents of T_i+1
else
compute perplexity scores P0 and P1 corresponding to M0 and M1 using
window size w around the feature-word in d.
if P0 < P1 - delta
add d to irrelevant documents of T_i+1
else if P1 < P0 - delta
add d to relevant documents of T_i+1
else
do not use d in T_i+1
end
end
end
Select a small random sample from relevant and irrelevant documents in
T_i+1, and (re)classify them manually if required.
end
T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
Firstly you should take care of how to extract features of the sample docs. Counting every word is not a good way. You might need some technique like TFIDF to teach the classifier that which words are important to classify and which are not.
Build a right dictionary. In your case, the word French kiss should be a unique word, instead of a sequence of French + kiss. Use the right technique to build a right dictionary is important.
The remain errors in samples are normal, we call it "not linear separable". There're a huge amount of advanced researches on how to solve this problem. For example, SVM (support vector machine) would be what you like to use. Please note that single-layer Rosenblatt perceptron usually shows very bad performance for the dataset which are not linear separable.
Some kinds of neural networks (like Rosenblatt perceptron) can be educated on erroneus data set and can show a better performance than tranier has. Moreover in many cases you should make errors for avoid over-training.
You can mark all unlabeled documents randomly, train several nets and estimate theirs performance on the test set (of course, you should not include unlabeled documents in the test set). After that you can in cycle recalculate weights of unlabeled documents as w_i = sum of quality(j) * w_ij, and then repeate training and the recalculate weight and so on. Because procedure is equivalent to introducing new hidden layer and recalculating it weights by Hebb procedure the overall procedure should converge if your positive and negative sets are lineary separable in some network feature space.

Resources