Is there an R function for finding most frequency items (LHS) in the rules of a given RHS? - apriori

I would like to identifiy the most frequent items (LHS) in the rules of a given consequence (RHS).
From the command below I get the LHS frequency, and I would like to add the RHS info to the same output.
rowSums(as(lhs(r), "ngCMatrix"))
Get the rules:
rules <- apriori(d, parameter = list(minlen=2, supp=0.05, conf=0.1), appearance = list(default="lhs", rhs=c("Causa=CA", "Causa=CB", "Causa=CE", "Causa=OT", "Causa=AB", "Causa=AT")))
Any suggests?
Thanks

Related

Clingo program expected to be satisfiable

I am testing some programs involving arithmetics in Clingo 5.0.0 and I don't understand why the below program is unsatisfiable:
#const v = 1.
a(object1).
a(object2).
b(object3).
value(object1,object2,object3) = "1.5".
value(X,Y,Z) > v, a(X), a(Y), b(Z), X!=Y :- go(X,Y,Z).
I expected an answer containing: a(object1) a(object2) b(object3) go(object1,object2,object3).
There is probably something I miss regarding arithmetic with Clingo.
I fear there are quite some misunderstandings about ASP here.
You can not assign values to predicates (value(a,b,c)=1.5). Predicates form atoms, that can be true or false (contained in an answer set or not).
I assume that your last rule shall derive the atom go(X,Y,Z). Rules do work the other way around, what is derived is on the left hand side.
There is no floating point arithmetic possible, you would have to scale your values up to integers.
Your problem might look like this, but this is just groping in the dark:
#const v = 1.
a(object1).
a(object2).
b(object3).
value(object1,object2,object3,2).
go(X,Y,Z) :- value(X,Y,Z,Value), Value > v, a(X), a(Y), b(Z), X!=Y.
The last rule states:
Derive go(object1,object2,object3) if value(object1,object2,object3,2) is true and 2 > 1 and a(object1) is true and a(object2) is true and b(object3) is true and object1 != object2.

Perl: Length of an anonymous list

How to get the length of an anonymous list?
perl -E 'say scalar ("a", "b");' # => b
I expected scalar to return the list in a scalar context - its length.
Why it returns the second (last) element?
It works for an array:
perl -E 'my #lst = ("a", "b"); say scalar #lst;' # => 2
You can use
my $n = () = f();
As applied to your case, it's
say scalar( () = ("a", "b") );
or
say 0+( () = ("a", "b") );
First, let's clear up a misconception.
You appear to believe that some operators evaluate to some kind of data structure called a list regardless of context, and that this list returns its length when coerced into scalar context.
All of that is incorrect.
An operator must evaluate to exactly one scalar in scalar context, and a sub must return exactly one scalar in scalar context. In list context, operators can evaluate to any number of scalars, and subs can return any number of scalars. So when we say an operator evaluates to a list, and when we say a sub returns a list, we aren't referring to some data structure; we are simply using "list" as a shorthand for "zero or more scalars".
Since there's no such thing as a list data structure, it can't be coerced into a scalar. Context isn't a coercion; context is something operators check to determine to what they evaluate in the first place. They literally let context determine their behaviour and what they return. It's up to each operator to decide what they return in scalar and list context, and there's a lot of variance.
As you've noted,
The #a operator in scalar context evaluates to a single scalar: the length of the array.
The comma operator in scalar context evaluates to a single scalar: the same value as its last operand.
The qw operator in scalar context evaluates to a single scalar: the last value it would normally return.
On to your question.
To determine to how many scalars an operator would evaluate when evaluated in list context, we need to evaluate the operator in list context. An operator always evaluates to a single scalar in scalar context, so your attempts to impose a scalar context are ill-founded (unless the operator happens to evaluate to the length of what it would have returned in list context, as is the case for #a, but not for many other operators).
The solution is to use
my $n = () = f();
The explanation is complicated.
One way
perl -wE'$len = () = qw(a b c); say $len' #--> 3
The = () = "operator" is a play on context. It forces list context on its right side and assigns the length of the list. See this post about list vs scalar assignments and this page for some thoughts on all this.
If this need be used in a list context then the LHS context can also be forced by scalar, like
say scalar( () = qw(a b c) );
Or by yet other ways (0+...), but scalar is in this case actually suitable, and clearest.
In your honest attempt scalar imposes the scalar context on its operand -- or here an expression, which is thus evaluated by the comma operator, whereby one after another term is discarded, until the last one which is returned.
You'd get to know about that with warnings on, as it would emit
Useless use of a constant ("a") in void context at -e line 1
Warnings can always be enabled in one-liners as well, with -w flag. I recommend that.
I'd like to also comment on the notion of a "list" in Perl, often misunderstood.
In programming text a "list" is merely a syntax device, that code can use; a number of scalars, perhaps submitted to a function, or assigned to an array variable, or so. It is often identified by parenthesis but those really only decide precedence and don't "make" anything nor give a "list" any sort of individuality, like a variable has; a list is just a grouping of scalars.
Internally that's how data is moved around; a "list" is a fleeting bunch of scalars on a stack, returned somewhere and gone.
A list is not -- not -- any kind of a data structure or a data type; that would be an array. See for instance a perlfaq4 item and this related page.
Perl references are hard. I'm not sending you to read prelref since it's not something that anyone can just read and start using.
Long story short, use this pattern to get an anonymous array size: 0+#{[ <...your array expression...> ]}
Example:
print 0+#{[ ("a", "b") ]};

Swift predicate only matches first value in array of values

I have a class Download that serves as a wrapper for CKQueryOperation. One of the inits allows me to build my predicate with an array of values:
init(type: String, queryField: String, queryValues: [CKRecordValue], to rec: RecievesRecordable, from database: CKDatabase? = nil) {
let predicate = NSPredicate(format: "\(queryField) = %#", argumentArray: queryValues)
query = CKQuery(recordType: type, predicate: predicate)
reciever = rec
db = database
super.init()
}
When I test it, query only matches the first value in the array. So if queryValues = [testValue0, testValue1] and I have one record whose field matches testValue0 and I have a second record that matches testValue1, only the first record will be discovered. If I switch the order, than the other record gets recognized.
It seems weird that I can create a predicate with an array but only the first value gets matched. The documentation says that values are substituted in the order they appear, but shouldn't it still be moving on to the second value?
For more context, each record is stored in a separate database (private vs public) and my Download class launches two separate CKQueryOperations that both rely on query, if database param is left nil. Whichever op fails ends up finding no results that match the first value and then giving up before checking the second value.
I can include the full code for 'Download' and my failing unit test if needed.
You are using the wrong predicate. You want to use the IN operation. And instead of using string substitution for the field name, use the %K format specifier:
let predicate = NSPredicate(format: "%K IN %#", arguments: queryField, queryValues)
Note that the NSPredicate(format:argumentArray:) expects there to be one format specifier in the format string for each value in the argument array. Since your format only had one format specifier, only the first value was taken from the array. And since you used =, the field was simply compared against that one value.
Basic Comparisons
=, == The left-hand expression is equal to the right-hand expression.
=, => The left-hand expression is greater than or equal to the right-hand expression. <=, =< The left-hand expression is less than or
equal to the right-hand expression.
The left-hand expression is greater than the right-hand expression. < The left-hand expression is less than the right-hand expression. !=,
<> The left-hand expression is not equal to the right-hand expression.
String Comparisons
BEGINSWITH The left-hand expression begins with the right-hand
expression.
CONTAINS The left-hand expression contains the right-hand expression.
ENDSWITH The left-hand expression ends with the right-hand expression.
LIKE The left hand expression equals the right-hand expression: ? and
are allowed as wildcard characters, where ? matches 1 character and * matches 0 or more characters.
MATCHES The left hand expression equals the right hand expression
using a regex-style comparison according to ICU v3 (for more details
see the ICU User Guide for Regular Expressions).
I used Basic Comparisons
let resultPredicate = NSPredicate(format: "%K = %#", "key", "value")
let result_filtered = MYARRAY.filtered(using: resultPredicate) as NSArray
More information : Predicate Programming Guide

How do I prevent a Datalog rule from pruning nulls?

I have the following facts and rules:
% frequents(D,P) % D=drinker, P=pub
% serves(P,B) % B=beer
% likes(D,B)
frequents(janus, godthaab).
frequents(janus, goldenekrone).
frequents(yanai, goldenekrone).
frequents(dimi, schlosskeller).
serves(godthaab, tuborg).
serves(godthaab, carlsberg).
serves(goldenekrone, pfungstaedter).
serves(schlosskeller, fix).
likes(janus, tuborg).
likes(janus, carlsberg).
count_good_beers_for_at(D,P,F) :- group_by((frequents(D,P), serves(P,B), likes(D,B)),[D,P],(F = count)).
possible_beers_served_for_at(D,P,B) :- lj(serves(P,B), frequents(D,R), P=R).
Now I would like to construct a rule that should work like a predicate returning "true" when the number of available "liked" beers at each pub that a "drinker" "frequents" is bigger than 0.
I would consider the predicate true when the rule returns no tuples. If the predicate is false, I was planning to make it return the bars not having a single "liked" beer.
As you can see, I already have a rule counting the good beers for a given drinker at a given pub. I also have a rule giving me the number of servable beers.
DES> count_good_beers_for_at(A,B,C)
{
count_good_beers_for_at(janus,godthaab,2)
}
Info: 1 tuple computed.
As you can see, the counter doesn't return the pubs frequented but having 0 liked beers. I was planning to work around this by using a left outer join.
DES> is_happy_at(D,P,Z) :- lj(serves(P,B), count_good_beers_for_at(D,Y,Z), (Y=P))
Info: Processing:
is_happy_at(D,P,Z) :-
lj(serves(P,B),count_good_beers_for_at(D,Y,Z),Y = P).
{
is_happy_at(janus,godthaab,2),
is_happy_at(null,goldenekrone,null),
is_happy_at(null,schlosskeller,null)
}
Info: 3 tuples computed.
This is almost right, except it is also giving me the pubs not frequented. I try adding an extra condition:
DES> is_happy_at(D,P,Z) :- lj(serves(P,B), count_good_beers_for_at(D,Y,Z), (Y=P)), frequents(D,P)
Info: Processing:
is_happy_at(D,P,Z) :-
lj(serves(P,B),count_good_beers_for_at(D,Y,Z),Y = P),
frequents(D,P).
{
is_happy_at(janus,godthaab,2)
}
Info: 1 tuple computed.
Now I somehow filtered everything containing nulls away! I suspect this is due to null-value logic in DES.
I recognize that I might be approaching this whole problem in a wrong way. Any help is appreciated.
EDIT: Assignment is "very_happy(D) ist wahr, genau dann wenn jede Bar, die Trinker D besucht, wenigstens ein Bier ausschenkt, das er mag." which translates to "very_happy(D) is true, iff each bar drinker D visits, serves at least 1 beer, that he likes". Since this assignment is about Datalog, I would think it is definitely possible to solve without using Prolog.
I think that for your assignement you should use basic Datalog, without abusing of aggregates. The point of the question is how to express universally quantified conditions. I googled for 'universal quantification datalog', and at first position I found deductnotes.pdf that asserts:
An universally quantified condition can only be expressed by an equivalent condition with existential quantification and negation.
In that PDF you will find also an useful example (pagg 9 & 10).
Thus we must rephrase our question. I ended up with this code:
not_happy(D) :-
frequents(D, P),
likes(D, B),
not(serves(P, B)).
very_happy(D) :-
likes(D, _),
not(not_happy(D)).
that seems what's required:
DES> very_happy(D)
{
}
Info: 0 tuple computed.
Note the likes(D, _), that's required to avoid that yanai and dimi get listed as very_happy, without explicit assertion of what them like (OT sorry my English really sucks...)
EDIT: I'm sorry, but the above solution doesn't work. I've rewritten it this way:
likes_pub(D, P) :-
likes(D, B),
serves(P, B).
unhappy(D) :-
frequents(D, P),
not(likes_pub(D, P)).
very_happy(D) :-
likes(D, _),
not(unhappy(D)).
test:
DES> unhappy(D)
{
unhappy(dimi),
unhappy(janus),
unhappy(yanai)
}
Info: 3 tuples computed.
DES> very_happy(D)
{
}
Info: 0 tuples computed.
Now we add a fact:
serves(goldenekrone, tuborg).
and we can see the corrected code outcome:
DES> unhappy(D)
{
unhappy(dimi),
unhappy(yanai)
}
Info: 2 tuples computed.
DES> very_happy(D)
{
very_happy(janus)
}
Info: 1 tuple computed.
Maybe not the answer your are expecting. But you can use ordinary Prolog and easily do group by queries with the bagof/3 or setof/3 builtin predicates.
?- bagof(B,(frequents(D,P), serves(P,B), likes(D,B)),L), length(L,N).
D = janus,
P = godthaab,
L = [tuborg,carlsberg],
N = 2
The semantics of bagof/3 is such that it does not compute an outer join for the given query. The query is normally executed by Prolog. The results are first accumulated and key sorted. Finally the results are then returned by backtracking. If your datalog cannot do without nulls, then yes you have to filter.
But you don't need to go into aggregates when you only want to know the existence of a liked beer. You can do it directly via a query without any aggregates:
is_happy_at(D,P) :- frequents(D,P), once((serves(P,B), likes(D,B))).
?- is_happy_at(D,P).
D = janus,
P = godthaab ;
Nein
The once/1 prevents from unnecessary backtrack. Datalog might either automatically not do unnecessary backtracking when it sees the projection in is_happy_at/2, i.e. B is projected away. Or you might need to explicitly use what corresponds to SQL DISTINCT. Or eventually your datalog provides you something that corresponds to SQL EXISTS which most closely corresponds to once/1.
Bye

Determine regular expression's specificity

Given the following regular expressions:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
The string alice#myprovider.com will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted.
So I hope you guys could help me out here...
Thanks !!
Paul
The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.
Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.
Setup
import re
class RegexElement:
def __init__(self, string,index):
self.string=string
self.supersets = []
self.subsets = []
self.disjoints = []
self.intersects = []
self.maybes = []
self.precompilation = {}
self.compiled = re.compile(string,re.IGNORECASE)
self.index = index
SUPERSET = object()
SUBSET = object()
INTERSECT = object()
DISJOINT = object()
EQUAL = object()
The Tests
Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.
SUPERSET means a is a superset of b. All matches of b will match a.
SUBSET means b is a superset of a.
INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.
DISJOINT means no matches of a will match b.
EQUAL means all matches of a will match b and all matches of b will match a.
def equal_test(a, b):
if a == b: return EQUAL
The graph
class SubsetGraph(object):
def __init__(self, tests):
self.regexps = []
self.tests = tests
self._dirty = True
self._roots = None
#property
def roots(self):
if self._dirty:
r = self._roots = [i for i in self.regexps if not i.supersets]
return r
return self._roots
def add_regex(self, new_regex):
roots = self.roots
new_re = RegexElement(new_regex)
for element in roots:
self.process(new_re, element)
self.regexps.append(new_re)
def process(self, new_re, element):
relationship = self.compare(new_re, element)
if relationship:
getattr(self, 'add_' + relationship)(new_re, element)
def add_SUPERSET(self, new_re, element):
for i in element.subsets:
i.supersets.add(new_re)
new_re.subsets.add(i)
element.supersets.add(new_re)
new_re.subsets.add(element)
def add_SUBSET(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
element.subsets.add(new_re)
new_re.supersets.add(element)
def add_DISJOINT(self, new_re, element):
for i in element.subsets:
i.disjoints.add(new_re)
new_re.disjoints.add(i)
new_re.disjoints.add(element)
element.disjoints.add(new_re)
def add_INTERSECT(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
new_re.intersects.add(element)
element.intersects.add(new_re)
def add_EQUAL(self, new_re, element):
new_re.supersets = element.supersets.copy()
new_re.subsets = element.subsets.copy()
new_re.disjoints = element.disjoints.copy()
new_re.intersects = element.intersects.copy()
def compare(self, a, b):
for test in self.tests:
result = test(a.string, b.string)
if result:
return result
def match(self, text, strict=True):
matches = set()
self._match(text, self.roots, matches)
out = []
for e in matches:
for s in e.subsets:
if s in matches:
break
else:
out.append(e)
if strict and len(out) > 1:
for i in out:
print(i.string)
raise Exception("Multiple equally specific matches found for " + text)
return out
def _match(self, text, elements, matches):
new_elements = []
for element in elements:
m = element.compiled.match(text)
if m:
matches.add(element)
new_elements.extend(element.subsets)
if new_elements:
self._match(text, new_elements, matches)
Usage
graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")
A complete usable version is here.
My gut instinct says that not only is this a hard problem, both in terms of computational cost and implementation difficulty, but it may be unsolvable in any realistic fashion. Consider the two following regular expressions to accept the string alice#myprovider.com
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Which one of these is more specific?
This is a bit of a hack, but it could provide a practical solution to this question asked nearly 10 years ago.
As pointed out by #torak, there are difficulties in defining what it means for one regular expression to be more specific than another.
My suggestion is to look at how stable the regular expression is with respect to a string that matches it. The usual way to investigate stability is to make minor changes to the inputs, and see if you still get the same result.
For example, the string alice#myprovider.com matches the regex /alice#myprovider\.com/, but if you make any change to the string, it will not match. So this regex is very unstable. But the regex /.*/ is very stable, because you can make any change to the string, and it still matches.
So, in looking for the most specific regex, we are looking for the least stable one with respect to a string that matches it.
In order to implement this test for stability, we need to define how we choose a minor change to the string that matches the regex. This is another can of worms. We could for example, choose to change each character of the string to something random and test that against the regex, or any number of other possible choices. For simplicity, I suggest deleting one character at a time from the string, and testing that.
So, if the string that matches is N characters long, we have N tests to make. Lets's look at deleting one character at a time from the string alice#foo.com, which matches all of the regular expressions in the table below. It's 12 characters long, so there are 12 tests. In the table below,
0 means the regex does not match (unstable),
1 means it matches (stable)
/alice#[a-z]+\.[a-z]+/ /[a-z]+#[a-z]+\.[a-z]+/ /.*/
lice#foo.com 0 1 1
aice#foo.com 0 1 1
alce#foo.com 0 1 1
alie#foo.com 0 1 1
alic#foo.com 0 1 1
alicefoo.com 0 0 1
alice#oo.com 1 1 1
alice#fo.com 1 1 1
alice#fo.com 1 1 1
alice#foocom 0 0 1
alice#foo.om 1 1 1
alice#foo.cm 1 1 1
--- --- ---
total score: 5 10 12
The regex with the lowest score is the most specific. Of course, in general, there may be more than one regex with the same score, which reflects the fact there are regular expressions which by any reasonable way of measuring specificity are as specific as one another. Although it may also yield the same score for regular expressions that one can easily argue are not as specific as each other (if you can think of an example, please comment).
But coming back to the question asked by #torak, which of these is more specific:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
We could argue that the second is more specific because it constrains more characters, and the above test will agree with that view.
As I said, the way we choose to make minor changes to the string that matches more than one regex is a can of worms, and the answer that the above method yields may depend on that choice. But as I said, this is an easily implementable hack - it is not rigourous.
And, of course the method breaks if the string that matches is empty. The usefulness if the test will increase as the length of the string increases. With very short strings, it is more likely produce equal scores for regular expressions that are clearly different in their specificity.
I'm thinking of a similar problem for a PHP projects route parser. After reading the other answers and comments here, and also thinking about the cost involved I might go in another direction altogether.
A solution however, would be to simply sort the regular expression list in order of it's string length.
It's not perfect, but simply by removing the []-groups it would be much closer. On the first example in the question it would this list:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
To this, after removing content of any []-group:
- alice#+\.+
- +#+\.+
- .*
Same thing goes for the second example in another answer, with the []-groups completely removed and sorted by length, this:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Would become sorted as:
+#myprovider.com
alice#+\.+
This is a good enough solution at least for me, if I choose to use it. Downside would be the overhead of removing all groups of [] before sorting and applying the sort on the unmodified list of regexes, but hey - you can't get everything.

Resources