I'm writing a pyDatalog program to analyse weather data from Weather Underground (just as a demo for myself and others in the company at the moment). I have written a custom predicate resolver which returns readings between a start and end time:
# class for the reading table.
class Reading(Base):
__table__ = Table('reading', Base.metadata, autoload = True, autoload_with = engine)
def __repr__(self):
return str(self.Time)
# predicate to resolve 'timeBetween(X, Y, Z)' statements
# matches items as X where the time of day is between Y and Z (inclusive).
# if Y is later than Z, it returns the items not between Z and Y (exclusive).
# TODO - make it work where t1 and t2 are not bound.
# somehow needs to tell the engine to try somewhere else first.
#classmethod
def _pyD_timeBetween3(cls, dt, t1, t2):
if dt.is_const():
# dt is already known
if t1.is_const() and t2.is_const():
if (dt.id.Time.time() >= makeTime(t1.id)) and (dt.id.Time.time() <= makeTime(t2.id)):
yield (dt.id, t1.id, t2.id)
else:
# dt is an unbound variable
if t1.is_const() and t2.is_const():
if makeTime(t2.id) > makeTime(t1.id):
op = 'and'
else:
op = 'or'
sqlWhere = "time(Time) >= '%s' %s time(Time) <= '%s'" % (t1.id, op, t2.id)
for instance in cls.session.query(cls).filter(sqlWhere):
yield(instance, t1.id, t2.id)
This works fine in the case where t1 and t2 are bound to specific values:
:> easterly(X) <= (Reading.WindDirection[X] == 'East')
:> + rideAfter('11:00:00')
:> + rideBefore('15:00:00')
:> goodTime(X) <= rideAfter(Y) & rideBefore(Z) & Reading.timeBetween(X, Y, Z)
:> goodTime(X)
[(2013-02-19 11:25:00,), (2013-02-19 12:45:00,), (2013-02-19 12:50:00,), (2013-02-19 13:25:00,), (2013-02-19 14:30:00,), (2013-02-19 15:00:00,), (2013-02-19 13:35:00,), (2013-02-19 13:50:00,), (2013-02-19 12:20:00,), (2013-02-19 12:35:00,), (2013-02-19 14:05:00,), (2013-02-19 11:20:00,), (2013-02-19 11:50:00,), (2013-02-19 13:15:00,), (2013-02-19 14:55:00,), (2013-02-19 12:00:00,), (2013-02-19 13:00:00,), (2013-02-19 14:20:00,), (2013-02-19 14:15:00,), (2013-02-19 13:10:00,), (2013-02-19 12:10:00,), (2013-02-19 14:45:00,), (2013-02-19 14:35:00,), (2013-02-19 13:20:00,), (2013-02-19 11:10:00,), (2013-02-19 13:05:00,), (2013-02-19 12:55:00,), (2013-02-19 14:10:00,), (2013-02-19 13:45:00,), (2013-02-19 13:55:00,), (2013-02-19 11:05:00,), (2013-02-19 12:25:00,), (2013-02-19 14:00:00,), (2013-02-19 12:05:00,), (2013-02-19 12:40:00,), (2013-02-19 14:40:00,), (2013-02-19 11:00:00,), (2013-02-19 11:15:00,), (2013-02-19 11:30:00,), (2013-02-19 11:45:00,), (2013-02-19 13:40:00,), (2013-02-19 11:55:00,), (2013-02-19 14:25:00,), (2013-02-19 13:30:00,), (2013-02-19 12:30:00,), (2013-02-19 12:15:00,), (2013-02-19 11:40:00,), (2013-02-19 14:50:00,), (2013-02-19 11:35:00,)]
However if I declare the goodTime rule with the conditions in the other order (i.e. where Y and Z are unbound at the point it tries to resolve timeBetween), it returns an empty set:
:> atoms('niceTime')
:> niceTime(X) <= Reading.timeBetween(X, Y, Z) & rideAfter(Y) & rideBefore(Z)
<pyDatalog.pyEngine.Clause object at 0x0adfa510>
:> niceTime(X)
[]
This seems wrong - the two queries should return the same set of results.
My question is whether there is a way of handling this situation in pyDatalog? I think what needs to happen is that the timeBetween predicate should be able to tell the engine to back off somehow and try to resolve other rules first before trying this one, but I can't see any reference to this in the docs.
The pyDatalog reference says : "although the order of pyDatalog statements is indifferent, the order of literals within a body is significant" pyDatalog does resolve predicates in a body in the order they are stated.
Having said that, it would be possible to improve pyDatalog to resolve predicates with bound variables first, but I'm not sure why this would be important.
Related
I use a netCDF file which stores one variable and has following dimensions: lon, lat, time.
Generally speaking I wish to compare it against different data that I have already in R stored as dataframe - first two columns are coordinates in WGS84, and next are values for specific time.
So I wrote following code.
# since # ncFile$dim$time$units say: [1] "days since 1900-1-1"
daysFromDate <- function(data1, data2="1900-01-01")
{
round(as.numeric(difftime(data1,data2,units = "days")))
}
#study area:
lon <- c(40.25, 48)
lat <- c(16, 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
varName <- "spei"
require(ncdf4)
require(RCurl)
x <- getBinaryURL("http://digital.csic.es/bitstream/10261/104742/3/SPEI_01.nc")
ncFile <- nc_open(x)
LonIdx <- which( ncFile$dim$lon$vals >= lon[1] | ncFile$dim$lon$vals <= lon[2])
LatIdx <- which( ncFile$dim$lat$vals >= lat[1] & ncFile$dim$lat$vals <= lat[2])
TimeIdx <- which( ncFile$dim$time$vals >= myTime[1] & ncFile$dim$time$vals <= myTime[2])
MyVariable <- ncvar_get( ncFile, varName)[ LonIdx, LatIdx, TimeIdx]
I thought that data frame will be returned so that I will be able to easily manipulate data (in example - check correlation or create a plot).
Unfortunately 3-dimensional list has been returned instead.
How can I reformat this to data frame with following columns X-Y-Time1-Time2-...
So, example data will looks as follows
X Y 2014-01-01 2014-01-02 2014-01-02
50 17 0.5 0.4 0.3
where 0.5, 0.4 and 0.3 are example variable values
Or maybe there is different solution?
Ok, try following code, but it assumes that ranges are dense filled. And I changed lon test from or to and
require(ncdf4)
nc <- nc_open("SPEI_01.nc")
print(nc)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
time <- ncvar_get(nc, "time")
lonIdx <- which( lon >= 40.25 & lon <= 48.00)
latIdx <- which( lat >= 16.00 & lat <= 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
timeIdx <- which(time >= myTime[1] & time <= myTime[2])
data <- ncvar_get(nc, "spei")[lonIdx, latIdx, timeIdx]
indices <- expand.grid(lon[lonIdx], lat[latIdx], time[timeIdx])
print(length(indices))
class(indices)
summary(indices)
str(indices)
df <- data.frame(cbind(indices, as.vector(data)))
summary(df)
str(df)
UPDATE
ok, looks like I got the idea what do you want, but have do direct solution. What I've got so far is this: split data frame using either split() function or data.table package. After splitting by X&Y, you'll get lists of small data frames where X&Y are a constant for a given frame. Probably is it possible to transpose and recombine them back, but I have no idea how. It might be a good idea to continue to work with data as columns, Lists are nested, could be flattened, and here is link for splitting in R: http://www.uni-kiel.de/psychologie/rexrepos/posts/dfSplitMerge.html
Code, as continued from previous example
require(data.table)
colnames(df) <- c("X","Y","Time","spei")
df$Time <- as.Date(df$Time, origin="1900-01-01")
dt <- as.data.table(df)
summary(dt)
# Taken from https://github.com/Rdatatable/data.table/issues/1389
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
# here is data.table split
q <- split.data.table(dt, by = c("X","Y"), drop=FALSE)
str(q)
# here is data frame split
qq <- split(df, list(df$X, df$Y))
str(qq)
I've been trying to extend the xor-swap to more than two variables, say n variables. But I've gotten nowhere that's better than 3*(n-1).
For two integer variables x1 and x2 you can swap them like this:
swap(x1,x2) {
x1 = x1 ^ x2;
x2 = x1 ^ x2;
x1 = x1 ^ x2;
}
So, assume you have x1 ... xn with values v1 ... vn. Clearly you can "rotate" the values by successively applying swap:
swap(x1,x2);
swap(x2,x3);
swap(x3,x4);
...
swap(xm,xn); // with m = n-1
You will end up with x1 = v2, x2 = v3, ..., xn = v1.
Which costs n-1 swaps, each costing 3 xors, leaving us with (n-1)*3 xors.
Is a faster algorithm using xor and assignment only and no additional variables known?
As a partial result I tried a brute force search for N=3,4,5 and all of these agree with your formula.
Python code:
from collections import *
D=defaultdict(int) # Map from tuple of bitmasks to number of steps to get there
N=5
Q=deque()
Q.append( (tuple(1<<n for n in range(N)), 0) )
goal = (tuple(1<<( (n+1)%N ) for n in range(N)))
while Q:
masks,ops = Q.popleft()
if len(D)%10000==0:
print len(D),len(Q),ops
ops += 1
# Choose two to swap
for a in range(N):
for b in range(N):
if a==b:
continue
masks2 = list(masks)
masks2[a] = masks2[a]^masks2[b]
masks2 = tuple(masks2)
if masks2 in D:
continue
D[masks2] = ops
if masks2==goal:
print 'found goal in ',ops
raise ValueError
Q.append( (masks2,ops) )
I am trying to create two data sets, one which summarizes data by 2 groups which I have done using the following code:
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
aggregate(x, list(g1, g2), mean)
The second needs to summarize the data by the first group and NOT the second group.
If we consider the possible pairs from the previous example:
A - X B - X C - X
A - Y B - Y C - Y
A - Z B - Z C - Z
The second dataset should to summarize the data as the average of the outgroup.
A - not X
A - not Y
A - not Z etc.
Is there a way to manipulate aggregate functions in R to achieve this?
Or I also thought there could be dummy variable that could represent the data in this way, although I am unsure how it would look.
I have found this answer here:
R using aggregate to find a function (mean) for "all other"
I think this indicates that a dummy variable for each pairing is necessary. However if there is anyone who can offer a better or more efficient way that would be appreciated, as there are many pairings in the true data set.
Thanks in advance
First let us generate the data reproducibly (using set.seed):
# same as question but added set.seed for reproducibility
set.seed(123)
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
Now we have two solutions both of which use aggregate:
1) ave
# x equals the sums over the groups and n equals the counts
ag = cbind(aggregate(x, list(g1, g2), sum),
n = aggregate(x, list(g1, g2), length)[, 3])
ave.not <- function(x, g) ave(x, g, FUN = sum) - x
transform(ag,
x = NULL, # don't need x any more
n = NULL, # don't need n any more
mean = x/n,
mean.not = ave.not(x, Group.1) / ave.not(n, Group.1)
)
This gives:
Group.1 Group.2 mean mean.not
1 A X 0.3155084 -0.091898832
2 B X -0.1789730 0.332544353
3 C X 0.1976471 0.014282465
4 A Y -0.3644116 0.236706489
5 B Y 0.2452157 0.099240545
6 C Y -0.1630036 0.179833987
7 A Z 0.1579046 -0.009670734
8 B Z 0.4392794 0.033121335
9 C Z 0.1620209 0.033714943
To double check the first value under mean and under mean.not:
> mean(x[g1 == "A" & g2 == "X"])
[1] 0.3155084
> mean(x[g1 == "A" & g2 != "X"])
[1] -0.09189883
2) sapply Here is a second approach which gives the same answer:
ag <- aggregate(list(mean = x), list(g1, g2), mean)
f <- function(i) mean(x[g1 == ag$Group.1[i] & g2 != ag$Group.2[i]]))
ag$mean.not = sapply(1:nrow(ag), f)
ag
REVISED Revised based on comments by poster, added a second approach and also made some minor improvements.
I've set up a Django project in which i create random points. These random points are stored in a database(sqlite) (i can see them via the admin website and change the values, so this works).
If i the write a script i can access the points and print them in a plot. See code below.
But if I then want to mine these points to sort them or only plot a selection of the dataset i seem to have trouble. If i readout the values they are not connected anymore and sorting x would mix up the point set.
Is there a way to sort the data set to a minimum value of in this case X and the sort the values and print the set? (keep all x, y, z and name value of the point intact?) (see answer below, point in Point3D.objects.all().order_by('x'):)
If i now want to to have the values of x between x = 12 and x = 30? how can i add this extra filter?
My code is as follows:
models.py:
class Point3D(models.Model):
name = models.CharField(max_length = 10)
x = models.DecimalField(max_digits=5, decimal_places=2)
y = models.DecimalField(max_digits=5, decimal_places=2)
z = models.DecimalField(max_digits=5, decimal_places=2)
generate the points:
from books.models import Point3D
def points():
for i in range(20):
x = random.randint(0,100)
y = random.randint(0,100)
z = random.randint(0,100)
p = Point3D(name = x , x = x ,y = y,z = z)
# print 'test'
p.save()
#
points()
in views.py:
def ThreeGraphs(request):
fig = Figure()
fig.suptitle('2D-punten')
ax = fig.add_subplot(111)
for point in Point3D.objects.all():
print point
name = int(point.name)
xs = int(point.x)
ys = int(point.y)
zs = int(point.z)
print (xs, ys, zs)
ax.plot(xs, ys, 'bo' )
HttpResponse(mimetype="image/png")
FigureCanvas(fig)
fig.savefig('template/images/testing.png')
picture = "testing.png"
return render_to_response('Test.html', {'picture': picture}, RequestContext(request))
Hope anyone knows how to solve my trouble.
Thanks a lot!
Tijl
You need to to this:
for point in Point3D.objects.all().order_by('x'):
This will return the points in sorted order by the 'x' field. You can say order_by('-x') to reverse the sort order.
I am studying for a test, and this is on the study guide sheet. This is not homework, and will not be graded.
Relation Schema R = (A,B,C,D,E)
Functional Dependencies = (AB->E, C->AD, D->B, E->C)
Is r1 = (A,C,D) r2 = (B,C,E) OR
x1 = (A,C,D) x2 = (A,B,E) a lossless join decomposition? and why?
My relational algebra is horribly rusty, but here is how I remember it to go
If r1 ∩ r2 -> r1 - r2 or r1 ∩ r2 -> r2 - r1 in FDs then you have lossless decomposition.
r1 ∩ r2 = C
r1 - r2 = AD
C->AD is in functional dependencies => lossless
for x1 and x2
x1 ∩ x2 = A
x1 - x2 = CD
A->CD is not in FDs
now check x2 - x1
x2 - x1 = BE
A->BE is not in FDs either, therefore lossy
references here, please check for horrible mistakes that I might have committed
Here is my understanding, basically you look at your decompositions and determine whether the common attributes between the relations is a key to at least one of the relations.
So with R1 and R2 - the only thing common between them is C. C would be a Key to R1, since you are given C -> A,D. So it's lossless.
For X1 and X2, the only thing common is A, which by itself is neither a key for X1 or X2 from the functional dependencies you are given.
Functional Dependencies = (AB->E, C->AD, D->B, E->C)
Is r1 = (A,C,D) r2 = (B,C,E) is lossless when you perform the Chase algorithm.
It can be seen that both tables agree on 'C' and the dependency C->AD is preserved in the table ACD.
x1 = (A,C,D) x2 = (A,B,E) is lossy as you will conclude after performing Chase Algorithm.
Alternately, it can be seen that both the tables only agree on A and there is no such dependency that is fully functionally dependent on A.
As described here, decomposition of R into R1 and R2 is lossless if
Attributes(R1) U Attributes(R2) = Attributes(R)
Attributes(R1) ∩ Attributes(R2) ≠ Φ
Common attribute must be a key for at least one relation (R1 or R2)
EDIT
with the assumption that only the non-trivial cases are considered here, I think OP intended that too (so that (2) holds under this non-trivial assumption):
e.g., not considering the trivial corner case where all tuples of R1 / R2 are unique, i.e., the empty set {} is a key (as #philipxy pointed out), hence any decomposition will be lossless and hence not interesting (since spurious tuples cant be created upon joining) - the corner cases for which decomposition can be lossless despite
Attributes(R1) ∩ Attributes(R2) = Φ are ruled out.
We can check the above conditions with the following python code snippet:
def closure(s, fds):
c = s
for f in fds:
l, r = f[0], f[1]
if l.issubset(c):
c = c.union(r)
if s != c:
c = closure(c, fds)
return c
def is_supekey(s, rel, fds):
c = closure(s, fds)
print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
return rel.issubset(c) #c == rel
def is_lossless_decomp(r1, r2, r, fds):
c = r1.intersection(r2)
if r1.union(r2) != r:
print('not lossless: R1 U R2 ≠ R!')
return False
if r1.union(r2) != r or len(c) == 0:
print('not lossless: no common attribute in between R1 and R2!')
return False
if not is_supekey(c, r1, fds) and not is_supekey(c, r2, fds):
print(f'not lossless: common attribute {"".join(c)} not a key in R1 or R2!')
return False
print('lossless decomposition!')
return True
To process the given FDs in standard form, to convert to suitable data structure, we can use the following function:
import re
def process_fds(fds):
pfds = []
for fd in fds:
fd = re.sub('\s+', '', fd)
l, r = fd.split('->')
pfds.append([set(list(l)), set(list(r))])
return pfds
Now let's test with the above decompositions given:
R = {'A','B','C','D','E'}
fds = process_fds(['AB->E', 'C->AD', 'D->B', 'E->C'])
R1, R2 = {'A', 'C', 'D'}, {'B', 'C', 'E'}
is_lossless_decomp(R1, R2, R, fds)
# (C)+ = ACD
# lossless decomposition!
R1, R2 = {'A', 'C', 'D'}, {'A', 'B', 'E'}
is_lossless_decomp(R1, R2, R, fds)
# (A)+ = A
# not lossless: common attribute A not a key in R1 or R2!