Determine if a measure is aggregable or not in data-warehouse - database

Can someone please tell me a 'general use', step-by-step method to find out what are the non-aggregability fields in a datamart. Here an example I found:
Notes: italic means 'key', bold identifies 'shortenings', 'column' is an alias for 'referencing'
Relational schema:
CALL(COD,DATE,FROM:S,TO:S,LEN)
SIM(SIM, USER:USER, TRIFF:T, BONUS)
TARIFF(TARIFF, CARRIER:CAR)
USER(USER, TOWN:TOW, LAST_TARIFF:TAR)
ROAMING_CALL(COD:CAL, FOREIGN_CARRIER:CAR)
PROMO_CALL(COD:CAL, PROMO_TARIFF:P_TA)
PROMO_TARIFF(TARIFF:TAR)
TOWN(TOWN,NATION)
CARRIER(CARRIER, NATION)
REQUESTS:
build a fact-schema for 'CALL' with following
dimensions:
DATE, SIM_FROM, CALLED_CARRIER, FOREIGN_CARRIER, PROMO_TARIFF and
measures: AVG_CALL_LENGTH, NUM_OUTGOING_SIM (as count distinct FROM),
NUM_INCOMING_SIM (as count distinct TO)
Now I can draw the fact-schema, but I'm in trouble finding what measures are aggregable along which dimensions
EDIT:
this is the pdf for the fact schema I have (sorry for not using a strict sintax but reading notes are included)
Measures:
Standard [obtained by the operational schema]:
NUM_INCOMING_CALLS = COUNT DISTINCT (TO)
UN-AGGREGABILITIES ->*THIS IS MY ISSUE*
Calculated [obtained by the operational schema, need partial data to add properly]:
AVG_CALL_LENGTH = CL_SUM/CL_COUNT
where
CL_SUM = SUM (LENGTH), CL_COUNT = COUNT(LENGTH)
UN-AGGREGABILITIES ->*THIS IS MY ISSUE*
Derived [can be found as a dimension]:
NUM_OUTGOING_CALLS = COUNT DISTINCT ( FROM )
UN-AGGREGABILITIES ->*THIS IS MY ISSUE*

Ok, I went and asked my teacher: he gave me a easy algorithm:
Given a schema D{D1, D2, D3, ... Dn}, for a Mesaure M= count distinct A n
if A U X -> Di is not trivial, X subset of D
X U A -> D1 (True)
X U A -> D2 (False)
X U A -> D3 (True)
...
X U A -> Dn-1 (False)
I have that NA = {D2, Dn-1}
NA: set of non-aggregabilities

Related

baseline fitting using Numpy poly1d

i have the following baseline:
and as it can be seen, it has an almost sinusoidal shape. i am trying to use polyfit on it. Actually what I have are two arrays of data,one called x and the other y. So what i am using is:
porder = 2
coefs = np.polyfit(x, y, porder)
baseline = np.poly1d(coefs)
cleanspec = y - baseline(x)
My goal is to obtain a clean spectrum in the end, who has a straight baseline with no ondulation.
However, the fitting is not working. Any suggestions on using another more efficient method?
I have tried changing porder to 3, but i have this warning, and it doesn't change anything:
Polyfit may be poorly conditioned
My data for x:
[1.10192816e+11 1.10192893e+11 1.10192969e+11 1.10193045e+11
1.10193122e+11 1.10193198e+11 1.10193274e+11 1.10193350e+11
1.10193427e+11 1.10193503e+11 1.10193579e+11 1.10193656e+11
1.10193732e+11 1.10193808e+11 1.10193885e+11 1.10193961e+11
1.10194037e+11 1.10194113e+11 1.10194190e+11 1.10194266e+11
1.10194342e+11 1.10194419e+11 1.10194495e+11 1.10194571e+11
1.10194647e+11 1.10194724e+11 1.10194800e+11 1.10194876e+11
1.10194953e+11 1.10195029e+11 1.10195105e+11 1.10195182e+11
1.10195258e+11 1.10195334e+11 1.10195410e+11 1.10195487e+11
1.10195563e+11 1.10195639e+11 1.10195716e+11 1.10195792e+11
1.10195868e+11 1.10195944e+11 1.10196021e+11 1.10196097e+11
1.10196173e+11 1.10196250e+11 1.10196326e+11 1.10196402e+11
1.10196479e+11 1.10196555e+11 1.10196631e+11 1.10196707e+11
1.10196784e+11 1.10196860e+11 1.10196936e+11 1.10197013e+11
1.10197089e+11 1.10197165e+11 1.10197241e+11 1.10197318e+11
1.10197394e+11 1.10197470e+11 1.10197547e+11 1.10197623e+11
1.10197699e+11 1.10197776e+11 1.10197852e+11 1.10197928e+11
1.10198004e+11 1.10198081e+11 1.10198157e+11 1.10198233e+11
1.10198310e+11 1.10198386e+11 1.10198462e+11 1.10198538e+11
1.10198615e+11 1.10198691e+11 1.10198767e+11 1.10198844e+11
1.10198920e+11 1.10198996e+11 1.10199073e+11 1.10199149e+11
1.10199225e+11 1.10199301e+11 1.10199378e+11 1.10199454e+11
1.10199530e+11 1.10199607e+11 1.10199683e+11 1.10199759e+11
1.10199835e+11 1.10199912e+11 1.10199988e+11 1.10200064e+11
1.10200141e+11 1.10202582e+11 1.10202658e+11 1.10202735e+11
1.10202811e+11 1.10202887e+11 1.10202963e+11 1.10203040e+11
1.10203116e+11 1.10203192e+11 1.10203269e+11 1.10203345e+11
1.10203421e+11 1.10203498e+11 1.10203574e+11 1.10203650e+11
1.10203726e+11 1.10203803e+11 1.10203879e+11 1.10203955e+11
1.10204032e+11 1.10204108e+11 1.10204184e+11 1.10204260e+11
1.10204337e+11 1.10204413e+11 1.10204489e+11 1.10204566e+11
1.10204642e+11 1.10204718e+11 1.10204795e+11 1.10204871e+11
1.10204947e+11 1.10205023e+11 1.10205100e+11 1.10205176e+11
1.10205252e+11 1.10205329e+11 1.10205405e+11 1.10205481e+11
1.10205557e+11 1.10205634e+11 1.10205710e+11 1.10205786e+11
1.10205863e+11 1.10205939e+11 1.10206015e+11 1.10206092e+11
1.10206168e+11 1.10206244e+11 1.10206320e+11 1.10206397e+11
1.10206473e+11 1.10206549e+11 1.10206626e+11 1.10206702e+11
1.10206778e+11 1.10206854e+11 1.10206931e+11 1.10207007e+11
1.10207083e+11 1.10207160e+11 1.10207236e+11 1.10207312e+11
1.10207389e+11 1.10207465e+11 1.10207541e+11 1.10207617e+11
1.10207694e+11 1.10207770e+11 1.10207846e+11 1.10207923e+11
1.10207999e+11 1.10208075e+11 1.10208151e+11 1.10208228e+11
1.10208304e+11 1.10208380e+11 1.10208457e+11 1.10208533e+11
1.10208609e+11 1.10208686e+11 1.10208762e+11 1.10208838e+11
1.10208914e+11 1.10208991e+11 1.10209067e+11 1.10209143e+11
1.10209220e+11 1.10209296e+11 1.10209372e+11 1.10209448e+11
1.10209525e+11 1.10209601e+11 1.10209677e+11 1.10209754e+11
1.10209830e+11]
and for y:
[ 0.00143858 0.05495827 0.07481739 0.03287334 -0.06275658 0.03744501
-0.04392341 0.02849104 0.03173781 0.09748282 0.02854265 0.06573162
0.08215295 0.0240697 0.00931477 0.17572605 0.06783381 0.04853354
-0.00226023 0.03722596 0.09687121 0.10767829 0.04922701 0.08036865
0.02371989 0.13885361 0.13903188 0.09910567 0.08793601 0.06048823
0.03932097 0.04061129 0.03706228 0.13764936 0.14150589 0.12226208
0.09041878 0.13638676 0.11107155 0.12261369 0.11765545 0.07425344
0.06643712 0.1449991 0.14256909 0.0924173 0.09291525 0.12216271
0.11272059 0.07618891 0.16787807 0.07832849 0.10786856 0.12381844
0.14182937 0.08078092 0.11932429 0.06383649 0.02923562 0.0864741
0.07806758 0.04514088 0.12929371 0.11769577 0.03619867 0.02811366
0.06401639 0.06883735 0.01162673 0.0956252 0.11206549 0.0485106
0.07269545 0.01662149 0.01287365 0.13401546 0.06300487 0.01994627
0.00721926 0.04863274 -0.01578364 0.0235379 0.03102316 0.00392559
0.05662182 0.04643381 -0.00665026 0.05532307 -0.01533339 0.04838893
0.02097954 0.02551123 0.03727188 -0.04001189 -0.04294883 0.02837669
-0.06062512 -0.0743994 -0.04665618 -0.03553261 -0.07057554 -0.07028277
-0.07502298 -0.07247965 -0.03540266 -0.03226398 -0.08014487 -0.11907543
-0.18521053 -0.1117617 -0.14377897 -0.07113503 -0.02480966 -0.07459746
-0.07994097 -0.02648713 -0.10288478 -0.13328137 -0.08121377 -0.13742166
-0.024583 -0.11391389 -0.02717251 -0.08876166 -0.04369363 -0.0790144
-0.09589054 -0.12058701 0.00041344 -0.06646403 -0.06368366 -0.10335613
-0.04508286 -0.18360729 -0.0551775 -0.06476622 -0.0834523 -0.01276785
-0.04145486 -0.14549992 -0.11186823 -0.07663398 -0.11920359 -0.0539315
-0.10507118 -0.09112374 -0.09751319 -0.06848278 -0.09031172 -0.07218853
-0.03129234 -0.04543539 -0.00942861 -0.06711099 -0.00712202 -0.11696418
-0.06344093 0.03624227 -0.04798777 0.01174394 -0.08326314 -0.06761215
-0.12063419 -0.05236908 -0.03914692 -0.05370061 -0.01620056 0.06731788
-0.06600111 -0.04601257 -0.02144361 0.00256863 -0.00093034 0.00629604
-0.0252835 -0.00907992 0.03583489 -0.03761906 0.10325763 0.08016437
-0.04900467 0.0110328 0.05019604 -0.04428984 -0.03208058 0.05095359
-0.01807463 0.0691733 0.07472691 0.00659871 0.00947692 0.0014422
0.05227057]
Having this huge offset in x is probably not helping. It definitively works when removing it for the fitting process. Looks like this:
import matplotlib.pyplot as plt
import numpy as np
scaledx = xdata * 1e-8 - 1100
coefs = np.polyfit( scaledx, ydata, 7)
base = np.poly1d( coefs )
xt = np.linspace( 1.9,2.1,150)
yt = base( xt )
fig = plt.figure()
ax = fig.add_subplot( 2, 1, 1 )
bx = fig.add_subplot( 2, 1, 2 )
ax.scatter( scaledx , ydata )
ax.plot( xt , yt )
bx.plot( scaledx , ydata - base( scaledx ) )
plt.show()
with xdata and ydata being numpy arrays of the OP data lists.
Provides:
Addon
Concerning the poorly conditioned one should remember how simple linear optimization works. In case of a polynomial one builds the matrix:
A = [
[1, x1, x1**2, ...],
[1, x2, x2**2, ...],
...
[1, xn, xn**2, ...]
]
and one needs B^(-1) the inverse of B with B = AT.A and AT being the transposed of A. Now looking at the x values in the order of 1e11, B will have order 1 on one side of the diagonal and for a second order polynomial order 1e44 on the other. In case of a third order polynomial this is getting worse, accordingly. Making the inverse, hence, is becoming unstable, numerically. Luckily, and as used above, this can be solved easily by simple re-scaling of the problem at hand.

How to interpret oglmx function in R programming?

am currently working on a project wherein am supposed to model public acceptance on pricing schemes.
The independent variables being used for model:- Age, gender,income etc... which are categorical in nature, so I converted them into factored variables using as.factor() function.
Age Gender Income
0 1 2
0 0 0
0 0 1
I have certain other variables like Transit satisfaction, Environment improvement etc... which are ordered factors on scale of 1 to 5 . 1 being extremely dissatisfied and 5 being very satisfied.
My model is as follows :-
mdl = oglmx( prcing ~Ann_In1+Edu+Env_imp+rs_imp,data=cpdat, link = "logit", constantMEAN = F, constantSD = F, delta = 0, threshparam = NULL)
summary(mdl)
Estimate Std. error t value Pr(>|t|)
Ann_In11 0.1605540 0.3021613 0.5314 0.5951749
Ann_In12 -0.9556992 0.4218504 -2.2655 0.0234824 *
Edu1 0.0710699 0.2678081 0.2654 0.7907196
Edu2 1.0732587 0.7112519 1.5090 0.1313061
Env_imp.L -0.8524288 0.4899275 -1.7399 0.0818752 .
Env_imp.Q 0.0784353 0.3936332 0.1993 0.8420595
Env_imp.C 0.4589036 0.4498676 1.0201 0.3076878
Env_imp^4 -0.2219108 0.4423486 -0.5017 0.6159032
rd_sft.L 2.6335035 0.7362206 3.5771 0.0003475 ***
rd_sft.Q -0.7064391 0.5773880 -1.2235 0.2211377
rd_sft.C 0.0130127 0.4408486 0.0295 0.9764519
rd_sft^4 -0.2886550 0.3582014 -0.8058 0.4203318
I obtained the results as below. Am unable to interpret the results. Any leads in this can be very helpful.
In case of rd_sft (road safety ) as rd_sft.L (linear) is signiicant than other levels, can we neglect the other levels i.e Q,C,^4 in model formation ??
please through some light on model formulation and its intepretation as i am new to R.

How do I delete entries from two two-column tables such that their second columns match within a certain error

So if I have two arrays in matlab. Let's call them locations1 and locations2
locations1
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
locations2
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
And I want to make it so that any two entries of the second columns that match each other within 100.0 remain while any entry that has no match will get removed. So I want the output to look like
locations1
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
locations2
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
How would I do this? Preferably without loops. Here is what I've done, but it has loops
locround1=round(locations1/50)*50;
locround2=round(locations2/50)*50;
for i=1:size(locations1,1)
nodel1(i)=sum(locround1(i,2)== locround2(:,2))
end
nodel1=repmat(nodel1>0,[2,1]);
nodel1=nodel1';
locations1=nodel1.*locations1;
locations1( ~any(locations1,2), : ) = [];
for i=1:size(locations2,1)
nodel2(i)=sum(locround2(i,2)== locround1(:,2))
end
nodel2=repmat(nodel2>0,[2,1]);
nodel2=nodel2';
locations2=nodel2.*locations2;
locations2( ~any(locations2,2), : ) = [];
This is what I got. If your MATLAB version has set operators, you can do it with the following codes:
Li1 = ismembertol(locations1(:,2),locations2(:,2),100, 'DataScale', 1);
locations1_new = locations1 (Li1,:);
Li2 = ismembertol(locations2(:,2),locations1(:,2),100, 'DataScale', 1);
locations2_new = locations2 (Li2,:);
I tested it, it works.
Let the data be defined as
locations1 = [
1123.44977625437 890.824688325172
1290.31273560851 5065.65794385883
1718.10632735926 2563.44895531365
1734.55379433782 4408.20631924691
2050.70084480064 1214.45353443990
2299.46239346717 3781.34694047196
4186.02801290113 4386.67818566045
5676.10649593031 4529.23023993815
];
locations2 = [
7474.22619378039 3166.41503120846
8604.40241305284 5069.40744277799
9048.25231808890 2563.58997620248
9059.71923042408 4381.75034710351
9643.05902166767 3796.42822996919
11460.8617087264 4392.85930695209
];
threshold = 100;
Then:
m = abs(locations1(:,2)-locations2(:,2).')<=threshold;
result1 = locations1(any(m,2),:);
result2 = locations2(any(m,1),:);
How this works:
The first line computes a matrix with the distance between each value from the second column of locations1 and each value from the second column of locations2. The distances are then compared with threshold, so that the matrix entries become true or false.
This makes use of implicit expansion, introduced in R2016b. For Matlab versions before that, use bsxfun as follows:
m = abs(bsxfun(#minus, locations1(:,2), locations2(:,2).'))<=threshold;
Each row of the computed matrix, m, corresponds to a value from locations1; and each column corresponds to a value from locations2.
The second line uses logical indexing to select the rows of location1 that satisfy the criterion for some value of location2.
Similarly, the third line selects the rows of location2 that satisfy the criterion for some value of location1.

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Text mining Clustering Analysis in R - Error :Two dimensional array

I'm trying to follow a document that has some code on text mining clustering analysis.
I'm fairly new to R and the concept of text mining/clustering so please bear with me if i sound illiterate.
I create a simple matrix called dtm and then run kmeans to produce 3 clusters. The code im having issues is where a function has been defined to get "five most common words of the documents in the cluster"
dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)
kmeans.result = kmeans(dtm0.75, 3)
perClusterCounts = function(df, clusters, n)
{
v = sort(colSums(df[clusters == n, ]),
decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)
Upon running this code i get the following error:
Error in colSums(df[clusters == n, ]) :
'x' must be an array of at least two dimensions
Could someone help me fix this please?
Thank you.
I can't reproduce your error, it works fine for me. Update your question with a reproducible example and you might get a more useful answer. Perhaps your input data object is empty, what do you get with dim(dtm0.75)?
Here it is working fine on the data that comes with the tm package:
library(tm)
data(crude)
dt0.75 <- DocumentTermMatrix(crude)
dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)
kmeans.result = kmeans(dtm0.75, 3)
perClusterCounts = function(df, clusters, n)
{
v = sort(colSums(df[clusters == n, ]),
decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)
word freq
the the 69
and and 25
for for 12
government government 11
oil oil 10

Resources