So I have several tables (python but I am open to any languages/programs which could do this):
timestamp = {123234 , 2342343, 2342345 , ...}
good_bad = {1 , -1 , -1 , ...}
number_of_t = {4 , 3 , 5 ...}
user = {23 , 45 , 23 ...}
....
I want to know if it is possible to compare this table to a table of a traded stock :
time_stock = {123123, 123123, 123123, ....}
price_stock = {123, 122 , 121, ...}
and find coherences, logical pattern ...
most preferable in a function something like
good_bad(x)+2*number_of_t(x-4) - good_bad(x)^2 = time_stock(x+10)
x being the location in the table
of course for stock I would only want to know patterns which can predict the "FUTURE" but that does not change the main point of the question:
How can I find such patterns in a data set?
Related
For example, I create a class 'student'
classdef student
properties
name
sex
age
end
methods
function obj = student(name,sex,age)
obj.name = name;
obj.sex = sex;
obj.age = age;
end
end
and then create some objects in an array 'school'
school(1)=student(A,'boy',19)
school(2)=student(B,'girl',18)
school(3)=student(C,'boy',20)
school(4)=student(D,'girl',19)
My question is how to find the index of the objects with certain properties in the array 'school'?
For example, if I want to find students with age 19, the result will be index [1,4]
If I want to find students with age 19 and sex 'boy', the result will be index [1]
Further question 1: how to find the row and colume index? The object with sex 'girl' and age 19 lies in row 1 colume 4.
Further question 2: if school is an cell array, how to solve above problems?
Seems kind of homework questions. However here are the answers:
% find students with age 19,
find ( [school(:).age] == 19 )
% find students with age 19 and sex 'boy',
find ( [school(:).age] == 19 & strcmp( { school(:).sex }, 'boy' ) )
% how to find the row and colume index?
[row, col] = ind2sub( size(school), find ( [school(:).age] == 19 & strcmp( { school(:).sex }, 'girl' ) ) )
Considering the last question, I would convert the cell of school objects back into an array and do as shown above.
If school is a cell array so you have
school = cell(4,1);
school{1}=student(A,'boy',19)
school{2}=student(B,'girl',18)
school{3}=student(C,'boy',20)
school{4}=student(D,'girl',19)
Then you can loop through them to evaluate your conditions. A concise way to do this is with cellfun:
boolAge19 = cellfun( #(x) x.age == 19, school );
idxAge19 = find( boolAge19 );
boolBoy = cellfun( #(x) strcmp( x.sex, 'boy' ), school );
idxBoy = find( boolBoy );
boolBoyAnd19 = boolAge19 & boolBoy;
idxBoyAnd19 = find( boolBoyAnd19 );
You can of course skip intermediate steps, the lines just get dense
idxBoyAnd19 = find( cellfun( #(x) x.age == 19, school ) & ...
cellfun( #(x) strcmp( x.sex, 'boy' ), school ) );
I have a post-joined datasets where the columns are identical except the right side has new and corrected data and a .TODROP suffix appended at the end of the column's name.
So the dataset looks something like this:
df = spark.createDataFrame(
[
(1, "Mary", 133, "Pizza", "Mary", 13, "Pizza"),
(2, "Jimmy", 8, "Hamburger", None, None, None),
(3, None, None, None, "Carl", 6, "Cake")
],
["guid", "name", "age", "fav_food", "name.TODROP", "age.TODROP", "fav_food.TODROP"]
)
I'm trying to slide over the right side columns to the left side columns if there is value:
if df['name.TODROP'].isNotNull():
df['name'] = df['name.TODROP']
if df['age.TODROP'].isNotNull():
df['age'] = df['age.TODROP']
if df['fav_food.TODROP'].isNotNull():
df['fav_food'] = df['fav_food.TODROP']
However, the problem is that the brute-force solution will take a lot longer with my real dataset because it has a lot more columns than this example. And I'm also getting this error so it wasn't working out anyway...
"pyspark.sql.utils.AnalysisException: Can't extract value from
name#1527: need struct type but got string;"
Another attempt where I try to do it in a loop:
col_list = []
suffix = ".TODROP"
for x in df.columns:
if x.endswith(suffix) == False:
col_list.append(x)
for x in col_list:
df[x] = df[x + suffix]
Same error as above.
Goal:
Can someone point me in the right direction? Thank you.
First of all, your dot representation of the column name makes confusion for the struct type of column. Be aware that. I have concatenate the column name with backtick and it prevents the misleading column type.
suffix = '.TODROP'
cols1 = list(filter(lambda c: not(c.endswith(suffix)), df.columns))
cols2 = list(filter(lambda c: c.endswith(suffix), df.columns))
for c in cols1[1:]:
df = df.withColumn(c, f.coalesce(f.col(c), f.col('`' + c + suffix + '`')))
df = df.drop(*cols2)
df.show()
+----+-----+---+---------+
|guid| name|age| fav_food|
+----+-----+---+---------+
| 1| Mary|133| Pizza|
| 2|Jimmy| 8|Hamburger|
| 3| Carl| 6| Cake|
+----+-----+---+---------+
am currently working on a project wherein am supposed to model public acceptance on pricing schemes.
The independent variables being used for model:- Age, gender,income etc... which are categorical in nature, so I converted them into factored variables using as.factor() function.
Age Gender Income
0 1 2
0 0 0
0 0 1
I have certain other variables like Transit satisfaction, Environment improvement etc... which are ordered factors on scale of 1 to 5 . 1 being extremely dissatisfied and 5 being very satisfied.
My model is as follows :-
mdl = oglmx( prcing ~Ann_In1+Edu+Env_imp+rs_imp,data=cpdat, link = "logit", constantMEAN = F, constantSD = F, delta = 0, threshparam = NULL)
summary(mdl)
Estimate Std. error t value Pr(>|t|)
Ann_In11 0.1605540 0.3021613 0.5314 0.5951749
Ann_In12 -0.9556992 0.4218504 -2.2655 0.0234824 *
Edu1 0.0710699 0.2678081 0.2654 0.7907196
Edu2 1.0732587 0.7112519 1.5090 0.1313061
Env_imp.L -0.8524288 0.4899275 -1.7399 0.0818752 .
Env_imp.Q 0.0784353 0.3936332 0.1993 0.8420595
Env_imp.C 0.4589036 0.4498676 1.0201 0.3076878
Env_imp^4 -0.2219108 0.4423486 -0.5017 0.6159032
rd_sft.L 2.6335035 0.7362206 3.5771 0.0003475 ***
rd_sft.Q -0.7064391 0.5773880 -1.2235 0.2211377
rd_sft.C 0.0130127 0.4408486 0.0295 0.9764519
rd_sft^4 -0.2886550 0.3582014 -0.8058 0.4203318
I obtained the results as below. Am unable to interpret the results. Any leads in this can be very helpful.
In case of rd_sft (road safety ) as rd_sft.L (linear) is signiicant than other levels, can we neglect the other levels i.e Q,C,^4 in model formation ??
please through some light on model formulation and its intepretation as i am new to R.
I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})
I have a large grouped chart and the y axis will not formulate properly. I have tried getting rid of zeros, and double checking for syntax typos and cannot seem to figure it out. Basically the y-axis ticks are 0 0 1 1 thats it???
<script>
window.onload = function ()
{
var data = [ ['18','47','11'] , ['10','4','1'] , ['0','0','1'] , ['0','2','0'] , ['8','9','0'] , ['6','6','0'] , ['5','3','1'] , ['2','7','0'] , ['9','5','1'] , ['5','6','0'] , ['6','5','0'] , ['4','5','0'] , ['3','2','2'] , ['3','2','0'] , ['0','1','0'] , ['1','0','0'] ] ;
var bar = new RGraph.Bar('cvs', data)
.Set('labels', ['JH', '166', 'JC', 'DR', 'KL', '206', '499', '181', '127', '01', '211', 'RK', '111', '46', '485', '65'])
.Set('colors', ['Gradient(#99f:#27afe9:#058DC7:#058DC7)', 'Gradient(#94f776:#50B332:#B1E59F)', 'Gradient(#fe783e:#EC561B:#F59F7D)'])
.Set('hmargin', 8)
.Set('strokestyle', 'white')
.Set('linewidth', 1)
.Set('shadow', true)
.Set('shadow.color', '#ccc')
.Set('shadow.offsetx', 0)
.Set('shadow.offsety', 0)
.Set('shadow.blur', 10)
.Draw();
}
</script>
That's because putting single quotes around your numbers turns them into strings - which are equivalent to 0. So you end up charting an array of zeros. RGraph then generates an appropriate scale of which the max is 1 - so a scale of 0.2, 0.4, 0.6, 0.8, 1. Then by default there's no decimals so they get rounded - producing 0,0,1,1,1.