I need to apply the Mann Kendall trend test in R to a big number (about 1 million) of different-sized time series. I've already created a script that takes the time-series (practically a list of numbers) from all the files in a certain directory and then outputs the results to a .txt file.
The problem is that I have about 1 million of time-series so creating 1 million of file isn't exactly nice. So I thought that putting all the time-series in only one .txt file (separated by some symbol like "#" for example) could be more manageable. So I have a file like this:
1
2
4
5
4
#
2
13
34
#
...
I'm wondering, is it possible to extract such series (between two "#") in R and then apply the analysis?
EDIT
Following #acesnap hints I'm using this code:
library(Kendall)
a=read.table("to_r.txt")
numData=1017135
for (i in 1:numData){
s1=subset(a,a$V1==i)
m=MannKendall(s1$V2)
cat(m[[1]]," ",m[[2]], " ", m[[3]]," ",m[[4]]," ", m[[5]], "\n" , file="monotonic_trend_checking.txt",append=TRUE)
}
This approach works but the problem is that it is taking ages for computation. Can you suggest a faster approach?
If you were to number the datasets as they went into the larger file it would make things easier. If you were to do that you could use a for loop and subsetting.
setNum data
1 1
1 2
1 4
1 5
1 4
2 2
2 13
2 34
... ...
Then do something like:
answers1 <- c()
numOfDataSets <- 1000000
for(i in 1:numOfDataSets){
ss1 <- subset(bigData, bigData$setNum == i) ## creates subset of each data set
ans1 <- mannKendallTrendTest(ss1$data) ## gets answer from test
answers1 <- c(answers1, ans1) ## inserts answer into vector
print(paste(i, " | ", ans1, "",sep="" )) ## prints which data set is in use
flush.console() ## prints to console now instead of waiting
}
Here is a perhaps a more elegant solution:
# Read in your data
x=c('1','2','3','4','5','#','4','5','5','6','#','3','6','23','#')
# Build a list of indices where you want to split by:
ind=c(0,which(x=='#'))
# Use those indices split the vector into a list
lapply(seq(length(ind)-1),function (y) as.numeric(x[(ind[y]+1):(ind[y+1]-1)]))
Note that for this code to work, you must have a '#' character at the very end of the file.
Related
Since I'm new in Julia I have sometimes obvious for you problems.
This time I do not know how to read the certain piece of data from the file i.e.:
...
stencil: half/bin/3d/newton
bin: intel
Per MPI rank memory allocation (min/avg/max) = 12.41 | 12.5 | 12.6 Mbytes
Step TotEng PotEng Temp Press Pxx Pyy Pzz Density Lx Ly Lz c_Tr
200000 261360.25 261349.16 413.63193 2032.9855 -8486.073 4108.1669
200010 261360.45 261349.36 413.53903 22.925126 -29.762605 132.03134
200020 261360.25 261349.17 413.46495 20.373081 -30.088775 129.6742
Loop
What I want is to read this file from third row after "Step" (the one which starts at 200010 which can be a different number - I have many files which stars at the same place but from different integer) until the program will reach the "Loop". Could you help me please? I'm stacked - I don't know how to combine the different options of julia to do it...
Here is one solution. It uses eachline to iterate over the lines. The loop skips the header, and any empty lines, and breaks when the Loop line is found. The lines to keep are returned in a vector. You might have to modify the detection of the header and/or the end token depending on the exact file format you have.
julia> function f(file)
result = String[]
for line in eachline(file)
if startswith(line, "Step") || isempty(line)
continue # skip the header and any empty lines
elseif startswith(line, "Loop")
break # stop reading completely
end
push!(result, line)
end
return result
end
f (generic function with 2 methods)
julia> f("file.txt")
3-element Array{String,1}:
"200000 261360.25 261349.16 413.63193 2032.9855 -8486.073 4108.1669"
"200010 261360.45 261349.36 413.53903 22.925126 -29.762605 132.03134"
"200020 261360.25 261349.17 413.46495 20.373081 -30.088775 129.6742"
In one of the columns of my df, the values in the cell are reported as an array (e.g. [1,2,3,4,8]) as opposed to being just single numbers. This is because the question was a "select all that apply" question.
However, when I try to count how many of each number occurs, I am not able to do so because these numbers are nested within a list. How can I extract the numbers so that I am able to count them?
For example:
row 1: [1,2,3,4,8]
row 2: [3]
row 3: [1,2,3,4]
I want to be able to run a statement such as: nrow(df[df$column == 1,]) that will count all of the occurrences of the number 1. So, in this case, the output would be 2, but right now it says 0.
Here is a method using base R:
# set up data
df <- as.data.frame(c('[1,2,3,4,8]', '[3]', '[1,2,3,4]'))
colnames(df) <- c('data')
# strip off starting and ending brackets
stripped <- substr(df$data, 2, nchar(df$data)-1)
# split each row by comma
split <- strsplit(stripped, ',')
# flatten the list of numbers to a vector
numbers <- unlist(split)
# view table of frequency of each number
table(numbers)
output:
numbers
1 2 3 4 8
2 2 3 2 1
getting count of a single number
# view count of a single number
length(which(numbers == '8'))
output:
[1] 1
I have two variables used in matplotlib, one of them has measured data, and a second one is a time scale from 0 to 300 s. What I need to do is to make a vertical list of them (both together, next to each other), to see in what time a certain measurement took place.
Use zip (Doku). The shorter of both list wins, the items of the longer one that do not match get discarded:
l1 = ["1","2","3","4","5"]
l2 = ["a","aa","aaa","aaaa","aaaa","discard","discard2"]
l3 = zip(l1,l2) # relates same indexes in bot lists as tuple (l1[i],l2[i])
for tup in l3:
print(tup[0], " " , tup[1])
output:
1 a
2 aa
3 aaa
4 aaaa
5 aaaa
The "vertical list" could already be what I called l3 here - its a list of 2-tuples containing (in your case: (time, value) )
Save to file:
with open("demodata.txt","w") as f:
for tup in l3:
f.write(tup[0], " " , tup[1],"\n")
I am new to the programming world and need help with loading a file to R and creating a matrix with it. I can import individual files and create and individual matrix out of it. How do I do this for multiple files? I have 21 files that each contain 100 rows and 100 columns and I need to import each file and put everything in a single array.
I would use list.files to list my files by pattern.
lapply to loop through the list of files and create a list data.frame with read.csv
rbindlist to bind all in a big matrix.
temp = list.files(pattern="*.csv")
named.list <- lapply(temp, read.csv)
library(data.table)
files.matrix <-rbindlist(named.list)
It's not exactly clear what structure you want. You can choose between a 2100x100 matrix or a 2100x100 dataframe or a 100x 100x 21 array or a list with 21 entries each of which was 100 x 100. (In R an array is the term one would use for a regular 3 dimensional structure with columns all of that same type. (and then of course there is agstudy's suggestion that you use a data.table.)
In a sense agstudy's code already gives you the 21 item list of dataframes each of dimension: 100x100:
temp = list.files(pattern="*.csv")
named.list <- lapply(temp, read.csv)
To get 100 x 100 x 21 array continue with this:
require(abind)
arr <- abind(named.list)
To get the 2100 x 100 dataframe, continue instead with:
longdf <- do.call(rbind, named.list)
To get the 2100 x 100 matrix continue from the last line with:
longmtx <- data.matrix(longdf)
I have following data:
a=[3 1 6]';
b=[2 5 2]';
c={'ab' 'bc' 'cd'}';
I now want to make a file which looks like this (the delimiter is tab):
ab 3 2
bc 1 5
cd 6 2
my solution (with a loop) is:
a=[3 1 6]';
b=[2 5 2]';
c={'ab' 'bc' 'cd'}';
c=cell2mat(c);
fid=fopen('filename','w');
for i=1:numel(b)
fprintf(fid,'%s\t%u\t%u\n',c(i,:),a(i),b(i));
end
fclose(fid);
Is there a possibility without loop and/or the possibility to write cell arrays directly in files?
Thanks.
How about this:
%A cell array holding all data
% (Note transpose)
data = cat(2, c, num2cell(a), num2cell(b))';
Write data to a file
fid = fopen('example.txt', 'w');
fprintf(fid, '%s\t%u\t%u\n', data{:});
fclose(fid);
This will be memory wasteful if your datasets get large (probably better to leave then as separate variables and loop), but seems to work.