I'm reading some (apparently) large grib files using xarray. I say 'apparently' because they're ~100MB each, which doesn't seem too big to me. However, running
import xarray as xr
ds = xr.open_dataset("gribfile.grib", engine="cfgrib")
takes a good 5-10 minutes. Worse, reading one of these takes up almost 4GB RAM - something that surprises me given the lazy-loading that xarray is supposed to do. Not least that this is 40-odd times the size of the original file!
This reading time and RAM usage seems excessive and isn't scalable to the 24 files I have to read.
I've tried using dask and xr.open_mfdataset, but this doesn't seem to help when the individual files are so large. Any suggestions?
Addendum:
dataset looks like this once opened:
<xarray.Dataset>
Dimensions: (latitude: 10, longitude: 10, number: 50, step: 53, time: 45)
Coordinates:
* number (number) int64 1 2 3 4 5 6 7 8 9 ... 42 43 44 45 46 47 48 49 50
* time (time) datetime64[ns] 2011-01-02 2011-01-04 ... 2011-03-31
* step (step) timedelta64[ns] 0 days 00:00:00 ... 7 days 00:00:00
surface int64 0
* latitude (latitude) float64 56.0 55.0 54.0 53.0 ... 50.0 49.0 48.0 47.0
* longitude (longitude) float64 6.0 7.0 8.0 9.0 10.0 ... 12.0 13.0 14.0 15.0
valid_time (time, step) datetime64[ns] 2011-01-02 ... 2011-04-07
Data variables:
u100 (number, time, step, latitude, longitude) float32 6.389208 ... 1.9880934
v100 (number, time, step, latitude, longitude) float32 -13.548858 ... -3.5112982
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
history: GRIB to CDM+CF via cfgrib-0.9.4.2/ecCodes-2.9.2 ...
I've temporarily got around the issue by reading in the grib files, one-by-one, and writing them to disk as netcdf. xarray then handles the netcdf files as expected. Obviously it would be nice to not have to do this because it takes ages - I've only done this for 4 so far.
Related
I'm looking to improve my code efficiency by turning my code into arrays and loops. The data i'm working with starts off like this:
ID Mapping Asset Fixed Performing Payment 2017 Payment2018 Payment2019 Payment2020
1 Loan1 1 1 1 90 30 30 30
2 Loan1 1 1 0 80 20 40 20
3 Loan1 1 0 1 60 40 10 10
4 Loan1 1 0 0 120 60 30 30
5 Loan2 ... ... ... ... ... ... ...
So For each ID (essentially the data sorted by Mapping, Asset, Fixed and then Performing) I'm looking to build a profile for the Payment Scheme.
The Payment Vector for the first ID looks like this:
PaymentVector1 PaymentVector2 PaymentVector3 PaymentVector4
1 0.33 0.33 0.33
It is represented by the formula
PaymentVector(I)=Payment(I)/Payment(1)
The above is fine to create in an array, example code can be given if you wish.
Next, under the assumption that every payment made is replaced i.e. when 30 is paid in 2018, it must be replaced, and so on.
I'm looking to make a profile that shows the outflows (and for illustration, but not required in code, in brackets inflows) for the movement of the payments as such - For ID=1:
Payment2017 Payment2018 Payment2019 Payment2020
17 (+90) -30 -30 -30
18 N/A (+30) -10 -10
19 N/A N/A (+40) -13.3
20 N/A N/A N/A (+53.3)
so if you're looking forwards, the rows can be thought of what year it is and the columns representing what years are coming up.
Hence, in year 2019, looking at what is to be paid in 2017 and 2018 is N/A because those payments are in the past / cannot be paid now.
As for in year 2018, looking at what has to be paid in 2019, you have to pay one-third of the money you have now, so -10.
I've been working to turn this dataset row by row into the array but there surely has to be a quicker way using an array:
The Code I've used so far looks like:
Data Want;
Set Have;
Array Vintage(2017:2020) Vintage2017-Vintage2020;
Array PaymentSchedule(2017:2020) PaymentSchedule2017-PaymentSchedule2020;
Array PaymentVector(2017:2020) PaymentVector2017-PaymentVector2020;
Array PaymentVolume(2017:2020) PaymentVolume2017-PaymentVolume2020;
do i=1 to 4;
PaymentVector(i)=PaymentSchedule(i)/PaymentSchedule(1);
end;
I'll add code tomorrow... but the code doesn't work regardless.
data have;
input
ID Mapping $ Asset Fixed Performing Payment2017 Payment2018 Payment2019 Payment2020; datalines;
1 Loan1 1 1 1 90 30 30 30
2 Loan1 1 1 0 80 20 40 20
3 Loan1 1 0 1 60 40 10 10
4 Loan1 1 0 0 120 60 30 30
data want(keep=id payment: fraction:);
set have;
array p payment:;
array fraction(4); * track constant fraction determined at start of profile;
array out(4); * track outlay for ith iteration;
* compute constant (over iterations) fraction for row;
do i = dim(p) to 1 by -1;
fraction(i) = p(i) / p(1);
end;
* reset to missing to allow for sum statement, which is <variable> + <expression>;
call missing(of out(*));
out(1) = p(1);
do iter = 1 to 4;
p(iter) = out(iter);
do i = iter+1 to dim(p);
p(i) = -fraction(i) * p(iter);
out(i) + (-p(i)); * <--- compute next iteration outlay with ye olde sum statement ;
end;
output;
p(iter) = .;
end;
format fract: best4. payment: 7.2;
run;
You've indexed your arrays with 2017:2020 but then try and use them using the 1 to 4 index. That won't work, you need to be consistent.
Array PaymentSchedule(2017:2020) PaymentSchedule2017-PaymentSchedule2020;
Array PaymentVector(2017:2020) PaymentVector2017-PaymentVector2020;
do i=2017 to 2020;
PaymentVector(i)=PaymentSchedule(i)/PaymentSchedule(2017);
end;
I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C
Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})
I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm
I have a very large dataset array with over a million values that looks like this:
Month Day Year Hour Min Second Line1 Line2 Power Dt
7 8 2013 0 1 54 1.91 4.98 826.8 0
7 8 2013 0 0 9 1.93 3.71 676.8 0
7 8 2013 0 1 15 1.92 5.02 832.8 0
7 8 2013 0 1 21 1.91 5.01 830.4 0
and so on.
When the measurement of seconds got to 60 it would start over again at 0 hence why the first number is bigger. I need to fill the delta t column (Dt) by taking the current rows seconds column and subtracting the previous rows seconds column and correcting for negatyive values. This opperation cannot preform this operation in a loop as the it would take ages to complete and needs to be completed in a simple, one-shot, vector subtraction operation.
You can try diff command to generate such results. Its very fast and should work wihout any for loop.
HTH
Dt=diff(datenum(A(:,1:6)))*60*60*24;
This gives the delta in seconds, but I'm not sure what you want you correction for negative differences to be. Could you give an example of the expected output?
Note that Dt will be one entry shorter than A, so you may have to pad it.
You can remove the negative values (I think) with the command
Dt(Dt<0)=Dt(Dt<0)+60;
If you need to pad the Dt vector so that it is the same length as the data set, try
Dt=[Dt;0];
I am new to R and I am trying to read in a data set. The data set is here:
http://petitlien.fr/myfiles
(The above link will expand to a GMX File storage folder link and click on Guest access to retrieve the file.)
The file named mydata.log has 32 entries with no header and it consists of 2 columns which are delimited by spaces.
I am trying the powerful command scan
test.frame<-scan(file="mydata.log",sep= "", nlines=32,blank.lines.skip=TRUE)
The above just read the first 3 rows:
head(test.frame)
[1] 0.0000 0.0000 144.3210 0.3400 159.4070 0.8925
I have tried also read.table:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
This one reads the first 6 lines only as shown below:
names(test.frame)
[1] "V1" "V2"
> head(test.frame)
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
Does someone know how to read this data set properly?
A related question: Can I control the number of significant digits or perhaps decimal places in the data being read in?
Thanks a lot...
This line of your code works perfectly:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
The reason why you only get 6 lines in your output is because you are using head. To view all lines, just enter the name of your object:
> test.frame
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
7 296.918 1.1025
8 346.773 1.1550
9 442.955 1.2075
10 694.879 1.2600
11 892.436 1.3125
12 1492.970 1.3650
13 2916.960 1.4175
14 3596.060 1.4700
15 5278.950 1.5225
16 7480.730 1.5750
17 12259.800 1.6275
18 14032.600 1.6800
19 19565.600 1.7325
20 31427.700 1.7850
21 58221.400 1.8375
22 92283.900 1.9900
23 165601.000 1.9425
24 165703.000 1.9950
25 213925.000 2.8750
26 260381.000 2.1000
27 312701.000 2.1525
28 370853.000 2.2050
29 479303.000 2.2575
30 487265.000 2.3100
31 545225.000 2.3625
32 703186.000 2.4150
Here is an easy way to see how many rows you have (useful when you have many observations):
nrow(test.frame)
[1] 32
As for the number of digits, see the round command. To look at the documentation for a command, enter a ? and then the command, in this case a function: ?round
#note that you do not have to put "digits=2", you can just put "2", but this way is clearer
> rounded_test.frame <- round(test.frame, digits=2)
> rounded_test.frame
V1 V2
1 0.00 0.00
2 144.32 0.34
3 159.41 0.89
4 198.41 0.94
5 222.56 1.00
6 235.46 1.05
7 296.92 1.10
8 346.77 1.16
9 442.95 1.21
10 694.88 1.26
11 892.44 1.31
12 1492.97 1.36
13 2916.96 1.42
14 3596.06 1.47
15 5278.95 1.52
16 7480.73 1.57
17 12259.80 1.63
18 14032.60 1.68
19 19565.60 1.73
20 31427.70 1.78
21 58221.40 1.84
22 92283.90 1.99
23 165601.00 1.94
24 165703.00 2.00
25 213925.00 2.88
26 260381.00 2.10
27 312701.00 2.15
28 370853.00 2.21
29 479303.00 2.26
30 487265.00 2.31
31 545225.00 2.36
32 703186.00 2.42
Note in the above I created a new object instead of replacing the current one. If you want to replace the current one and lose the data forever (until you reload the dataset of course!), then you can use this line instead:
test.frame <- round(test.frame, digits=2)
If you don't really want to compress your numbers, you might just be interested in viewing the rounded numbers. You can do this the following command:
print(test.frame,digits=2)
Instead of nrow() as suggested, I would recommend str() ("structure") that gives you more useful information about your data set (class of variables etc). It's also a bit less cryptic....:)
I want to blur my image using the native Gaussian blur formula. I read the Wikipedia article, but I am not sure how to implement this.
How do I use the formula to decide weights?
I do not want to use any built in functions like what MATLAB has
Writing a naive gaussian blur is actually pretty easy. It is done in exactly the same way as any other convolution filter. The only difference between a box and a gaussian filter is the matrix you use.
Imagine you have an image defined as follows:
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
A 3x3 box filter matrix is defined as follows:
0.111 0.111 0.111
0.111 0.111 0.111
0.111 0.111 0.111
To apply the gaussian blur you would do the following:
For pixel 11 you would need to load pixels 0, 1, 2, 10, 11, 12, 20, 21, 22.
you would then multiply pixel 0 by the upper left portion of the 3x3 blur filter. Pixel 1 by the top middle, pixel 2, pixel 3 by top right, pixel 10 by middle left and so on.
Then add them altogether and write the result to pixel 11. As you can see Pixel 11 is now the average of itself and the surrounding pixels.
Edge cases do get a bit more complex. What values do you use for the values of the edge of the texture? One way can be to wrap round to the other side. This looks good for an image that is later tiled. Another way is to push the pixel into the surrounding places.
So for upper left you might place the samples as follows:
0 0 1
0 0 1
10 10 11
I hope you can see how this can easily be extended to large filter kernels (ie 5x5 or 9x9 etc).
The difference between a gaussian filter and a box filter is the numbers that go in the matrix. A gaussian filter uses a gaussian distribution across a row and column.
e.g for a filter defined arbitrarily as (ie this isn't a gaussian, but probably not far off)
0.1 0.8 0.1
the first column would be the same but multiplied into the first item of the row above.
0.01 0.8 0.1
0.08
0.01
The second column would be the same but the values would be multiplied by the 0.8 in the row above (and so on).
0.01 0.08 0.01
0.08 0.64 0.08
0.01 0.08 0.01
The result of adding all of the above together should equal 1. The difference between the above filter and the original box filter would be that the end pixel written would have a much heavier weighting towards the central pixel (ie the one that is in that position already). The blur occurs because the surrounding pixels do blur into that pixel, though not as much. Using this sort of filter you get a blur but one that doesn't destroy as much of the high frequency (ie rapid changing of colour from pixel to pixel) information.
These sort of filters can do lots of interesting things. You can do an edge detect using this sort of filter by subtracting the surrounding pixels from the current pixel. This will leave only the really big changes in colour (high frequencies) behind.
Edit: A 5x5 filter kernel is define exactly as above.
e.g if your row is 0.1 0.2 0.4 0.2 0.1 then if you multiply each value in their by the first item to form a column and then multiply each by the second item to form the second column and so on you'll end up with a filter of
0.01 0.02 0.04 0.02 0.01
0.02 0.04 0.08 0.04 0.02
0.04 0.08 0.16 0.08 0.04
0.02 0.04 0.08 0.04 0.02
0.01 0.02 0.04 0.02 0.01
taking some arbitrary positions you can see that position 0, 0 is simple 0.1 * 0.1. Position 0, 2 is 0.1 * 0.4, position 2, 2 is 0.4 * 0.4 and position 1, 2 is 0.2 * 0.4.
I hope that gives you a good enough explanation.
Here's the pseudo-code for the code I used in C# to calculate the kernel. I do not dare say that I treat the end-conditions correctly, though:
double[] kernel = new double[radius * 2 + 1];
double twoRadiusSquaredRecip = 1.0 / (2.0 * radius * radius);
double sqrtTwoPiTimesRadiusRecip = 1.0 / (sqrt(2.0 * Math.PI) * radius);
double radiusModifier = 1.0;
int r = -radius;
for (int i = 0; i < kernel.Length; i++)
{
double x = r * radiusModifier;
x *= x;
kernel[i] = sqrtTwoPiTimesRadiusRecip * Exp(-x * twoRadiusSquaredRecip);
r++;
}
double div = Sum(kernel);
for (int i = 0; i < kernel.Length; i++)
{
kernel[i] /= div;
}
Hope this helps.
To use the filter kernel discussed in the Wikipedia article you need to implement (discrete) convolution. The idea is that you have a small matrix of values (the kernel), you move this kernel from pixel to pixel in the image (i.e. so that the center of the matrix is on the pixel), multiply the matrix elements with the overlapped image elements, sum all the values in the result and replace the old pixel value with this sum.
Gaussian blur can be separated into two 1D convolutions (one vertical and one horizontal) instead of a 2D convolution, which also speeds things up a bit.
I am not clear whether you want to restrict this to certain technologies, but if not SVG (ScalableVectorGraphics) has an implementation of Gaussian Blur. I believe it applies to all primitives including pixels. SVG has the advantage of being an Open standard and widely implemented.
Well, Gaussian Kernel is a separable kernel.
Hence all you need is a function which supports Separable 2D Convolution like - ImageConvolutionSeparableKernel().
Once you have it, all needed is a wrapper to generate 1D Gaussian Kernel and send it to the function as done in ImageConvolutionGaussianKernel().
The code is a straight forward C implementation of 2D Image Convolution accelerated by SIMD (SSE) and Multi Threading (OpenMP).
The whole project is given by - Image Convolution - GitHub.