I'm pretty new at coding with Matlab and I'm struggling with an issue I can't fix.
Basically I have data "half - hourly taken" (48 per day) and referred to 17 days (17x48=816 elements).
I got all my data in a big matrix (816 x 31)and I need to discriminate some "day time data" from "night time data".
The elements of the column array (816 elements) I need to process are the following (for the first day):
night_data= bigmatrix([1:8,46:48],27);
day_data= bigmatrix([22:32],27)
but I have to make the same "selection" for each day, i.e. the next day would be
night_data_2 = bigmatrix ([49:56,93:96],27)
day_data_2 = bigmatrix ([70:81],27)
and so on...
How can I make it? Should I use a loop? Is there any indexing function I don't know that could help me?
Thank you in advance.
You can reshape your data so that each column represents one day. That would give you a 48 x 17 x 31 matrix:
dailymatrix = reshape(bigmatrix, 48, 17, 31);
Now, to access the data you've got one new subscript. Your first night/day data would change to
night_data = dailymatrix([1:8, 46:48], 1, 27);
% ^-- 1st day
day_data = dailymatrix([22:32], 1, 27);
The second day's data would be:
night_data = dailymatrix([1:8, 46:48], 2, 27);
% ^-- 2nd day
day_data = dailymatrix([22:32], 2, 27);
To get all 17 days' worth of data,
night_data = dailymatrix([1:8, 46:48], :, 27);
day_data = dailymatrix([22:32], :, 27);
Since the data is in the same timeslots each day, you never have to change the first subscript.
You can use variables in your indicies for your matrix and wrap this in a loop with some dynamic indexing.
night_data.(strcat('night',int2str(n)))=bigmatrix([1+n*48:8+n*48, 46+n*48:48+n*48],27)
This will create a structure that creates fields called night 1, night 2 etc all the way to night n that you need. This can be repeated for day as well.
However, you should be using date indexing with table variables in matlab. Once you convert your date column to a datetime object,
bigmatrix.Date=datetime(bigmatrix.Date)
you can basically do something like the following.
night_data_1=bigmatrix(hour(bigmattrix.Date)>22&hour(bigmattrix.Date)<8 ,27)
which will be able to index all data points between 10 PM and 8 AM (or whatever your day-night cycle cutoff is).
Related
Let me start by saying I'm new to python/pyspark
I've got a dataframe of 100 items, I'm slicing that up into batches of 25 then for each batch I need to do work on each row. I'm getting duplicate values in the last do work step. I've verified my original list does not contain duplicates, my slice step generates 4 distinct lists
batchsize = 25
sliced = []
emailLog = []
for i in range(1,bc_df.count(),batchsize):
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
for s in sliced:
for r in s['slice']:
emailLog.append({"email":r['emailAddress']})
re = sc.parallelize(emailLog)
re_df = sqlContext.createDataFrame(re)
re_df.createOrReplaceTempView('email_logView')
%sql
select count(distinct(email)) from email_logView
My expectation is to have 100 distinct email addresses, I sometiems get 75, 52, 96, 100
Your issue is caused by this line because it is not deterministic and allows duplicates:
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
Let's take a closer look at what is happening (I assume that the index column ranges from 1 to 100).
Your range function generates four values for i (1,26,51 and 76).
During the first iteration you request all rows which index is 1 or greater (i.e. [1,100]) and take 25 of them.
During the second iteration you request all rows which index is 26 or greater (i.e. [26,100]) and take 25 of them.
During the third iteration you request all rows which index is 51 or greater (i.e. [51,100]) and take 25 of them.
During the fourth iteration you request all rows which index is 76 or greater (i.e. [76,100]) and take 25 of them.
You already see that the intervals are overlapping. That means that the email addresses of an iteration could also have been taken by previous iterations.
You can fix this by simply extending your filter with an upper limit. For example:
sliced.append({"slice":bc_df.filter((bc_df.Index >= i) & (bc_df.Index < i + batchsize)).rdd.collect()})
That is just a quick fix to solve your problem. As general advise I recommend you to avoid .collect() as often as possible because it does not scale horizontaly.
I would like to find the indices at which several input values are matched in corresponding arrays. As an example, consider a time-series, for which a dataset contains multiple arrays: years, months, days, and hours. The values of the arrays are filled chronologically. Since the dataset is collected over the span of a few years, the years array will be sorted but the remaining arrays will not be (since the values in hours will only be sorted from 0-24 per day per month per year). Even though this dataset is collected over a span of several years, the dataset is not necessarily continuous - meaning that the number of days or hours between observations (or values as determined by consecutive indices) can be greater than one (but not always).
import numpy as np
years = np.array([2017, 2017, 2018, 2018, 2018, 2018])
months = np.array([12, 12, 1, 1, 1, 2]) # 1-12 months in the year
days = np.array([31, 31, 1, 2, 18, 1]) # 28 (or 29), 30, or 31 days per month
hours = np.array([4, 2, 17, 12, 3, 15]) # 0-23 hours per day
def get_matching_time_index(yy, mm, dd, hh):
""" This function returns an array of indices at which all values are matched in their corresponding arrays. """
res, = np.where((years == yy) & (months == mm) & (days == dd) & (hours == hh))
return res
idx_one = get_matching_time_index(2018, 1, 1, 17)
# >> [2]
idx_two = get_matching_time_index(2018, 2, 2, 0)
# >> []
idx_one = [2] since the 2nd index of years is 2018, the 2nd index of months is 1, the 2nd index of days is 1, and the 2nd index of hours is 17. Since idx_two came up empty, I would like to expand my search range to the find the index that corresponds to the next nearest time. Since the last index of each array is nearest to the corresponding values of the input datetime parameters, I would like the last index of these arrays to be returned (5 in this case).
One might be inclined to think that it's impossible to find the nearest group of values in multiple arrays. But in this case, the hours take precedence over the days, which take precedence over the months, etc. (since an observation 3 hours off from the input time is nearer in time than an observation 3 days off from the input time).
I found a lot of nifty solutions that will work on one array via this post on StackOverflow, but not for a condition that works on multiple arrays. Furthermore, the most efficient solutions posted assume that the array is sorted, whereas the only sorted array in the case of my example is the years.
I suppose I can repeat the operations suggested in that post to repeat the same procedure on each of the multiple arrays - this way, I can find the indices that are common for each of the arrays. Then, one can take the difference of input time-parameters and the time-parameters that are found at the common indices. Starting from the arrays of smaller units (hours in this case), one can pick the index that corresponds to the smallest difference. BUT, I feel that there is a simpler approach that may also be more efficient.
How can I better approach this problem to find the index that corresponds to the nearest grouping of data points via multiple arrays? Is this where a multi-dimensional array becomes handy?
EDIT:
On second thought, one can convert all time parameters into elapsed hours. Then, one can find the index corresponding the observation that is nearest in elapsed hours. Regardless, I am still curious about various ways of approaching this problem.
Your edit contains probably the good idea.
A fast an secure way to achieve that is :
In [93]: dates=np.vectorize(datetime.datetime)(years,months,days,hours)
In [94]: np.abs(datetime.datetime(2018, 1, 1, 0)-dates).argmin()
Out[94]: 2
I am having a set of data. Let's say a grid-points nxm (n latitude, m:longitude) daily temperature for the whole world during a month. However, the temperature in my location of interest is not correct, so I need to update it. In other words, I have to change the data at some certain grid points for every time step (daily). I attach here a simple example. Let's say each matrix 1x2 on the left is the correct data, while each 6x4 matrix contains some incorrect data (6: latitude, 4: longitude). What I need is to change the correct data from the left to the right as indicated in the same color for every time step.
Could anyone help me?
Many thanks
For example this data:
A=rand(4,2)
B=rand(6,4,4)
You would want these values to be replaced by A:
B(3,2:3,:)
Just make sure the size is the same
size(B(3,2:3,:))
> 1 2 4
A=reshape(A',[1 2 4])
And you can put it there
B(3,2:3,:)=A
[edit] Sorry, I probably just don't see the problem.
T = randi(255,[1E3,1E3,31],'uint8'); %1000 longitude, 1000 latitude, 31 days
C = repmat([50,100],[31,1,1]); %correction for 31 days and two locations. must become 50 and 100.
%location 20,10 and 20,11 must change.
T(20,10:11,:)=reshape(C',[1 2 31]);
T(20,10,3) %test for third day.
>> 50
T(20,11,10) %test for tenth day.
>> 100
The replacement takes 0.000365 second on my pc.
I am working with a datetime array s constructed as follows:
ds = datetime(2010,01,01,'TimeZone','Europe/Berlin');
de = datetime(2030,01,01,'TimeZone','Europe/Berlin');
s = ds:hours(1):de;
I am using ismember function to find the first occurrence of a specific date in that array.
ind = ismember(s,specificDate);
startPlace = find(ind,1);
The two lines from above are called many times in my application and consume quite some time. It is clear to me that Matlab compares ALL dates from s with specificDate, even though I need only the first occurrence of specificDate in s. So to speed up the application it would be good if Matlab would stop comparing specificDate to s once the first match is found.
One solution would be to use a while loop, but with the while loop the application becomes even slower (I tried it).
Any idea how to work around this problem?
I'm not sure what your specific use-case is here, but with the step size between elements of s being one hour, your index is simply going to be the difference in hours between your specific date and the start date, plus one. No need to create or search through s in the first place:
startPlace = hours(specificDate-ds)+1;
And an example to test each solution:
specificDate = datetime(2017, 1, 1, 'TimeZone', 'Europe/Berlin'); % Sample date
ind = ismember(s, specificDate); % Compare to the whole vector
startPlace = find(ind, 1); % Find the index
isequal(startPlace, hours(specificDate-ds)+1) % Check equality of solutions
ans =
logical
1 % Same!
What you can do to save yourself some time is to convert the datetime to a datenum in such a case you will be comparing numbers rather than strings, which significantly accelerates your processing time, like this:
s_new = datenum(s);
ind = ismember(s_new,datenum(specificDate));
startPlace = find(ind,1);
Ok. I have a simple question although I'm still fairly new to Matlab (taught myself). So I was wanting a 1x6 matrix to look like this below:
0
0
1
0
321, 12 <--- needs to be in one box in 1x6 matrices
4,30,17,19 <--- needs to be in one box in 1x6 matrices
Is there a possible way to do this or am I going to just have to write them all in separate boxes thus making it a 1x10 matrix?
My code:
event_marker = 0;
event_count = 0;
block_number = 1;
date = [321,12] % (its corresponding variables = 321 and 12)
time = [4,30,17,19] % (its corresponding variable = 4 and 30 and 17 and 19)
So if I understand you correctly, you want an array that contains 6 elements, of which 1 element equals 1, another element is the array [312,12] and the last element is the array [4,30,17,19].
I'll suggest two things to accomplish this: matrices, and cell-arrays.
Cell arrays
In Matlab, a cell array is a container for arbitrary types of data. You define it using curly-braces (as opposed to block braces for matrices). So, for example,
C = {'test', rand(4), {#cos,#sin}}
is something that contains a string (C{1}), a normal matrix (C{2}), and another cell which contains function handles (C{3}).
For your case, you can do this:
C = {0,0,1,0, [321,12], [4,30,17,19]};
or of course,
C = {0, event_marker, event_count, block_number, date, time};
Matrices
Depending on where you use it, a normal matrix might suffice as well:
M = [0 0 0 0
event_marker 0 0 0
event_count 0 0 0
block_number 0 0 0
321 12 0 0
4 30 17 19];
Note that you'll need some padding (meaning, you'll have to add those zeros in the top-right somehow). There's tonnes of ways to do that, but I'll "leave that as an exercise" :)
Again, it all depends on the context which one will be easier.
Consider using cell arrays rather than matrices for your task.
data = cell(6,1); % allocate cell
data{1} = event_marker; % note the curly braces here!
...
data{6} = date; % all elements of date fits into a single cell.
If your date and time variables are actually represent date (numbers of days, months, years) and time (hours, mins, sec), they can be packed into one or two numbers.
Look into DATENUM function. If you have a vector, for example, [2013, 4, 10], representing April 10th of 2013 you can convert it into a serial date:
daten = datenum([2013, 4, 10]);
It's ok if you have number of days in a year, but not months. datenum([2013, 0, 300]) will also work.
The time can be packed together with date or separately:
timen = datenum([0, 0, 0, 4, 30, 17.19]);
or
datetimen = datenum([2013, 4, 10, 4, 30, 17.19]);
Once you have this serial date you can just keep it in one vector with other numbers.
You can convert this number back into either date vector or date string with DATEVEC and DATESTR function.