I'm working on a Semi-supervised algorithm, and my dataset is STL-10, consisting of 5000 labeled data and 100000 unlabeled data for training. I store All of the unlabeled & labeled training sets in the same folder.
I used torchvision.datasets.ImageFolder to transform into the desired tensor input.
The problem is I want to shuffle the data after batching; the reason for this is all the input for my Model has to have the same label targets. Here is part of my code:
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
from SpykeTorch import utils
trainsetfolder = utils.CacheDataset(ImageFolder("/img", my_transform))
testsetfolder = utils.CacheDataset(ImageFolder("/img_test", my_transform))
trainset = DataLoader(trainsetfolder, batch_size = 100, shuffle = False)
testset = DataLoader(testsetfolder, batch_size = 100, shuffle = False)
for data, target in trainset:
print(data,target,len(data),len(target))
The output of the above code is something like that:
#print:
tensor[img1 img2 ... img100], [0 0 ... 0] ,100, 100
tensor[img101 img102 ... img200], [0 0 ... 0], 100, 100
...
tensor[img401 img402 ... img500],[0 0 ... 0], 100, 100
tensor[img1 img2 ... img100], [1 1 ... 1] ,100, 100
tensor[img101 img102 ... img200], [1 1 ... 1], 100, 100
...
tensor[img401 img402 ... img500],[1 1 ... 1], 100, 100
...
tensor[img1 img2 ... img100], [9 9 ... 9] ,100, 100
tensor[img101 img102 ... img200], [9 9 ... 9], 100, 100
...
tensor[img401 img402 ... img500],[9 9 ... 9], 100, 100
tensor[img1 img2 ... img100], [10 10 ... 10] ,100, 100
tensor[img99901 img99902 ... img100000], [10 10 10 ... 10], 100, 100
as you can see, the batched data from every target fit to model target by target,
It means data fit to model like :
5 * 100 batches equal to 500 from target 1
5 * 100 batches equal to 500 from target 2
5 * 100 batches equal to 500 from target 3
5 * 100 batches equal to 500 from target 4
5 * 100 batches equal to 500 from target 5
5 * 100 batches equal to 500 from target 6
5 * 100 batches equal to 500 from target 7
5 * 100 batches equal to 500 from target 8
5 * 100 batches equal to 500 from target 9
1000 * 100 batches equal to 100000 from target 10
Now, I want to have shuffle batched data something like this:
#print:
tensor[img201 img202 ... img300], [1 1 ... 1] ,100, 100
tensor[img1101 img1102 ... img1200], [10 10 ... 10], 100, 100
tensor[img401 img402 ... img500],[0 0 ... 0], 100, 100
tensor[img801 img802 ... img900], [8 8 ... 8] ,100, 100
tensor[img501 img502 ... img600], [10 10 ... 10], 100, 100
tensor[img401 img402 ... img500],[1 1 ... 1], 100, 100
tensor[img1 img2 ... img100], [9 9 ... 9] ,100, 100
tensor[img701 img702 ... img800], [4 4 ... 4], 100, 100
...
tensor[img401 img402 ... img500],[0 0 ... 0], 100, 100
tensor[img101 img102 ... img200], [2 2 ... 2] ,100, 100
tensor[img96901 img96902 ... img97000], [10 10 10 ... 10], 100, 100
Total of 105000 data.
Can anyone please help me with that?
Thanks.
Related
I am trying to write a general utility to update indices in a jax array that may have a different number of dimensions depending on the instance.
I know that I have to use the .at[].set() methods, and this is what I have so far:
b = np.arange(16).reshape([4,4])
print(b)
update_indices = np.array([[1,1], [3,2], [0,3]])
update_indices = np.moveaxis(update_indices, -1, 0)
b = b.at[update_indices[0], update_indices[1]].set([333, 444, 555])
print(b)
This transforms:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
into
[[ 0 1 2 555]
[ 4 333 6 7]
[ 8 9 10 11]
[ 12 13 444 15]]
My problem is that I have had to hard code the argument to at as update_indices[0], update_indices[1]. However, in general b could have an arbitrary number of dimensions so this will not work. (e.g. for a 3D array I would have to replace it with update_indices[0], update_indices[1], update_indices[2]).
It would be nice if I could write something like b.at[*update_indices] but this does not work.
This should work:
b.at[tuple(update_indices)]
I have an array of length n. The array has braking energy values, and the index number represents time in seconds.
The structure of array is as follows:
Index 1 to 140, array has zero values. (Vehicle not braking)
Index 141 to 200, array has random energy values. (Vehicle was braking and regenerating energy)
Index 201 to 325, array has zero values. (Vehicle not braking)
Index 326 to 405, array has random energy values. (Vehicle was braking and regenerating energy)
...and so on for an array of length n.
What I want to do is to get starting and ending index number of each set of energy values.
For example the above sequence gives this result:
141 - 200
326 - 405
...
Can someone please suggest what method or technique can I use to get this result?
Using diff is a quick way to do this.
Here is a demo (see the comments for details):
% Junk data for demo. Indices shown above for reference
% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
x = [0, 0, 0, 2, 3, 4, 0, 0, 1, 1, 7, 9, 3, 4, 0, 0, 0];
% Logical converts all non-zero values to 1
% diff is x(2:end)-x(1:end-1), so picks up on changes to/from zeros
% Instead of 'logical', you could have a condition here,
% e.g. bChange = diff( x > 0.5 );
bChange = diff( logical( x ) );
% bChange is one of the following for each consecutive pair:
% 1 for [0 1] pairs
% 0 for [0 0] or [1 1] pairs
% -1 for [1 0] pairs
% We inflate startIdx by 1 to index the non-zero value
startIdx = find( bChange > 0 ) + 1; % Indices of [0 1] pairs
endIdx = find( bChange < 0 ); % Indices of [1 0] pairs
I'll leave it as an exercise to capture the edge cases where you add a start or end index if the array starts or ends with a non-zero value. Hint: you could handle each case separately or pad the initial x with additional end values.
Output of the above:
startIdx
>> [4, 9]
endIdx
>> [6, 14]
So you can format this however you like to get the spans 4-6, 9-14.
This task is performed by two methods Both works perfectly.
Wolfie Method:
bChange = diff( EnergyB > 0 );
startIdx = find( bChange > 0 ) + 1; % Indices of [0 1] pairs
endIdx = find( bChange < 0 ); % Indices of [1 0] pairs
Result:
startIdx =
141
370
608
843
endIdx =
212
426
642
912
Second Method:
startends = find(diff([0; EnergyB > 0; 0]));
startends = reshape(startends, 2, [])';
startends(:, 2) = startends(:, 2) - 1
Result:
startends =
141 212
370 426
608 642
843 912
What I am trying to do is select the 1st element of each cell regardless of the number of columns or rows (they may change based on user defined criteria) and make a new pandas dataframe from the data. My actual data structure is similar to what I have listed below.
0 1 2
0 [1, 2] [2, 3] [3, 6]
1 [4, 2] [1, 4] [4, 6]
2 [1, 2] [2, 3] [3, 6]
3 [4, 2] [1, 4] [4, 6]
I want the new dataframe to look like:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
The code below generates a data set similar to mine and attempts to do what I want to do in my code without success (d), and mimics what I have seen in a similar question with success(c ; however, only one column). The link to the similar, but different question is here :Python Pandas: selecting element in array column
import pandas as pd
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame((zz.columns.values))
b = pd.DataFrame.transpose(a)
c =zz[0].str[0] # this will give the 1st value for each cell in columns 0
d= zz[[b[0]].values].str[0] #attempt to get 1st value for each cell in all columns
You can use apply and for selecting first value of list use indexing with str:
print (zz.apply(lambda x: x.str[0]))
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
Another solution with stack and unstack:
print (zz.stack().str[0].unstack())
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I would use applymap which applies the same function to each individual cell in your DataFrame
df.applymap(lambda x: x[0])
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I'm a big fan of stack + unstack
However, #jezrael already put that answer down... so + 1 from me.
That said, here is a quicker way. By slicing a numpy array
pd.DataFrame(
np.array(zz.values.tolist())[:, :, 0],
zz.index, zz.columns
)
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
timing
I have a dataframe with three columns. X, Y, and counts, where counts is the number of occurences where x and y appear together. My goal is to transform this from a dataframe to an array of two dimensions where X is the name of the rows, Y is the name of the columns and the counts make up the records in the table.
Is this possible? I can elaborate if needed.
To get the same result as a pivot table, you can also perform a groupby operation and then unstack one of the columns:
import numpy as np
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'black'] * 2,
'vehicle': ['car', 'truck'] * 3,
'value': np.arange(1, 7)})
>>> df
color value vehicle
0 red 1 car
1 blue 2 truck
2 black 3 car
3 red 4 truck
4 blue 5 car
5 black 6 truck
>>> df.groupby(['color', 'vehicle']).sum().unstack('vehicle')
value
vehicle car truck
color
black 3 6
blue 5 2
red 1 4
Here is an IPython session that may be a good simulation of what you are trying to do:
In [17]: import pandas as pd
In [18]: from random import randint
In [19]: x = ['a', 'b', 'c'] * 4
In [20]: y = ['i', 'j', 'k', 'l'] * 3
In [21]: counts = [randint(10, 20) for i in range(12)]
In [22]: df = pd.DataFrame(dict(x=x, y=y, counts=counts))
In [23]: df.head()
Out[23]:
counts x y
0 16 a i
1 10 b j
2 16 c k
3 15 a l
4 19 b i
In [24]: df.pivot(index='x', columns='y', values='counts')
Out[24]:
y i j k l
x
a 16 14 18 15
b 19 10 15 20
c 10 18 16 16
In [25]: df.pivot(index='x', columns='y', values='counts').values
Out[25]:
array([[16, 14, 18, 15],
[19, 10, 15, 20],
[10, 18, 16, 16]], dtype=int64)
I have a large array in r and would like to subset it using points I obtained from a different matrix.
i.e.
,,1
34 1 1 3 4
32 1 3 4 5
23 1 1 3 4
35 1 3 4 4
23 1 2 3 4
,,2
234 1 1 3 4
32 1 3 4 5
324 1 1 3 4
23 1 3 4 4
232 1 2 3 4
and would like it to return
34 1 1 3 4
23 1 1 3 4
23 1 2 3 4
234 1 1 3 4
324 1 1 3 4
232 1 2 3 4
in some format.
These particular rows would be returned as I am choosing based on the last 3 columns
(i.e I want all the rows with last 3 digits 1,3,4 and 2,3,4)
One way is
m1 <- apply(ar1, 2, `[`)
m1[m1[,2]%in% 1:2 & m1[,3]==3 & m1[,4]==4,]
# [,1] [,2] [,3] [,4]
#[1,] 1 1 3 4
#[2,] 1 1 3 4
#[3,] 1 2 3 4
#[4,] 1 1 3 4
#[5,] 1 1 3 4
#[6,] 1 2 3 4
Or
res <- do.call(rbind,lapply(seq(dim(ar1)[3]), function(i) {
x1 <- ar1[,,i]
x2 <- t(x1[,-1])
x1[colSums(x2==c(1,3,4)|x2==c(2,3,4))==3,]}))
res
# [,1] [,2] [,3] [,4]
#[1,] 1 1 3 4
#[2,] 1 1 3 4
#[3,] 1 2 3 4
#[4,] 1 1 3 4
#[5,] 1 1 3 4
#[6,] 1 2 3 4
Update
Suppose if the values to match are in a matrix with each row as the matching vector.
toMatch <- rbind(c(1,3,4), c(2,3,4), c(4,3,2), c(1,9,4))
indx1 <- apply(toMatch, 1, paste, collapse="")
res <- do.call(rbind,lapply(seq(dim(ar1)[3]), function(i) {
x1 <- ar1[,,i]
x1[apply(x1[,-1], 1, paste, collapse='') %in% indx1,]
}))
data
ar1 <- structure(c(1, 1, 1, 1, 1, 1, 3, 1, 3, 2, 3, 4, 3, 4, 3, 4, 5,
4, 4, 4, 1, 1, 1, 1, 1, 1, 3, 1, 3, 2, 3, 4, 3, 4, 3, 4, 5, 4,
4, 4), .Dim = c(5L, 4L, 2L))