ValueError: setting an array element with a sequence - passing a list in a dictionary to DataGenerator - arrays

I am working on a keras multilabel problem. In order to work with big amount of data to avoid memory issues, I implemented a custom data generator.
So far I work with a csv file with IDs, Filenames and their corresponding labels (21 in total), which looks like this:
Filename label1 label2 label3 label4 ... ID
abc1.jpg 1 0 0 1 ... id-1
def2.jpg 1 0 0 1 ... id-2
ghi3.jpg 1 0 0 1 ... id-3
...
I put the the ids and the labels in dictionaries which have the following output:
partition: {'train': ['id-1','id-2','id-3',...], 'validation': ['id-7','id-14','id-21',...]}
labels: {'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-2': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
...}
All my images are converted to arrays and saved in single npy files. id-1.npy, id-2.npy...
Then I am executing my code:
import numpy as np
import keras
from keras.layers import *
from keras.models import Sequential
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, batch_size=32, dim=(224,224), n_channels=3,
n_classes=21, shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('Folder with npy files/' + ID + '.npy')
# Store class
y[i] = self.labels[ID]
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
# Parameters
params = {'dim': (224, 224),
'batch_size': 32,
'n_classes': 21,
'n_channels': 3,
'shuffle': True}
# Datasets
partition = partition
labels = labels
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
# Design model
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=(224, 224, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(21))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Train model on dataset
model.fit_generator(generator=training_generator,
validation_data=validation_generator)
and the following Error raises:
ValueError: setting an array element with a sequence
the following part of the error seems to be crucial:
<ipython-input-58-fedc63607310> in __getitem__(self, index)
31
32 # Generate data
---> 33 X, y = self.__data_generation(list_IDs_temp)
34
35 return X, y
<ipython-input-58-fedc63607310> in __data_generation(self, list_IDs_temp)
53
54 # Store class
---> 55 y[i] = self.labels[ID]
56
57 return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
as soon as i replace labels from the beginning with the following, the code gets executed:
labels = {'id-0': 0,
'id-1': 2,
'id-2': 1,
...}
I still want to pass multiple labels to the DataGenerator, therefore I chose to put a list in the dictionary, as shown in the beginning, but this gives me the ValueError. How can I anyway pass multiple values for a single ID to the DataGenerator as suggested? What do I have to adjust?
A hint or a snippet of code I appreciate a lot.

If i understand well your code here is the problem :
y = np.empty((self.batch_size), dtype=int)
You are creating an emty 1D array, but here :
y[i] = self.labels[ID]
You are filling it with a sequence :
'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
In order to work you need to create your label array with the shape of your batch_size and the lenght of your sequence :
y = np.empty((self.batch_size, len(sequence)), dtype=int)
EDIT
to_categorical is to encode categorical feature to be arrays like [0, 0, 0, 1], [0, 0, 1, 0], etc But you are feeding sequences, not categorical features.
By feeding sequences to your network, you don't want to one_hot encode it so replace :
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
by :
return X, y
Recommendation from last comment
The problem is that your Softmax activation will try to give the best score to the correct class, but here you give sequence array that softmax will interpret with multiple "correct class" :
For exemple : if you have 3 labels [1, 2, 3], by one_hot encoding you will have [1, 0, 0], [0, 1, 0], [0, 0, 1], there is only one "1" per encoded label array, one correct class, softmax will try to get this class score bigger as possible.
But in you case your are giving arrays with multiple "1's" :
with that : [1, 0, 1] softmax don't know to which class give the best score.
So i would recommand that, you start with your 21 labels [0,1,2,3, ..] and then you one_hot encode this array and you give it to your network.
If you really need that sequence, you have to find an other solution !
Hope i'm clear !

Related

Most efficient way to forward fill a bit array

Imagine you have a bit array (any data type is okay. e.g. list, np.array, bitarray, bitmap, etc of booleans) that is randomly filled. What is the fastest way to “forward fill” (left to right, or 0th index to nth index) that array in Python such that n bits get set to 1 following each bit already set to 1?
For example, take the array below:
[01000100000]
Given n=2 the forward filled array would be:
[01110111000]
edit
Assume that the input is a bit array of 10,000 elements, of which a random 20% are true, and n=25. This can be represented as a python list with 10,000 boolean elements, of which 20% are True. This could also be represented as a set with 2,000 int elements between 0 and 10,000.
edit 2
To get things started, here are some examples using the parameters above:
new = set()
new.update(*[range(i, i+25) for i in existing])
# 2.34 ms ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new = BitMap() # This is a pyroaring BitMap
for e in existing:
new.add_range(e, e+25)
# 461 µs ± 6.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have addressed several datatypes below. There are no timings given, you might want to time the statement setting ans or refactor-in functions to time at the granularity that makes sense to you.
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 19 09:08:56 2021
for: https://stackoverflow.com/questions/70397220/most-efficient-way-to-forward-fill-a-bit-array
#author: paddy
"""
from random import sample
n = 2 # bits to the right of set bits to also set
elements = 17
true_percent = 20.0
#%% Using arbitrary precision int
print("\nUsing arbitrary precision int.\n".upper())
from operator import or_
from functools import reduce
# Set some random bits True
bits = sum(1 << r
for r in sample(range(elements), int(true_percent/100 * elements)))
# Set n right-adjacent bits.
ans = reduce(or_, (bits >> x for x in range(n+1)), 0)
# Print
print(f"Random bits = {bits:0{elements}b}")
if 1:
print()
for x in range(n+1):
print(f" {bits >> x:0{elements}b}")
print()
print(f"Answer = {ans:0{elements}b}\n")
#%% Using list.
print("\nUsing list.\n".upper())
from operator import or_
from functools import reduce
bits = [0] * elements
# Set some random bits to 1
for r in sample(range(elements), int(true_percent/100 * elements)):
bits[r] = 1
# Set n right-adjacent bits.
# [0]*x is padding bits on the left.
# zip(*(list1, list2,..)) returns the n'th elements on list1, list2,...
# int(any(...)) or's them.
ans = [int(any(shifts))
for shifts in zip(*([0]*x + bits for x in range(n+1)))]
# Print
print(f"Random bits = {bits}")
if 1:
print()
for x in range(n+1):
print(f" {[0]*x + bits}")
print()
print(f"Answer = {ans}\n")
#%% Using numpy.
# Adapt the list solution to use numpy operators on numpy arrays
#%% Using other ordered collections such as str.
# Convert to and from int solution.
Sample Output:
USING ARBITRARY PRECISION INT.
Random bits = 01000000010000010
01000000010000010
00100000001000001
00010000000100000
Answer = 01110000011100011
USING LIST.
Random bits = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
Answer = [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0]

how to store 1-D array in a Matrix by adding a new row in python

I am applying Dijkstra on each node in python using spyder. I am getting the correct results too but I am unable to store the results (1-D arrays) into an nxn matrix by adding a row. I am storing the data obtained from Dijkstra in D_path variable but in variable explorer, it is giving me the type as NoneType Object. I am appending it with the row and appending the row with the matrix.
import sys
import numpy as np
class Graph():
def __init__(self, vertices):
self.V = vertices
self.graph = [[0 for column in range(vertices)]
for row in range(vertices)]
def printSolution(self, dist):
print("Vertex tDistance from Source")
for node in range(self.V):
print(node, "t", dist[node])
# A utility function to find the vertex with
# minimum distance value, from the set of vertices
# not yet included in shortest path tree
def minDistance(self, dist, sptSet):
# Initialize minimum distance for next node
min = sys.maxsize
# Search not nearest vertex not in the
# shortest path tree
for v in range(self.V):
if dist[v] < min and sptSet[v] == False:
min = dist[v]
min_index = v
return min_index
# Funtion that implements Dijkstra's single source
# shortest path algorithm for a graph represented
# using adjacency matrix representation
def dijkstra(self, src):
dist = [sys.maxsize] * self.V
dist[src] = 0
sptSet = [False] * self.V
for cout in range(self.V):
# Pick the minimum distance vertex from
# the set of vertices not yet processed.
# u is always equal to src in first iteration
u = self.minDistance(dist, sptSet)
# Put the minimum distance vertex in the
# shortest path tree
sptSet[u] = True
# Update dist value of the adjacent vertices
# of the picked vertex only if the current
# distance is greater than new distance and
# the vertex in not in the shortest path tree
for v in range(self.V):
if self.graph[u][v] > 0 and sptSet[v] == False and dist[v] > dist[u] + self.graph[u][v]:
dist[v] = dist[u] + self.graph[u][v]
self.printSolution(dist)
# Driver program
g = Graph(9)
g.graph = [[0, 4, 0, 0, 0, 0, 0, 8, 0],
[4, 0, 8, 0, 0, 0, 0, 11, 0],
[0, 8, 0, 7, 0, 4, 0, 0, 2],
[0, 0, 7, 0, 9, 14, 0, 0, 0],
[0, 0, 0, 9, 0, 10, 0, 0, 0],
[0, 0, 4, 14, 10, 0, 2, 0, 0],
[0, 0, 0, 0, 0, 2, 0, 1, 6],
[8, 11, 0, 0, 0, 0, 1, 0, 7],
[0, 0, 2, 0, 0, 0, 6, 7, 0]
]
#D_path = list()
matrix=[] #define empty matrix
for i in range(9): #total row is 3
row=[]
D_path = g.dijkstra(i)
row.append(D_path) #adding 0 value for each column for this row
matrix.append(row) #add fully defined column into the row
print (matrix)

convert numpy array of size n into a scalar with sub-item index numbers as scalar values

I want to convert a numpy array
a = array([[1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])
into a scalar with sub-item index numbers as scalar values
desired output
a1 = array([[1],[1],[3],[3],[6],[4]])
I tried this method:
a1 = []
for item in a:
a1.append(np.where(item==1))
and I get this output:
a1 = [(array([0]),),
(array([0]),),
(array([2]),),
(array([2]),),
(array([6]),),
(array([4]),)]
Is there a more pythonic way to achieve it?
After reading the question more carefully, I think this one-liner should
solve your issue in a more pythonic way:
np.where(a==1)[1].reshape((a.shape[0],1))
If you prefer starting indexing with 1 instead of 0 (as is evident in your desired output), then you just have to add a 1 to the line above, i.e.
np.where(a==1)[1].reshape((a.shape[0],1)) + 1
Note that there seems to be an error in your example output above for the last two elements.

how to feed DataGenerator for KERAS multilabel issue?

I am working on a multilabel classification problem with KERAS.
When i execute the code like this i get the following error:
ValueError: Error when checking target: expected activation_19 to have 2 dimensions, but got array with shape (32, 6, 6)
This is because of my lists full of "0" and "1" in the labels dictionary, which dont fit to keras.utils.to_categorical in return statement, as i learned recently. softmax cant handle more than one "1" as well.
I guess I first need a Label_Encoder and afterwards One_Hot_Encoding for labels, to avoid multiple "1" in labels, which dont go together with softmax.
I hope someone can give me a hint how to preprocess or transform labels data, to get the code fixed. I will appreciate a lot.
Even a code snippet would be awesome.
csv looks like this:
Filename label1 label2 label3 label4 ... ID
abc1.jpg 1 0 0 1 ... id-1
def2.jpg 0 1 0 1 ... id-2
ghi3.jpg 0 0 0 1 ... id-3
...
import numpy as np
import keras
from keras.layers import *
from keras.models import Sequential
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, batch_size=32, dim=(224,224), n_channels=3,
n_classes=21, shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size, self.n_classes), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('Folder with npy files/' + ID + '.npy')
# Store class
y[i] = self.labels[ID]
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
-----------------------
# Parameters
params = {'dim': (224, 224),
'batch_size': 32,
'n_classes': 21,
'n_channels': 3,
'shuffle': True}
# Datasets
partition = partition
labels = labels
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
# Design model
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=(224, 224, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(21))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Train model on dataset
model.fit_generator(generator=training_generator,
validation_data=validation_generator)
Since you already have the labels as a vector of 21 elements of 0 and 1, you shouldn't use keras.utils.to_categorical in the function __data_generation(self, list_IDs_temp). Just return X and y.
Ok i have a solution but i'm not sure that's the best .. :
from sklearn import preprocessing #for LAbelEncoder
labels_list = [x[1] for x in labels.items()] #get the list of all sequences
def convert(list):
res = int("".join(map(str, list)))
return res
label_int = [convert(i) for i in labels_list] #Convert each sequence to int
print(label_int) #E.g : [1,2,3] become 123
le = preprocessing.LabelEncoder()
le.fit(label_int)
labels = le.classes_ #Encode each int to only get the uniques
print(labels)
d = dict([(y,x) for x,y in enumerate(labels)]) #map each unique sequence to an label like 0, 1, 2, 3 ...
print(d)
labels_encoded = [d[i] for i in label_int] #get all the sequence and encode them with label obtained
print(labels_encoded)
labels_encoded = to_categorical(labels_encoded) #encode to_cagetorical
print(labels_encoded)
This is not really clean i think, but it's working
You need to change your last Dense layer to have a number of neurones equal to the lenght of the labels_encoded sequences.
For the predictions, you will have the dict "d" that map the predicted value to your orginal sequence style.
Tell me if you need clarifications !
For a few test sequences, it's gives you that :
labels = {'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
'id-1': [0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-2': [0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
'id-3': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
'id-4': [0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
[100100001100000001011, 10100001100000000001, 100001100010000001, 100100001100000001011, 10100001100000000001]
[100001100010000001 10100001100000000001 100100001100000001011]
{100001100010000001: 0, 10100001100000000001: 1, 100100001100000001011: 2}
[2, 1, 0, 2, 1]
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
EDIT after clarification :
Ok i read a little more about the subject, once more the problem of softmax is that it will try to maximize a class while minize the others.
So i would sugest to keep your arrays of 21 ones's and zeros's but instead of using Softmax, use Sigmoid (to predict a probability between 0 and 1 for each class) with binary_crossentropy.
And use a treshold for your predictions :
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
Keep me in touch of the results !

Loop through 2d array within a dictionary

I am looping through all values in a 2d array which is held in a dictionary with the key Band_1
{'Band_1': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint16)}
The code runs, but the array is 2650 x 2650 and I have 150+ dictionaries to process at each step, so it's very slow.
Note: for this example, there is only one key per dictionary, but that will not always be the case.
I have tried 3 different methods to loop through the array:
Method 1:
for key, bands in img.iteritems():
for pixel in bands:
for x in pixel:
if x != noDataVal:
x = x - dark_val
else:
x = noDataVal
Method 2:
for key, bands in img.iteritems():
for pixel in bands.flat:
if pixel != noDataVal:
pixel = pixel - dark_val
else:
pixel = noDataVal
Method 3:
img = {k:[[i-dark_val for i in line if i != noDataVal] for line in data] for k, data in img.items()}
Method 4:
for key, band in image.iteritems():
band = image[key]
band_m = np.ma.masked_array(band, mask=noDataVal)
band_m = band_m - dark_val
where:
dark_val = 75
noDataVal = 0
with timeit values over 5 loops as follows:
This is the first method: 18.446967368
This is the second method: 18.6967543083
This is the third method: 19.0136934398
This is the fourth method: 0.6860613911
Any improvement on these methods in terms of speed/efficiency?

Resources