How can I create and shuffle a dataset for triplet mining in TensorFlow 2? - dataset

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
##tf.function
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
else:
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
shape=(batch_size//samples_per_class,),
minval=0,
maxval=class_nbr,
dtype=tf.dtypes.int64,
) # Which classes will be in batch
choice = tf.data.Dataset.from_tensor_slices(tf.concat([choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch = tf.data.experimental.choose_from_datasets(all_datasets, choice)
if i==0:
all_batches = batch
else:
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "training.py", line 137, in <module>
main()
File "training.py", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\tfr_to_dataset.py", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
custom_shuffle
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

Related

lapply calling .csv for changes in a parameter

Good afternoon
I am currently trying to pull some data from pushshift but I am maxing out at 100 posts. Below is the code for pulling one day that works great.
testdata1<-getPushshiftData(postType = "submission", size = 1000, before = "1546300800", after= "1546200800", subreddit = "mysubreddit", nest_level = 1)
I have a list of Universal Time Codes for the beginning and ending of each day for a month. What I would like to do is get the syntax to replace the "after" and "before" values for each day and for each day to be added to the end of the pulled data. Even if it placed the data to a bunch of separate smaller datasets I could work with it.
Here is my (feeble) attempt. "links" is the data frame with the UTCs
mydata<- lapply(1:30, function(x) getPushshiftData(postType = "submission", size = 1000, after= links$utcstart[,x],before = links$utcendstart[,x], subreddit = "mysubreddit", nest_level = 1))
Here is the error message I get: Error in links$utcstart[, x] : incorrect number of dimensions
I've also tried without the "function (x)" argument and get the following message:
Error in ifelse(is.null(after), "", sprintf("&after=%s", after)) :
object 'x' not found
Can anyone help with this?

Python3 - IndexError when trying to save a text file

i'm trying to follow this tutorial with my own local data files:
CNTK tutorial
i have the following function to save my data array into a txt file feedable to CNTK:
# Save the data files into a format compatible with CNTK text reader
def savetxt(filename, ndarray):
dir = os.path.dirname(filename)
if not os.path.exists(dir):
os.makedirs(dir)
if not os.path.isfile(filename):
print("Saving", filename )
with open(filename, 'w') as f:
labels = list(map(' '.join, np.eye(11, dtype=np.uint).astype(str)))
for row in ndarray:
row_str = row.astype(str)
label_str = labels[row[-1]]
feature_str = ' '.join(row_str[:-1])
f.write('|labels {} |features {}\n'.format(label_str, feature_str))
else:
print("File already exists", filename)
i have 2 ndarrays of the following shape that i want to feed the model:
train.shape
(1976L, 15104L)
test.shape
(1976L, 15104L)
Then i try to implement the fucntion like this:
# Save the train and test files (prefer our default path for the data)
data_dir = os.path.join("C:/Users", 'myself', "OneDrive", "IA Project", 'data', 'train')
if not os.path.exists(data_dir):
data_dir = os.path.join("data", "IA Project")
print ('Writing train text file...')
savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
print ('Writing test text file...')
savetxt(os.path.join(data_dir, "Test-128x118_cntk_text.txt"), test)
print('Done')
and then i get the following error:
Writing train text file...
Saving C:/Users\A702628\OneDrive - Atos\Microsoft Capstone IA\Capstone data\train\Train-128x118_cntk_text.txt
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-24-b53d3c69b8d2> in <module>()
6
7 print ('Writing train text file...')
----> 8 savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
9
10 print ('Writing test text file...')
<ipython-input-23-610c077db694> in savetxt(filename, ndarray)
12 for row in ndarray:
13 row_str = row.astype(str)
---> 14 label_str = labels[row[-1]]
15 feature_str = ' '.join(row_str[:-1])
16 f.write('|labels {} |features {}\n'.format(label_str, feature_str))
IndexError: list index out of range
Can somebody please tell me what's going wrong with this part of the code? And how could i fix it? Thank you very much in advance.
Since you're using your own input data -- are they labelled in the range 0 to 9? The labels array only has 10 entries in it, so that could cause an out-of-range problem.

Reading and Writing Arrays from Multiple HDF Files in IDL

I am fairly new to IDL, and I am trying to write a code that will take a MODIS HDF file (level three data MOD14A1 and MYD14A1 to be specific), read the array, and then write the data from the array preferably to a csv file, but ASCII would work, too. I have code that will allow me to do this for one file, but I want to be able to do this for multiple files. Essentially, I want it to read one HDF array, write it to a csv, move to the next HDF file, and then write that array to the same csv file in the next row. Any help here would be greatly appreciated. I've supplied the code I have so far to do this with one file.
filename = dialog_pickfile(filter = filter, path = path, title = title)
csv_file = 'Data.csv'
sd_id = HDF_SD_START(filename, /READ)
; read "FirePix", "MaxT21"
attr_index = HDF_SD_ATTRFIND(sd_id, 'FirePix')
HDF_SD_ATTRINFO, sd_id, attr_index, DATA = FirePix
attr_index = HDF_SD_ATTRFIND(sd_id, 'MaxT21')
HDF_SD_ATTRINFO, sd_id, attr_index, DATA = MaxT21
index = HDF_SD_NAMETOINDEX(sd_id, 'FireMask')
sds_id = HDF_SD_SELECT(sd_id, index)
HDF_SD_GETDATA, sds_id, FireMask
HDF_SD_ENDACCESS, sds_id
index = HDF_SD_NAMETOINDEX(sd_id, 'MaxFRP')
sds_id = HDF_SD_SELECT(sd_id, index)
HDF_SD_GETDATA, sds_id, MaxFRP
HDF_SD_ENDACCESS, sds_id
HDF_SD_END, sd_id
help, FirePix
print, FirePix, format = '(8I8)'
print, MaxT21, format = '("MaxT21:", F6.1, " K")'
help, FireMask, MaxFRP
WRITE_CSV, csv_file, FirePix
After I run this, and choose the correct file, this is the output I am getting:
FIREPIX LONG = Array[8]
0 4 0 0 3 12 3 0
MaxT21: 402.1 K
FIREMASK BYTE = Array[1200, 1200, 8]
MAXFRP LONG = Array[1200, 1200, 8]
The "FIREPIX" array is the one I want stored into a csv.
Thanks in advance for any help!
Instead of using WRITE_CSV, it is fairly simple to use the primitive IO routines to write a comma-separated array, i.e.:
openw, lun, csv_file, /get_lun
; the following line produces a single line the output CSV file
printf, lun, strjoin(strtrim(firepix, 2), ', ')
; TODO: do the above line as many times as necessary
free_lun, sun

Trying to find area using SHAPE#AREA

I am trying to get the area of each polygon in the .shp file and then append it to a new column in the attribute table.
import os
import arcpy
import math
folderpath = 'C:\Users\Michaelf\Desktop\GEOG M173'
arcpy.env.workspace = folderpath
arcpy.env.overwriteOutput = True
input_shp = folderpath + r'\lower48_county_2012_election.shp'
equal_shape = folderpath + r'\project_lower48.shp'
out_point = folderpath + r'\lower_48_centroid.shp'
out_shp = folderpath + r'\48_State_Centroids.shp'
totarea = []
arcpy.AddField_management(input_shp, "totarea")
geometryField = arcpy.Describe(totarea).shapeFieldName
cursor = arcpy.UpdateCursor(totarea)
for row in cursor:
AreaValue = row.getValue(geometryField).area
row.setValue("total_area",AreaValue)
cursor.updateRow(row)
del row, cursor
print AreaValue
Below is one suggestion I've received but I don't quite understand it
with arcpy.da.SearchCursor(input_shp, ("OID#", "SHAPE#AREA")) as cursor:
for row in cursor:
print("Feature {0} has an area of {1}".format(row[0], row[1]))
Here is my error message:
Traceback (most recent call last):
File "C:/Users/Michaelf/Desktop/GEOG M173/test2.py", line 16, in <module>
geometryField = arcpy.Describe(totarea).shapeFieldName
File "C:\Program Files (x86)\ArcGIS\Desktop10.3\ArcPy\arcpy\__init__.py", line 1246, in Describe
return gp.describe(value)
File "C:\Program Files (x86)\ArcGIS\Desktop10.3\ArcPy\arcpy\geoprocessing\_base.py", line 374, in describe
self._gp.Describe(*gp_fixargs(args, True)))
RuntimeError: Object: Describe input value is not valid type
Since it's a geometry-centric database, Arc automatically stores the total area of a polygon as an attribute. There's no need to calculate it, which is why (I assume) that code snippet was suggested to you:
with arcpy.da.SearchCursor(input_shp, ("OID#", "SHAPE#AREA")) as cursor:
for row in cursor:
print("Feature {0} has an area of {1}".format(row[0], row[1]))
This does the following:
Create a cursor to go through the input shapefile input_shp, pulling just two attributes (the object ID and the total area).
Loop through the shapefile one row at a time.
For each row, print out row[0] (the object ID) and row[1] (the total area).
For further reading I'd take a look at this excellent answer on GIS.SE. It explores some of the options for accessing and calculating geometry in detail.
(Final note: the shape#area attribute will be in the units of the data's projection. Be aware of this if you're getting unexpected numbers.)

Query or Expression for excluding certain values from DAL selection

I'm trying to exclude posts which have a tag named meta from my selection, by:
meta_id = db(db.tags.name == "meta").select().first().id
not_meta = ~db.posts.tags.contains(meta_id)
posts=db(db.posts).select(not_meta)
But those posts still show up in my selection.
What is the right way to write that expression?
My tables look like:
db.define_table('tags',
db.Field('name', 'string'),
db.Field('desc', 'text', default="")
)
db.define_table('posts',
db.Field('title', 'string'),
db.Field('message', 'text'),
db.Field('tags', 'list:reference tags'),
db.Field('time', 'datetime', default=datetime.utcnow())
)
I'm using Web2Py 1.99.7 on GAE with High Replication DataStore on Python 2.7.2
UPDATE:
I just tried posts=db(not_meta).select() as suggested by #Anthony, but it gives me a Ticket with the following Traceback:
Traceback (most recent call last):
File "E:\Programming\Python\web2py\gluon\restricted.py", line 205, in restricted
exec ccode in environment
File "E:/Programming/Python/web2py/applications/vote_up/controllers/default.py", line 391, in <module>
File "E:\Programming\Python\web2py\gluon\globals.py", line 173, in <lambda>
self._caller = lambda f: f()
File "E:/Programming/Python/web2py/applications/vote_up/controllers/default.py", line 8, in index
posts=db(not_meta).select()#orderby=settings.sel.posts, limitby=(0, settings.delta)
File "E:\Programming\Python\web2py\gluon\dal.py", line 7578, in select
return adapter.select(self.query,fields,attributes)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3752, in select
(items, tablename, fields) = self.select_raw(query,fields,attributes)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3709, in select_raw
filters = self.expand(query)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3589, in expand
return expression.op(expression.first)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3678, in NOT
raise SyntaxError, "Not suported %s" % first.op.__name__
SyntaxError: Not suported CONTAINS
UPDATE 2:
As ~ isn't currently working on GAE with Datastore, I'm using the following as a temporary work-around:
meta = db.posts.tags.contains(settings.meta_id)
all=db(db.posts).select()#, limitby=(0, settings.delta)
meta=db(meta).select()
posts = []
i = 0
for post in all:
if i==settings.delta: break
if post in meta: continue
else:
posts.append(post)
i += 1
#settings.delta is an long integer to be used with limitby
Try:
meta_id = db(db.tags.name == "meta").select().first().id
not_meta = ~db.posts.tags.contains(meta_id)
posts = db(not_meta).select()
First, your initial query returns a complete Row object, so you need to pull out just the "id" field. Second, not_meta is a Query object, so it goes inside db(not_meta) to create a Set object defining the set of records to select (the select() method takes a list of fields to return for each record, as well as a few other arguments, such as orderby, groupby, etc.).

Resources