How to set max sequence length with a hugging face sagemaker estimator? - amazon-sagemaker

I'd like to increase the max sequence length from 128 to 512 (the maximum distilbert can handle.) I believe it's only using 128 tokens right now, because the training samples it prints out have an attention_mask with 128 values. This is my code:
import sagemaker
from sagemaker.huggingface import HuggingFace
# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
'model_name_or_path': 'distilbert-base-uncased',
'output_dir': '/opt/ml/model',
'do_predict': True,
'do_eval': True,
'do_train': True,
"train_file": "/opt/ml/input/data/train/train.csv",
"validation_file": "/opt/ml/input/data/val/val.csv",
"test_file": "/opt/ml/input/data/test/test.csv",
"num_train_epochs": 50,
"per_device_train_batch_size": 32
}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}
# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='./examples/pytorch/text-classification',
instance_type='ml.p3.8xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
#use_spot_instances = True,
#max_wait = 24*60*60+1,
hyperparameters = hyperparameters
)
# starting the train job
huggingface_estimator.fit({'train' : s3_input + "/train.csv",
'val' : s3_input + "/val.csv",
'test' : s3_input + "/test.csv"})
Inspecting run_glue.py, input arguments are taken here
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
but the hyperparameters that we can set only impact training_args. data_args gets used to set the max_seq_length later in this file. I don't see an option in the huggingface estimator to pass anything other than hyperparameters. I could fork v4.6.1 and manually set this value, but it seems overkill, is there a proper way to just pass this value?

It turns out max_seq_length : 512 can just be plugged into the hyperparams. I likely typo'd this before as I was getting messages that the param wasn't being used.

Related

Lasso via GridSearchCV: ConvergenceWarning: Objective did not converge

I am trying to find the optimal parameter of a Lasso regression:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])
My variables are demeaned and standardized. But I get the following error:
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.279e+02, tolerance: 6.395e-01
I know this error has been reported several times, but none of the previous posts answer how to solve it. In my case, the error is generated by the fact that the lowerbound 0.000005 is very small, but this is a reasonable value as indicated by solving the tuning problem via the information criteria:
lasso_aic = LassoLarsIC(criterion='aic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_aic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_aic.alpha_))
lasso_bic = LassoLarsIC(criterion='bic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_bic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_bic.alpha_))
AIC and BIC give values of around 0.000008. How can this warning be solved?
Increasing the default parameter max_iter=1000 in Lasso will do the job:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True, max_iter=5000)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])

How can I create and shuffle a dataset for triplet mining in TensorFlow 2?

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
##tf.function
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
else:
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
shape=(batch_size//samples_per_class,),
minval=0,
maxval=class_nbr,
dtype=tf.dtypes.int64,
) # Which classes will be in batch
choice = tf.data.Dataset.from_tensor_slices(tf.concat([choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch = tf.data.experimental.choose_from_datasets(all_datasets, choice)
if i==0:
all_batches = batch
else:
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "training.py", line 137, in <module>
main()
File "training.py", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\tfr_to_dataset.py", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
custom_shuffle
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

unable to extra/list all event log on watson assistant wrokspace

Please help I was trying to call watson assistant endpoint
https://gateway.watsonplatform.net/assistant/api/v1/workspaces/myworkspace/logs?version=2018-09-20 to get all the list of events
and filter by date range using this params
var param =
{ workspace_id: '{myworkspace}',
page_limit: 100000,
filter: 'response_timestamp%3C2018-17-12,response_timestamp%3E2019-01-01'}
apparently I got any empty response below.
{
"logs": [],
"pagination": {}
}
Couple of things to check.
1. You have 2018-17-12 which is a metric date. This translates to "12th day of the 17th month of 2018".
2. Assuming the date should be a valid one, your search says "Documents that are Before 17th Dec 2018 and after 1st Jan 2019". Which would return no documents.
3. Logs are only generated when you call the message() method through the API. So check your logging page in the tooling to see if you even have logs.
4. If you have a lite account logs are only stored for 7 days and then deleted. To keep logs longer you need to upgrade to a standard account.
Although not directly related to your issue, be aware that page_limit has an upper hard coded limit (IIRC 200-300?). So you may ask for 100,000 records, but it won't give it to you.
This is sample python code (unsupported) that is using pagination to read the logs:
from watson_developer_cloud import AssistantV1
username = '...'
password = '...'
workspace_id = '....'
url = '...'
version = '2018-09-20'
c = AssistantV1(url=url, version=version, username=username, password=password)
totalpages = 999
pagelimit = 200
logs = []
page_count = 1
cursor = None
count = 0
x = { 'pagination': 'DUMMY' }
while x['pagination']:
if page_count > totalpages:
break
print('Reading page {}. '.format(page_count), end='')
x = c.list_logs(workspace_id=workspace_id,cursor=cursor,page_limit=pagelimit)
if x is None: break
print('Status: {}'.format(x.get_status_code()))
x = x.get_result()
logs.append(x['logs'])
count = count + len(x['logs'])
page_count = page_count + 1
if 'pagination' in x and 'next_url' in x['pagination']:
p = x['pagination']['next_url']
u = urlparse(p)
query = parse_qs(u.query)
cursor = query['cursor'][0]
Your logs object should contain the logs.
I believe the limit is 500, and then we return a pagination URL so you can get the next 500. I dont think this is the issue but once you start getting logs back its good to know

Exception: decimal.InvalidOperation raised when saving a Django data model

I am storing crypto-currency data into a Django data model (using Postgres database). The vast majority of the records are saved successfully. But, on one record in particular I am getting an exception decimal.InvalidOperation.
The weird thing is, I can't see anything different about the values being saved in the problematic record from any of the others that save successfully. I have included a full stack trace on paste bin. Before the data is saved, I have outputted raw values to the debug log. The following is the data model I'm saving the data to. And the code that saves the data to the data model.
I'm stumped! Anyone know what the problem is?
Data Model
class OHLCV(m.Model):
""" Candles-stick data (open, high, low, close, volume) """
# class variables
_field_names = None
timeframes = ['1m', '1h', '1d']
# database fields
timestamp = m.DateTimeField(default=timezone.now)
market = m.ForeignKey('bc.Market', on_delete=m.SET_NULL, null=True, related_query_name='ohlcv_markets', related_name='ohlcv_market')
timeframe = m.DurationField() # 1 minute, 5 minute, 1 hour, 1 day, or the like
open = m.DecimalField(max_digits=20, decimal_places=10)
high = m.DecimalField(max_digits=20, decimal_places=10)
low = m.DecimalField(max_digits=20, decimal_places=10)
close = m.DecimalField(max_digits=20, decimal_places=10)
volume = m.DecimalField(max_digits=20, decimal_places=10)
Code Which Saves the Data Model
#classmethod
def fetch_ohlcv(cls, market:Market, timeframe:str, since=None, limit=None):
"""
Fetch OHLCV data and store it in the database
:param market:
:type market: bc.models.Market
:param timeframe: '1m', '5m', '1h', '1d', or the like
:type timeframe: str
:param since:
:type since: datetime
:param limit:
:type limit: int
"""
global log
if since:
since = since.timestamp()*1000
exchange = cls.get_exchange()
data = exchange.fetch_ohlcv(market.symbol, timeframe, since, limit)
timeframe = cls.parse_timeframe_string(timeframe)
for d in data:
try:
timestamp = datetime.fromtimestamp(d[0] / 1000, tz=timezone.utc)
log.debug(f'timestamp={timestamp}, market={market}, timeframe={timeframe}, open={d[1]}, high={d[2]}, low={d[3]}, close={d[4]}, volume={d[5]}')
cls.objects.create(
timestamp=timestamp,
market=market,
timeframe=timeframe,
open=d[1],
high=d[2],
low=d[3],
close=d[4],
volume=d[5],
)
except IntegrityError:
pass
except decimal.InvalidOperation as e:
error_log_stack(e)
Have a look at your data and check if it fits within the field limitations:
The mantissa must fit in the max_digits;
The decimal places should be less than decimal_places;
And according to the DecimalValidator : the number of whole digits should not be greater than max_digits - decimal_places;
Not sure how your fetch_ohlcv function fills the data array, but if there is division it is possible that the number of decimal_digits is greater than 10.
The problem I had, that brought me here, was too many digits in the integer part therefore failing the last requirement.
Check this answer for more information on a similar issue.

In Django how can I run a custom clean function on fixture data during import and validation?

In a ModelForm I can write a clean_<field_name> member function to automatically validate and clean up data entered by a user, but what can I do about dirty json or csv files (fixtures) during a manage.py loaddata?
Fixtures loaded with loaddata are assumed to contain clean data that doen't need validation (usually as an inverse operation to a prior dumpdata), so the short answer is that loaddata isn't the approach you want if you need to clean your inputs.
However, you probably can use some of the underpinnings of loaddata while implementing your custom data cleaning code--I'm sure you can easily script something using the Django serialization libs to read your existing data files them in and the save the resulting objects normally after the data has been cleaned up.
In case others want to do something similar, I defined a model method to do the cleaning (so it can be called from ModelForms)
MAX_ZIPCODE_DIGITS = 9
MIN_ZIPCODE_DIGITS = 5
def clean_zip_code(self, s=None):
#s = str(s or self.zip_code)
if not s: return None
s = re.sub("\D","",s)
if len(s)>self.MAX_ZIPCODE_DIGITS:
s = s[:self.MAX_ZIPCODE_DIGITS]
if len(s) in (self.MIN_ZIPCODE_DIGITS-1,self.MAX_ZIPCODE_DIGITS-1):
s = '0'+s # FIXME: deal with other intermediate lengths
if len(s)>=self.MAX_ZIPCODE_DIGITS:
s = s[:self.MIN_ZIPCODE_DIGITS]+'-'+s[self.MIN_ZIPCODE_DIGITS:]
return s
Then wrote a standalone python script to clean up my legacy json files using any clean_ methods found among the models.
import os, json
def clean_json(app = 'XYZapp', model='Entity', fields='zip_code', cleaner_prefix='clean_'):
# Set the DJANGO_SETTINGS_MODULE environment variable.
os.environ['DJANGO_SETTINGS_MODULE'] = app+".settings"
settings = __import__(app+'.settings').settings
models = __import__(app+'.models').models
fpath = os.path.join( settings.SITE_PROJECT_PATH, 'fixtures', model+'.json')
if isinstance(fields,(str,unicode)):
fields = [fields]
Ns = []
for field in fields:
try:
instance = getattr(models,model)()
except AttributeError:
print 'No model named %s could be found'%(model,)
continue
try:
cleaner = getattr(instance, cleaner_prefix+field)
except AttributeError:
print 'No cleaner method named %s.%s could be found'%(model,cleaner_prefix+field)
continue
print 'Cleaning %s using %s.%s...'%(fpath,model,cleaner.__name__)
fin = open(fpath,'r')
if fin:
l = json.load(fin)
before = len(l)
cleans = 0
for i in range(len(l)):
if 'fields' in l[i] and field in l[i]['fields']:
l[i]['fields'][field]=cleaner(l[i]['fields'][field]) # cleaner returns None to delete records
cleans += 1
fin.close()
after = len(l)
assert after>.5*before
Ns += [(before, after,cleans)]
print 'Writing %d/%d (new/old) records after %d cleanups...'%Ns[-1]
with open(fpath,'w') as fout:
fout.write(json.dumps(l,indent=2,sort_keys=True))
return Ns
if __name__ == '__main__':
clean_json()

Resources