evaluating test dataset using eval() in LightGBM - eval

I have trained a ranking model with LightGBM with the objective 'lambdarank'.
I want to evaluate my model to get the nDCG score for my test dataset using the best iteration, but I have never been able to use the lightgbm.Booster.eval() nor lightgbm.Booster.eval_train() function.
First, I have created 3 dataset instances, namely the train set, valid set and test set:
lgb_train = lgb.Dataset(x_train, y_train, group=query_train, free_raw_data=False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train, group=query_valid, free_raw_data=False)
lgb_test = lgb.Dataset(x_test, y_test, group=query_test)
I then train my model using lgb_train and lgb_valid:
gbm = lgb.train(params,
lgb_train,
num_boost_round=1500,
categorical_feature=chosen_cate_features,
valid_sets=[lgb_train, lgb_valid],
evals_result=evals_result,
early_stopping_rounds=150
)
When I call the eval() or the eval_train() functions after training, it returns an error:
gbm.eval(data=lgb_test,name='test')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-122-7ff5ef5136b8> in <module>()
----> 1 gbm.eval(data=lgb_test,name='test')
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval(self, data,
name, feval)
1925 raise TypeError("Can only eval for Dataset instance")
1926 data_idx = -1
-> 1927 if data is self.train_set:
1928 data_idx = 0
1929 else:
AttributeError: 'Booster' object has no attribute 'train_set'
gbm.eval_train()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-123-0ce5fa3139f5> in <module>()
----> 1 gbm.eval_train()
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval_train(self,
feval)
1956 List with evaluation results.
1957 """
-> 1958 return self.__inner_eval(self.__train_data_name, 0, feval)
1959
1960 def eval_valid(self, feval=None):
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in
__inner_eval(self, data_name, data_idx, feval)
2352 """Evaluate training or validation data."""
2353 if data_idx >= self.__num_dataset:
-> 2354 raise ValueError("Data_idx should be smaller than number
of dataset")
2355 self.__get_eval_info()
2356 ret = []
ValueError: Data_idx should be smaller than number of dataset
and when i called the eval_valid() function, it returns an empty list.
Can anyone tell me how to evaluate a LightGBM model and get the nDCG score using test set properly? Thanks.

If you add keep_training_booster=True as an argument to your lgb.train, the returned booster object would be able to execute eval and eval_train (though eval_valid would still return an empty list for some reason even when valid_sets is provided in lgb.train).
Documentation says:
keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning.

Related

How can I create and shuffle a dataset for triplet mining in TensorFlow 2?

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
##tf.function
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
else:
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
shape=(batch_size//samples_per_class,),
minval=0,
maxval=class_nbr,
dtype=tf.dtypes.int64,
) # Which classes will be in batch
choice = tf.data.Dataset.from_tensor_slices(tf.concat([choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch = tf.data.experimental.choose_from_datasets(all_datasets, choice)
if i==0:
all_batches = batch
else:
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "training.py", line 137, in <module>
main()
File "training.py", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\tfr_to_dataset.py", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
custom_shuffle
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

Tensorflow: convert PrefetchDataset to BatchDataset

Tensorflow: convert PrefetchDataset to BatchDataset
With latest Tensorflow version 2.3.1I am trying to follow basic text classification example at: https://www.tensorflow.org/tutorials/keras/text_classification. Instead of creating dataset from directory as example does, I am using a csv file:
SELECT_COLUMNS = ['SentimentText','Sentiment']
LABEL_COLUMN = 'Sentiment'
LABELS = [0, 1]
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=3, # Artificially small to make examples easier to show.
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)
As a result I get:
type(all_data)
tensorflow.python.data.ops.dataset_ops.PrefetchDataset
Example loads data from directory with:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
And gets dataset of another type:
type(raw_train_ds)
tensorflow.python.data.ops.dataset_ops.BatchDataset
Now when I try to standardise and vectorise data with functions from example:
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
max_features = 10000
sequence_length = 250
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
And apply them to my dataset I get error:
# Make a text-only dataset (without labels), then call adapt
train_text = all_data.map(lambda x, y: x)
vectorize_layer.adapt(train_text)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-1f1fc445912d> in <module>
1 # Make a text-only dataset (without labels), then call adapt
2 train_text = all_data.map(lambda x, y: x)
----> 3 vectorize_layer.adapt(train_text)
/opt/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/preprocessing/text_vectorization.py in adapt(self, data, reset_state)
378 shape = dataset_ops.get_legacy_output_shapes(data)
379 if not isinstance(shape, tensor_shape.TensorShape):
--> 380 raise ValueError("The dataset passed to 'adapt' must contain a single "
381 "tensor value.")
382 if shape.rank == 0:
ValueError: The dataset passed to 'adapt' must contain a single tensor value.
How to convert PrefetchDataset to BatchDataset ?
You could use tf.stack method to pack the features into a single array. The below function is from Custom training: walkthrough in Tensorflow.
def pack_features_vector(features, labels):
features = tf.stack(list(features.values()), axis=1)
return features, labels
all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)
train_dataset = all_data.map(pack_features_vector)
train_text = train_dataset.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

Python3 - IndexError when trying to save a text file

i'm trying to follow this tutorial with my own local data files:
CNTK tutorial
i have the following function to save my data array into a txt file feedable to CNTK:
# Save the data files into a format compatible with CNTK text reader
def savetxt(filename, ndarray):
dir = os.path.dirname(filename)
if not os.path.exists(dir):
os.makedirs(dir)
if not os.path.isfile(filename):
print("Saving", filename )
with open(filename, 'w') as f:
labels = list(map(' '.join, np.eye(11, dtype=np.uint).astype(str)))
for row in ndarray:
row_str = row.astype(str)
label_str = labels[row[-1]]
feature_str = ' '.join(row_str[:-1])
f.write('|labels {} |features {}\n'.format(label_str, feature_str))
else:
print("File already exists", filename)
i have 2 ndarrays of the following shape that i want to feed the model:
train.shape
(1976L, 15104L)
test.shape
(1976L, 15104L)
Then i try to implement the fucntion like this:
# Save the train and test files (prefer our default path for the data)
data_dir = os.path.join("C:/Users", 'myself', "OneDrive", "IA Project", 'data', 'train')
if not os.path.exists(data_dir):
data_dir = os.path.join("data", "IA Project")
print ('Writing train text file...')
savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
print ('Writing test text file...')
savetxt(os.path.join(data_dir, "Test-128x118_cntk_text.txt"), test)
print('Done')
and then i get the following error:
Writing train text file...
Saving C:/Users\A702628\OneDrive - Atos\Microsoft Capstone IA\Capstone data\train\Train-128x118_cntk_text.txt
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-24-b53d3c69b8d2> in <module>()
6
7 print ('Writing train text file...')
----> 8 savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
9
10 print ('Writing test text file...')
<ipython-input-23-610c077db694> in savetxt(filename, ndarray)
12 for row in ndarray:
13 row_str = row.astype(str)
---> 14 label_str = labels[row[-1]]
15 feature_str = ' '.join(row_str[:-1])
16 f.write('|labels {} |features {}\n'.format(label_str, feature_str))
IndexError: list index out of range
Can somebody please tell me what's going wrong with this part of the code? And how could i fix it? Thank you very much in advance.
Since you're using your own input data -- are they labelled in the range 0 to 9? The labels array only has 10 entries in it, so that could cause an out-of-range problem.

context is unavailable in python behave

I am starting with Python behave and got stuck when trying to access context - it's not available. Here is my code:
Here is the Feature file:
Feature: Company staff inventory
Scenario: count company staff
Given a set of employees:
| name | dept |
| Paul | IT |
| Mary | IT |
| Pete | Dev |
When we count the number of employees in each department
Then we will find two people in IT
And we will find one employee in Dev
Here is the Steps file:
from behave import *
#given('a set of employees')
def step_impl(context):
assert context is True
#when('we count the number of employees in each department')
def step_impl(context):
context.res = dict()
for row in context.table:
for k, v in row:
if k not in context.res:
context.res[k] = 1
else:
context.res[k] += 1
#then('we will find two people in IT')
def step_impl(context):
assert context.res['IT'] == 2
#then('we will find one employee in Dev')
def step_impl(context):
assert context.res['Dev'] == 1
Here's the Traceback:
Traceback (most recent call last):
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1456, in run
match.run(runner.context)
File "/home/kseniyab/Documents/broadsign_code/spikes/BDD_Gherkin/behave/src/py3behave/lib/python3.4/site-packages/behave/model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "steps/count_staff.py", line 5, in step_impl
assert context is True
AssertionError
The context object is there. It is your assertion that is problematic:
assert context is True
fails because context is not True. The only thing that can satisfy x is True is where x has been set to True. context has not been set to True. It is an object. Note that testing whether context is True is not the same as testing whether if you were to write if context: the true branch would be taken or the false branch. The latter is equivalent to testing whether bool(context) is True.
There's no point in testing whether context is available. It is always there.

Loading data into R with rsqlserver package

I've just installed rsqlserver like so (no errors)
install_github('rsqlserver', 'agstudy',args = '--no-multiarch')
And created a connection to my database:
> library(rClr)
> library(rsqlserver)
Warning message:
multiple methods tables found for ‘dbCallProc’
> drv <- dbDriver("SqlServer")
> conn <- dbConnect(drv, url = "Server=MyServer;Database=MyDB;Trusted_Connection=True;")
>
Now when I try to get data using dbGetQuery, I get this error:
> df <- dbGetQuery(conn, "select top 100 * from public2013.dim_Date")
Error in clrCall(sqlDataHelper, "GetConnectionProperty", conn, prop) :
Type: System.MissingMethodException
Message: Method not found: 'System.Object System.Reflection.PropertyInfo.GetValue(System.Object)'.
Method: System.Object GetConnectionProperty(System.Data.SqlClient.SqlConnection, System.String)
Stack trace:
at rsqlserver.net.SqlDataHelper.GetConnectionProperty(SqlConnection _conn, String prop)
>
When I try to fetch results using dbSendQuery, I also get an error.
> res <- dbSendQuery(conn, "select top 100 * from public2013.dim_Date")
> df <- fetch(res, n = -1)
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
Strangely, the file c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs doesn't actually exist on my computer.
Am I doing something wrong?
I am agstudy the creator of rsqlserver package. Sorry for the late but I finally I get some time to fix this bug. ( actually it was a not yet implemented feature). I demonstrate here how you can read/write data.frame with missing values in Sql server.
First I create a data.frame with missing values. It is important to distinguish the difference between numeric and character variables.
library(rsqlserver)
url = "Server=localhost;Database=TEST_RSQLSERVER;Trusted_Connection=True;"
conn <- dbConnect('SqlServer',url=url)
## create a table with some missing value
dat <- data.frame(txt=c('a',NA,'b',NA),
value =c(1L,NA,NA,2))
My input looks like this :
# txt value
# 1 a 1
# 2 <NA> NA
# 3 b NA
# 4 <NA> 2
I insert dat in my data base with the handy function dbWriteTable:
dbWriteTable(conn,name='T_TABLE_WITH_MISSINGS',
dat,row.names=FALSE,overwrite=TRUE)
Then I will read it using 2 methods:
dbSendQuery
res = dbSendQuery(conn,'SELECT *
FROM T_TABLE_WITH_MISSINGS')
fetch(res,n=-1)
dbDisconnect(conn)
txt value
1 a 1
2 <NA> NaN
3 b NaN
4 <NA> 2
dbReadTable:
rsqlserver is DBI compliant and implement many convenient functions to deal at least at possible with SQL.
conn <- dbConnect('SqlServer',url=url)
dbReadTable(conn,name='T_TABLE_WITH_MISSINGS')
dbDisconnect(conn)
txt value
1 a 1
2 <NA> NaN
3 b NaN
4 <NA> 2
(EDIT: I had missed something in your post (call to fetch). I can now reproduce the issue too.)
Short story is: do you have a NULL value in your database? this may be the cause.
Longer story, for a full repro:
I've used a sample DB reproducible by following the instructions at http://www.codeproject.com/Tips/326527/Create-a-Sample-SQL-Database-in-Less-Than-2-Minute
EDIT:
I can reproduce your issue with:
library(rClr)
library(rsqlserver)
drv <- dbDriver("SqlServer")
conn <- dbConnect(drv, url = "Server=Localhost\\somename;Database=Fabrics;Trusted_Connection=True;")
res <- dbSendQuery(conn, "SELECT TOP 100 * FROM [Fabrics].[dbo].[Client]")
str(res)
## Formal class 'SqlServerResult' [package "rsqlserver"] with 1 slots
..# Id:<externalptr>
> df <- fetch(res, n = -1)
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
the following commands suggest things work as expected if using other commands.
> dbExistsTable(conn, name='Client')
Error in sqlServerExecScalar(conn, statement, ...) :
Message: There is already an open DataReader associated with this Command which must be closed first.
> dbClearResult(res)
[1] TRUE
> dbExistsTable(conn, name='Client')
[1] TRUE
> dbExistsTable(conn, name='SomeIncorrectColumn')
[1] FALSE
Note that I cannot reproduce the very odd one about MissingMethodException
df <- dbGetQuery(conn, "SELECT TOP 100 * FROM [Fabrics].[dbo].[Client]")
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
Since the debug symbols seem present, I can debug it further through visual studio. It bombs in SqlDataHelper.Fetch at
_resultSet[_cnames[i]].SetValue(_reader.GetValue(i), cnt);
and the variable watch gives me:
i 11 int
_cnames[i] "Street2" string
_reader.GetValue(i) {} object {System.DBNull}
_reader.GetValue(i-1) "806 West Sir Francis Drake St" object {string}
_reader.GetValue(i+1) "Spokane" object {string}
The entry for Street2 is indeed a NULL:
ClientId FirstName MiddleName LastName Gender DateOfBirth CreditRating XCode OccupationId TelephoneNumber Street1 Street2 City ZipCode Longitude Latitude Notes
1 Nicholas Pat Kane M 1975-10-07 00:00:00.000 3 ZU8 5ML 4 (279) 459 - 2707 2870 North Cherry Blvd. NULL Carlsbad 64906 32.7608137325835 117.112738329071
For information, sessionInfo() output includes:
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] rsqlserver_1.0 rClr_0.5-2
loaded via a namespace (and not attached):
[1] DBI_0.2-7 tools_3.0.2
Hope this helps.

Resources