How to use yolo training coco dataset + custom data together - dataset

I want to train yolov5 by combining the coco dataset and the custom dataset created with roboflow. How do I merge datasets?

can I ask why you’re looking to combine the two?
Are you just wanting to do Transfer Learning to accelerate your model training and inference performance? If that’s the case, you can just use Train From Checkpoint, with Roboflow Train, and use the COCO checkpoint - https://docs.roboflow.com/train
Otherwise, is your goal to detect your custom classes alongside all of the classes in COCO?

Create a data configuration file combined_datasets.yaml that combines multiple datasets like this:
path: ../../yolov5_datasets # realative data root dir
train: # train images (relative to 'path')
- coco_dataset/train/images # use both coco
- custom_dataset/train/images # and you custom dataset for train
val: # val images
- coco_dataset/val/images # use both coco
- custom_dataset/val/images # and you custom dataset for eval
# Classes
nc: N # number of classes
names: [ 'name_0', 'name_1', '...', 'name_N-1' ] # class names
Specify it for training:
python train.py --data combined_datasets.yaml --cfg yolov5s.yaml --weights yolov5s.pt --device 2 --img 320

Related

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

using CDO to extract dataset only for a specific region

I want to use 'cdo' to extract data from a precipitation NetCDF dataset using another NetCDF over South America.
I have tried multiple procedures, but I always get some error (such as grid size not same, Unsupported generic co-ordinates, etc).
The codes I have tried:
cdo mul chirps_2000-2015_annual_SA.nc Extract feature.nc output.nc
# Got grid error
cdo -f nc4 setctomiss,0 -gtc,0 -remapcon,r1440x720 Chirps_2000-2015_annual_SA.nc CHIRPS_era5_pev_2000-2015_annual_SA_masked.nc
# Got unsupported generic error
I am sure you could make/find more elegant solution, but I just combined Python and shell executable cdo to fulfill the task (calling subprocess can be considered as a bad habit sometimes/somewhere).
#!/usr/bin/env ipython
import numpy as np
from netCDF4 import Dataset
import subprocess
# -------------------------------------------------
def nc_varget(filein,varname):
ncin=Dataset(filein);
vardata=ncin.variables[varname][:];
ncin.close()
return vardata
# -------------------------------------------------
gridfile='extract_feature.nc'
inputfile='precipitation_2000-2015_annual_SA.nc'
outputfile='selected_region.nc'
# -------------------------------------------------
# Detect the start/end based on gridfile:
poutlon=nc_varget(gridfile,'lon')
poutlat=nc_varget(gridfile,'lat')
pinlon=nc_varget(inputfile,'lon')
pinlat=nc_varget(inputfile,'lat')
kkx=np.where((pinlon>=np.min(poutlon)) & (pinlon<=np.max(poutlon)))
kky=np.where((pinlat>=np.min(poutlat)) & (pinlat<=np.max(poutlat)))
# -------------------------------------------------
# -------------------------------------------------
commandstr='cdo selindexbox,'+str(np.min(kkx))+','+str(np.max(kkx))+','+str(np.min(kky))+','+str(np.max(kky))+' '+inputfile+' '+outputfile
subprocess.call(commandstr,shell=True)
The problem in your data is that the file "precipitation_2000-2015_annual_SA.nc" does not specify the grid at the moment - variables lon, lat are generic and hence the grid is generic. Otherwise you could use other operators instead of selindexbox. File extract_feature.nc are more closer to the standard as the variables lon, lat have also the name and unit attributes.

Loading of mobilenet v2 works, but pretrained mobilenet v2 fails

I retrain a mobilenet v2 modell using my own images and i can label new images with the output in python (https://www.tensorflow.org/hub/tutorials/image_retraining). Loading the file works, but during prediction it fails with (concole.log of Firefox and Chromium):
The dict provided in model.execute(dict) has keys: [images] not part of model graph.
I retrain a modell using the provided retrain.py
python retrain.py --image_dir flower_photos/ --tfhub_module https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/2 --random_brightness 10 --how_many_training_steps 100
inside flower_photos there are folders with the name of the images and inside the appropriate images.
flower_photos
--- Huflattich
------- 1.jpg
------- 2.jpg
....
--- Buschwindröschen
------- 1.jpg
------- 2.jpg
I can convert this model using
tensorflowjs_converter --input_format=tf_frozen_model --output_node_names='module_apply_default/MobilenetV2/Logits/output' /tmp/output_graph.pb Mobilenetv2/web_model
but this isn't working inside the provided example from https://github.com/tensorflow/tfjs-examples/tree/master/mobilenet
If i convert the original mobilenet v2 using
tensorflowjs_converter --input_format=tf_hub 'https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/2' mobilenetv2/web_model
i can load inside the provided example.
In the end, the programme should detect different early bloomer flowers shown by the webcam and classify. This should be a PWA for students and motivate them to experience nature.
Tensorflow.js currently has two types of models,
Layers model which allows training, you can load them with tf.loadModel(...)
Models that are converted from TensorFlow generated models, which does not allow training. This is what you have, you should use tf.loadFrozenModel(...)
here is an example for loading the frozen model and performing a prediction on an image. https://github.com/tensorflow/tfjs-converter/tree/master/demo/mobilenet

Using Autoencoder on numerical dataset in Keras

I am trying to develop an Intrusion Detection System based on deep learning using Keras.
We simulated a NORMAL network traffic and I prepared it in CSV file (numerical dataset of network packets fields (IP source, port,etc..)). But I do not have the ABNORMAL (malicious) packets to train the neural network on.
I searched for problems like that and I found that Autoencoder is a good approach in unsupervised learning, but the thing is that I am new to deep learning, and I only found this example https://blog.keras.io/building-autoencoders-in-keras.html where they are using Autoencoder on images dataset.
I want to use Autoencoder (or any thing useful in my case) with numerical CSV dataset in order to predict if the incoming packet is normal or malicious.
Any recommendation?
I have found the answer:
You can load the numerical dataset into python using e.g. numpy load text. Then, specify the encoder and decoder networks (basically just use the Keras Layers modules to design neural networks). Make sure the input layer of the encoder accepts your data, and the output layer of the decoder has the same dimension. Then, specify appropriate loss function (least squares, cross entropy, etc’) again with Keras losses. Finally, specify your optimizer with (surprise!) Keras optimizers.
Thats it, your done! Hit ‘run’, and watch your autoencoder autoencode (because that is how the autoencoders do). If you want a great tutorial about how to construct this.
from keras.layers import Input,Dense
from keras.models import Model
# number of neurons in the encoding hidden layer
encoding_dim = 5
# input placeholder
input_data = Input(shape=(6,)) # 6 is the number of features/columns
# encoder is the encoded representation of the input
encoded = Dense(encoding_dim, activation ='relu')(input_data)
# decoder is the lossy reconstruction of the input
decoded = Dense(6, activation ='sigmoid')(encoded) # 6 again number of features and should match input_data
# this model maps an input to its reconstruction
autoencoder = Model(input_data, decoded)
# this model maps an input to its encoded representation
encoder = Model(input_data, encoded)
# model optimizer and loss
autoencoder = Model(input_data, decoded)
# loss function and optimizer
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, = train_test_split(data, test_size=0.1, random_state=42)
# train the model
autoencoder.fit(x_train,
x_train,
epochs=50,
batch_size=256,
shuffle=True)
autoencoder.summary()
# predict after training
# note that we take them from the *test* set
encoded_data = encoder.predict(x_test)

Resources