Using Autoencoder on numerical dataset in Keras - dataset

I am trying to develop an Intrusion Detection System based on deep learning using Keras.
We simulated a NORMAL network traffic and I prepared it in CSV file (numerical dataset of network packets fields (IP source, port,etc..)). But I do not have the ABNORMAL (malicious) packets to train the neural network on.
I searched for problems like that and I found that Autoencoder is a good approach in unsupervised learning, but the thing is that I am new to deep learning, and I only found this example https://blog.keras.io/building-autoencoders-in-keras.html where they are using Autoencoder on images dataset.
I want to use Autoencoder (or any thing useful in my case) with numerical CSV dataset in order to predict if the incoming packet is normal or malicious.
Any recommendation?

I have found the answer:
You can load the numerical dataset into python using e.g. numpy load text. Then, specify the encoder and decoder networks (basically just use the Keras Layers modules to design neural networks). Make sure the input layer of the encoder accepts your data, and the output layer of the decoder has the same dimension. Then, specify appropriate loss function (least squares, cross entropy, etc’) again with Keras losses. Finally, specify your optimizer with (surprise!) Keras optimizers.
Thats it, your done! Hit ‘run’, and watch your autoencoder autoencode (because that is how the autoencoders do). If you want a great tutorial about how to construct this.

from keras.layers import Input,Dense
from keras.models import Model
# number of neurons in the encoding hidden layer
encoding_dim = 5
# input placeholder
input_data = Input(shape=(6,)) # 6 is the number of features/columns
# encoder is the encoded representation of the input
encoded = Dense(encoding_dim, activation ='relu')(input_data)
# decoder is the lossy reconstruction of the input
decoded = Dense(6, activation ='sigmoid')(encoded) # 6 again number of features and should match input_data
# this model maps an input to its reconstruction
autoencoder = Model(input_data, decoded)
# this model maps an input to its encoded representation
encoder = Model(input_data, encoded)
# model optimizer and loss
autoencoder = Model(input_data, decoded)
# loss function and optimizer
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, = train_test_split(data, test_size=0.1, random_state=42)
# train the model
autoencoder.fit(x_train,
x_train,
epochs=50,
batch_size=256,
shuffle=True)
autoencoder.summary()
# predict after training
# note that we take them from the *test* set
encoded_data = encoder.predict(x_test)

Related

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

Query on TFP Probabilistic Model

In the TFP tutorial, the model output is Normal distribution. I noted that the output can be replaced by an IndependentNormal layer. In my model, the y_true is binary class. Therefore, I used an IndependentBernoulli layer instead of IndependentNormal layer.
After building the model, I found that it has two output parameters. It doesn't make sense to me since Bernoulli distribution has one parameter only. Do you know what went wrong?
# Define the prior weight distribution as Normal of mean=0 and stddev=1.
# Note that, in this example, the we prior distribution is not trainable,
# as we fix its parameters.
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
prior_model = Sequential([
tfpl.DistributionLambda(
lambda t: tfd.MultivariateNormalDiag(loc=tf.zeros(n), scale_diag=tf.ones(n))
)
])
return prior_model
# Define variational posterior weight distribution as multivariate Gaussian.
# Note that the learnable parameters for this distribution are the means,
# variances, and covariances.
def posterior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
posterior_model = Sequential([
tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n), dtype=dtype),
tfpl.MultivariateNormalTriL(n)
])
return posterior_model
# Create a probabilistic DL model
model = Sequential([
tfpl.DenseVariational(units=16,
input_shape=(6,),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0],
activation='relu'),
tfpl.DenseVariational(units=16,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0],
activation='sigmoid'),
tfpl.DenseVariational(units=tfpl.IndependentBernoulli.params_size(1),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/X_train.shape[0]),
tfpl.IndependentBernoulli(1, convert_to_tensor_fn=tfd.Bernoulli.logits)
])
model.summary()
screenshot of the results executed the codes on Google Colab
I agree the summary display is confusing but I think this is an artifact of the way tfp layers are implemented to interact with keras. During normal operation, there will only be one return value from a DistributionLambda layer. But in some contexts (that I don't fully grok) DistributionLambda.call may return both a distribution and a side-result. I think the summary plumbing triggers this for some reason, so it looks like there are 2 outputs, but there will practically only be one. Try calling your model object on X_train, and you'll see you get a single distribution out (its type is actually something called TensorCoercible, which is a wrapper around a distribution that lets you pass it into tf ops that call tf.convert_to_tensor -- the resulting value for that op will be the result of calling your convert_to_tensor_fn on the enclosed distribution).
In summary, your distribution layer is fine but the summary is confusing. It could probably be fixed; I'm not keras-knowledgeable enough to opine on how hard it would be.
Side note: you can omit the event_shape=1 parameter -- the default value is (), or "scalar", which will behave the same.
HTH!

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Data output from the device

I am working on a project,in which I need to extract data from the device: InertialUnit.
I get a single value in real time, but I need data for the first 10 s and in 1 ms increments, or all the data for the entire cycle of the device. Please help me implement this if possible.
Webots controllers are like any other programs, so you can easily get the values of the inertial unit and save them in a file at each step.
Here is a very simple example in Python:
from controller import Robot
robot = Robot()
inertial_unit = robot.getInertialUnit('inertial unit')
inertial_unit.enable(10)
while robot.step(10) != -1:
values = inertial_unit.getValues()
with open('values.txt','a') as f:
f.write('\n'.join(values))

In Python: Why does a Numpy array returned in a method have to be stored in the instance?

The issue is a bit complicated so I need to give you the full story:
I'm trying to create basic sound using the package pyaudio.
The pyaudio module asks for a callback function that reads audio data.
It basically looks like this:
def callback():
return sound.readframes()
I have a class that provides audio data. It creates a signal by indexing a numpy array that contains a wave length of a basic wave form. It basically looks like this:
class Wave(object):
data = [] # one wavelength of a basic waveform like a sine wave will be stored here
def readframes(self):
indices = # the indices for the next frames
# based on a frequency and the current position
return self.data[indices]
class SineWave(Wave):
data = # points of a sine wave in a numpy array
sound = SineWave()
# pyaudio now reads audio data in a non-blocking callback mode
Now to the problem:
There was noise between each call to readframes or each block of audio data. The sound itself was still intact, but there was an additional clicking. Increasing the size of the returned data block also increased the time between each click. I still don't quite know what caused the noise, but I fixed it by storing the returned numpy array in the instance instead of returning it directly.
This causes noise:
return self.data[indices]
This does not cause noise:
+ self.result = self.data[indices]
+ return self.result
It can't be an alteration of the data that causes the noise or else this wouldn't fix it. Why does the result have to be stored in the instance?
Update:
I also have to mention that the noise doesn't occur everytime I run the script. It occurs randomly about 50% of the runs. Either the noise appears right at the start and won't go away or it doesn't appear at all.
It looks like an issue with the initialization of the sound device, but how that could be related to my fix is a mistery.
Here is code that reproduces the noise problem:
import pyaudio
import time
from numpy import sin, linspace, arange, pi, float32, uint32
SR = 44100 # sampling rate
FPB = 2048 # frames per buffer
class SineWave(object):
"""A sine wave"""
data_size = 2**10
data = sin(linspace(0, 2*pi, data_size, endpoint=False, dtype=float32))
def __init__(self, frequency):
self.pos = 0
self.fq = frequency
self.ppf = self.data_size * self.fq / SR # points per frame
def readframes(self):
size = FPB
pos = self.pos
indices = (arange(pos, size+pos)*self.ppf).astype(uint32) & (self.data_size-1)
self.pos = (self.pos + size) % SR
self.result = self.data[indices]
return self.data[indices]
def callback(in_data, frame_count, time_info, status):
return sound.readframes(), pyaudio.paContinue
p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(4),
channels=1,
rate=SR,
output=True,
stream_callback=callback,
frames_per_buffer=FPB)
sound = SineWave(440)
stream.start_stream()
time.sleep(2)
stream.stop_stream()
stream.close()
p.terminate()
On my system the noise appears in about 90% of the runs.
Changing
return self.data[indices]
to
return self.result
fixes it.
There shouldn't be a difference whether you store the intermediate array in the instance or not. I suspect there is some other problem going on that causes the error you're experiencing. Can you provide code that actually runs and exhibits the problem?
If you want to see a (hopefully!) working implementation, have a look at the play() function of the sounddevice module, especially at the loop argument. With this, you should be able to play your waveform repeatedly.
The only thing you need to store in the instance, is the frame index where you left off, just to continue with the following frame index in the next callback (and of course wrap around in the end). You should also consider the case where your wave form is shorter than an audio block. It might even be much shorter. I solved that by using a recursive function call, see the code.
UPDATE:
It's indeed quite strange that assigning to self "solves" your problem. I think this has something to do with how PyAudio interprets the returned array as Python buffer.
You should rewrite your code using the sounddevice module, which properly supports NumPy arrays and has a sane API for the callback function.
It's easy, just use this callback function:
def callback(outdata, frames, time, status):
if status:
print(status)
outdata[:, 0] = sound.readframes()
And start the stream with:
import sounddevice as sd
stream = sd.OutputStream(
channels=1, samplerate=SR, callback=callback, blocksize=FPB)
Apart from that, your code is flawed since it doesn't do any interpolation when reading from the wavetable.

Resources