How to turn tensorflow record into a variable? - file

so let's say I have converted 1000 images of dog jpegs into a tensorflow record file (python version, not C++)
Let's say this tensorflow record file has the following path
path = "/Users/Bill/Desktop/work/tensorflowproject1"
data = path +'train.tfrecords'
now the filepath to this tf record file is stored in "data" the string
this is where it gets sort of tricky
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
This is what the sample on the internet/google website runs on the example they provided "mnist"
But how would I get a variable that represents the training set directly? NOT just the filepath?
I want a dataset within tensorflow that I will then run something like this:
train_df = df.sample(frac = .6, random_state = 0).sort()
test_df = df.drop(train_df.index)
to split up the data. I have done this a lot with arrays, or dataframes. Never images, or a tensorflow record file

Related

Snowflake- creating external table by using pattern

I am trying to create the external table(xyz) in snowflake by using pattern to load historical file from stage, there are multiple files and using following pattern to load the file name started with below
201802242300_5d80272d1abcd32cc7a981da083ed498.gz. ( Feb 24th 2018 file)
Create external table xyz
(
samplecol1 varchar as (value:samplecol1::varchar),
samplecol2 varchar as (value:samplecol2::varchar),
date as to_date(substr(metadata$filename,1,8),yyyymmdd)
)
partition by (date)
location = #snowflakestage.largetable
pattern='.*/20180224.*[_].*.gz'
file_format = (type = 'JSON');
it's executing successfully but not loading any data. Is my pattern right to pick the file name listed above?
A good way to test patterns is via the LIST command as it takes the same PATTERN option.
thus for you:
LIST #snowflakestage.largetable pattern='.*/20180224.*[_].*.gz'
for example using the CitiBike example data, there are not parque files
so if you try load all files, you get errors.
create stage citibike.public.citibike_trips
url = 's3://snowflake-workshop-lab/citibike-trips';
list #citibike_trips;
name
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/08/data_01a19496-0601-8b21-003d-9b03003c624a_3106_4_0.snappy.parquet
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/09/data_01a19496-0601-8b21-003d-9b03003c624a_1906_6_0.snappy.parquet
s3://snowflake-workshop-lab/citibike-trips-parquet/2022/01/10/data_01a19496-0601-8b21-003d-9b03003c624a_2206_6_0.snappy.parquet
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_0.json.gz
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_1.json.gz
s3://snowflake-workshop-lab/citibike-trips/json/2013-06-01/data_01a304b5-0601-4bbe-0045-e8030021523e_005_7_2.json.gz
So I played around till I found this pattern worked for the files I wanted.
list #citibike_trips pattern = '.*trips_.*csv.gz';

How can I extract an image and a caption dataset from a h5 file?

I want to use FashionGen dataset that has 2 h5 format file for train and validation data. The h5 file's list of datasets are like this:
index
index_2
input_brand
input_category
input_composition
input_concat_description
input_department
input_description
input_gender
input_image
input_msrpUSD
input_name
input_pose
input_productID
input_season
input_subcategory
And I just need the "Input_image" and "Input_description" datasets. Would you mind help me please?
Details depend on the dataset dtype and shape and Python objects to be created. This code will get you started. Review h5py docs for details. h5py Quick Start Guide. Note: dataset and group names are case sensitive. Be sure to verify if they are "Input_image" or "input_image".
with h5py.File(filename,'r') as h5f:
# create NumPy array from image dataset:
image_arr = h5f['input_image'][:]
# create NumPy array from description dataset:
descr_arr = h5f['input_description'][:]
Note: if the datasets are too large to fit into memory, you can use h5py dataset objects and reference as-if they are NumPy arrays. The code is very similar. See below:
with h5py.File(filename,'r') as h5f:
# create h5py object of images dataset:
image_ds = h5f['input_image']
# create NumPy object of description dataset:
descr_ds = h5f['input_description']

Read a Struct JSON with AWS Glue that is on a single line

I have this JSON on a Bucket that has been crawled with a classifier that splits arrays into record with this JSON classifier $[*].
I noticed that the JSON is on a single line - nothing wrong with syntax - but this results in the table being created having a single column of type array containing a struct which contains the actual fields I need.
In Athena I wasn't able to access the data and Glue was not able to read the columns as in array.field; so I manually changed the structure of the table making a single struct type with the other fields inside. This I'm able to query on Athena and get the Glue wizard to recognise the single columns as part of the struct.
When I create the job and map the fields accordingly (this is what is automatically generated, note the array.field notation applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("array.col1", "long", "col1", "long"), ("array.col2", "string", "col2", "string"), ("array.col3", "string", "col3", "string")], transformation_ctx = "applymapping1")) I test the output on a table in an S3 Bucket. The Job does not fail at all, BUT creates files in the Bucket that are empty!
Another thing I've tried is to modify the Source JSON and add return lines:
this is before:
[{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}]
this is after:
[
{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},
{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}
]
Having the file modified as stated before lets me correctly read and write data; this led me to believe that the problem is having a bad JSON at the beginning. Before asking to change the JSON is there something I can implement in my Glue Job (Spark 2.4, Python 3) to handle a JSON on a single line? I've searched everywhere but found nothing.
The end goal is to load data into Redshift, we're working S3 to S3 to check on why data isn't being read.
Thanks in advance for your time and consideration.

How to generate training dataset from huge binary data using TF 2.0?

I have a binary dataset of over 15G. I want to extract the data for model training using TF 2.0. Currently here is what I am doing:
import numpy as np
import tensorflow as tf
data1 = np.fromfile('binary_file1', dtype='uint8')
data2 = np.fromfile('binary_file2', dtype='uint8')
dataset = tf.data.Dataset.from_tensor_slices((data1, data2))
# then do something like batch, shuffle, prefetch, etc.
for sample in dataset:
pass
but this consumes my memory and I don't think it's a good way to deal with such big files. What should I do to deal with this problem?
You have to make use of FixedLengthRecordDataset to read binary files. Though this dataset accepts several files together, it will extract data only one file at a time. Since you wish to have data to be read from two binary files simultaneously, you will have to create two FixedLengthRecordDataset and then zip them.
Another thing to note down is you have to specify parameter of record_bytes which states how many bytes would you like to read per iteration. This has to be an exact multiple of the total bytes of your binary file size.
ds1 = tf.data.FixedLengthRecordDataset('file1.bin', record_bytes=1000)
ds2 = tf.data.FixedLengthRecordDataset('file2.bin', record_bytes=1000)
ds = tf.data.Dataset.zip((ds1, ds2))
ds = ds.batch(32)

Why use an external database with Matlab?

Why should you use an external database (e.g. Mysql) when working with (large/growing) Data?
I know of some projects which use SQL databases, but I can't see the advantage you get from doing this in contrast to just storing everything in .mat files (as for example stated here: http://www.matlabtips.com/how-to-store-large-datasets/)
Where is this necessary? Where does this approach simplify things?
Regarding growing data, let's take an example where, on a production line, you would measure different sources with different sensors:
Experiment.Date = '2014-07-18 # 07h28';
Experiment.SensorType = 'A';
Experiment.SensorSerial = 'SENSOR-00012-A';
Experiment.SourceType = 'C';
Experiment.SourceSerial = 'SOURCE-00143-C';
Experiment.SensorPositions = 180 * linspace(0, 359, 360) / pi;
Experiment.SensorResponse = rand(1, 360);
And store these experiments on disk using .mat files:
experiment.2013-01-02.0001.mat
experiment.2013-01-02.0002.mat
experiment.2013-01-02.0003.mat
experiment.2013-01-03.0004.mat
...
experiment.2014-07-18.0001.mat
experiment.2014-07-18.0002.mat
So now, if I ask you:
"what is the typical response of sensors of type B when the source is of type E" ?
Or:
"Which sensor has best performances to measure sources of type C ? Sensors A or sensors B ?"
"How performances of these sensors degrade with time ?"
"Did modification we made last july to production line improved lifetime of sensors A ?"
Loading in memory all these .mat files, to check if date, sensor and source type are correct and then calculate min,mean,max responses, and other statistics is gonna be very painful and time consuming + writing custom code for file selection!
Building a data-base on top of these .mat files can be very useful to "SELECT/JOIN/..." elements of interest and then perform further statistic or operations.
NB: The database does not replace .mat files (i.e. the information), it just a practical and standard way to quickly select some of them upon conditions without having to load and parse everything.

Resources