How to generate training dataset from huge binary data using TF 2.0? - dataset

I have a binary dataset of over 15G. I want to extract the data for model training using TF 2.0. Currently here is what I am doing:
import numpy as np
import tensorflow as tf
data1 = np.fromfile('binary_file1', dtype='uint8')
data2 = np.fromfile('binary_file2', dtype='uint8')
dataset = tf.data.Dataset.from_tensor_slices((data1, data2))
# then do something like batch, shuffle, prefetch, etc.
for sample in dataset:
pass
but this consumes my memory and I don't think it's a good way to deal with such big files. What should I do to deal with this problem?

You have to make use of FixedLengthRecordDataset to read binary files. Though this dataset accepts several files together, it will extract data only one file at a time. Since you wish to have data to be read from two binary files simultaneously, you will have to create two FixedLengthRecordDataset and then zip them.
Another thing to note down is you have to specify parameter of record_bytes which states how many bytes would you like to read per iteration. This has to be an exact multiple of the total bytes of your binary file size.
ds1 = tf.data.FixedLengthRecordDataset('file1.bin', record_bytes=1000)
ds2 = tf.data.FixedLengthRecordDataset('file2.bin', record_bytes=1000)
ds = tf.data.Dataset.zip((ds1, ds2))
ds = ds.batch(32)

Related

How to Create Relationship between two Different Column in Neo4j

I am trying to initiate a relationship between two columns in Neo4j. my dataset is a CSV file with two-column refers to Co-Authorship and I want to Construct a Network of it. I already load the data, return them and match them.
Loading
load csv from 'file:///conet1.csv' as rec
return the data
create (:Guys {source: rec[0], target: rec[1]})
now I need to Construct the Collaboration Network of data by making a relationship between source and target columns. What do you propose for the purpose?
I was able to make a relationship between mentioned columns in NetworkX graph libray in python like this:
import pandas as pd
import networkx as nx
g = nx.Graph()
df = pd.read_excel('Colab.csv', columns= ['source', 'target'])
g = nx.from_pandas_edgelist(df,'source','target', 'weight')
If I understand your use case, I do not believe you should be creating Guys nodes just to store relationship info. Instead, the graph-oriented approach would be to create an Author node for each author and a relationship (say, of type COLLABORATED_WITH) between the co-authors.
This might work for you, or at least give you a clue:
LOAD CSV FROM 'file:///conet1.csv' AS rec
MERGE (source:Author {id: rec[0]})
MERGE (target:Author {id: rec[1]})
CREATE (source)-[:COLLABORATED_WITH]->(target)
If it is possible that the same relationship could be re-created, you should replace the CREATE with a more expensive MERGE. Also, a work can have any number of co-authors, so having a relationship between every pair may be sub-optimal depending on what you are trying to do; but that is a separate issue.

How to turn tensorflow record into a variable?

so let's say I have converted 1000 images of dog jpegs into a tensorflow record file (python version, not C++)
Let's say this tensorflow record file has the following path
path = "/Users/Bill/Desktop/work/tensorflowproject1"
data = path +'train.tfrecords'
now the filepath to this tf record file is stored in "data" the string
this is where it gets sort of tricky
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
This is what the sample on the internet/google website runs on the example they provided "mnist"
But how would I get a variable that represents the training set directly? NOT just the filepath?
I want a dataset within tensorflow that I will then run something like this:
train_df = df.sample(frac = .6, random_state = 0).sort()
test_df = df.drop(train_df.index)
to split up the data. I have done this a lot with arrays, or dataframes. Never images, or a tensorflow record file

Why use an external database with Matlab?

Why should you use an external database (e.g. Mysql) when working with (large/growing) Data?
I know of some projects which use SQL databases, but I can't see the advantage you get from doing this in contrast to just storing everything in .mat files (as for example stated here: http://www.matlabtips.com/how-to-store-large-datasets/)
Where is this necessary? Where does this approach simplify things?
Regarding growing data, let's take an example where, on a production line, you would measure different sources with different sensors:
Experiment.Date = '2014-07-18 # 07h28';
Experiment.SensorType = 'A';
Experiment.SensorSerial = 'SENSOR-00012-A';
Experiment.SourceType = 'C';
Experiment.SourceSerial = 'SOURCE-00143-C';
Experiment.SensorPositions = 180 * linspace(0, 359, 360) / pi;
Experiment.SensorResponse = rand(1, 360);
And store these experiments on disk using .mat files:
experiment.2013-01-02.0001.mat
experiment.2013-01-02.0002.mat
experiment.2013-01-02.0003.mat
experiment.2013-01-03.0004.mat
...
experiment.2014-07-18.0001.mat
experiment.2014-07-18.0002.mat
So now, if I ask you:
"what is the typical response of sensors of type B when the source is of type E" ?
Or:
"Which sensor has best performances to measure sources of type C ? Sensors A or sensors B ?"
"How performances of these sensors degrade with time ?"
"Did modification we made last july to production line improved lifetime of sensors A ?"
Loading in memory all these .mat files, to check if date, sensor and source type are correct and then calculate min,mean,max responses, and other statistics is gonna be very painful and time consuming + writing custom code for file selection!
Building a data-base on top of these .mat files can be very useful to "SELECT/JOIN/..." elements of interest and then perform further statistic or operations.
NB: The database does not replace .mat files (i.e. the information), it just a practical and standard way to quickly select some of them upon conditions without having to load and parse everything.

Storing R Objects in a relational database

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources