Julia Concatenating dataframes in loop

Julia Concatenating dataframes in loop - loops

I have files in folder that I want to read all at once, bind and make new dataframe.
Code:
using CSV, DataFrames
path="somepath"
files=readdir(path)
d=DataFrame()
for file=files
x=CSV.read(file)
d=vcat(d,x)
end
Produces:
Error: UndefVarError: d not defined

You can use append! in such cases (change allowing this approach is on master but is not released yet):
d=DataFrame()
for file=files
append!(d, CSV.read(file))
end
or, if you want to use vcat (this option will use a bit more memory):
reduce(vcat, [CSV.read(file) for file in files])
The original code should be rewritten as:
d=DataFrame()
for file=files
x=CSV.read(file)
global d=vcat(d,x)
end
(note global in front of d) but this is not a recommended way to perform this operation.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?

(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.

Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.

I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Check if each string from array is contained by another string array

Sorry I'am new on Ruby (just a Java programmer), I have two string arrays:
Array with file paths.
Array with patterns (can be a path or a file)
I need to check each patter over each "file path". I do with this way:
#flag = false
["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"].each do |source|
["bb/cc/","zz/xx/ee"].each do |to_check|
if source.include?(to_check)
#flag = true
end
end
end
puts #flag
This code is ok, prints "true" because "bb/cc" is in source.
I have seen several posts but can not find a better way. I'm sure there should be functions that allow me to do this in fewer lines.
Is this is possible?

As mentioned by #dodecaphonic use Enumerable#any?. Something like this:
paths.any? { |s| patterns.any? { |p| s[p] } }
where paths and patterns are arrays as defined by the OP.

While that will work, that's going to have geometric scaling problems, that is it has to do N*M tests for a list of N files versus M patterns. You can optimize this a little:
files = ["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"]
# Create a pattern that matches all desired substrings
pattern = Regexp.union(["bb/cc/","zz/xx/ee"])
# Test until one of them hits, returns true if any matches, false otherwise
files.any? do |file|
file.match(pattern)
end
You can wrap that up in a method if you want. Keep in mind that if the pattern list doesn't change you might want to create that once and keep it around instead of constantly re-generating it.

Populating 2D-Arrays from CSV (without m*n-Loops)

while migrating an Excel-VBA project to Visual Basic 2010, I came across a problem when populating arrays.
In Excel-VBA I would do something like
Function mtxCorrel() As Variant
mtxCorrel = wsCorr.UsedRange
End Function
to read an m*n-matrix (in this case n*n), that is conveniently stored in a worksheet, into an array for further use.
In VB2010 I obviously won't use an Excel-Worksheet as storage. csv-Files (see below) seem like a decent alternative.
I want to populate an 2d-array with the csv-contents without looping n*n-times. Let's assume I already know n=4 for demonstration purposes.
This suggests that what I want to do cant be done.
Nevertheless I still hope something like the following could work:
Function mtxCorrel() As Object
Dim array1(4, 4) As String
Using ioReader As New Microsoft.VisualBasic.FileIO.TextFieldParser("C:\cm_KoMa.csv")
With ioReader
.TextFieldType = FileIO.FieldType.Delimited
.SetDelimiters(";")
' Here I want to...
' A) ...either populate the whole 2d-array with something like
array1 = .ReadToEnd()
' B) ... or populate the array by looping its 1d-"rows"
While Not .EndOfData
array1(.LineNumber, 0)= .ReadFields()
End While
End With
End Using
return array1
End Function
Notes:
I'm mainly interested in populating the array.
I'm less interested in potential errors with determining which csv-line belongs into which 1d-"row", and also not interested in checking n.
Appendix: sample csv-File:
1;0.5;0.9;0.3
0.5;1;0.6;0.2
0.9;0.6;1;0.1
0.3;0.2;0.1;1

Python: Can you extend an array on each iteration using glob (or similar) to read in files from a directory

Is there a way to extend an array that stores data from a file on each iteration of a for-loop and with command combo, using glob. Currently, I have something like
import glob
from myfnc import func
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHead, data = func(thefile)
where func is defined in another script myfnc. What this does is on each iteration in the directory, stores the data from each file in fileHead and data (as arrays), erasing whatever was there on the previous iteration. What I need is something that will extend each array on each pass. Is there a nice way to do this? It doesn't need to be a for-loop, with combo. That is just how I am reading in all files from the directory.
I thought of initializing the arrays beforehand and then try extending them after the with is done on one pass, but it was giving me some kind of error with the extend command. With the error, the code would look like
import glob
from myfnc import func
fileHead, data = [0]*2
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHeadExtend, dataExtend = func(thefile)
fileHead.extend(fileHeadExtend)
data.extend(dataExtend)
So, the issue that it has is fileHead and data are both initialized but as int's. However, I don't want want to initialize the arrays to so many zeros. There should not be any arbitrary values in there to begin with. So, that is where issue is lying for this.

You want:
import glob
from myfnc import func
fileHead = list()
data = list()
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHeadExtend, dataExtend = func(thefile)
fileHead.extend(fileHeadExtend)
data.extend(dataExtend)

How do I make RGeo::Feature::Geometry methods available to RGeo::Geographic::SphericalMultiPolygonImpl?

I am using Rails 4.2 with PostGIS, rgeo and the activerecord-postgis-adapter gem on Ubuntu. I have also installed the following libraries: libgeos++-dev libgeos-3.4.2 libgeos-c1 libgeos-dbg libgeos-dev libgeos-doc libgeos-ruby1.8 ruby-geos. An RGeo::Error::UnsupportedOperation is being raised when I call contains? on an RGeo::Geographic::SphericalMultiPolygonImpl. How do I make the Feature::Geometry methods available to my RGeo::Geographic::SphericalMultiPolygonImpl?

You probably need to break up that multipolygon into pieces and run a "contains" call on each of them. I'm guessing the #contains method must be run on one polygon at a time. Here's what that operation might look like:
responses = {}
n = this_shape.num_geometries
(0..n).to_a.each do |i|
responses[i] = this_shape.geometry_n(i).contains?(other_shape)
end
Alternatively, you could break up those multipolygons into individual polygons and then run the loop on the array...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight