Saving type information of datastructures

Saving type information of datastructures - file

I am building a framework for my day-to-day tasks. I am programming in scala using a lot of type parameter.
Now my goal is to save datastructures to files (e.g. xml files). But I realized that it is not possible using xml files. As I am new to this kind of problem I am asking:
Is there a way to store the types of my datastructures in a file??? Is there a way in scala???

Okay guys. You did a great job basicly by naming the thing I searched for.
Its serialization.
With this in mind I searched the web and was completly astonished by this feature of java.
Now I do something like:
object Serialize {
def write[A](o: A): Array[Byte] = {
val ba = new java.io.ByteArrayOutputStream(512)
val out = new java.io.ObjectOutputStream(ba)
out.writeObject(o)
out.close()
ba.toByteArray()
}
def read[A](buffer: Array[Byte]): A = {
val in = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(buffer))
in.readObject().asInstanceOf[A]
}
}
The resulting Byte-Arrays can be written to a file and everthing works well.
And I am totaly fine that this solution is not human readable. If my mind changes someday. There are JSON-parser allover the web.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?

(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.

Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.

I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

how to get array from and xml file in swift

I am new to swift but I have made an android app where a string array is selected from an xml file. This is a large xml file that contains a lot of string arrays and the app gets the relevant string array based on a user selection.
I am now trying to develop the same app for iOS using swift. I would like to use the same xml file but I can not see and easy way to get the correct array. For example, part of the xml looks like this
<string-array name="OCR_Businessstudies_A_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
<item>7. The global environment of business</item>
</string-array>
<string-array name="OCR_Businessstudies_AS_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
</string-array>
If I have the string "OCR_Businessstudies_A_Topics" how do i get the "OCR_Businessstudies_A_Topics" array from the xml file.
This is very straight forward in android and although I have used online tutorials for swift it seems like I have to parse the xml file but do not seem to be getting anywhere.
Is there a better approach than trying to parse the whole xml fie?
Thanks
Barry

You can write your own XML parser, conforming to NSXMLParser or use a library like HTMLReader:
let fileURL = NSBundle.mainBundle().URLForResource("data", withExtension: "xml")!
let xmlData = NSData(contentsOfURL: fileURL)!
let topic = "OCR_Businessstudies_A_Topics"
let document = HTMLDocument(data: xmlData, contentTypeHeader: "text/xml")
for item in document.nodesMatchingSelector("string-array[name='\(topic)'] item") {
print(item.textContent)
}

Declaring new variables on the fly?

Has anyone found a way to declare variables on the fly?
I have some variables that can be limitless...
val1
val2
val3
val4...
Is there a way to create these on the fly, something like...
for(loop though){
val + '1' = dosomething('1')
}
I know it wont look anything like the above, but hoping you get the gist of it.

Apex is a compiled, not parsed / dynamic language. You could pull such thing off in say Javascript or PHP but (as far as I know) not in Java. And when you get to the bottom of it - Salesforce is built on Oracle database and Java so Apex is kind of thin wrapper on Java calls they deemed most necessary.
Just out of curiosity - why would you need that?
Use Jeremy's idea with loops if you need sequential access. Or...
If you need "unique key -> some value", use Maps.
Map<String, Double> myMap = new Map<String, Double>();
for(Integer i = 1; i < 10; ++i){
myMap.put('someKey' + String.valueOf(i),Math.floor(Math.random() * 1000));
}
System.debug(myMap);
System.debug(myMap.get('someKey7'));
The "double" in this example can be equally Integer, Id, Account, MyCustomClass... whatever your function returns.
There's one more trick I can think of that might be helpful if your data comes from external source. It's to use JSON / XML parsers that could cast any kind of data (held in a String) to a collection of your choice. It kind of goes back to List/Map idea but it's totally up to you how would you build / retrieve this string beforehand. Read about JSON methods for a start although if you don't have a structure that follows predictable pattern you might want to check out JSON/XML parsers (click here and scroll down to the examples).

This isn't possible in apex. You could however use a list:
List<String> values = new List<String>();
for (Integer i = 0; i < aList.size(); i++) {
values.add(dosomething(i));
}

Octave select a file?

Does Octave have a good way to let the user select an input file? I've seen code like this for Matlab, but doesn't work in Octave.
A gui based method would be preferred, but some sort of command-line choice would work also. It would be great if there were some way to do this that would work in both Matlab and Octave.
I found this for Matlab but it does not work in Octave, even when you install Octave Forge Java package for the listdlg function. In Octave, dir() gives you:
647x1 struct array containing the fields:
name
date
bytes
isdir
datenum
statinfo
but I don't know how to convert this to an array of strings listdlg expects.

You have already the Octave Forge java package installed, so you can create instances of any java class and call any java method.
For example to create a JFileChooser and call the JFileChooser.showOpenDialog(Component parent) method:
frame = javaObject("javax.swing.JFrame");
frame.setBounds(0,0,100,100);
frame.setVisible(true);
fc = javaObject ("javax.swing.JFileChooser")
returnVal = fc.showOpenDialog(frame);
file = fc.getSelectedFile();
file.getName()
Btw. I had some troubles installing the package.
Here is a fix for Ubuntu. that worked also for my Debian Testing.
EDIT
#NoBugs In reply to your comment:
If you need to use listdlg you can do the following:
d = dir;
str = {d.name};
[sel,ok] = listdlg('PromptString','Select a file:',...
'SelectionMode','single',...
'ListString',str);
if ok == 1
disp(str{sel(1)});
end
This should be compatible with matlab, by I cannot test it right now.
If you want to select multiple files use this:
d = dir;
str = {d.name};
[sel,ok] = listdlg('PromptString','Select a file:',...
'SelectionMode','multiple',...
'ListString',str);
if ok == 1
imax = length(sel);
for i=1:1:imax
disp(str{sel(i)});
end
end

I never came across an open-file-dialog in octave.
If you are looking for a gui based method maybe guioctave can help you. I never used it, because it appears only be available for windows machines.
A possible solution would be to write a little script in octave, that would allow the user to parse through the directories and select a file like that.

Thought I'd provide an updated answer to this old question, since it is appearing in the 'related questions' field for other questions.
Octave provides the uigetdir and uigetfile functions, which do what you expect.

Simple / Smart, Pythonic database solution, can use Python types + syntax? (Key / Value Dict, Array, maybe Ordered Dict)

Looking for solutions that push the envelope and:
Avoid
Manually writing SQL queries(Python can be more OO not passing DSL strings)
Using non-Python datatypes for a supposedly required model definition
Using a new class of types rather than perfectly good native Python types
Boast
Using Python objects
Using Object Oriented and key based retrieval and creation
Quick protoyping
No SQL table to make
Model /Type inference or no model
Less lines and characters to type
Easily output to and from JSON, maybe XML or even Protocol Buffers.
I do web, desktop and mobile software development so the more portable the better.
python
>> from someAmazingDB import *
>> db.taskList = []
>> db['taskList'].append({title:'Beat old sql interfaces','done':False})
>> db.taskList.append({title:'Illustrate different syntax modes','done':True})
#at this point it should autosave
#we should be able to reload the console and access like:
python
>> from someAmazingDB import *
>> print 'Done tasks'
>> for task in db.taskList:
>> if task.done:
>> print task
'Illustrate different syntax modes'
Here is the challenge: The above code should work with very little modification or thinking required. Like a different import statement and maybe a little more but Django Models and SQLAlchemy DO NOT CUT IT.
I'm looking for more interesting library suggestions than just "Try Shelve" or "use pickle"
I'm not opposed to Python classes being used for models but they should be really straight forward, unlike the stuff you see with Django and similar.

I've was actually working on something like this earlier today. There is no readme or sufficient tests yet, but... http://github.com/mikeboers/LiteMap/blob/master/litemap.py
The LiteMap class behaves much like the builtin dict, but it persists into a SQLite database. You did not indicate what particular database you were interested in, but this could be almost trivially modified to any back end.
It also does not track changes to mutable classes (e.g. like appending to the list in your example), but the API is really simple.

Database access doesn't get better than SQLAlchemy.

Care to explain what about Django's models you don't find straightforward? Here's how I'd do what you have in Django:
from django.db import models
class Task(models.Model):
title = models.CharField(max_length=...)
is_done = models.BooleanField()
def __unicode__(self):
return self.title
----
from mysite.tasks.models import Task
t = Task(title='Beat old sql interfaces', is_done=True)
t.save()
----
from mysite.tasks.models import Task
print 'Done tasks'
for task in Task.objects.filter(is_done=True):
print task
Seems pretty straightforward to me! Also, results in a slightly cleaner table/object naming scheme IMO. The trickier part is using Django's DB module separate from the rest of Django, if that's what you're after, but it can be done.

Using web2py:
>>> from gluon.sql import DAL, Field
>>> db=DAL('sqlite://stoarge.db')
>>> db.define_table('taskList',Field('title'),Field('done','boolean')) # creates the table
>>> db['taskList'].insert(title='Beat old sql interfaces',done=False)
>>> db.taskList.insert(title='Beat old sql interfaces',done=False)
>> for task in db(db.taskList.done==True).select():
>> print task.title
Supports 10 different database back-ends + google app engine.

Question looks strikingly similar to http://api.mongodb.org/python/1.9%2B/tutorial.html
So answer is pymongo, what else ;)
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
tasklist = db['test-tasklist']
tasklist.append({title:'Beat old sql interfaces','done':False})
db.tasklist.append({title:'Illustrate different syntax modes','done':True})
for task in db.tasklist.find({done:True}):
print task.title
I haven't tested the code but wont be very different than this
BTW Redish is also interesting and fun.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight