(def db-sample
{
:person [{:person/id 9 :name "rich" :surname "hickey" :join-date "04.04.2016" :experience :experience/lead :loyality-level :loyality-level/more-than-seven-years :work-type :work-type/tenure :work-time :work-time/part-time}]
:employees/developer-team [{:frontend [[:person/id 1] [:person/id 2] [:person/id 3] [:person/id 4]]
}
hello everyone, I making practice on assoc functions so I wanted to create a sample database just using assoc functions to do the practice.
I checked it's quick docs but there is no explanation about how can I create a vector and put data into it. I left an example on top, my question is how can I create db-sample data by using assoc functions? (or maybe easier better options)
In practice if you wanted to simulate a db it would first have to be an atom so you can update in place. And your "adding a person" would be something like:
(def db-sample (atom {:person [] :employees/developer-team []})
(swap! db-sample update :person #(conj % new-person))
Things can get tricky when your database is too nested - there are libraries for this such as specter. But keeping databases relatively flat is also good practice IMHO.
Related
I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)
In the Hash documentation, the section on Object keys seems to imply that you can use any type as a Hash key as long as you indicate but I am having trouble when trying to use an array as the key:
> my %h{Array};
{}
> %h{[1,2]} = [3,4];
Type check failed in binding to parameter 'key'; expected Array but got Int (1)
in block <unit> at <unknown file> line 1
Is it possible to do this?
The [1,2] inside the %h{[1,2]} = [3,4] is interpreted as a slice. So it tries to assign %h{1} and %{2}. And since the key must be an Array, that does not typecheck well. Which is what the error message is telling you.
If you itemize the array, it "does" work:
my %h{Array};
%h{ $[1,2] } = [3,4];
say %h.perl; # (my Any %{Array} = ([1, 2]) => $[3, 4])
However, that probably does not get what you want, because:
say %h{ $[1,2] }; # (Any)
That's because object hashes use the value of the .WHICH method as the key in the underlying array.
say [1,2].WHICH; say [1,2].WHICH;
# Array|140324137953800
# Array|140324137962312
Note that the .WHICH values for those seemingly identical arrays are different.
That's because Arrays are mutable. As Lists can be, so that's not really going to work.
So what are you trying to achieve? If the order of the values in the array is not important, you can probably use Sets as keys:
say [1,2].Set.WHICH; say [1,2].Set.WHICH
# Set|AEA2F4CA275C3FE01D5709F416F895F283302FA2
# Set|AEA2F4CA275C3FE01D5709F416F895F283302FA2
Note that these two .WHICHes are the same. So you could maybe write this as:
my %h{Set};
dd %h{ (1,2).Set } = (3,4); # $(3, 4)
dd %h; # (my Any %{Set} = ((2,1).Set) => $(3, 4))
Hope this clarifies things. More info at: https://docs.raku.org/routine/WHICH
If you are really only interested in use of an Object Hash for some reason, refer to Liz's answer here and especially the answers to, and comments on, a similar earlier question.
The (final1) focus of this answer is a simple way to use an Array like [1,'abc',[3/4,Mu,["more",5e6],9.9],"It's {<sunny rainy>.pick} today"] as a regular string hash key.
The basic principle is use of .perl to approximate an immutable "value type" array until such time as there is a canonical immutable Positional type with a more robust value type .WHICH.
A simple way to use an array as a hash key
my %hash;
%hash{ [1,2,3].perl } = 'foo';
say %hash{ [1,2,3].perl }; # displays 'foo'
.perl converts its argument to a string of Perl 6 code that's a literal version of that argument.
say [1,2,3].perl; # displays '[1, 2, 3]'
Note how spaces have been added but that doesn't matter.
This isn't a perfect solution. You'll obviously get broken results if you mutate the array between key accesses. Less obviously you'll get broken results corresponding to any limitations or bugs in .perl:
say [my %foo{Array},42].perl; # displays '[(my Any %{Array}), 42]'
1 This is, hopefully, the end of my final final answer to your question. See my earlier 10th (!!) version of this answer for discussion of the alternative of using prefix ~ to achieve a more limited but similar effect and/or to try make some sense of my exchange with Liz in the comments below.
There is an attribute :organisation/ord. This is how I'm getting the data structure to pass to d/transact:
(assoc (d/pull db [:db/id] (:db/id organisation)) :organisation/ord new-org-ord)
;; => {:db/id 17592186045432, :organisation/ord 4198}
Here organisation is of type datomic.query.EntityMap and new-org-ord is an integer. This works fine but seems unwieldy. Is there simpler code that does the same job?
Thinking all I need do was turn EntityMap into a real map I tried this:
(assoc (into {} organisation) :organisation/last-invoice-ordinal new-org-ord)
But got:
:db.error/not-an-entity Unable to resolve entity: #:db{:id 17592186045433} in datom [-9223301668109597772 :organisation/timespan #:db{:id 17592186045433}]
This is simpler:
{:db/id (:db/id organisation), :organisation/ord new-org-ord}
And here's another alternative that also works:
(assoc (select-keys organisation [:db/id]) :organisation/ord new-org-ord)
It doesn't really make sense to be transacting with a map that has anything in it apart from a map-entry to identify the entity id you want to assert some new facts against, together with map-entries that represent those facts.
My app passes to different methods a json_element for which the keys are different, and sometimes empty.
To handle it, I have been hard-coding the extraction with the following sample code:
def act_on_ruby_tag(json_element)
begin
# logger.progname = __method__
logger.debug json_element
code = json_element['CODE']['$'] unless json_element['CODE'].nil?
predicate = json_element['PREDICATE']['$'] unless json_element['PREDICATE'].nil?
replace = json_element['REPLACE-KEY']['$'] unless json_element['REPLACE-KEY'].nil?
hash = json_element['HASH']['$'] unless json_element['HASH'].nil?
I would like to eliminate hardcoding the values, and not quite sure how.
I started to think through it as follows:
keys = json_element.keys
keys.each do |k|
set_key = k.downcase
instance_variable_set("#" + set_key, json_element[k]['$']) unless json_element[k].nil?
end
And then use #code for example in the rest of the method.
I was going to try to turn into a method and then replace all this hardcoded code.
But I wasn't entirely sure if this is a good path.
It's almost always better to return a hash structure from a method where you have things like { code: ... } rather than setting arbitrary instance variables. If you return them in a consistent container, it's easier for callers to deal with delivering that to the right location, storing it for later, or picking out what they want and discarding the rest.
It's also a good idea to try and break up one big, clunky step with a series of smaller, lighter operations. This makes the code a lot easier to follow:
def extract(json)
json.reject do |k, v|
v.nil?
end.map do |k, v|
[ k.downcase, v['$'] ]
end.to_h
end
Then you get this:
extract(
'TEST' => { '$' => 'value' },
'CODE' => { '$' => 'code' },
'NULL' => nil
)
# => {"test"=>"value", "code"=>"code"}
If you want to persist this whole thing as an instance variable, that's a fairly typical pattern, but it will have a predictable name that's not at the mercy of whatever arbitrary JSON document you're consuming.
An alternative is to hard-code the keys in a constant like:
KEYS = %w[ CODE PREDICATE ... ]
Then use that instead, or one step further, define that in a YAML or JSON file you can read-in for configuration purposes. It really depends on how often these will change, and what sort of expectations you have about the irregularity of the input.
This is a slightly more terse way to do what your original code does.
code, predicate, replace, hash = json_element.values_at *%w{
CODE PREDICATE REPLACE-KEY HASH
}.map { |x| x.fetch("$", nil) if x }
Looking for solutions that push the envelope and:
Avoid
Manually writing SQL queries(Python can be more OO not passing DSL strings)
Using non-Python datatypes for a supposedly required model definition
Using a new class of types rather than perfectly good native Python types
Boast
Using Python objects
Using Object Oriented and key based retrieval and creation
Quick protoyping
No SQL table to make
Model /Type inference or no model
Less lines and characters to type
Easily output to and from JSON, maybe XML or even Protocol Buffers.
I do web, desktop and mobile software development so the more portable the better.
python
>> from someAmazingDB import *
>> db.taskList = []
>> db['taskList'].append({title:'Beat old sql interfaces','done':False})
>> db.taskList.append({title:'Illustrate different syntax modes','done':True})
#at this point it should autosave
#we should be able to reload the console and access like:
python
>> from someAmazingDB import *
>> print 'Done tasks'
>> for task in db.taskList:
>> if task.done:
>> print task
'Illustrate different syntax modes'
Here is the challenge: The above code should work with very little modification or thinking required. Like a different import statement and maybe a little more but Django Models and SQLAlchemy DO NOT CUT IT.
I'm looking for more interesting library suggestions than just "Try Shelve" or "use pickle"
I'm not opposed to Python classes being used for models but they should be really straight forward, unlike the stuff you see with Django and similar.
I've was actually working on something like this earlier today. There is no readme or sufficient tests yet, but... http://github.com/mikeboers/LiteMap/blob/master/litemap.py
The LiteMap class behaves much like the builtin dict, but it persists into a SQLite database. You did not indicate what particular database you were interested in, but this could be almost trivially modified to any back end.
It also does not track changes to mutable classes (e.g. like appending to the list in your example), but the API is really simple.
Database access doesn't get better than SQLAlchemy.
Care to explain what about Django's models you don't find straightforward? Here's how I'd do what you have in Django:
from django.db import models
class Task(models.Model):
title = models.CharField(max_length=...)
is_done = models.BooleanField()
def __unicode__(self):
return self.title
----
from mysite.tasks.models import Task
t = Task(title='Beat old sql interfaces', is_done=True)
t.save()
----
from mysite.tasks.models import Task
print 'Done tasks'
for task in Task.objects.filter(is_done=True):
print task
Seems pretty straightforward to me! Also, results in a slightly cleaner table/object naming scheme IMO. The trickier part is using Django's DB module separate from the rest of Django, if that's what you're after, but it can be done.
Using web2py:
>>> from gluon.sql import DAL, Field
>>> db=DAL('sqlite://stoarge.db')
>>> db.define_table('taskList',Field('title'),Field('done','boolean')) # creates the table
>>> db['taskList'].insert(title='Beat old sql interfaces',done=False)
>>> db.taskList.insert(title='Beat old sql interfaces',done=False)
>> for task in db(db.taskList.done==True).select():
>> print task.title
Supports 10 different database back-ends + google app engine.
Question looks strikingly similar to http://api.mongodb.org/python/1.9%2B/tutorial.html
So answer is pymongo, what else ;)
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
tasklist = db['test-tasklist']
tasklist.append({title:'Beat old sql interfaces','done':False})
db.tasklist.append({title:'Illustrate different syntax modes','done':True})
for task in db.tasklist.find({done:True}):
print task.title
I haven't tested the code but wont be very different than this
BTW Redish is also interesting and fun.