I have some data (orderbook data for a crypto token) it looks like this:
'exchange': 'BYBIT'
'symbol': 'BTC-USDT-PERP'
'book': {'bid': [Decimal], 'ask': [Decimal]}
'timestamp': 1664773747.197
The bid and ask in the book are a long list of decimal values. This is where I am unsure.
I have tried to make a payload to send to InfluxDB using method client.write_api().write(bucket=data[ 'exchange' ], org='mydb', record=json_payload)
How do I write this to influxdb in an efficient way?
Related
I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)
I am posting the following object
{
skillName : "Professional Skills"
_id : {$oid: "5adf23946ab671bf6cb36aff"}
}
to the DjangoService given below:
#csrf_exempt
#api_view(['GET','POST'])
def saveSubjectView(request): #this service will add & update Subject
if request.method == 'POST':
try:
stream = StringIO(request.body)
subject = JSONParser().parse(stream)
print("The subejct is ")
pp.pprint(subject)
serializedsubject = json.loads(json_util.dumps(subject))
print("serializedsubject")
pp.pprint(serializedsubject)
The output that I am getting is
'skillType': { u'_id': { }, u'skillName': u'Professional Skills'}
The ObjectId posted from the front end (AngularJS) is not printed in the service. I know that I can fix it by removing the $oid while posting from the AngularJS application. But I would like to know why this is not happening. I have searched the documents and I couldn't get a proper reply. May be the keywords I used are wrong. Keywords used are : "JSON serialisation of ObjectId", "$oid json serialization using Django".
The complete object I am posting to the Django service is given below:
Exactly. $oid or anything prefixed with $ is an internal format and reserved, so you cannot post field names. The convention is from MongoDB Extended JSON where such prefixes are used to identify the BSON Type for proper conversion, and used as a serializable transport since these "types" are not supported in basic JSON.
So the solution is to actually use the bson.json_util to "deserialize" the JSON string right from the start:
from bson import json_util
# serializedsubject = json.loads(json_util.dumps(subject))
serializedsubject = json_util.loads(request.body) # correct usage
Or more succinctly self contained:
input = '{ "skillName" : "Professional Skills" ,"_id" : { "$oid": "5adf23946ab671bf6cb36aff"} }'
json_util.loads(input)
Returns
{u'skillName': u'Professional Skills', u'_id': ObjectId('5adf23946ab671bf6cb36aff')}
This correctly casts objects from any keys notated with the Extended JSON Syntax to their correct BSON Type, as also supported in the driver functions. And naturally the driver will then convert back to BSON when sending to MongoDB.
If for some reason your request.body contains anything other than a "string" which is valid for input to the function, then it is up to your code to convert it to that point. But there should be no need to "parse to JSON" and then "stringify" again just to input to the function.
NOTE: If you have not already done so within your JavaScript client side of the application, there is also the bson package available. This would allow where such Extended JSON is "received" from the server the translation into the BSON Types as JavaScript Objects, and of course then the serialization of such objects back into the Extended JSON Format.
This would in fact be recommended where "type" information needs to be maintained with the data transmitted and kept between client and server.
I am trying to get the rates from this website.
So I connect with website = Faraday.get('https://bitpay.com/api/rates')).status == 200 and then try to parse this.
A segment of the response I get is:
#<Faraday::Response:0x007fcf1ce25688
#env=
#<struct Faraday::Env
method=:get,
body=
"[{\"code\":\"BTC\",\"name\":\"Bitcoin\",\"rate\":1}, {\"code\":\"USD\",\"name\":\"US Dollar\",\"rate\":586.66},{\"code\":\"EUR\",\"name\":\"Eurozone Euro\",\"rate\":528.991322},{\"code\":\"GBP\",\"name\":\"Pound Sterling\",\"rate\":449.441986},{\"code\":\"JPY\",\"name\":\"Japanese Yen\",\"rate\":59907.95922},{\"code\":\"CAD\",\"name\"
When I do a website.body I get a String class of all these values found on that website. I want to parse them though (JSON?) so that I can get each rate as a float.
I tried something JSON.parse(website.body)["GBP"]["rate"].to_f but yet again it cannot work in a string.
The return I get is TypeError: no implicit conversion of String into Integer
I was having a similar (but not the same) format from a different rates website and this is how I was handling it. Do I need to change its format first or is there a different way around it?
You're trying to access to the parsed JSON with the key "GBP" but you have an array. It's like if you did
a = [1,2,3,4,5]
a['foo']
Try out
currencies = JSON.parse(website.body)
currencies.each { |currency| puts currency['rate'] }
and change it like you need
Heyhou
I'm trying to add some 3D models (buildings) in JSON format to a Firebase from a Node.js server reading the data from a JSON file. I'm using the push() to add a key under buildings.
var buildingsRef = new Firebase('https://test.firebaseIO-demo.com/buildings');
buildingsRef.push({ ... });
Unfortunately, the push is very slow and it takes approx. 30 seconds to insert the JSON into Firebase. The JSON object looks like the following:
{
geometries: {
vertices: [...],
faces: [...],
normals: [....]
},
materials: {
},
object: {
},
metadata: {
}
}
The geometries object contains Arrays for vertices, faces, normals, uv's, etc.. These arrays may have up to 10'000 entries or more, depending on the complexity of the 3D-Model.
I'm not sure if the large JSON itself (file on disk is about 2Mb) or the Array representation is the cause for the slow insertion into Firebase. I suspect that it has something to do with how Firebase internally represents Arrays.
Is there any way to optimize this? I like to store the whole building under one key, so I can check in the frond end of my application if the building has been replaced (usually by the server-side). I don't need to modify the individual arrays, I just like to be able to swap out a building with a different one and let my front-end update accordingly.
Thanks for your help!
In JMeter I want to check the number of objects in a JSON array, which I receive from the server.
For example, on a certain request I expect an array with 5 objects.
[{...},{...},{...},{...},{...}]
After reading this: count members with jsonpath?, I tried using the following JSON Path Assertion:
JSON Path: $
Expected value: hasSize(5)
Validate against expected value = checked
However, this doesn't seem to work properly. When I actually do receive 5 objects in the array, the response assertion says it doesn't match.
What am I doing wrong?
Or how else can I do this?
Although JSONPath Extractor doesn't provide hasSize function it still can be done.
Given the example JSON from the answer by PMD UBIK-INGENIERIE, you can get matches number on book array in at least 2 ways:
1. Easiest (but fragile) way - using Regular Expression Extractor.
As you can see, there are 4 entries for category like:
{ "category": "reference",
{ \"category\": \"fiction\"
...
If you add a Regular Expression Extractor configured as follows:
It'll capture all the category entries and return matches number as below:
So you will be able to use this ${matches_matchNr} variable wherever required.
This approach is straightforward and easy to implement but it's very vulnerable to any changes in the response format. If you expect that JSON data may change in the foreseeable future continue with the next option.
2. Harder (but more stable) way - calling JsonPath methods from Beanshell PostProcessor
JMeter has a Beanshell scripting extension mechanism which has access to all variables/properties in scope as well as to the underlying JMeter and 3rd-party dependencies APIs. In this case you can call JsonPath library (which is under the hood of JsonPath Extractor) directly from Beanshell PostProcessor.
import com.jayway.jsonpath.Criteria;
import com.jayway.jsonpath.Filter;
import com.jayway.jsonpath.JsonPath;
Object json = new String(data);
List categories = new ArrayList();
categories.add("fiction");
categories.add("reference");
Filter filter = Filter.filter(Criteria.where("category").in(categories));
List books = JsonPath.read(json, "$.store.book[?]", new Filter[] {filter});
vars.put("JSON_ARRAY_SIZE", String.valueOf(books.size()));
The code above evaluates JSONPath expression of $.store.book[?] against parent sampler response, counts matches number and stores it into ${JSON_ARRAY_SIZE} JMeter Variable
which can later be reused in an if clause or an assertion.
References:
JMeter – Working with JSON – Extract JSON response
JMeter's User Manual Regular Expressions entry
JSON Path Documentation and Examples
How to use BeanShell: JMeter's favorite built-in component
This is not possible with the plugin you are using (JMeter-plugins).
But it can be done with JSON Extractor since JMeter 3.0, this plugin has been donated by UbikLoadPack (http://jmeter.apache.org/changes_history.html)
Example:
Say you have this JSON that contains an array of books:
{ "store": {"book": [
{ "category": "reference","author": "Nigel Rees","title": "Sayings of the Century","price": 8.95},
{ "category": "fiction","author": "Evelyn Waugh","title": "Sword of Honour","price": 12.99},
{ "category": "fiction","author": "Herman Melville","title": "Moby Dick","isbn": "0-553-21311-3","price": 8.99},
{ "category": "fiction","author": "J. R. R. Tolkien","title": "The Lord of the Rings","isbn": "0-395-19395-8","price": 22.99}
],
"bicycle": {"color": "red","price": 19.95}} }
To have this count:
1/ Add JSON Extractor:
The count will be then available bookTitle_matchNr which you can access through:
${bookTitle_matchNr}
Running this Test Plan would display this:
As you can see, Debug Sampler-${bookTitle_matchNr} shows Debug Sampler-4