How does one call external datasets into scikit-learn? - dataset

For example consider this dataset:
(1)
https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data
Or
(2)
http://data.worldbank.org/topic
How does one call such external datasets into scikit-learn to do anything with it?
The only kind of dataset calling that I have seen in scikit-learn is through a command like:
from sklearn.datasets import load_digits
digits = load_digits()

You need to learn a little pandas, which is a data frame implementation in python. Then you can do
import pandas
my_data_frame = pandas.read_csv("/path/to/my/data")
To create model matrices from your data frame, I recommend the patsy library, which implements a model specification language, similar to R formulas
import patsy
model_frame = patsy.dmatrix("my_response ~ my_model_fomula", my_data_frame)
then the model frame can be passed in as an X into the various sklearn models.

Simply run the following command and replace the name 'EXTERNALDATASETNAME' with the name of your dataset
import sklearn.datasets
data = sklearn.datasets.fetch_EXTERNALDATASETNAME()

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Use the X from for-loop as part of variable name (Pythnon)

I have a problem that may be obvious for Pythonist, but I just can't google it out.
Shortly:
I want to use x from the for x in my_data_header as part of my variable name. For example instead of hardcoding my_data.selected_column use my_data.x to loop trough all columns.
Longer:
I want to make boxplot from scientific data imported from the spreadsheet. In one columns are the treatments designation by which I trim dataset. Other are measurements I want to draw boxplots from. I need to loop trough measurement columns and export the boxplots. So the x of the for loop have to be used in:
*selection of the column (within each treatment), titling the boxplot, nameing the export .png file,...
I could perform steps separately, but coudn't compose the loop.
What is recommended approach for looping through spreadsheet columns with complex task where you have to refer to column titles? (I will complete information if needed.)
I am trying to switch from RStudio/Markdown/Knit to Python.
Thank you in advance!
It was the pd.DataFrame issue: my prior used date.read method imported as non-pandas compatible. I past my working code below if someone find it useful. If SO community find it irrelevant just delete all together.
`import numpy as np`
`import pandas as pd
`import seaborn as sns
`data = pd.read_excel('/media/Data/my_file.xlsx', 0)`
`h=data.columns #read headers line`
`d = pd.DataFrame(data)
`print(type(d)) #check that is <class 'pandas.core.frame.DataFrame'>`
`for x in h:`
` yy=d[x] #forreference to column`
` bp = sns.boxplot(x='Column_with_treatments', y=yy, data=data) #make graphs`
` fig = bp.get_figure() #put created graph in to obeject fig`
` nn=yy.name+'_name_you_want.png' #crate file name string`
` print(nn) # it print in log the column name of present graph`
` fig.savefig(nn) #save graph image`
` #plt.show() # it would show each image, but you need to close it to continue`
` plt.clf() #clear the present graph from memory`

JSON array keyerror in Python

I'm fairly new to Python programming and am attempting to extract data from a JSON array. Below code results in an error for
js[jstring][jkeys]['5. volume'])
Any help would be much appreciated.
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
url = https://www.alphavantage.co/queryfunction=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
try:
js = json.loads(data)
except:
js = None
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
print('volume',DailyData(symbol)[5])
Looks like the reason for the error is because the returned data from the URL is a bit more hierarchical than you may realize. To see that, print out js (I recommend using a jupyter notebook):
import urllib.request, urllib.parse, urllib.error
import ssl
import json
import sqlite3
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
js
You can see that js (now a python dict) has a "Meta Data" key before the actual time series begins. You need to start operating on the dict at that key.
Having said that, to get the data into a table like structure (for plotting, time series analysis, etc), you can use pandas package to read the dict key directly into a dataframe. The pandas DataFrame constructor accepts a dict as input. In this case, the data was transposed, so the T at the end rotates it (try with and without the T and you will see it.
import pandas as pd
df=pd.DataFrame(js['Time Series (Daily)']).T
df
Added edit... You could get the data into a dataframe with a single line of code:
import requests
import pandas as pd
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
data = pd.DataFrame(requests.get(url).json()['Time Series (Daily)']).T
DataFrame: The contructor from Pandas to make data into a table like structure
requests.get(): method from the requests library to fetch data..
.json(): directly converts from JSON to a dict
['Time Series (Daily)']: pulls out the key from the dict that is the time series
.T: transposes the rows and columns.
Good luck!
Following code worked for me
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
# Your code was missing the ? after query
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={}&apikey=demo".format(symb)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
# query multiple times, just to print one item?
print('open',DailyData('MSFT')[1])
print('high',DailyData('MSFT')[2])
print('low',DailyData('MSFT')[3])
print('close',DailyData('MSFT')[4])
print('volume',DailyData('MSFT')[5])
Output:
open 99.8850
high 101.4300
low 99.6700
close 101.1600
volume 19234627
Without seeing the error, it's hard to know what exact problem you were having.

How to transform a normalVectorRDD into a Labeled Point data?

I'm working with pyspark and I'd like to perform a linear regression from Mllib package. So I want to generate my own (big) data to compare my cluster performance vs single node python interpreter.
from pyspark.mllib.random import RandomRDDs
u=RandomRDDs.normalVectorRDD(sc, 1000000000, 500)
u.take(5)
I got this:
array([ -1.13787491e+00, 3.68202613e-01, 9.59762136e-01,
6.33172122e-01, -1.91278957e+00, -1.17794680e+00,
-7.77179759e-01, -1.48368585e+00, 2.32369644e+00,...]
And I'd like to parse it into LabeledPoint data so it can be recognized by the LinearregressionwithSGD algorithm. Each row like this:
LabeledPoint(0.469112,[-0.282863,-1.509059,-1.135632,1.212112,-0.173215,0.119209,-1.044236,-0.861849,-2.104569,-0.494929,1.071804,0.721555,-0.706771,-1.039575,0.27186,-0.424972,0.56702,0.276232,-1.087401,-0.67369,0.113648,-1.478427,0.524988,0.404705])
First value as target or label and the rest as features.
Try this,
from pyspark.mllib.regression import LabeledPoint
u.map(lambda x:LabeledPoint(x[0],x[1:]))

Simple / Smart, Pythonic database solution, can use Python types + syntax? (Key / Value Dict, Array, maybe Ordered Dict)

Looking for solutions that push the envelope and:
Avoid
Manually writing SQL queries(Python can be more OO not passing DSL strings)
Using non-Python datatypes for a supposedly required model definition
Using a new class of types rather than perfectly good native Python types
Boast
Using Python objects
Using Object Oriented and key based retrieval and creation
Quick protoyping
No SQL table to make
Model /Type inference or no model
Less lines and characters to type
Easily output to and from JSON, maybe XML or even Protocol Buffers.
I do web, desktop and mobile software development so the more portable the better.
python
>> from someAmazingDB import *
>> db.taskList = []
>> db['taskList'].append({title:'Beat old sql interfaces','done':False})
>> db.taskList.append({title:'Illustrate different syntax modes','done':True})
#at this point it should autosave
#we should be able to reload the console and access like:
python
>> from someAmazingDB import *
>> print 'Done tasks'
>> for task in db.taskList:
>> if task.done:
>> print task
'Illustrate different syntax modes'
Here is the challenge: The above code should work with very little modification or thinking required. Like a different import statement and maybe a little more but Django Models and SQLAlchemy DO NOT CUT IT.
I'm looking for more interesting library suggestions than just "Try Shelve" or "use pickle"
I'm not opposed to Python classes being used for models but they should be really straight forward, unlike the stuff you see with Django and similar.
I've was actually working on something like this earlier today. There is no readme or sufficient tests yet, but... http://github.com/mikeboers/LiteMap/blob/master/litemap.py
The LiteMap class behaves much like the builtin dict, but it persists into a SQLite database. You did not indicate what particular database you were interested in, but this could be almost trivially modified to any back end.
It also does not track changes to mutable classes (e.g. like appending to the list in your example), but the API is really simple.
Database access doesn't get better than SQLAlchemy.
Care to explain what about Django's models you don't find straightforward? Here's how I'd do what you have in Django:
from django.db import models
class Task(models.Model):
title = models.CharField(max_length=...)
is_done = models.BooleanField()
def __unicode__(self):
return self.title
----
from mysite.tasks.models import Task
t = Task(title='Beat old sql interfaces', is_done=True)
t.save()
----
from mysite.tasks.models import Task
print 'Done tasks'
for task in Task.objects.filter(is_done=True):
print task
Seems pretty straightforward to me! Also, results in a slightly cleaner table/object naming scheme IMO. The trickier part is using Django's DB module separate from the rest of Django, if that's what you're after, but it can be done.
Using web2py:
>>> from gluon.sql import DAL, Field
>>> db=DAL('sqlite://stoarge.db')
>>> db.define_table('taskList',Field('title'),Field('done','boolean')) # creates the table
>>> db['taskList'].insert(title='Beat old sql interfaces',done=False)
>>> db.taskList.insert(title='Beat old sql interfaces',done=False)
>> for task in db(db.taskList.done==True).select():
>> print task.title
Supports 10 different database back-ends + google app engine.
Question looks strikingly similar to http://api.mongodb.org/python/1.9%2B/tutorial.html
So answer is pymongo, what else ;)
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
tasklist = db['test-tasklist']
tasklist.append({title:'Beat old sql interfaces','done':False})
db.tasklist.append({title:'Illustrate different syntax modes','done':True})
for task in db.tasklist.find({done:True}):
print task.title
I haven't tested the code but wont be very different than this
BTW Redish is also interesting and fun.

Resources