In the SageMaker SDK Is there any way to combine / interpolate a ParameterString in a ProcessingInput/Processor/ProcessingStep? - amazon-sagemaker

When using the SageMaker SDK, I would like to use a pipeline parameter ParameterString to construct a S3 path, thus I need to interpolate the ParameterString somehow, python str.format() and f-strings do not properly work with ParameterString.
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
date_parameter = ParameterString(name="date")
p_input = ProcessingInput(
source=f"s3://my-bucket/date={date_parameter}",
destination="/opt/ml/processing/input"),)
What can be used to compose / combine / interpolate pipeline parameters?

The closest equivalent to string interpolation that you can use in a SageMaker Pipeline is sagemaker.workflow.functions.Join
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.functions import Join
from sagemaker.workflow.parameters import ParameterString
date_parameter = ParameterString(name="date")
source_variable = Join(on='', values=['s3://bucket-name/date=', date_parameter])
p_input = ProcessingInput(
input_name="xxx",
source=source_variable,
destination="/opt/ml/processing/input",
)
The source_variable in the code above will translate to {'Std:Join': {'On': '', 'Values': ['s3://bucket-name/date=', {'Get': 'Parameters.date'}]}} on the CreatePipeline > PipelineDefinition pipeline definition.
And when the SageMaker Pipeline is actually started that will be evaluated by SageMaker to a literal string

Related

Metrics for any step of Sagemaker pipeline (not just TrainingStep)

My understanding is that in order to compare different trials of a pipeline (see image), the metrics can only be obtained from the TrainingStep, using the metric_definitions argument for an Estimator.
In my pipeline, I extract metrics in the evaluation step that follows the training. Is it possible to record there metrics that are then tracked for each trial?
SageMaker suggests using Property Files and JsonGet for each necessary step. This approach is suitable for using conditional steps within the pipeline, but also trivially for persisting results somewhere.
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.steps import ProcessingStep
evaluation_report = PropertyFile(
name="EvaluationReport",
output_name="evaluation",
path="evaluation.json"
)
step_eval = ProcessingStep(
# ...
property_files=[evaluation_report]
)
and in your processor script:
import json
report_dict = {} # your report
evaluation_path = "/opt/ml/processing/evaluation/evaluation.json"
with open(evaluation_path, "w") as f:
f.write(json.dumps(report_dict))
You can read this file in the pipeline steps directly with JsonGet.

Why is Table Schema inference from Scala case classes not working in this official example

I am having trouble with Schema inference from Scala case classes during conversion from DataStreams to Tables in Flink. I've tried reproducing the examples given in the documentation but cannot get them to work. I'm wondering whether this might be a bug?
I have commented on a somewhat related issue in the past. My workaround is not using case classes but defining somewhat laboriously a DataStream[Row] with return type annotations.
Still I would like to learn if it is somehow possible to get the Schema inference from case classes working.
I'm using Flink 1.15.2 with Scala 2.12.7. I'm using the java libraries but install flink-scala separately.
This is my implementation of Example 1 as a quick Sanity Check:
import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration
import org.apache.flink.test.util.MiniClusterWithClientResource
import org.scalatest.BeforeAndAfter
import org.scalatest.funsuite.AnyFunSuite
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import java.time.Instant
class SanitySuite extends AnyFunSuite with BeforeAndAfter {
val flinkCluster = new MiniClusterWithClientResource(
new MiniClusterResourceConfiguration.Builder()
.setNumberSlotsPerTaskManager(2)
.setNumberTaskManagers(1)
.build
)
before {
flinkCluster.before()
}
after {
flinkCluster.after()
}
test("Verify that table conversion works as expected") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = StreamTableEnvironment.create(env)
case class User(name: String, score: java.lang.Integer, event_time: java.time.Instant)
// create a DataStream
val dataStream = env.fromElements(
User("Alice", 4, Instant.ofEpochMilli(1000)),
User("Bob", 6, Instant.ofEpochMilli(1001)),
User("Alice", 10, Instant.ofEpochMilli(1002))
)
val table =
tableEnv.fromDataStream(
dataStream
)
table.printSchema()
}
}
According to documentation this should result in:
(
`name` STRING,
`score` INT,
`event_time` TIMESTAMP_LTZ(9)
)
What I get:
(
`f0` RAW('SanitySuite$User$1', '...')
)
If I instead modify my code in line with Example 5 - that is explicitly define a Schema that mirrors the case class, I instead get an error which very much looks like it results from the inability of extracting the case class fields:
Unable to find a field named 'event_time' in the physical data type derived from the given type information for schema declaration. Make sure that the type information is not a generic raw type. Currently available fields are: [f0]
the issue is with the imports, you are importing java classes and using scala classes for pojo.
Using following works:
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.configuration.Configuration
import org.apache.flink.connector.kafka.source.KafkaSource
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._

JSON array keyerror in Python

I'm fairly new to Python programming and am attempting to extract data from a JSON array. Below code results in an error for
js[jstring][jkeys]['5. volume'])
Any help would be much appreciated.
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
url = https://www.alphavantage.co/queryfunction=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
try:
js = json.loads(data)
except:
js = None
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
print('volume',DailyData(symbol)[5])
Looks like the reason for the error is because the returned data from the URL is a bit more hierarchical than you may realize. To see that, print out js (I recommend using a jupyter notebook):
import urllib.request, urllib.parse, urllib.error
import ssl
import json
import sqlite3
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
js
You can see that js (now a python dict) has a "Meta Data" key before the actual time series begins. You need to start operating on the dict at that key.
Having said that, to get the data into a table like structure (for plotting, time series analysis, etc), you can use pandas package to read the dict key directly into a dataframe. The pandas DataFrame constructor accepts a dict as input. In this case, the data was transposed, so the T at the end rotates it (try with and without the T and you will see it.
import pandas as pd
df=pd.DataFrame(js['Time Series (Daily)']).T
df
Added edit... You could get the data into a dataframe with a single line of code:
import requests
import pandas as pd
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
data = pd.DataFrame(requests.get(url).json()['Time Series (Daily)']).T
DataFrame: The contructor from Pandas to make data into a table like structure
requests.get(): method from the requests library to fetch data..
.json(): directly converts from JSON to a dict
['Time Series (Daily)']: pulls out the key from the dict that is the time series
.T: transposes the rows and columns.
Good luck!
Following code worked for me
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
# Your code was missing the ? after query
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={}&apikey=demo".format(symb)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
# query multiple times, just to print one item?
print('open',DailyData('MSFT')[1])
print('high',DailyData('MSFT')[2])
print('low',DailyData('MSFT')[3])
print('close',DailyData('MSFT')[4])
print('volume',DailyData('MSFT')[5])
Output:
open 99.8850
high 101.4300
low 99.6700
close 101.1600
volume 19234627
Without seeing the error, it's hard to know what exact problem you were having.

How does one call external datasets into scikit-learn?

For example consider this dataset:
(1)
https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data
Or
(2)
http://data.worldbank.org/topic
How does one call such external datasets into scikit-learn to do anything with it?
The only kind of dataset calling that I have seen in scikit-learn is through a command like:
from sklearn.datasets import load_digits
digits = load_digits()
You need to learn a little pandas, which is a data frame implementation in python. Then you can do
import pandas
my_data_frame = pandas.read_csv("/path/to/my/data")
To create model matrices from your data frame, I recommend the patsy library, which implements a model specification language, similar to R formulas
import patsy
model_frame = patsy.dmatrix("my_response ~ my_model_fomula", my_data_frame)
then the model frame can be passed in as an X into the various sklearn models.
Simply run the following command and replace the name 'EXTERNALDATASETNAME' with the name of your dataset
import sklearn.datasets
data = sklearn.datasets.fetch_EXTERNALDATASETNAME()

Scala Slick-Extensions SQLServerDriver 2.1.0 usage - can't get it to compile

I am trying to use Slick-Extensions to connect to an SQL Server Database from Scala. I use slick 2.1.0 and slick-extensions 2.1.0.
I can't seem to get the code I wrote to compile. I followed the examples from slick's website and this compiled fine when the driver was H2. Please see below:
package com.example
import com.typesafe.slick.driver.ms.SQLServerDriver.simple._
import scala.slick.direct.AnnotationMapper.column
import scala.slick.lifted.TableQuery
import scala.slick.model.Table
class DestinationMappingsTable(tag: Tag) extends Table[(Long, Int, Int)](tag, "DestinationMappings_tbl") {
def id = column[Long]("id", O.PrimaryKey, O.AutoInc)
def mltDestinationType = column[Int]("mltDestinationType")
def mltDestinationId = column[Int]("mltDestinationId")
def * = (id, mltDestinationType, mltDestinationId)
}
I am getting a wide range of errors: scala.slick.model.Table does not take type parameters, column does not take type parameters and O not found.
If the SQLServerDriver does not use the same syntax as slick, where do I find its documentation?
Thank you!
I think to your import of scala.slick.model.Table shadows your import of com.typesafe.slick.driver.ms.SQLServerDriver.simple.Table
Try to just remove the:
import scala.slick.model.Table

Resources