AWS Blazing text supervised hyperparameter not logging objective metric - amazon-sagemaker

I am running a Hyperparameter tuning job using Sagemakers built in training image for Blazing text (blazingtext:latest) however when my jobs complete they only log out #train accuracy:
...
06:00:36 ##### Alpha: 0.0000 Progress: 100.00% Million Words/sec: 0.00 #####
06:13:19 Training finished.
06:13:19 Average throughput in Million words/sec: 0.00
06:13:19 Total training time in seconds: 1888.88
06:13:19 #train_accuracy: 0.4103
06:13:19 Number of train examples: 55783
The Hyperparameter job does not allow for me to pick #train_accuracy as an objective metric, only "validation:accuracy" or train:mean_rho appear in the dropdown.
After the job completes under "Best training job" tab I see:
Best training job summary data is available when you have completed training jobs that are emiting metrics.
Am I missing something obvious?

Ensure there is a validation channel in addition to the "train" channel :

Related

Solr Metrics Number confusion

We are wanting to start using the Solr metrics, example web API call below:
http://localhost:8983/solr/admin/metrics?type=counter&group=core.bookings
We get results back like this from our master
We get metrics like this from our slave that does all the serving
The question is what are the totalTime values measure in? If we presume they are milliseconds, that would mean that each request was taking 0.8 of a day, and that is just not right! Can someone help describe what units of measure the totalTime is in?
After tracing the source, this is based on the value returned from the metrics timing object, which is given in nanoseconds:
long elapsed = timer.stop();
metrics.totalTime.inc(elapsed);
.. so it's nanoseconds, not milliseconds. This seems to make sense if we take the numbers and divided them:
>>> total = 718781574846646
>>> requests = 6789073
>>> total / requests
105873301.82583779
>>> time_per_request = total / requests
>>> time_per_request / 1_000_000_000
0.10587330182583779
.. so about 0.1s per request on average.

Time series forecast with TensorFlow Probability - seasonality parameters

I was using SARIMAX for timeseries forecasting and getting decent results. I am currently exploring the TensorFlow Probability and came across "tfp.sts.forecast".
About data: I have sample timeseries data with hourly data. This data can have hourly as well as weekly seasonality.
For buidling the model, as mentioned in samples, I tried setting the effect as below
hour_of_day_effect = tfp.sts.Seasonal(
num_seasons=24,
num_steps_per_season=1,
observed_time_series=observed_time_series,
allow_drift=True,
name='hour_of_day_effect')
day_of_week_effect = tfp.sts.Seasonal(
num_seasons=168,
num_steps_per_season=1,
observed_time_series=observed_time_series,
allow_drift=True,
name='day_of_week_effect')
autoregressive = tfp.sts.Autoregressive(
order=1,
observed_time_series=observed_time_series,
name='autoregressive')
model = tfp.sts.Sum([hour_of_day_effect,
day_of_week_effect,
autoregressive],
observed_time_series=observed_time_series)
My question is about "day_of_week_effect". For my hourly data, if I set it up as
num_seasons=7
num_steps_per_season=24
I do not get good results.
For example, if I see spike on every Friday, that is not captured correctly if the values are set as above.
But if I set them up as
num_seasons=168,
num_steps_per_season=1,
it is captured correctly. (I arrived at 168 as 24 * 7)
Could you please let me know about this behavior?

constantUsersPerSecond in Gatling

I'm using Gatling to run loadtest.
I've put all the queries (3million queries in total) in a log file and load it into a feed
val feeder = tsv(logFileLocation).circular
val search =
feed(feeder)
.exec(
http("/")
.get("${params}")
.headers(sentHeaders)
).pause(1)
As for the simulation, I want 50 users concurrently and peak request at 50/seconds so I set up this way
setUp(myservicetest.inject(atOnceUsers(50))).throttle(
reachRps(50) in (40 minutes), jumpToRps(20), holdFor(20 minutes)).maxDuration(60 minutes).protocols(httpProtocol)
My assumption is these 50 users each loads the queries and starts from fisr to last queries. Because there are enough queries to execute, these 50 users will always stay on line for the whole duration (60 mins)
But when I ran it, I saw user1 runs query1, user2 runs query2, .. user50 runs query50. Every user just runs 1 query and then quit. So exactly 50 queries were executed in the loadtest and it finished quickly.
SO my question is, say I have 3 million queries in tsv(logFileLocation).circular and multiple users, will each user starts from query1 and try to execute all 3m queries. Or each user is scheduled to run part of the 3m queries and if enough time is allocated, at the end of the test all 3m queries are executed for just once?
Thanks
Disclaimer: Gatling's author here
The latter: "at the end of the test all 3m queries are executed for just once".
Each virtual user performs its scenario.
Injection profiles only controls when new virtual users are started.
If you want virtual users to perform more than one request, you have to include that in your scenario, eg with loops and pauses.
Those are Gatling basics, I recommend you have a look at the new Gatling Academy courses we've just launched.

Can snowflake work as an operational data store against which I can write rest APIs

I am researching snowflake database and have a data aggregation use case, where i need to expose the aggregated data via a Rest API. While the data ingestion and aggregation seems to be well defined, is snowflake a system that can be used as an operational data store for servicing high throughput apis?
Or is this an anti pattern for this system
Updating based on your recent comment.
Here's some quick test results I did on large tables we have in production. *Changed the table names for display.
vLookupView records = 175,760,316
vMainView records = 179,035,026
SELECT
LP.REGIONCODE
, SUM(L.VALUE)
FROM DBO.vLookupView AS LP
INNER JOIN DBO.vMainView AS M
ON LP.PK = M.PK
GROUP BY LP.REGIONCODE;
Results:
SQL SERVER
Production box - 2:04 minutes
**Snowflake:**
By Warehouse (compute) size
XS - 17.1 seconds
Small - 9.9 seconds
Medium - 7.1s seconds
Large - 5.4 seconds
Extra Large - 5.4 seconds
When I added a WHERE condition
WHERE L.ENTEREDDATE BETWEEN '1/1/2018' AND '6/1/2018'
the results were:
SQL SERVER
Production box - 5 seconds
**Snowflake:**
By Warehouse (compute) size
XS - 12.1 seconds
Small - 3.9 seconds
Medium - 3.1s seconds
Large - 3.1 seconds
Extra Large - 3.1 seconds

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

Resources