I'm trying to visualize multiple metrics, while using only one as the objective. I see how you can define 'custom' metrics using 'MetricDefinitions' under 'AlgorithmSpecification', but what if we just want to see more of the following and record them in CloudWatch as our HyperParameter tuning job progresses:
validation:accuracy
validation:auc
validation:error
validation:logloss
validation:mse
There are more, of course, and the exact metrics I realize might vary based on whether it's a classification or regression problem.
The larger question is just how do we specify the 'recording/logging' of more of these metrics using a standard container like the one for XGBoost?
You can monitor a set of metrics for each problem type by using SageMaker Model Monitor[ Reference ]
If you need to monitor any additional metrics apart from the default ones, you can use the BYOC approach to monitor them.
Some of the examples for building BYOC can be found here
Related
I can't understand the difference between SageMaker instance count and Data parallelism. As we already have a feature that can specify how many instances we train model when we write a training script using sagemaker-sdk.
However, in 2021 re:Invent, SageMaker team launched and demonstrated SageMaker managed Data Parallelism and this feature also provides distributed training.
I've searched a lot of sites for letting me know about that, but I can't find really clear demonstration. I share some stuffs explaining the concept I mentioned closely. Link : https://godatadriven.com/blog/distributed-training-a-diy-aws-sagemaker-model/
Increasing the instance count will enable SageMaker to launch those many instances and copy data to the instances. This will only enable parallelization at the infrastructure level. To really carry out distributed training we need support at framework/code level where the code should know how to aggregate/send gradients across all the GPU's/instances within the cluster. In some case how to distribute data as well usually when using DataLoaders. To achieve this SageMaker has Distributed Data Parallelism feature built into it. This is similar to other alternatives like Horovod, Pytorch DDP etc...
Reaching out to the community to pressure test our internal thinking.
We are building a simplified business intelligence platform that will aggregate metrics (i.e. traffic, backlinks) and text list (i.e search keywords, used technologies) from several data providers.
The data will be somewhat loosely structured and may change over time with vendors potentially changing their response formats.
Data volume may be long term 100,000 rows x 25 input vectors.
Data would be updated and read continuously but not at massive concurrent volume.
We'd expect to need to do some ETL transformations on the gathered data from partners along the way to the UI (e.g show trending information over the past five captured data points).
We'd want to archive every single data snapshot (i.e. version it) vs just storing the most current data point.
The persistence technology should be readily available through AWS.
Our assumption is our requirements lend themselves best towards DynamoDB (vs Amazon Neptune or Redshift or Aurora).
Is that fair to assume? Are there any other questions / information I can provide to elicit input from this community?
Because of your requirement to have a schema-less structure, and to version each item, DynamoDB is a great choice. You will likely want to build the table as a composite Partition/Sort key structure, with the Sort key being the Version, and there are several techniques you can use to help you locate the 'latest' version etc. This is a very common pattern, and with DDB Autoscaling you can ensure that you only provision the amount of capacity that you actually need.
I want to use, objectify for spatial search. I have entities that have longitude and latitude associated with them. Latitude and longitude information is dynamic e.g. service providers (like electrician, carpenter) in a city. I want to implement a query that gives me service providers providing some specific service in 1 Km radius. Searching on google reveals following options
Use Objectify with geohashes - Not sure, how accurate and scalable this solution is
Use Google Search - It will need entities(or part of it) duplicated in the form of documents and Will it be able to support dynamically updated locations.
Use other database like mongodb
Assuming few millions entities and latitude/longitude dynamically updated, please suggest me an appropriate option.
thanks
Ittium
I've used geohashes. It works, although you end up selecting more data than the exact bounds you are looking for and then filtering out the extra. This might or might not be a good solution depending on your specific application. It requires writing more code but has fewer moving parts (all in the datastore).
Google search and "other database" are basically the same architectural pattern - use the task queue to replicate updates to an external index. If you want a quick solution, the search service is probably is the easiest to wrap your head around.
Just pick one solution and run with it for a while. You can always reindex the data into a different solution.
It really depends on your query rate but I usually prefer to use google search. Building and maintaining docs is pretty simple and you get a different quota to handle this queries.
I build a web analytics tool and consider to use Graphite. This is a very basic tool with just a few interesting dimentions but there are multiple dimensions associated with a measurement. For example when a user hits the site I want to keep track of geography, browser etc. The metric name would probably be:
usa.chrome.windows8.organic...
I can then use wildcards to do interesting queries.
Is this abuse of metric names (and Graphite in general), or is it a good approach as long as I only care about a small amount of metrics.
I think the approach will be fine, although there are some important considerations when naming the metrics. Because Graphite will store a .wsp file for every metric name you'll have a difficult time re-sizing or adjusting the storage settings should you decide to change your configuration. Additionally, the Graphite UI will have a "folder" for every metric name so you can easily make the UI unusable.
Graphite recommends that "Volatile path components should be kept as deep into the hierarchy as possible". This essentially means that if you can push the parts of the metrics that are frequently unique to the end of the "bucket" without impacting your grouping queries you should try to do so.
Here is a great post on using Graphite that includes naming recommendations. And here is another one with additional info from Jason Dixon (an excellent source for Graphite stuff in general).
This is basically a forwarded answer of mine from another question...
I did encounter a nice guide (also referenced in the accepted answer) someone put together on this topic though. From the guide:
<namespace>.<instrumented section>.<target (noun)>.<action (past tense verb)>
Example:
accounts.authentication.password.attempted
You've already considered the needs you will have, but try and foresee a bit and not restrict yourself from expanding your capabilities.
Unless you have totals at every level, it will be difficult/tedious for you to compare metrics. Perhaps consider some metrics that you typically would like to compare and start by separating those out.
We are considering moving a planning and budgeting app to the Salesforce platform. The existing app is built on a dimensional data model, and has extensive ad-hoc query capability implemented through star joins.
We see how the platform will allow us to put together the data entry screens quickly, but the underlying datamodel and query languages do not seem suitable for our reporting requirements.
Is it possible to have fast and flexible reporting with this platform? If not, how cumbersome is it to extract the data on a regular basis to bring it into an analytical application?
Hmm - I guess I answer my own question? The relative silence on this (even with bounty- who wants to have anything to do with something that is ignored on stackoverflow?) is a kind of answer.
So - No, this platform is not well suited for applications that have any kind of ROLAP requirements. I guess shame on me for asking a silly question, but I welcome any responses...
Doing native, fast, OLAP-like queries: possible, but somewhat cumbersome since SFDC is basically a traditional-style RDBMS with somewhat limited joining capability within its native reporting. You can do OLAP-like things with custom code but it can get cumbersome if you are used to using established high-end OLAP solutions.
Extracting data from SFDC to use in other applications: really easy and supported across a number of technologies, the most common is extracting CSV files or using the data web service. There are tools like the SFDC data loader which also let you extract/load data via command line or UI. That's probably what I would recommend to a client who has pre-existing expertise in a given analysis tool.
I would not attempt to build an OLAP data model in salesforce. The limitations in both the joins and roll-up of data from child to parent make it difficult to implement a star schema with aggregations.
There are some products such as IQ 20/20 that can integrate with salesforce and provide near real time business intelligence functionality.
Analytical snapshots can also help as they provide a way to build aggregate tables. The snapshots pull data from a report and can be scheduled to run periodically. The different salesforce editions give different features regarding the scheduling so it is best to check the limits for your edition before going too far into the design.