Configuring Fairness Monitoring for categorical features - ibm-watson

When configuring fairness monitoring for a model, the prediction column only allows for integer numerical value even though the prediction label are categorical, is this an intended feature? How do I configure this for categorical feature (that is not integer)? is a manual conversion required?

The training data could have class labels such as “Loan Denied”, “Loan Granted”. The prediction value returned by WML scoring end point has values such as “0.0”, “1.0", etc. The scoring end point also has an optional column which contains the text representation of prediction. E.g., if prediction=1.0, the predictionLabel column could have a value “Loan Granted”. If such a column is available, then while configuring the favourable and unfavourable outcome for the model, you can specify the string values “Loan Granted” and “Loan Denied”. If such a column is not available then you need to specify the integer/double values of 1.0, 0.0 for the favourable/unfavourable class.
WML has a concept of output schema which defines the schema of the output of WML scoring end point and the role for the different columns. The roles are used to identify which column contains the prediction value, which column contains the prediction probability, and the class label value, etc. The output schema is automatically set for models created using model builder. It can also be set using the WML python client. Users can use the output schema to define a column which contains the string representation of the prediction. This is done by setting the modeling_role for the column to ‘decoded-target’. The documentation for the WML python client is available at: http://wml-api-pyclient-dev.mybluemix.net/#repository. Search for “OUTPUT_DATA_SCHEMA” to understand the output schema and the API to use is to store_model API which accepts the OUTPUT_DATA_SCHEMA as a parameter.

Related

Some use cases of Seeds in DBT (Data built tool)

Suppose we have a CSV file in DBT with fields like I'd and First_name but name is in Lower case ...
While loading the data covert it to upper case without using any extra space or creating macros for the same .
You are looking for the quote_columns config for seeds. From the docs:
An optional seed configuration, used to determine whether column names in the seed file should be quoted when the table is created.
The default is to not quote on Snowflake; Snowflake then converts your field names to SHOUT_CASE. If you set quote_columns=True (see the docs on how to do this for your specific seed or all seeds), Snowflake will respect the case of the quoted name that gets passed to it.

how to visualize geodata (Points) in databricks builtin map?

I simply have a dataframe of "lat" and "lon" coordinates that I would like to visualize on a map.
I am trying to view my data data on the map. But nothing is being displayed.
In my case, the mmsi is the id.
Please note that I have around 2M points to display. Is that possible in databricks?
If there is no way around (plotting 2,000,000 points) in databricks, then what tool can handle large amount of data?
Any help is much appreciated!!
The built in map does not support latitude / longitude data points.
Delivering 2 million data points directly to a browser is problematic. Databricks has a protective limit of 20 MB for HTML display. Your viewers won't be able to take in and visually process that level of detail at any point in time.
I recommend implementing a filter/summarization strategy to enable the display within Databricks notebooks. Some ideas can be taken from this Databricks post: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Also this Anaconda post gives an excellent survey of data visualization packages, some (like Datashader, Vaex) implement a visual summarization strategies before rendering images: https://www.anaconda.com/blog/python-data-visualization-2018-why-so-many-libraries
The error message clearly says that " Unrecognizable values in the first column. The values should be either country codes in ISO 3166-1 alpha-3 format (e.g. "GBR") or US state abbreviations (e.g. "TX")."
Note: To plot a graph of the world, use country codes in ISO 3166-1 alpha-3 format as the key.
A Map Graph is a way to visualize your data on a map.
Plot Options... was used to configure the graph below.
Keys should contain the field with the location.
Series groupings is always ignored for World Map graphs.
Values should contain exactly one field with a numerical value.
Since there can multiple rows with the same location key, choose "Sum", "Avg", "Min", "Max", "COUNT" as the way to combine the values for a single key.
Different values are denoted by color on the map, and ranges are always spaced evenly.
Reference: Databricks - Charts and Graph
Hope this helps.

Possible record collisions due to the size / definition of the TIMESTAMP field in the TIME SERIES type containers of GridDB

I am working with GridDB and I have observed loss of records during the insertions that I attribute to the lack of definition of the timestamp field.
I tried to give more definition in the entry field but saving it makes it trim. Logs do not indicate any data loss or erroneous writing.
A query DB:
[{
"columns":[
{"name":"original_timestamp","type":"TIMESTAMP"},
{"name":"FIELD_A","type":"STRING"}
...
{"name":"FIELD_Z","type":"STRING"}
{"name":"code_timestamp","type":"STRING"}],
"results":[
"2019-07-19T11:28:42.328Z",
"SOME String Value for A",
...
"SOME String Value for Z",
"2019-07-19 11:28:59.239922"}
]
The number of registered ingested its lower than expected.
We're working on a model based on two indexes. Any other idea and / or helpful experience?
Thanks in advance!
GridDB stores TIMESTAMP values in millisecond resolution, inserting records with greater resolution such as micro or nanosecond resolution will result in the timestamp value being truncated.
There are three ways to get around the timestamp collisions:
Use a Collection with a long as your first index. In that long, store a Unix Epoch in micro or nanoseconds as required. You will obviously lose some time series functions and have to manually convert comparison operators to a Unix epoch in your desired resolution.
Use a collection and disable the row key (No #RowKey tag in Java or set the last Boolean in ContainerInfo to False in other languages). This will allow multiple records to have the same "row key value". You can enable a secondary index on this column to ensure queries are still fast. TIMESTAMP and TO_TIMESTAMP_MS functions still work but I'm fairly certain none of the other special timestamp functions will work. When I've had to deal with Timestamp collisions in GridDB, this is the solution I chose.
Detect the collisions before you insert and if there is going to be a collision, write the colliding record into a separate container. Use multi-get/query to query all of the containers.

how to increase the sample size used during schema discovery to 'unlimited'?

I have encountered some errors with the SDP where one of the potential fixes is to increase the sample size used during schema discovery to 'unlimited'.
For more information on these errors, see:
No matched schema for {"_id":"...","doc":{...}
The value type for json field XXXX was presented as YYYY but the discovered data type of the table's column was ZZZZ
XXXX does not exist in the discovered schema. Document has not been imported
Question:
How can I set the sample size? After I have set the sample size, do I need to trigger a rescan?
These are the steps you can follow to change the sample size. Beware that a larger sample size will increase the runtime for the algorithm and there is no indication in the dashboard other than the job remaining in 'triggered' state for a while.
Verify the specific load has been stopped and the dashboard status shows it as stopped (with or without error)
Find a document https://<account>.cloudant.com/_warehouser/<source> where <source> matches the name of the Cloudant database you have issues with
Note: Check https://<account>.cloudant.com/_warehouser/_all_docs if the document id is not obvious
Substitute "sample_size": null (which scans a sample of 10,000 random documents) with "sample_size": -1 (to scan all documents in your database) or "sample_size": X (to scan X documents in your database where X is a positive integer)
Save the document and trigger a rescan in the dashboard. A new schema discovery run will execute using the defined sample size.

Any way to control number of decimal places while browsing SSAS cube?

When I browse the cube and pivot Sales by Month ,(for example), I get something like 12345.678901.
Is there a way to make it so that when a user browses they get values rounded up to nearest two decimal places, ie: 12345.68, instead?
Thanks,
-teddy
You can enter a format string in the properties for your measure or calculation and if your OLAP client supports it then the formatting will be used. e.g. for 1 decimal place you'd use something like "#,0.0;(#,0.0)". Excel supports format strings by default and you can configure Reporting Services to use them.
Also if you're dealing with money you should configure the measure to use the Currency data type. By default Analysis Services will use Double if the source data type in the database is Money. This can introduce rounding issues and is not as efficient as using Currency. See this article for more info: The many benefits of money data type. One side benefit of using Currency is you will never see more than 4 decimal places.
Either edit the display properties in the cube itself, so it always returns 2 decimal places whenever anyone edits the cube.
Or you can add in a format string when running MDX:
WITH MEMBER [Measures].[NewMeasure] AS '[Measures].[OldMeasure]', FORMAT_STRING='##0.00'
You can change format string property of your measure. There are two possible ways:
If measure is direct measure -
Go to measure's properties and update 'Format String'
If measure is calculated measure -
Go to Calculations and update 'Format String'

Resources