ValueError: Classification metrics can't handle a mix of binary and continuous targets? i use logistic regression here - logistic-regression

test_data
y_pred=ridge_model.predict(x_test)
acc_score=accuracy_score(y_test,y_pred)
print('accuracy:',acc_score)
cnf_mat=confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cnf_mat)
class_score=classification_report(y_test,y_pred)
print('classification report:',class_score)

Related

How to use iteration in R to simplify my code for GLM?

I've just started using R and am having some issues when trying to simplify my code. I can't share my real data, but have used an open dataset to ask my question (Breed to represent my IV and Age to represent a DV).
In my dataset, I have all factor variables - my independent variable has 3 levels and my dependent variables all have 2 levels (0/1). Out of a larger dataset, I have six dependent variables and would like to run some descriptive stats and GLM for each. I have figured out working code for running each DV independently, see below. However, I am currently just copying & pasting this code and replacing the DV variables each time. I would like to instead create a function that I can apply to simplify my code.
I have attempted to do this using the purr package (map) but have had no luck. If someone could provide an example of how to do this using the sample data below, it would help me out a lot (though I know in the below data there is only one DV provided).
install.packages("GLMsData")
library(GLMsData)
data(butterfat)
library(tidyverse)
library(dplyr)
#Descriptive summaries
butterfat %>%
group_by(Breed, Age) %>%
summarise(n())
prop.table(table(butterfat$Breed, butterfat$Age), 1)
#Model
Age_model1 <- glm(Age ~ Breed, family=binomial, data=butterfat, na.action = na.omit)
#Get summary, including coefficients and p-values
summary(Age_model1)
#See coefficients, get odds ratio and confidence intervals
Age_model1$coefficients
exp(Age_model1$coefficients)
exp(confint(Age_model1))
Many Models: R for Data Science
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(lm = map(data,
~glm(Sepal.Length~Sepal.Width,data = .))) %>%
mutate(tidy = map(lm,
broom::tidy)) %>%
unnest(tidy)

Flink dynamic partition in s3 with Datastream API

I am writing a flink datastream pipeline in Java where sink is configured to write output to s3. However, I am trying to understand if there is any way to dynamically partition s3 output into directories based on values from the streaming data itself. For example:
Let's say we have 2 type of departments for class 10th i.e. science and maths. Input datastream has fields
class, department, student_name, marks
10, science, abc, 65
10, maths, abc, 71
10, science, bcd, 59
So the pipeline should produce data in following directory structure:
s3://<bucket_name>/class=10/department=science/part-xxx
s3://<bucket_name>/class=10/department=maths/part-xxx
Please note that I know this is possible with Table API, but I am looking for this alternative with Datastream API. Closest option seems like DateTimeBucketAssigner but this will not work my usecase. Any thoughts ?

How to generate training dataset from huge binary data using TF 2.0?

I have a binary dataset of over 15G. I want to extract the data for model training using TF 2.0. Currently here is what I am doing:
import numpy as np
import tensorflow as tf
data1 = np.fromfile('binary_file1', dtype='uint8')
data2 = np.fromfile('binary_file2', dtype='uint8')
dataset = tf.data.Dataset.from_tensor_slices((data1, data2))
# then do something like batch, shuffle, prefetch, etc.
for sample in dataset:
pass
but this consumes my memory and I don't think it's a good way to deal with such big files. What should I do to deal with this problem?
You have to make use of FixedLengthRecordDataset to read binary files. Though this dataset accepts several files together, it will extract data only one file at a time. Since you wish to have data to be read from two binary files simultaneously, you will have to create two FixedLengthRecordDataset and then zip them.
Another thing to note down is you have to specify parameter of record_bytes which states how many bytes would you like to read per iteration. This has to be an exact multiple of the total bytes of your binary file size.
ds1 = tf.data.FixedLengthRecordDataset('file1.bin', record_bytes=1000)
ds2 = tf.data.FixedLengthRecordDataset('file2.bin', record_bytes=1000)
ds = tf.data.Dataset.zip((ds1, ds2))
ds = ds.batch(32)

Vespa - Proton: Custom bucketing & Query

References:
id scheme
Format: id:<namespace>:<document-type>:<key/value-pairs>:<user-specified>
http://docs.vespa.ai/documentation/content/buckets.html
http://docs.vespa.ai/documentation/content/idealstate.html
its possible to structure data in user defined bucketing logic by using 32 LSB in document-id format (n / g selections).
however, the query logic isn't very clear on how to route queries to a specific bucket range based on a decision taken in advance.
e.g., it is possible to split data into a time range (start-time/end-time) if i can define n (a number) compressing the range. all documents tagged such will end up in same bucket (that will follow its course of split on number of documents / size as configured).
however, how do i write a search query on data indexed in such manner?
is it possible to indicate the processor to choose a specific bucket, or range of buckets (in case distribution algorithm might have moved buckets)?
You can choose one bucket in a query by specifying the streaming.groupname query property.
Either in the http request by adding
&streaming.groupname=[group]
or in a Searcher by
query.properties().set("streaming.groupname","[group]").
If you want multiple buckets, use the parameter streaming.selection instead, which accepts any document selection expression: http://docs.vespa.ai/documentation/reference/document-select-language.html
To specify e.g two buckets, use set streaming.selection (in the HTTP request or a Searcher) to
id.group=="[group1]" and id.group=="[group2]"
See http://docs.vespa.ai/documentation/streaming-search.html
Note that streaming search should only be used when each query only need to search one or a few buckets. It avoids building reverse indexes, which is cheaper in that special case (only).
The &streaming.* parameters is described here http://docs.vespa.ai/documentation/reference/search-api-reference.html#streaming.groupname
This only applies to document types which are configured with mode=streaming, for default mode which is index you cannot control the query routing http://docs.vespa.ai/documentation/reference/services-content.html#document

Why use an external database with Matlab?

Why should you use an external database (e.g. Mysql) when working with (large/growing) Data?
I know of some projects which use SQL databases, but I can't see the advantage you get from doing this in contrast to just storing everything in .mat files (as for example stated here: http://www.matlabtips.com/how-to-store-large-datasets/)
Where is this necessary? Where does this approach simplify things?
Regarding growing data, let's take an example where, on a production line, you would measure different sources with different sensors:
Experiment.Date = '2014-07-18 # 07h28';
Experiment.SensorType = 'A';
Experiment.SensorSerial = 'SENSOR-00012-A';
Experiment.SourceType = 'C';
Experiment.SourceSerial = 'SOURCE-00143-C';
Experiment.SensorPositions = 180 * linspace(0, 359, 360) / pi;
Experiment.SensorResponse = rand(1, 360);
And store these experiments on disk using .mat files:
experiment.2013-01-02.0001.mat
experiment.2013-01-02.0002.mat
experiment.2013-01-02.0003.mat
experiment.2013-01-03.0004.mat
...
experiment.2014-07-18.0001.mat
experiment.2014-07-18.0002.mat
So now, if I ask you:
"what is the typical response of sensors of type B when the source is of type E" ?
Or:
"Which sensor has best performances to measure sources of type C ? Sensors A or sensors B ?"
"How performances of these sensors degrade with time ?"
"Did modification we made last july to production line improved lifetime of sensors A ?"
Loading in memory all these .mat files, to check if date, sensor and source type are correct and then calculate min,mean,max responses, and other statistics is gonna be very painful and time consuming + writing custom code for file selection!
Building a data-base on top of these .mat files can be very useful to "SELECT/JOIN/..." elements of interest and then perform further statistic or operations.
NB: The database does not replace .mat files (i.e. the information), it just a practical and standard way to quickly select some of them upon conditions without having to load and parse everything.

Resources