How to apply Logistic Regression Model in Test Data - logistic-regression

I had a Titanic Data set which had two parts- Train Data and Test Data.
I have developed a model on Training Data set after missing value and outlier treatment.
Now I have to apply the model on Test Data set, So do I need to do the missing value treatment and outlier treatment on this data as well before applying the model.
And will it be the same process whenever I have to predict.

Everything depends on the scenario you are trying to solve. Usually, I try to apply data pre-processing for test data as well, this is because I want to compare the accuracy of the two models (I want them to be similar). But just in case, if you want to see the performance of your regression against outliers as well, then you can try that out.
In case of prediction data, I think you should do the missing value treatment and, outlier treatment if possible. This is because if you can tag something as an outlier then you can avoid running them against the model and getting weird results. But in most production scenarios its difficult to predict something is an outlier and so I usually don't do outlier cleanups.
Hope this helps !

Related

How should a training dataset be distributed?

Apologies if this is a beginner question. I’m building a text-to-speech model. I was wondering if my training dataset should be “realistically” distributed (i.e. same distribution as the data it will be used on), or should it be uniformly distributed to make sure it performs well on all kinds of sentences. Thanks.
I’d say that this depends on the dataset size. If you have a really, really small dataset, which is common in some domains and rare in others, then you’d want to ensure that all the “important kinds of data” (whatever that means for your task) would be represented there even if they’re relatively rare, but a realistic distribution is better if you have a large enough dataset that all the key scenarios would be adequately represented anyway.
Also, if mistakes on certain data items are more important than others (which is likely in some domains), then it may make sense to overrepresent them, as you’re not optimizing for the average case of the real distribution.
There’s also the case of the targeted annotation where you look at the errors your model is making and specifically annotate extra data to overrepresent those cases - because there are scenarios where some types of data happen to be both very common and trivial to solve, so adding extra training data for them takes effort but doesn’t change the results in any way.

Accuracy Document Embedding in Apache Solr

I made use of Bert document embeddings to perform information retrieval on the CACM dataset. I achieved a very low accuracy score of around 6%. However when I used the traditional BM-25 method, the result was a lot closer to 40% which is close to the average accuracy found in literature for this dataset. This is all being performed within Apache Solr.
I also attempted to perform information retrieval using Doc2Vec and acheived similarly poor results as with BERT. Is it not advisable to use document embeddings for IR tasks such as this one ?
Many people find document embeddings work really well for their purposes!
If they're not working for you, possible reasons include:
insufficiency of training data
problems in your unshown process
different end-goals – what's your idea of 'accuracy'? – than others
It's impossible to say what's affecting your process, & raw perception of its usefulness, without far more details on what you're aiming to achieve, and then doing.
Most notably, if there's other published work using the same dataset, and a similar definition of 'accuracy' on which the other published work claims a far better result using the same methods as give worse results for you, then it's more likely that there are errors in your implementation.
You'd have to name the results you're trying to match (ideally with links to the exact writeups), & show the details of what your code does, for others to have any chance of guessing what's happening for you.

Do you keep the true distribution between classes when you manually create your own dataset, or do you make it balanced?

(I struggled a bit to phrase the title - please feel free to suggest another title).
I have a text-dataset which I need to classify, say there's three classes. I need to create the targets by manually setting the labels based on the text (say the three classes are dog,cat,bird).
When I do so I notice we have, say, 70% dog, 20% cat and 10% bird.
Since a lot of machine learning models struggle with imbalanced data, my first thought would be to force the dataset being balanced simply to ignore some of the dog and cat text (i.e "undersampling") thus ending up with (almost) a balanced dataset, making it more easy to train the model.
My concern is though that if we want to train e.g a neural network and get the probability for each class, not training over the correct distribution of the data would result in over/under-confident predictions?
Indeed if your dataset is imbalanced, there is a risk of affecting the performance of your classifier.
You'll find plenty of libraries to help you deal with this problem (see below) and the bottom line is if classes are equally represented in your dataset, it can only help prevent biais of your classifier:
https://imbalanced-learn.org/stable/auto_examples/index.html#general-examples
https://github.com/ufoym/imbalanced-dataset-sampler
https://github.com/MaxHalford/pytorch-resample etc...
(but you can also do that sampling yourself, shouldn't be too difficult, eg libraries like pandas have such functionality)
As a safeguard, split your dataset into 3:
Training (eg 70% of our data): the bulk of the data used for learning
Validation (eg 20%): what your classifier uses for regularization (ie to prevent over fitting)
Test (eg 10%): this data is NEVER exposed to your classifier for learning purposes, you keep it separate and just use it at the end on your model to evaluate its true performance (you call predict and compare with expected classes).
This should be a good starting point.

Regression Model for categorical data

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.

How do you verify the correct data is in a data mart?

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.
Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.
You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.
You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:
the source data directly, controlling for all the scrubbing steps in between source and fact
a source system report that is known to be correct
etc.
If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.
See this excellent post by ConcernedOfTunbridgeWells for more recommendations.
Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).
Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.
If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

Resources