why naive bayes requires balanced training data? - artificial-intelligence

I created a word sentiment app using the Naive Bayes algorithm.
There are two types of criteria in this classification training data, that is positive training data and negative training data. I take a unique word on every training data that has been grouped. so, I have all the unique words for each data criteria. Then, I calculate the probability value of occurrence of each unique word.
The problem is when I use uneven training data. For example: I use 60% of negative training data and 40% positive training data. Then the results of test data will be more likely to negative results, and vice versa.
Besides I have to use balanced data, what should I do to solve this problem? and is there an additional method I should add?

Naive Bayes requires balanced training data because the likelihood of each parameter is influenced by the prior value (priority value).
this prior value is taken from the classes of each data.
maybe you already understand when I explain this kind of thing.

Related

ANN Performance Using Separate Namespaces

I am trying to perform ANN, but my data is split into partitions or "tenants." Searches are always restricted to a single tenant, which represents a small percentage of the total documents.
I first tried implementing this using a filter on a tenant string attribute. However, I encountered this piece of documentation, that suggests the performance will be poor:
There is a small problem here however. If the eligibility list is small in relation to the number of items in the graph, skipping occurs with a high probability. This means that the algorithm needs to consider an exponentially increasing number of candidates, slowing down the search significantly. To solve this, Vespa.ai switches over to a brute-force search when this occurs. The result is a efficient ANN search when combined with filters.
What's the best way to solve my problem? Will partitioning my data into separate namespaces trigger the creation of a separate HNSW graph per namespace?
Performance will be fine, the query planner will just choose to not use the ANN index for these queries. You'll find lots of details on this topic, including how to tune this, in this blog post: https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/
If all your queries are towards a single tenant which is a small percentage of the total documents I don't think you necessarily need to create an HNSW index at all, but this depends on the absolute numbers and the largest "small percentage".
(Namespaces are not relevant here - their only purpose is to safely add a string to ids so that you can have multiple sources of ids and still be guaranteed global uniqueness.)

Efficiently search large DB for similar records

Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!
Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.

Most appropriate AI for parameter weighting?

I have data of this form:
[(v1, A1, B1), (v2, A2, B2), (v3, A3, B3), ...]
The vs correspond to the data elements and the As and Bs to numerical values characterizing the vs.
A human looking at this data can look at it and see which tuple seems the best "match" according to the A and B values. I want a form of AI that I could train by picking one of these tuples as the best, and that would adjust the weights given to A and B.
Basically, each tuple represents an approximation to a value. A represents an error and B represents the complexity of each approximation. I want some compromise between error and complexity by assigning them different weightings. I want to run several trials with approximations to different values, and choose the one I think looks the best, and have the AI adjust the weightings correspondingly.
What you described is also known as a model selection problem, something often encountered in machine learning and statistics. You basically have some models that fit your data by some measure of goodness (typically measured as error or log likelihood) and those models have some complexity measure (typically the number of parameters in the model). You want to pick the best fitting model and penalize its complexity because that can be a sign of overfitting.
Typically, the degree to which overfitting can affect you is driven by the size of your data. But there are some measures that explicitly allow you to trade off model fitness and complexity:
Akaike information criterion
Bayesian information criterion
Regularization
Choose a model based on your data as above can bias the model choice toward the data. Thus, this is done typically using a validation set and then evaluated on a test set.
I don't know if your approach in having an algorithm solve this problem is a good one. Typically it is dependent on your data and some degree of intuition. The meta-machine-learning technique you described probably won't be too reliable, in my opinion. Better to start with some more principled and simpler ideas first.

Regression Model for categorical data

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.

Missing values for the data to be used in a Neural Network model for prediction

I currently have a lot of data that will be used to train a prediction neural network (gigabytes of weather data for major airports around the US). I have data for almost every day, but some airports have missing values in their data. For example, an airport might not have existed before 1995, so I have no data before then for that specific location. Also, some are missing whole years (one might span from 1990 to 2011, missing 2003).
What can I do to train with these missing values without misguiding my neural network? I though about filling the empty data with 0s or -1s, but I feel like this would cause the network to predict these values for some outputs.
I'm not an expert, but surely this would depend on the type of neural network you have?
The whole point of neural networks is they can deal with missing information and so forth.
I agree though, setting empty data with 1's and 0's can't be a good thing.
Perhaps you could give some info on your neural network?
I'm using a lot NNs for forecasting and I can say you that you can simply leave that "holes" in your data. In fact, NNs are able to learn relationships inside observed data and so if you don't have a specific period it doesn't matter...if you set empty data as a constant value you will have give to your training algorithm misleading information. NNs don't need "continuous" data, in fact it's a good practise to shuffle the data sets before training in order to do the backpropagation phase on not-contiguous samples...
Well a type of neural network named autoencoder is suitable for your work. Autoencoders can be used to reconstruct the input. An autoencoder is trained to learn the underlying data manifold/distribution. However, they are mostly used for signal reconstruction tasks such as image and sound. You could however use them to fill the missing features.
There is also another technique coined as "matrix-factorization" which is used in many recommendation systems. People use matrix factorization techniques to fill huge matrices with a lot of missing values. For instance, suppose there are 1 million movies on IMDb. Almost no one has watched even 1/10 of those movies throughout her life. But she has voted for some movies. The matrix is N by M where N is the number of users and M the number of movies. Matrix factorization are among the techniques used to fill the missing values and suggest movies to the users based on their previous votes for other movies.

Resources