Decision boundary difference between Logistic Regression with Lasso and Ridge regularization when the data is linearly separable - logistic-regression

As the title, is the Decision boundary different between Logistic Regression with Lasso and Ridge regularization when the data is linearly separable?

Related

Is there an algorithm to sort or filter a database table by distance to a vector, other than the naive one?

Say I have a large database table whose entries are vectors. I wish to do a search and sort by distance to a vector. The naive way consists in every time computing the distance between my vector and each of the ones from the database, then sorting by that distance.
Is there any other known algorithm for doing this, perhaps involving some type of indexing in advance?
Alternatively, are there known implementations of such algorithms, for say SQL or Elasticsearch?

Tf-idf with large or small corpus size

"An essence of using Tf-Idf method with large corpuses is, the larger size of corpuses used is, the more unique weights terms have. This is because of the increasing of documents size in corpus or documents length gives a lower probability of duplicating a weight value for two terms in corpus. That is, the weights in Tf-Idf scheme can present a fingerprint for weights. Where in low size corpus, Tf-Idf can’t make that difference since there is huge potential of finding two terms having the same weights since they share the same source documents with the same frequency in each document. This feature can be an adversary and supporter by using Tf-Idf weighting scheme in plagiarism detection field, depending on the corpus size."
This is what I have deduced from tf-idf technique .. is it true?
Are there any link or documents can prove my conclusion؟
After 4 years of waiting for an answer, I can say the answer is yes :)
This actually can be proved simply as in the following picture. We have 4 documents and below the TF and TFIDF tables for each term.
When we have a small corpus (few documents), we can see that the probability that some terms have the same distribution would be high (air, quality), and because of this, their tfidf values are identical. See the table above.
But when we have a corpus with a huge amount of documents, it's less probable that we can find two terms that have the same distribution across all of the corpus.
Note: I used this website to calculate Tf-Idf: https://remykarem.github.io/tfidf-demo/

Need BIG DATA SETS FOR Multiple Linear Regression computing

I need BIG DATA SETS FOR Multiple Linear Regression computing for experimentation thesis please ( up to 3 million example)
You can find huge datasets for Regression in research journals. There are none available on the internet of the above mentioned size. You can create a data of your own by taking imaginary variables and do analysis on it ;)

Similarity calculation in a database

I have sets each containing a non-constant number of elements. They are represented in a database as shown below (this is a very simplistic example)
I have two problems
How to efficiently calculate the similarity?
How to represent the calculated similarity in a database?
Note that the current algorithm will not scale well because of the n! complexity.
Note: I can change the representation of the database as well as the algorithm used to calculate the similarity.

Combination of percentiles from different data sets: how can this be accomplished?

I need to compute the Nth percentiles of a series of related, but segmented data sets.
The combined data sets are too large to compute all at once due to memory limitations, but the framework to perform piece-wise calculations is already in place. So how might I perform calculations on each data set and then combine those calculations to find the percentile that I need?
Other information about the data:
The data often has outliers.
The individual data sets tend to be roughly the same size, but not always
The individual data sets are not expected to share the same distribution
Could I compute the combined median, means, and standard deviations and then estimate any percentile from there?
The median, mean and standard deviation alone are unlikely to be enough, especially if you have outliers.
If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, such as in the parallel mode of the C++ STL library.
If only approximate percentiles are required then Cross Validated has a question -- Estimation of quantile given quantiles of subset -- that suggests a subsampling approach. You would take some (but not all) datapoints from each dataset, make a new combined dataset that is small enough to fit on a single machine and compute the percentiles of that.
Another approximate approach, efficient if percentiles of each segment are already available, would be to approximate the cumulative distribution function of each segment as a step function from the percentiles. Then the overall distribution would be a finite mixture of the segment distributions and the cumulative distribution function a weighted sum of the segment cumulative distribution functions. The quantile function (i.e., the percentiles) could be computed by numerically inverting the cumulative distribution function.

Resources