Combination of percentiles from different data sets: how can this be accomplished? - database

I need to compute the Nth percentiles of a series of related, but segmented data sets.
The combined data sets are too large to compute all at once due to memory limitations, but the framework to perform piece-wise calculations is already in place. So how might I perform calculations on each data set and then combine those calculations to find the percentile that I need?
Other information about the data:
The data often has outliers.
The individual data sets tend to be roughly the same size, but not always
The individual data sets are not expected to share the same distribution
Could I compute the combined median, means, and standard deviations and then estimate any percentile from there?

The median, mean and standard deviation alone are unlikely to be enough, especially if you have outliers.
If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, such as in the parallel mode of the C++ STL library.
If only approximate percentiles are required then Cross Validated has a question -- Estimation of quantile given quantiles of subset -- that suggests a subsampling approach. You would take some (but not all) datapoints from each dataset, make a new combined dataset that is small enough to fit on a single machine and compute the percentiles of that.
Another approximate approach, efficient if percentiles of each segment are already available, would be to approximate the cumulative distribution function of each segment as a step function from the percentiles. Then the overall distribution would be a finite mixture of the segment distributions and the cumulative distribution function a weighted sum of the segment cumulative distribution functions. The quantile function (i.e., the percentiles) could be computed by numerically inverting the cumulative distribution function.

Related

Is there an algorithm to sort or filter a database table by distance to a vector, other than the naive one?

Say I have a large database table whose entries are vectors. I wish to do a search and sort by distance to a vector. The naive way consists in every time computing the distance between my vector and each of the ones from the database, then sorting by that distance.
Is there any other known algorithm for doing this, perhaps involving some type of indexing in advance?
Alternatively, are there known implementations of such algorithms, for say SQL or Elasticsearch?

Tf-idf with large or small corpus size

"An essence of using Tf-Idf method with large corpuses is, the larger size of corpuses used is, the more unique weights terms have. This is because of the increasing of documents size in corpus or documents length gives a lower probability of duplicating a weight value for two terms in corpus. That is, the weights in Tf-Idf scheme can present a fingerprint for weights. Where in low size corpus, Tf-Idf can’t make that difference since there is huge potential of finding two terms having the same weights since they share the same source documents with the same frequency in each document. This feature can be an adversary and supporter by using Tf-Idf weighting scheme in plagiarism detection field, depending on the corpus size."
This is what I have deduced from tf-idf technique .. is it true?
Are there any link or documents can prove my conclusion؟
After 4 years of waiting for an answer, I can say the answer is yes :)
This actually can be proved simply as in the following picture. We have 4 documents and below the TF and TFIDF tables for each term.
When we have a small corpus (few documents), we can see that the probability that some terms have the same distribution would be high (air, quality), and because of this, their tfidf values are identical. See the table above.
But when we have a corpus with a huge amount of documents, it's less probable that we can find two terms that have the same distribution across all of the corpus.
Note: I used this website to calculate Tf-Idf: https://remykarem.github.io/tfidf-demo/

Linear Probing vs Chaining

In Algorithm Design Foundations,Analysis, and Internet Examples by Michael T. Goodrich ,Roberto Tamassia in section 2.5.5 Collision-Handling Schemes the last paragraph says
These open addressing schemes save some space over the separate
chaining method, but they are not necessarily faster. In experimental
and theoretical analysis, the chaining method is either competitive or
faster than the other methods, depending upon the load factor of the
methods.
But regarding the speed previous SO Answer says exact opposite.
Linear probing wins when the load factor = n/m is smaller. That is when the number of elements is small compared to the slots. But exactly reverse happen when load factor tends to 1. The table become saturated and every time we have to travel nearly whole table resulting in exponential growth. On the other hand Chaining still grows linearly.

Similarity calculation in a database

I have sets each containing a non-constant number of elements. They are represented in a database as shown below (this is a very simplistic example)
I have two problems
How to efficiently calculate the similarity?
How to represent the calculated similarity in a database?
Note that the current algorithm will not scale well because of the n! complexity.
Note: I can change the representation of the database as well as the algorithm used to calculate the similarity.

example of equi-depth histograms in databases?

I am unable to understand the role equi-depth histograms play in query optimization. Can someone please give me some pointers to good resources or could anyone explain. I have read a few research papers but still I could not convince my for the need and use of equi-depth histograms. So, can someone please explain equi-depth histograms with an example.
Also can we merge the buckets of the histograms so that the histogram becomes small enough and fits in 1 page on disk?
Also what are bucket boundaries in equi-depth histograms?
Caveat: I'm not an expert on database internals, so this is a general, not a specific answer.
Query compilers convert the query, usually given in SQL, to a plan for obtaining the result. Plans consist of low level "instructions" to the database engine: scan table T looking for value V in column C; use index X on table T to locate value V; etc.
Query optimization is about the compiler deciding which of a (potentially huge) set of alternative query plans have minimum cost. Costs include wall clock time, IO bandwidth, intermediate result storage space, CPU time, etc. Conceptually, the optimizer is searching the alternative plan space, evaluating the cost of each to guide the search, ultimately choosing the cheapest it can find.
The costs mentioned above depend on estimates of how many records will be read and/or written, whether the records can be located by indexes, what columns of those records will be used, and the size of the data and/or how many disk pages they occupy.
These quantities in turn often depend on the exact data values stored in the tables. Consider for example select * from data where pay > 100 where pay is an indexed column. If the pay column has no values over 100, then the query is extremely cheap. A single probe of the index answers it. Conversely the result set could contain the entire table.
This is where histograms help. (Equi-depth histograms are just one way of maintaining histograms.) In the preceeding query a histogram will in O(1) time provide an estimate of the fraction of rows that will be produced by the query without knowing exactly what those rows will contain.
In effect, the optimizer is "executing" the query on an abstraction of the data. The histogram is that abstraction. (Others are possible.) The histogram is useful for estimating costs and result sizes for query plan operations: join result size and page hits during mass insertions and deletions (which may lead to the generation of a temporary index), for example.
For a simple inner join example, suppose we know how integer-valued join columns of two tables are distributed:
Bins (25% each)
Table A Table B
0-100 151-300
101-150 301-500
151-175 601-700
176-300 1001-1100
It's easy to see that 50% of Table A and 25% of Table B reflect the possible participation. If these are unique-valued columns, then a useful join size estimate is max(.5 * |A|, .25 * |B|). This is a very simple example. In many (most?) cases, the analysis requires much more mathematical sophistication. For joins, it's usual to compute an estimated histogram of the results by "joining" the histograms of the operands. This is what makes the literature so diverse, complicated, and interesting.
PhD dissertations often have surveys that cover big bodies of technical literature like this in a concise form that isn't too difficult to read. (After all, the candidate is trying to convince a committee he/she knows how to do a literature search.) Here is one such example.

Resources