Tf-idf with large or small corpus size - tf-idf

"An essence of using Tf-Idf method with large corpuses is, the larger size of corpuses used is, the more unique weights terms have. This is because of the increasing of documents size in corpus or documents length gives a lower probability of duplicating a weight value for two terms in corpus. That is, the weights in Tf-Idf scheme can present a fingerprint for weights. Where in low size corpus, Tf-Idf can’t make that difference since there is huge potential of finding two terms having the same weights since they share the same source documents with the same frequency in each document. This feature can be an adversary and supporter by using Tf-Idf weighting scheme in plagiarism detection field, depending on the corpus size."
This is what I have deduced from tf-idf technique .. is it true?
Are there any link or documents can prove my conclusion؟

After 4 years of waiting for an answer, I can say the answer is yes :)
This actually can be proved simply as in the following picture. We have 4 documents and below the TF and TFIDF tables for each term.
When we have a small corpus (few documents), we can see that the probability that some terms have the same distribution would be high (air, quality), and because of this, their tfidf values are identical. See the table above.
But when we have a corpus with a huge amount of documents, it's less probable that we can find two terms that have the same distribution across all of the corpus.
Note: I used this website to calculate Tf-Idf: https://remykarem.github.io/tfidf-demo/

Related

Why is using lucene's DateRangeField substantially increasing index size?

Recently, I added three DateRangeField's to my solr schema. This quadrupled the size of the index-- specifically the .doc file. The fields are used heavily in that they are present in all the documents and the ranges can be quite large, but this still seems like a surprising increase. I checked the index with Luke, and I see a large number of terms that might correspond to the size of the ranges. This seems supported by the fact that the term counts remain consistent even when I index a very small number of documents.
Is this expected behavior? Is there some way to improve the index size and still use DateRangeField?
This is using Solr 6.6.
Many thanks.

How can I distribute databases across servers as evenly as possible while minimizing the movements required?

The Problem:
I'm attempting to create a program to redistribute ~1200 databases of various sizes across a finite number of servers to create as even a distribution as possible. The databases are currently on those servers, but the distribution is uneven.
I have looked at quite a few articles and stackoverflow/stackexchange posts but can't seem to find something that solves the whole issue.
Constraints:
Databases are of drastically different sizes
Need to Minimize the number of movements required to get to the even distribution
the "weight" (how full) on each server needs to be as close as possible, we can say within 1% of each other.
What I have:
To create these even distributions I have the database identifier, the current server it exists on, the database size in Gbs, and the database size in a nominal data type that is relevant to the company as the operational database size. I have generated three clusters on the databases to split them into clusters of size based on the nominal data type that is relevant to the company as the operational database size and thus have them notated as "Small", "Medium", "Large". I also have a Rank on each individual row which lists the databases from smallest to largest as a function of the two sizes available to me.
What I have looked into (with my understanding of why it wouldn't work):
Bin packing - minimizes the number of bins under the size constraint of the bin. I need the algorithm to distribute evenly according to size among the bins while retaining knowledge of where they came from so we can get a count of movements required to get to the normalized distribution
Knapsack - Assume only one bin and pack it according to the size of the packages as well as their value to the person. I need more bins than that and I don't want maximum fullness in one bin before we move on to the next we need an even distribution.
K-Partition Problem - I don't see a way for this to count the number of movements.
Multiprocessor Scheduling - We have no time dimension and I don't see a way to see where the job is moving from so we can get the number of movements required to get to the end distribution.
What I need:
Direction on an algorithm (or an r package) that could help me solve this issue.
I have looked at:
Bin Packing: Set amount on bins, want to minimize the max bin weight
What is the proper problem name / algorithm for this problem description in computer science theory?
The greedy algorithm and implementation
Filling bins with an equal size
Spread objects evenly over multiple collections
Any direction would be very appreciated and if you have a link to further documentation as a reference to what you are discussing that would be lovely.

Index performance for large # documents in Lucene

I have been using postgresql for full text search, matching a list of articles against documents containing a particular word. The performance for which degraded with a rise in the no. of rows. I had been using postgresql support for full text searches which made the performance faster, but over time resulted in slower searches as the articles increased.
I am just starting to implement with solr for searching. Going thru various resources on the net I came across that it can do much more than searching and give me finer control over my results.
Solr seems to use an inverted index, wouldn't the performance degrade over time if many documents (over 1 million) contain a search term begin queried by the user? Also if I am limiting the results via pagination for the searched term, while calculating the score for the documents, wouldn't it need to load all of the 1 million+ documents first and then limit the results which would dampen the performance with many documents having the same word?
Is there a way to sort the index by the score itself in the first place which would avoid loading of the documents later?
Lucene has been designed to solve all the problems you mentioned. Apart from inverted index, there is also posting lists, docvalues, separation of indexed and stored value, and so on.
And then you have Solr on top of that to add even more goodies.
And 1 million documents is an introductory level problem for Lucene/Solr. It is being routinely tested on indexing a Wikipedia dump.
If you feel you actually need to understand how it works, rather than just be reassured about this, check books on Lucene, including the old ones. Also check Lucene Javadocs - they often have additional information.

example of equi-depth histograms in databases?

I am unable to understand the role equi-depth histograms play in query optimization. Can someone please give me some pointers to good resources or could anyone explain. I have read a few research papers but still I could not convince my for the need and use of equi-depth histograms. So, can someone please explain equi-depth histograms with an example.
Also can we merge the buckets of the histograms so that the histogram becomes small enough and fits in 1 page on disk?
Also what are bucket boundaries in equi-depth histograms?
Caveat: I'm not an expert on database internals, so this is a general, not a specific answer.
Query compilers convert the query, usually given in SQL, to a plan for obtaining the result. Plans consist of low level "instructions" to the database engine: scan table T looking for value V in column C; use index X on table T to locate value V; etc.
Query optimization is about the compiler deciding which of a (potentially huge) set of alternative query plans have minimum cost. Costs include wall clock time, IO bandwidth, intermediate result storage space, CPU time, etc. Conceptually, the optimizer is searching the alternative plan space, evaluating the cost of each to guide the search, ultimately choosing the cheapest it can find.
The costs mentioned above depend on estimates of how many records will be read and/or written, whether the records can be located by indexes, what columns of those records will be used, and the size of the data and/or how many disk pages they occupy.
These quantities in turn often depend on the exact data values stored in the tables. Consider for example select * from data where pay > 100 where pay is an indexed column. If the pay column has no values over 100, then the query is extremely cheap. A single probe of the index answers it. Conversely the result set could contain the entire table.
This is where histograms help. (Equi-depth histograms are just one way of maintaining histograms.) In the preceeding query a histogram will in O(1) time provide an estimate of the fraction of rows that will be produced by the query without knowing exactly what those rows will contain.
In effect, the optimizer is "executing" the query on an abstraction of the data. The histogram is that abstraction. (Others are possible.) The histogram is useful for estimating costs and result sizes for query plan operations: join result size and page hits during mass insertions and deletions (which may lead to the generation of a temporary index), for example.
For a simple inner join example, suppose we know how integer-valued join columns of two tables are distributed:
Bins (25% each)
Table A Table B
0-100 151-300
101-150 301-500
151-175 601-700
176-300 1001-1100
It's easy to see that 50% of Table A and 25% of Table B reflect the possible participation. If these are unique-valued columns, then a useful join size estimate is max(.5 * |A|, .25 * |B|). This is a very simple example. In many (most?) cases, the analysis requires much more mathematical sophistication. For joins, it's usual to compute an estimated histogram of the results by "joining" the histograms of the operands. This is what makes the literature so diverse, complicated, and interesting.
PhD dissertations often have surveys that cover big bodies of technical literature like this in a concise form that isn't too difficult to read. (After all, the candidate is trying to convince a committee he/she knows how to do a literature search.) Here is one such example.

Combination of percentiles from different data sets: how can this be accomplished?

I need to compute the Nth percentiles of a series of related, but segmented data sets.
The combined data sets are too large to compute all at once due to memory limitations, but the framework to perform piece-wise calculations is already in place. So how might I perform calculations on each data set and then combine those calculations to find the percentile that I need?
Other information about the data:
The data often has outliers.
The individual data sets tend to be roughly the same size, but not always
The individual data sets are not expected to share the same distribution
Could I compute the combined median, means, and standard deviations and then estimate any percentile from there?
The median, mean and standard deviation alone are unlikely to be enough, especially if you have outliers.
If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, such as in the parallel mode of the C++ STL library.
If only approximate percentiles are required then Cross Validated has a question -- Estimation of quantile given quantiles of subset -- that suggests a subsampling approach. You would take some (but not all) datapoints from each dataset, make a new combined dataset that is small enough to fit on a single machine and compute the percentiles of that.
Another approximate approach, efficient if percentiles of each segment are already available, would be to approximate the cumulative distribution function of each segment as a step function from the percentiles. Then the overall distribution would be a finite mixture of the segment distributions and the cumulative distribution function a weighted sum of the segment cumulative distribution functions. The quantile function (i.e., the percentiles) could be computed by numerically inverting the cumulative distribution function.

Resources