I am taking an Information Retrieval course, where we have started with "Boolean retrieval".
I have come across the following question (taken from the Stanford book on Information retrieval):
For a conjunctive query, is processing postings lists in order of size
guaranteed to be optimal? Explain why / why not.
The explanation given is as follows:
The order is not guaranteed to be optimal. Consider three terms with
postings list sizes s1=100, s2=105 and s3=110. Suppose the
intersection of s1 and s2 has length 100 and the intersection of s1
and s3 length 0. The ordering s1, s2, s3 requires 100+105+100+110=315
steps through the postings lists. The ordering s1, s3, s2 requires
100+110+0+0=210 steps through the postings lists.
Could anyone please explain the above?
For instance: In "100+105+100+110"; what does 100 stand for? Is it the size of s1 or the intersection between s1 and s2? (105 and 110 are fairly obvious).
According to the question,you should process the postings lists in ascending order(with size).Considering s1 = 100,s2 = 105,s3 = 110.So you should process s1 and s2 first.Let's say you get r1 = s1 AND s2,then you should process r1 and s3.
According to the algorithm,you can estimate the consumption.
Since s1 AND s2 = r1,and r1 has 100 length.Now the consumption is: O(s1+s2) = 100 + 105 = 205,then you process r1 with s3,the consumption is O(r1+s3) = 100 + 110 = 210,so the whole consumption is 205 + 210 = 415.
But we already know s1 AND s3 = 0,so we should process s1 and s3 first with consumption O(s1+s3) = 100+110 = 210,let's call it r2.Finally process r2 and s2, O(r2+s2) = 0(since r2 has 0 length).So the whole consumption is 210 + 0 = 210 which is smaller than 415.
The key idea is that we don't know the intermediate result(here stands for r1 or r2).So processing postings lists in order of size may not guaranteed to be optimal.
Related
How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK and finding that the database has duck. Curious from an algorithms/efficiency perspective how they would accomplish this:
Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains, starts_with, ends_with, and unscrambles)?
I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?
Basically wondering the efficient way this would be done.
"Containing C" amounts to count(C) > 0. Unscrambling CAUDC amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.
Here is a quick and dirty python sqlite3 demo:
from collections import defaultdict, Counter
import sqlite3
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
alphabet = [chr(ord('A')+i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word TEXT'] + [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)
for c in alphabet:
cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')
def insert(word):
counts = Counter(word)
columns = ['word'] + [f'{c}_count' for c in counts.keys()]
counts = [f'"{word}"'] + [f'{n}' for n in counts.values()]
var_str = f'({", ".join(columns)})'
val_str = f'({", ".join(counts)})'
insert_cmd = f'INSERT INTO abc {var_str} VALUES {val_str}'
cur.execute(insert_cmd)
def unscramble(text):
counts = {a:0 for a in alphabet}
for c in text:
counts[c] += 1
where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
cur.execute(select_cmd)
return list(sorted([tup[0] for tup in cur.fetchall()]))
print('Building sqlite table...')
with open('/usr/share/dict/words') as f:
word_set = set(line.strip().upper() for line in f)
for word in word_set:
if all(c in alphabet_set for c in word):
insert(word)
print('Table built!')
d = defaultdict(list)
for word in unscramble('CAUDK'):
d[len(word)].append(word)
print("unscramble('CAUDK'):")
for n in sorted(d):
print(' '.join(d[n]))
Output:
Building sqlite table...
Table built!
unscramble('CAUDK'):
A C D K U
AC AD AK AU CA CD CU DA DC KC UK
AUK CAD CUD
DUCK
I don't know for sure what they're doing, but I suggest this algorithm for contains and unscramble (and, I think, can be trivially extended to starts with or end with):
User submits a set of letters in the form of a string. Say, user submits bdsfa.
The algorithm sorts that string in (1). So, query becomes abdfs
Then, to find all words with those letters in them, the algorithm simply accesses the directory database/a/b/d/f/s/ and finds all words with those letters in. In case it finds the directory to be empty, it goes one level up: database/a/b/d/f/ and shows result there.
So, now, the question is, how to index the database of millions of words as done in step (3)? database/ directory will have 26 directories inside it for a to z, each of which will have 26-1 directories for all letters, except their parent's. E.g.:
database/a/{b,c,...,z}`
database/b/{a,c,...,z}`
...
database/z/{a,c,...,y}`
This tree structure will be only 26 level deep. Each branch will have no more than 26 elements. So browsing this directory structure is scalable.
Words will be stored in the leaves of this tree. So, the word apple will be stored in database/a/e/l/p/leaf_apple. In that place, you will also find other words such as leap. More specifically:
database/
a/
e/
l/
p/
leaf_apple
leaf_leap
leaf_peal
...
This way, you can efficiently reach the subset of target words as O(log n), where n is total number of words in your database.
You can further optimise this by adding additional indices. For example, there are too many words containing a, and the website won't display them all (at least not in the 1st page). Instead, the website may say there are total 500,000 many words containing 'a', here is 100 examples. In order to obtain 500,000 efficiently, the number of children at every level can be added during the indexing. E.g. `database/{a,b,...,z}/{num_children,
`database/
{...}/
{num_children,...}/
{num_childred,...}/
...
Here, num_children is just a leaf node, just like leaf_WORD. All leafs are files.
Depending on the load that this website has, it may not require to load this database in memory. It may simply leave it to the operating system to decide which portion of its file system to cache in memory as a read-time optimisation.
Personally, I think, as a criticism to applications, I think developers tend to jump into requiring RAM too fast even when a simple file system trick can do the job without any noticeable difference to the end user.
Say I have a list of 3 symbols :
l:`s1`s2`s3
What is the q-way to generate the following list of n*(n+1)/2 permutations?
(`s1;`s1),(`s1;`s2),(`s1;`s3),(`s2;`s2),(`s2;`s3),(`s3;`s3)
This can be seen as in the context of correlation matrix, where I want all the upper triangular part of the correlation matrix, including the diagonal.
Of course the size of my initial list will exceed 3, so I would like a generic function to perform this operation.
I know how to generate the diagonal elements:
q) {(x,y)}'[l;l]
(`s1`s1;`s2`s2;`s3`s3)
But I don't know how to generate the non-diagonal elements.
Another solution you might find useful:
q)l
`s1`s2`s3
q){raze x,/:'-1_{1_x}\[x]}l
s1 s1
s1 s2
s1 s3
s2 s2
s2 s3
s3 s3
This uses the scan accumulator to create a list of lists of symbols, with each dropping the first element:
q)-1_{1_x}\[l]
`s1`s2`s3
`s2`s3
,`s3
The extra -1_ is needed since the scan will also return an empty list at the end. Then join each element of the list onto this result using an each-right and an each:
{x,/:'-1_{1_x}\[x]}l
(`s1`s1;`s1`s2;`s1`s3)
(`s2`s2;`s2`s3)
,`s3`s3
Finally use a raze to get the distinct permutations.
EDIT: could also use
q){raze x,/:'til[count x]_\:x}l
s1 s1
s1 s2
s1 s3
s2 s2
s2 s3
s3 s3
which doesnt need the scan at all and is very similar to the scan solution performance-wise!
I would try below code
{distinct asc each x cross x}`s1`s2`s3
It
cross generates all (s_i, s_j) pairs
asc each sorts every pair by index, so `s3`s1 becomes `s1`s3
distinct removes duplicates
Not the most efficient way by very short one.
If I am understanding the question (apologies if I have missed something). Below should give you what you are looking for
q)test:`s1`s2`s3`s4`s5
q)(til cnt) _' raze (-1+cnt:count test)cut test,'/:test
(`s1`s1;`s2`s1;`s3`s1;`s4`s1;`s5`s1)
(`s2`s2;`s3`s2;`s4`s2;`s5`s2)
(`s3`s3;`s4`s3;`s5`s3)
(`s4`s4;`s5`s4)
,`s5`s5
I'm trying to understand some examples from a textbook regarding join trees, their cardinality, selectivity, and cost.
The cost function is given as follows:
The statistics for the example are
R1 = 10
R2 = 100
R3 = 1000
f_(1,2) = 0.1
f_(2,3) = 0.2
What trips me up is that they then say: assume f_ij=1 for all other combinations.
What does this say about other combinations? Does this mean that joining R_2 and R_3 won't produce any results because they don't share any attributes? If they don't share any attributes, wouldn't that make the result an empty set?
I appreciate the help!
I am trying to analyze 2 billion rows (of text files in HDFS). Each file's lines contain an array of sorted integers:
[1,2,3,4]
The integer values can be 0 to 100,000. I am looking to overlap within each array of integers all possibly combinations (one-way aka 1,2 and 2,1 are not necessary). Then reduce and sum the counts of those overlaps. For example:
File:
[1,2,3,4]
[2,3,4]
Final Output:
(1,2) - 1
(1,3) - 1
(1,4) - 1
(2,3) - 2
(2,4) - 2
(3,4) - 2
The methodology that I have tried is using Apache Spark, to create a simple job that parallelizes the processing and reducing of blocks of data. However I am running into issues where the memory can't hold a hash of ((100,000)^2)/2 options and thus I am having to result in running traditional map reduce of map, sort, shuffle, reduce locally, sort, shuffle, reduce globally. I know creating the combinations is a double for loop so O(n^2) but what is the most efficient way to programmatically do this so I can minimally write to disk? I am trying to perform this task sub 2 hours on a cluster of 100 nodes (64gb ram/2 cores) Also any recommended technologies or frameworks. Below is what I have been using in Apache Spark and Pydoop. I tried using more memory optimized Hashs, however they still were too much memory.
import collection.mutable.HashMap
import collection.mutable.ListBuffer
def getArray(line: String):List[Int] = {
var a = line.split("\\x01")(1).split("\\x02")
var ids = new ListBuffer[Int]
for (x <- 0 to a.length - 1){
ids += Integer.parseInt(a(x).split("\\x03")(0))
}
return ids.toList
}
var textFile = sc.textFile("hdfs://data/")
val counts = textFile.mapPartitions(lines => {
val hashmap = new HashMap[(Int,Int),Int]()
lines.foreach( line => {
val array = getArray(line)
for((x,i) <- array.view.zipWithIndex){
for (j <- (i+1) to array.length - 1){
hashmap((x,array(j))) = hashmap.getOrElse((x,array(j)),0) + 1
}
}
})
hashmap.toIterator
}).reduceByKey(_ + _)
Also Tried PyDoop:
def mapper(_, text, writer):
columns = text.split("\x01")
slices = columns[1].split("\x02")
slice_array = []
for slice_obj in slices:
slice_id = slice_obj.split("\x03")[0]
slice_array.append(int(slice_id))
val array = getArray(line)
for (i, x) in enumerate(array):
for j in range(i+1, len(array) - 1):
write.emit((x,array[j]),1)
def reducer(key, vals, writer):
writer.emit(key, sum(map(int, vals)))
def combiner(key, vals, writer):
writer.count('combiner calls', 1)
reducer(key, vals, writer)
I think your problem can be reduced to word count where the corpus contains at most 5 billion distinct words.
In both of your code examples, you're trying to pre-count all of the items appearing in each partition and sum the per-partition counts during the reduce phase.
Consider the worst-case memory requirements for this, which occur when every partition contains all of the 5 billion keys. The hashtable requires at least 8 bytes to represent each key (as two 32-bit integers) and 8 bytes for the count if we represent it as a 64-bit integer. Ignoring the additional overheads of Java/Scala hashtables (which aren't insignificant), you may need at least 74 gigabytes of RAM to hold the map-side hashtable:
num_keys = 100000**2 / 2
bytes_per_key = 4 + 4 + 8
bytes_per_gigabyte = 1024 **3
hashtable_size_gb = (num_keys * bytes_per_key) / (1.0 * bytes_per_gigabyte)
The problem here is that the keyspace at any particular mapper is huge. Things are better at the reducers, though: assuming a good hash partitioning, each reducer processes an even share of the keyspace, so the reducers only require roughly (74 gigabytes / 100 machines) ~= 740 MB per machine to hold their hashtables.
Performing a full shuffle of the dataset with no pre-aggregation is probably a bad idea, since the 2 billion row dataset probably becomes much bigger once you expand it into pairs.
I'd explore partial pre-aggregation, where you pick a fixed size for your map-side hashtable and spill records to reducers once the hashtable becomes full. You can employ different policies, such as LRU or randomized eviction, to pick elements to evict from the hashtable. The best technique might depend on the distribution of keys in your dataset (if the distribution exhibits significant skew, you may see larger benefits from partial pre-aggregation).
This gives you the benefit of reducing the amount of data transfer for frequent keys while using a fixed amount of memory.
You could also consider using a disk-backed hashtable that can spill blocks to disk in order to limit its memory requirements.
In simple terms what are pts and dts values?
Why are they important while transcoding [decode-encode] videos ?
What does this code bit do in ffmpeg.c , what is its purpose?
01562 ist->next_pts = ist->pts = picture.best_effort_timestamp;
01563 if (ist->st->codec->time_base.num != 0) {
01564 int ticks= ist->st->parser ? ist->st->parser->repeat_pict+1 : ist->st->codec->ticks_per_frame;
01565 ist->next_pts += ((int64_t)AV_TIME_BASE *
01566 ist->st->codec->time_base.num * ticks) /
01567 ist->st->codec->time_base.den;
01568 }
Those are the decoding time stamp (DTS) and presentation time stamp (PTS). You can find an explanation here inside a tutorial.
So let's say we had a movie, and the frames were displayed like: I B B P. Now, we need to know the information in P before we can display either B frame. Because of this, the frames might be stored like this: I P B B. This is why we have a decoding timestamp and a presentation timestamp on each frame. The decoding timestamp tells us when we need to decode something, and the presentation time stamp tells us when we need to display something. So, in this case, our stream might look like this:
PTS: 1 4 2 3
DTS: 1 2 3 4
Stream: I P B B
Generally the PTS and DTS will only differ when the stream we are playing has B frames in it.
B frames are predicted from I and P frames. B frames usually have more errors compared to I and P and hence are not recommended for prediction, though they might be closer in time. There are algorithms in which B is used for prediction but it is from a past B frame and not future B frames.
So in a sequence of I P B1 B2, Decode order is I P B1 B2 and Display order is I B1 B2 P. P is predicted from I, B1 from both I and P, B2 again from I and P.