I have a table which can contain up to billions rows
CREATE TABLE "Log4DataUsb" (
"Time" integer primary key not null ,
"Microseconds" integer ,
"Current" integer ,
"Voltage" integer )
Usually a user will want to query the data within a specific range, for example Time <= 123456789 and Time >= 0, because this may return billions rows, I want to segment the rows and only return a batch each time, like LIMIT 10,000, LIMITE 10,000 OFFSET X until it reaches the end of this time-range query.
I notice that when the number of rows goes up, this query can be quite slow, executing the queries below will take seconds even though I just want to move to the next batch.
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000 OFFSET 10,0000
If the database is supposed to have 2 billion rows in total, is there any way it can largely increase the query performance?
I came across the following programming interview problem:
Challenge 1: N-grams
An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot".
For a given set of words and an n-gram length
Your task is to
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
Note that your function will receive the following arguments:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
Data constraints
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
Efficiency constraints
• your function is expected to print the result in less than 2 seconds
Example
Input
text: “aaaab a0a baaab c”
Output aaa
ngramLength: 3
Explanation
For the input presented above the 3-grams sorted by frequency are:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...
If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....
Is that the fastest option possible?
Thank you!!!
If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.
This should execute fast enough, and there is no need to code any fancy data structures.
Sorry for posting python but this is what I would do:
You might get some ideas for the algorithm. Notice this program solves an order of magnitude more words.
from itertools import groupby
someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3
ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
for i in range(len(word)-n+1):
ngrams.append(word[i:i+n])
# you could inline all logic here
# add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word
ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])
ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)
popular_ngrams = []
for freq in ngrams_freq_sorted:
if freq[0] == ngrams_freq_sorted[0][0]:
popular_ngrams.append(freq[1])
else:
break
print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**
So the basic recipe for this problem would be:
Find all n-grams in string
Map all duplicate entries into a new structure that has the n-gram and the number of times it occurs
You can find my c++ solution here: http://ideone.com/MNFSis
Given:
const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };
Then, for the first step - this is the code:
const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) {
//noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
if (noSpace(ptr)) {
//safely copy our current n-gram over to the ngrams array
//we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
//are valid letters
for (int i=0; i<NGRAM; i++) {
ngrams[idx][i] = ptr[i];
}
ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
idx++;
}
ptr++;
}
At this point, we have a list of all n-grams. Lets find the most popular one:
FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list
for (int i=0; i<NGRAMS; i++) {
if (ngrams[i][0] == '\0') break;
//insertFreqNode takes a start node, this where we will start to search for duplicates
//the simplest description is like this:
// 1 we search from head down each child, if we find a node that has text equal to
// ngrams[i] then we update it's frequency count
// 2 if the freq is >= to the current winner we place this as head.next
// 3 after program is complete, our most popular nodes will be the first nodes
// I have not implemented sorting of these - it's an exercise for the reader ;)
insertFreqNode(&head, ngrams[i]);
}
//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl
Good luck to you!
Just for fun, I wrote a SQL version (SQL Server 2012):
if object_id('dbo.MaxNgram','IF') is not null
drop function dbo.MaxNgram;
go
create function dbo.MaxNgram(
#text varchar(max)
,#length int
) returns table with schemabinding as
return
with
Delimiter(c) as ( select ' '),
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
)T(N)
),
E2(N) as (
select 1 from E1 a cross join E1 b
),
E6(N) as (
select 1 from E2 a cross join E2 b cross join E2 c
),
tally(N) as (
select top(isnull(datalength(#text),0))
ROW_NUMBER() over (order by (select NULL))
from E6
),
cteStart(N1) as (
select 1 union all
select t.N+1 from tally t cross join delimiter
where substring(#text,t.N,1) = delimiter.c
),
cteLen(N1,L1) as (
select s.N1,
isnull(nullif(charindex(delimiter.c,#text,s.N1),0) - s.N1,8000)
from cteStart s
cross join delimiter
),
cteWords as (
select ItemNumber = row_number() over (order by l.N1),
Item = substring(#text, l.N1, l.L1)
from cteLen l
),
mask(N) as (
select top(#length) row_Number() over (order by (select NULL))
from E6
),
topItem as (
select top 1
substring(Item,m.N,#length) as Ngram
,count(*) as Length
from cteWords w
cross join mask m
where m.N <= datalength(w.Item) + 1 - #length
and #length <= datalength(w.Item)
group by
substring(Item,m.N,#length)
order by 2 desc, 1
)
select d.s
from (
select top 1 NGram,Length
from topItem
) t
cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go
which when invoked with the sample input provided by OP
set nocount on;
select s as [ ] from MaxNgram(
'aaaab a0a baaab c aab'
,3
);
go
yields as desired
------------------------------
aaa
3
If you're not bound to C, I've written this Python script in about 10 minutes which processes 1.5Mb file, containing more than 265 000 words looking for 3-grams in 0.4s (apart from printing the values on the screen)
The text used for the test is Ulysses of James Joyce, you can find it free here https://www.gutenberg.org/ebooks/4300
Words separators here are both space and carriage return \n
import sys
text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
word_len = len(word)
if word_len < ngram_len:
continue
for i in range(0, (word_len - ngram_len) + 1):
ngram = word[i:i+ngram_len]
if ngram in ngrams:
ngrams[ngram] += 1
else:
ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
if val not in ngrams_by_freq:
ngrams_by_freq[val] = [key]
else:
ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
print('{} with frequency of {}'.format(key[1:], key[0]))
You can convert trigram into RADIX50 code.
See http://en.wikipedia.org/wiki/DEC_Radix-50
In radix50, output value for trigram fits into 16-bit unsigned int value.
Thereafter, you can use radix-encoded trigram as index in the array.
So, your code would be like:
uint16_t counters[1 << 16]; // 64K counters
bzero(counters, sizeof(counters));
for(const char *p = txt; p[2] != 0; p++)
counters[radix50(p)]++;
Thereafter, just search for max value in the array, and decode index into trigram back.
I used this trick, when implemented Wilbur-Khovayko algorithm for fuzzy search ~10 years ago.
You can download source here: http://itman.narod.ru/source/jwilbur1.tar.gz.
You can solve this problem in O(nk) time where n is the number of words and k is the average number of n-grams per word.
You're correct in thinking that a hash table is a good solution to the problem.
However, since you have limited time to code a solution, I'd suggest using open addressing instead of a linked list. The implementation may be simpler: if you reach a collision you just walk farther along the list.
Also, be sure to allocate enough memory to your hash table: something about twice the size of the expected number of n-grams should be fine. Since the expected number of n-grams is <=250,000 a hash table of 500,000 should be more than sufficient.
In terms of coding speed, the small input length (250,000) makes sorting and counting a feasible option. The quickest way is probably to generate an array of pointers to each n-gram, sort the array using an appropriate comparator, and then walk along it keeping track of the which n-gram appeared the most.
One simple python solution for this question
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3
for str in str_list:
start = 0
end = ngram_len
len_word = len(str)
for i in range(0,len_word):
if end <= len_word :
if str_hash.get(str[start:end]):
str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
else:
str_hash[str[start:end]] = 1
start = start +1
end = end +1
else:
break
keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])
Lets assume, I have one column 'Rate' and rows have to be fetched on basis of below criteria.
Below CHARS are nothing but columns.
if X>Y then fetch Rate
if A>B then
if C>D then fetch Rate
if E>F then
if G>H then fetch Rate
if I>J then fetch Rate
if K>L then
if M>N THEN fetch Rate
I am stuck at, how to write multilevel conditions in WHERE clause.
I'm not at all sure I've gotten the right end of the stick, but it sounds like you just want to combine multiple conditions in the WHERE clause:
WHERE
X>Y OR
(
A>B AND
(
C>D OR
(
E>F AND
(
G>H OR
I>J OR
(K>L AND M>N)
)
)
)
)
Not all of the brackets are strictly necessary, but I usually prefer to use them rather than remember whether AND or OR has higher precedence.
To parse contditions into SQL one may use a simple set of two rules:
Change "then fetch Rate" to "OR (...)"
Change "then" to "AND (...)"
so the resulting SQL is something like that:
select Rate
from ...
where (X > Y) OR
(A > B) AND
((C > D) OR
(E > F) AND
((G > H) OR
(I > J) OR
(K > L) AND
(M > N)))
Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/
When I perform a simple query like this:
select * from nodeType
Calling skip(N) on the range iterator is slow.
What am I doing wrong?
Found out why (self answering) - was using document order by default.
Try adding a sensible "order by" to the query - goes from minutes for 10000 nodes to < 1 second.
Sadly, the RangeIterator skip() method in Jackrabbit implementation (RangeIterator is just an interface) is traversing over the nodes linearly. You might as well just write
int counter = 0;
while ( counter < offset && iter.hasNext() ) { iter.next(); counter++; }