I want to be able to randomly select 5 rows in C

Let's think aloud a bit.
If you just need to select 5 arbitrary numbers that happen to sum up to a number below a given N, you can cheat and select just the 5 smallest numbers; if they sum up to a number larger than N,choosing any other numbers won't help, too, and you register an error.
If you want your numbers to sum up quite close to N (the user asked 20 minutes, you try to offer something like 19 minutes and not 5), it becomes a knapsack problem, which is hard, but maybe various approximate ways to solve it could help.
If you just want to choose 5 random numbers that sum up to N, you can keep choosing 5 numbers (songs) randomly and check. You'll have to limit the number of tries done and/or time spent, and be ready to report a failure.
A somehow more efficient algorithm would keep a list of songs chosen so far, and the sum of their lengths s. It would try to add to it a random song with length ≤ N - s. If it failed after a few attempts, it would remove the longest song from the list and repeat. It must be ready to admit failure, too, based on the total number of attempts made and/or time spent.
I don't think a simple SQL query could efficiently solve this problem. You can approximately encode the algorithm above as a very complex SQL query, though. I'd rather encode it in Python, because local SQLite lookups are pretty fast, provided that your songs are indexed by length.

A possible solution is to only select songs for which the individual length is < 500. Then you keep as much of them as you can. If you have less than 5 or if the total time is < 500, then you iterate or recurse for find some songs for the unused time.
def createtimeplay(timee, tot = None, tot_time = 0):
if tot is None: tot= [] # at first call initialize the result list
# exclude previously selected songs from the search
qry = "SELECT * FROM songs WHERE length <= ?"
if len(tot) > 0:
qry += " and name not in (" + ','.join(['?'] * len(tot)) + ')'
qry += " ORDER BY RANDOM() LIMIT 5 "
curs = c.execute(qry, [timee] + [song[0] for song in tot])
cur = (curs.fetchall())
if len(cur) == 0: return tot # no song were found: we can return
# keep songs that fit in allowed time
cur_time = 0
for song in cur:
if cur_time + song[1] <= timee:
cur_time += song[1]
if (len(tot) == 5) return tot # never more than 5 songs
tot_time += cur_time # total songs time
if len(tot) != 5 and cur_time != timee: # if not all recurse
createtimeplay(timee - tot_time, tot, tot_time)
return tot
The trick is that we pass a list which is a modifiable object, so all recursive calls add songs to the same list.
You can then use:
>>> print(createtimeplay(500))
[('Song 18', 350, 'Country', 'z'), ('Song 4', 150, 'Pop', 'x')]
>>> print(createtimeplay(500))
[('Song 12', 200, 'Country', 'z'), ('Song 3', 100, 'Country', 'z'), ('Song 14', 200, 'Rap', 'y')]
>>> print(createtimeplay(500))
[('Song 5', 300, 'Rap', 'y'), ('Song 7', 200, 'Pop', 'x')]
But previous solution is very inefficient: it requires more than one query, when each query is a full table scan because of the order by random(), and uses recursion when it could easily be avoided. It would be both simpler and more efficient to only do a full table scan at sqlite level, shuffle the result in sqlite or Python, and then just scan once the full randomized list of songs, keeping a maximum number of 5 with a constraint of the total length.
Code is now much simpler:
def createtimeplay(tim, n, con):
songs = c.execute("""SELECT name, length, genre, artist
FROM songs
WHERE length < ? ORDER BY RANDOM()""", (tim,)).fetchall()
result = []
tot = 0
for song in songs:
if song[1] <= tim:
tim -= song[1]
if len(result) == n or tim == 0: break
return result
In this code, I choosed to pass the maximum number and a cursor or a connection to the sqlite database as parameters.


Summation based on unique entries of two arrays | Speed Issue

I have 3 arrays of size 803500*1 with the following details:
Rid: It can contain any number
RidID: It contains elements from 1 to 184 in random order. Each element appears multiple times.
r: It contains elements 0,1,2,...12. All elements (except zero) appear nearly 3400 to 3700 times at random indices in this array.
Following may be useful for generating sample data:
Rid = rand(803500,1);
RidID = randi(184,803500,1);
r = randi(13,803500,1)-1; %This may not be a good sample for r as per previously mentioned details?
What I want to do?
I want to calculate the sum of those entries of Rid which correspond to each positive unique entry of r and each unique entry of RidID.
This may be clearer with the code which I wrote for this problem:
RNum = numel(unique(RidID));
RSum = ones(RNum,12); %Preallocating for better speed
for i=1:12
RperM = r ==i;
for j = 1:RNum
RSum(j,i) = sum(Rid(RperM & (RidID==j)));
My code works but it takes 5 seconds on average on my computer and I have to do this calculation nearly a thousand times. If this time be reduced from 5 seconds to atleast half of it, I'll be very happy. But how do I optimize this? I don't mind if it is made better with vectorization or any better written loop.
I am using MATLAB R2017b.
You can use accumarray :
u = unique(RidID);
A = accumarray([RidID r+1], Rid);
RSum = A(u, 2:13);
This is slower than accumarray as suggested by rahnema, but using findgroups and splitapply may save memory.
In your example, there may be thousands of zero-valued elements in the resulting matrix, where a combination of RidID and r does not occur. In this case a stacked result would be more memory efficient, like so:
RidID | r | Rid_sum
1 | 1 | 100
2 | 1 | 200
4 | 2 | 85
This can be achieved with the following code:
[ID, rn, RidIDn] = findgroups(r,RidID); % Get unique combo ID for 'r' and 'RidID'
RSum = splitapply( #sum, Rid, ID ); % Sum for each ID
output = table( RidIDn, rn, RSum ); % Nicely formatted table output
% Get rid of elements where r == 0
output( output.rn == 0, : ) = [];
You could convert this to the same output as the accumarray method, but it's already a slower method...
% Convert to 'unstacked' 2D matrix (optional)
RSum = full( sparse( 1:numel(Ridn), 1:numel(rn), RSum ) );

lag over columns/ variables SPSS

I want to do something I thought was really simple.
My (mock) data looks like this:
data list free/totalscore.1 to totalscore.5.
begin data.
1 2 6 7 10 1 4 9 11 12 0 2 4 6 9
end data.
These are total scores accumulating over a number of trials (in this mock data, from 1 to 5). Now I want to know the number of scores earned in each trial. In other words, I want to subtract the value in the n trial from the n+1 trial.
The most simple syntax would look like this:
COMPUTE trialscore.1 = totalscore.2 - totalscore.1.
COMPUTE trialscore.2 = totalscore.3 - totalscore.2.
COMPUTE trialscore.3 = totalscore.4 - totalscore.3.
And so on...
So that the result would look like this:
But of course it is not possible and not fun to do this for 200+ variables.
I attempted to write a syntax using VECTOR and DO REPEAT as follows:
COMPUTE #y = 1.
VECTOR totalscore = totalscore.1 to totalscore.5.
DO REPEAT trialscore = trialscore.1 to trialscore.5.
COMPUTE #y = #x + 1.
COMPUTE trialscore(#i) = totalscore(#y) - totalscore(#i).
But it doesn't work.
Any help is appreciated.
Ps. I've looked into using LAG but that loops over rows while I need it to go over 1 column at a time.
I am assuming respid is your original (unique) record identifier.
If you do not have a record indentifier, you can very easily create a dummy one:
compute respid=$casenum.
end of EDIT
You could try re-structuring the data, so that each score is a distinct record:
/make totalscore from totalscore.1 to totalscore.5
then sort your cases so that scores are in descending order (in order to be bale to use lag function):
sort cases by respid (a) scorenumber (d).
Then actually do the lag-based computations
do if respid=lag(respid).
compute trialscore=totalscore-lag(totalscore).
end if.
In the end, un-do the restructuring:
You should end up with a set of totalscore variables (the last one will be empty), which will hold what you need.
you can use do repeat this way:
do repeat
before=totalscore.1 to totalscore.4
/after=totalscore.2 to totalscore.5
/diff=trialscore.1 to trialscore.4 .
compute diff=after-before.
end repeat.

Python loop to print salaries within range of the average

I'm an absolute beginner to Python and am tasked with creating a program that does a few things:
Inputs employee names into a list.
Inputs that employee's salary after inputting their name.
Totals the salaries in a list, (2 lists: names[] and salaries[]).
Finds the average salary after totaling.
Prints employees that earn within $5,000 of the average salary (Where I'm stuck).
Please see my code below:
# function to total the salaries entered into the "newSalary" variable and "salaries[]".
def totalSalaries(salaries):
total = 0
for i in salaries:
total += i
return total
# Finds the average salary after adding and dividing salaries in "salaries[]".
def averageSalaries(salaries):
l = len(salaries)
t = totalSalaries(salaries)
ave = t / l
return ave
# Start main
def main():
# Empty names list for "name" variable.
names = []
# Empty salaries list for "salary" and "newSalary" variables.
salaries = []
# Starts the loop to input names and salaries.
done = False
while not done:
name = input("Please enter the employee name or * to finish: ")
salary = float(input("Please enter the salary in thousands for " + name + ": "))
# Try/except to catch exceptions if a float isn't entered.
# The float entered then gets converted to thousands if it is a float.
s = float(salary)
# Message to user if a float isn't entered.
print("Please enter a valid float number.")
done = False
newSalary = salary * 1000
# Break in the loop, use * to finish inputting Names and Salaries.
if name == "*":
done = True
# Appends the names into name[] and salaries into salaries[] if * isn't entered.
# Restarts loop afterwards if * is not entered.
# STUCK HERE. Need to output Names + their salaries if it's $5,000 +- the total average salary.
for i in range(len(salaries)):
if newSalary is 5000 > ave < 5000:
print(name + ", " + str(newSalary))
# Quick prints just to check my numbers after finishing with *.
Any info is greatly appreciated. I hope the rest of the functions and logic in this program makes sense.
Your haven't written your iterator correctly. With an array, you can just use for element in array: and the loop will iterate over array by putting each element in element. So your for loop becomes for salary in salaries.
Also, You need to split your condition in two and use additions and substractions. Your code should check if the salary is higher or equal to the average minus 5000 and if it is lower or equal to the average plus 5000. If you want to formalize this mathematically it would be:
Salary >= Average - 5000
Salary <= Average + 5000
So the condition at line becomes if salary >= (average - 5000) and salary <= (average + 5000)
Finally, you do not call averageSalaries before entering the loop, so the average salary haven't been computed yet. You should call the function and put the result in a variable before your for loop.

The n-gram that is the most frequent one among all the words

I came across the following programming interview problem:
Challenge 1: N-grams
An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot".
For a given set of words and an n-gram length
Your task is to
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
Note that your function will receive the following arguments:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
Data constraints
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
Efficiency constraints
• your function is expected to print the result in less than 2 seconds
text: “aaaab a0a baaab c”
Output aaa
ngramLength: 3
For the input presented above the 3-grams sorted by frequency are:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...
If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....
Is that the fastest option possible?
Thank you!!!
If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.
This should execute fast enough, and there is no need to code any fancy data structures.
Sorry for posting python but this is what I would do:
You might get some ideas for the algorithm. Notice this program solves an order of magnitude more words.
from itertools import groupby
someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3
ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
for i in range(len(word)-n+1):
# you could inline all logic here
# add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word
ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])
ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)
popular_ngrams = []
for freq in ngrams_freq_sorted:
if freq[0] == ngrams_freq_sorted[0][0]:
print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**
So the basic recipe for this problem would be:
Find all n-grams in string
Map all duplicate entries into a new structure that has the n-gram and the number of times it occurs
You can find my c++ solution here: http://ideone.com/MNFSis
const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };
Then, for the first step - this is the code:
const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) {
//noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
if (noSpace(ptr)) {
//safely copy our current n-gram over to the ngrams array
//we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
//are valid letters
for (int i=0; i<NGRAM; i++) {
ngrams[idx][i] = ptr[i];
ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
At this point, we have a list of all n-grams. Lets find the most popular one:
FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list
for (int i=0; i<NGRAMS; i++) {
if (ngrams[i][0] == '\0') break;
//insertFreqNode takes a start node, this where we will start to search for duplicates
//the simplest description is like this:
// 1 we search from head down each child, if we find a node that has text equal to
// ngrams[i] then we update it's frequency count
// 2 if the freq is >= to the current winner we place this as head.next
// 3 after program is complete, our most popular nodes will be the first nodes
// I have not implemented sorting of these - it's an exercise for the reader ;)
insertFreqNode(&head, ngrams[i]);
//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl
Good luck to you!
Just for fun, I wrote a SQL version (SQL Server 2012):
if object_id('dbo.MaxNgram','IF') is not null
drop function dbo.MaxNgram;
create function dbo.MaxNgram(
#text varchar(max)
,#length int
) returns table with schemabinding as
Delimiter(c) as ( select ' '),
E1(N) as (
select 1 from (values
E2(N) as (
select 1 from E1 a cross join E1 b
E6(N) as (
select 1 from E2 a cross join E2 b cross join E2 c
tally(N) as (
select top(isnull(datalength(#text),0))
ROW_NUMBER() over (order by (select NULL))
from E6
cteStart(N1) as (
select 1 union all
select t.N+1 from tally t cross join delimiter
where substring(#text,t.N,1) = delimiter.c
cteLen(N1,L1) as (
select s.N1,
isnull(nullif(charindex(delimiter.c,#text,s.N1),0) - s.N1,8000)
from cteStart s
cross join delimiter
cteWords as (
select ItemNumber = row_number() over (order by l.N1),
Item = substring(#text, l.N1, l.L1)
from cteLen l
mask(N) as (
select top(#length) row_Number() over (order by (select NULL))
from E6
topItem as (
select top 1
substring(Item,m.N,#length) as Ngram
,count(*) as Length
from cteWords w
cross join mask m
where m.N <= datalength(w.Item) + 1 - #length
and #length <= datalength(w.Item)
group by
order by 2 desc, 1
select d.s
from (
select top 1 NGram,Length
from topItem
) t
cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
which when invoked with the sample input provided by OP
set nocount on;
select s as [ ] from MaxNgram(
'aaaab a0a baaab c aab'
yields as desired
If you're not bound to C, I've written this Python script in about 10 minutes which processes 1.5Mb file, containing more than 265 000 words looking for 3-grams in 0.4s (apart from printing the values on the screen)
The text used for the test is Ulysses of James Joyce, you can find it free here https://www.gutenberg.org/ebooks/4300
Words separators here are both space and carriage return \n
import sys
text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
word_len = len(word)
if word_len < ngram_len:
for i in range(0, (word_len - ngram_len) + 1):
ngram = word[i:i+ngram_len]
if ngram in ngrams:
ngrams[ngram] += 1
ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
if val not in ngrams_by_freq:
ngrams_by_freq[val] = [key]
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
print('{} with frequency of {}'.format(key[1:], key[0]))
You can convert trigram into RADIX50 code.
See http://en.wikipedia.org/wiki/DEC_Radix-50
In radix50, output value for trigram fits into 16-bit unsigned int value.
Thereafter, you can use radix-encoded trigram as index in the array.
So, your code would be like:
uint16_t counters[1 << 16]; // 64K counters
bzero(counters, sizeof(counters));
for(const char *p = txt; p[2] != 0; p++)
Thereafter, just search for max value in the array, and decode index into trigram back.
I used this trick, when implemented Wilbur-Khovayko algorithm for fuzzy search ~10 years ago.
You can download source here: http://itman.narod.ru/source/jwilbur1.tar.gz.
You can solve this problem in O(nk) time where n is the number of words and k is the average number of n-grams per word.
You're correct in thinking that a hash table is a good solution to the problem.
However, since you have limited time to code a solution, I'd suggest using open addressing instead of a linked list. The implementation may be simpler: if you reach a collision you just walk farther along the list.
Also, be sure to allocate enough memory to your hash table: something about twice the size of the expected number of n-grams should be fine. Since the expected number of n-grams is <=250,000 a hash table of 500,000 should be more than sufficient.
In terms of coding speed, the small input length (250,000) makes sorting and counting a feasible option. The quickest way is probably to generate an array of pointers to each n-gram, sort the array using an appropriate comparator, and then walk along it keeping track of the which n-gram appeared the most.
One simple python solution for this question
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3
for str in str_list:
start = 0
end = ngram_len
len_word = len(str)
for i in range(0,len_word):
if end <= len_word :
if str_hash.get(str[start:end]):
str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
str_hash[str[start:end]] = 1
start = start +1
end = end +1
keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])

how to calculate rolling volatility

I am trying to design a function that will calculate 30 day rolling volatility.
I have a file with 3 columns: date, and daily returns for 2 stocks.
How can I do this? I have a problem in summing the first 30 entries to get my vol.
So it will read an excel file, with 3 columns: a date, and daily returns.
daily.ret = read.csv("abc.csv")
e.g. date stock1 stock2
01/01/2000 0.01 0.02
etc etc, with years of data. I want to calculate rolling 30 day annualised vol.
This is my function:
calc_30day_vol = function()
stock1 = abc$stock1^2
stock2 = abc$stock1^2
j = 30
approx_days_in_year = length(abc$stock1)/10
vol_1 = 1: length(a1)
vol_2 = 1: length(a2)
for (i in 1 : length(a1))
vol_1[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a1[i:j])
vol_2[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a2[i:j])
j = j + 1
So stock1, and stock 2 are the squared daily returns from the excel file, needed to calculate vol. Entries 1-30 for vol_1 and vol_2 are empty since we are calculating 30 day vol. I am trying to use the rowSums function to sum the squared daily returns for the first 30 entries, and then move down the index for each iteration.
So from day 1-30, day 2-31, day 3-32, etc, hence why I have defined "j".
I'm new at R, so apologies if this sounds rather silly.
This should get you started.
First I have to create some data that look like you describe
getSymbols(c("SPY", "DIA"), src='yahoo')
m <- merge(ROC(Ad(SPY)), ROC(Ad(DIA)), all=FALSE)[-1, ]
dat <- data.frame(date=format(index(m), "%m/%d/%Y"), coredata(m))
tmpfile <- tempfile()
write.csv(dat, file=tmpfile, row.names=FALSE)
Now I have a csv with data in your very specific format.
Use read.zoo to read csv and then convert to an xts object (there are lots of ways to read data into R. See R Data Import/Export)
r <- as.xts(read.zoo(tmpfile, sep=",", header=TRUE, format="%m/%d/%Y"))
# each column of r has daily log returns for a stock price series
# use `apply` to apply a function to each column.
vols.mat <- apply(r, 2, function(x) {
#use rolling 30 day window to calculate standard deviation.
#annualize by multiplying by square root of time
runSD(x, n=30) * sqrt(252)
#`apply` returns a `matrix`; `reclass` to `xts`
vols.xts <- reclass(vols.mat, r) #class as `xts` using attributes of `r`
# SPY.Adjusted DIA.Adjusted
#2012-06-22 0.1775730 0.1608266
#2012-06-25 0.1832145 0.1640912
#2012-06-26 0.1813581 0.1621459
#2012-06-27 0.1825636 0.1629997
#2012-06-28 0.1824120 0.1630481
#2012-06-29 0.1898351 0.1689990
