Is the tf-idf of scikit-learn in this example correct? The most frequent words have high score - tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer
documents=["The car is driven on the road","The truck is
driven on the highway","the lorry is"]
{'the': 7, 'car': 0, 'on': 5, 'driven': 1, 'is': 3, 'road': 6, 'lorry': 4, 'truck': 8, 'highway': 2}
[[0.45171082 0.34353772 0. 0.26678769 0. 0.34353772 0.45171082 0.53357537 0. ]
[0. 0.34353772 0.45171082 0.26678769 0. 0.34353772 0. 0.53357537 0.45171082]
[0. 0. 0. 0.45329466 0.76749457 0. 0. 0.45329466 0. ]]
The word "the" should have a low score in the three documents

tfidf = term frequency (tf)* inverse doc frequency (idf)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print (X.toarray())
print ("---")
t = TfidfTransformer(use_idf=True, norm=None, smooth_idf=False)
a = t.fit_transform(X)
print (a.toarray())
print ("---")
print (t.idf_)
idf(the) is low but the tf(the, doc1) = 2 is high which is pushing it over other words.
From the above example code:
The idf (no Norm, non smoothed idf) of is == the == 1
However tf(the, doc1) = 2 and tf(is, doc1) = 1, which bumps up the value of tfidf of tfidf(the, doc1).
similarly idf(car) = 2.09861229 but tf(car, doc1) = 1, => tfidf(car, doc1) = 2.09861229, which is very close to tfidf(the,doc1). The smoothing of idf further reduces the gap.
On a large corpus the differences become more prominent.
Try running you code by disabling smoothing and no normalisation to see the effect on small corpus.


How to calculate the median (without np.median) from frequency data?

I am trying to create a function which receives an ordered array of values and associated frequencies as input and produces the median of the observations as output. My idea was to recreate the original data set by repeatedly adding each value, in order, to a new variable according to its frequency of occurrence. After that, I would just call a function I've already created for calculating the median of a set of raw observations.
So, for example. So we have:
severities = np.arange(7)
with_helmet = np.array([248, 58, 11, 3, 2, 8, 1])
Then I want my function to add zero 248 times, one 58 times, and so on. I'm new to numpy, and I'm embarrassed to say I'm not sure how to do this. A helpful function I found was
np.repeat(array, repeats)
but that duplicates each element a set number of times, whereas I want to duplicate each element in values the number of times it occurs (i.e. according to the corresponding frequency value).
Can anyone provide in suggestions (in base python and numpy only)?
With regard to just calculating the median given frequencies:
import numpy as np
severities = np.arange(7)
with_helmet = np.array([248, 58, 11, 3, 2, 8, 1])
np.median(np.repeat(severities, with_helmet))
will work fine for simple cases.
Then you asked:
You're right about the scaling problem. This isn't a problem for my
data sets, but I wonder how you would approach calculating the median
(without np.median) without recreating the original data set?
Here is a way that will scale better:
Given your data is basically a frequency table:
You can pin the median point (sum of frequency divided by two) to a number in the left column. np.searchsorted provides such functionality, but requires a monotinically increasing array as an input (of which the frequency column isn't). To make this possible I use np.cumsum over frequencies to get another representation of the frequences which can be used with np.searchsorted.
Assuming the number column is already sorted we can get a linear time and space algorithm respective the the length of the table:
import numpy as np
import unittest
from numpy.typing import ArrayLike
def median_from_frequencies(numbers: ArrayLike, frequencies: ArrayLike):
numbers: ArrayLike, assumed sorted numbers
frequencies: ArrayLike, frequencies of said numbers
bins = np.cumsum(frequencies)
mid = bins[-1] / 2
idx = np.searchsorted(bins, mid)
result = numbers[idx]
if mid.is_integer():
result = (result + numbers[idx + 1]) / 2
return result
class TestMedian(unittest.TestCase):
def test_simple_length_1(self):
a = np.array([0])
numbers = np.array([0])
frequencies = np.array([1])
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
def test_simple_length_2(self):
a = np.array([0,1])
numbers = np.array([0,1])
frequencies = np.array([1,1])
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
def test_simple_length_4(self):
a = np.array([1,1,2,2])
numbers = np.array([1,2])
frequencies = np.array([2,2])
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
def test_length_5(self):
a = np.array([10,10,20,30,30])
numbers = np.array([10,20,30])
frequencies = np.array([2,1,2])
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
def test_length_7(self):
a = np.array([1,1,2,2,7,7,7])
numbers = np.array([1,2,7])
frequencies = np.array([2,2,3])
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
def test_your_numbers(self):
severities = np.arange(7)
with_helmet = np.array([248, 58, 11, 3, 2, 8, 1])
a = np.repeat(severities, with_helmet)
numbers = severities
frequencies = with_helmet
median1 = np.median(a)
median2 = median_from_frequencies(numbers, frequencies)
self.assertEqual(median1, median2)
if __name__ == '__main__':
Ran 6 tests in 0.002s
If I understand, this is it:
import numpy as np
import collections
severities = np.arange(7)
with_helmet = np.array([248, 58, 11, 3, 2, 8, 1])
ans = np.repeat(severities, with_helmet)
counter = collections.Counter(ans)

Tensorflow JS Probabilties

i have multiple feature columns and a result column where i want to predict if something happens or not.
so I'm training my model and finally i do
const predictions = model.predict(xTest).argMax(-1);
this returns a tensor and when getting the data with:
predictions.dataSync ()
i get values like [0, 1, 1, 1, 0, ...]
is there any way to get probabilities like in python? [0.121, 0.421, 0.8621, ...]
I only found one result:
is this still the case? are there no probabilities in javascript?
tf.argMax returns the indices of the maximum value along the axis. If you rather want to have the maximum value itself you could use tf.max instead
const x = tf.tensor2d([[1, 2, 3],[ 4, 8, 4]]);
x.max(-1).print() // [3, 8]
x.argMax(-1).print() // [2, 1]

You have an array of integers, and for each index you want to find the product of every integer except the integer at that index

I was going over some interview questions and came across this one at a website. I have come up with a solution in Ruby, and I wish to know if it is efficient and an acceptable solution. I have been coding for a while now, but never concentrated on complexity of a solution before. Now I am trying to learn to minimize the time and space complexity.
You have an array of integers, and for each index you want to find the product of every integer except the integer at that index.
arr = [1,2,4,5]
result = [40, 20, 10, 8]
# result = [2*4*5, 1*4*5, 1*2*5, 1*2*4]
With that in mind, I came up with this solution.
def find_products(input_array)
product_array = []
input_array.length.times do |iteration|
a = input_array.shift
product_array << input_array.inject(:*)
input_array << a
arr = find_products([1,7,98,4])
From what I understand, I am accessing the array as many times as its length, which is considered to be terrible in terms of efficiency and speed. I am still unsure on what is the complexity of my solution.
Any help in making it more efficient is appreciated and if you can also tell the complexity of my solution and how to calculate that, it will be even better.
def product_of_others(arr)
case arr.count(0)
when 0
total = arr.reduce(1,:*) { |n| total/n }
when 1
ndx_of_0 = arr.index(0) do |n,i|
if i==ndx_of_0
arr[0,ndx_of_0].reduce(1,:*) * arr[ndx_of_0+1..-1].reduce(1,:*)
else { 0 }
product_of_others [1,2,4,5] #=> [40, 20, 10, 8]
product_of_others [1,-2,0,5] #=> [0, 0, -10, 0]
product_of_others [0,-2,4,5] #=> [-40, 0, 0, 0]
product_of_others [1,-2,4,0] #=> [0, 0, 0, -8]
product_of_others [1,0,4,0] #=> [0, 0, 0, 0]
product_of_others [] #=> []
For the case where arr contains no zeroes I used arr.reduce(1,:*) rather than arr.reduce(:*) in case the array is empty. Similarly, if arr contains one zero, I used .reduce(1,:*) in case the zero was at the beginning or end of the array.
For inputs not containing zeros (for others, see below)
Easiest (and relatively efficient) to me seems to first get the total product:
total_product = array.inject(1){|product, number| product * number}
And then map each array element to the total_product divided by the element:
result = {|number| total_product / number}
After initial calculation of total_product = 1*2*4*5 this will calculate
result = [40/1, 40/2, 40/4, 40/5]
As far as I remember this sums up to O(n) [creating total product: touch each number once] + O(n) [creating one result per number: touch each number once]. (correct me if i am wrong)
As #hirolau and #CarySwoveland pointed out, there is a problem if you have (exactly 1) 0 in the input, thus:
For inputs containing zeros (workaroundish, but borrows performance benefit and complexity class)
zero_count = array.count{|number| number == 0}
if zero_count == 0
# as before
elsif zero_count == 1
# one zero in input, result will only have 1 non-zero
nonzero_array = array.reject{|n| n == 0}
total_product = nonzero_array.inject(1){|product, number| product * number}
result = do |number|
(number == 0) ? total_product : 0
# more than one zero? All products will be zero!
result ={|_| 0}
Sorry that this answer by now basically equals #CarySwoveland, but I think my code is more explicit.
Look at the comments about further performance considerations.
Here is how I would do it:
arr = [1,2,4,5]
result = do |x|
new_array = arr.dup # Create a copy of original array
new_array.delete_at(arr.index(x)) # Remove an instance of the current value
new_array.inject(:*) # Return the product.
p result # => [40, 20, 10, 8]
I not know ruby, but, accessing an array is O(1), that means that is in constant time, so the complexity of your algorithm is O(n), it is very good. I don't think that a better solution can be found in terms of complexity. The real speed is another issue, but that solution is fine

When / How to stop Ruby loop that creates nested arrays within nested arrays

I'm unsure when to end the loop that runs the map statement, the times is simply put in place as an example of where the loop should be and what code should be contained within. I would like to run it until the first value of the created multidimensional array is 0 (because it will consistently be the largest value until it becomes 0 itself and creates the last nested array), but I'm completely stumped on how to do so. Any help would be greatly appreciated!
def wonky_coins(n)
coins = [n]
if n == 0
return 1
i = 1
n.times do! do |x|
if x != 0
i+= 2
o = []
o << x/2
o << x/3
o << x/4
x = o
puts x
return i
# Catsylvanian money is a strange thing: they have a coin for every
# denomination (including zero!). A wonky change machine in
# Catsylvania takes any coin of value N and returns 3 new coins,
# valued at N/2, N/3 and N/4 (rounding down).
# Write a method `wonky_coins(n)` that returns the number of coins you
# are left with if you take all non-zero coins and keep feeding them
# back into the machine until you are left with only zero-value coins.
# Difficulty: 3/5
describe "#wonky_coins" do
it "handles a coin of value 1" do
wonky_coins(1).should == 3
it "handles a coin of value 5" do
wonky_coins(5).should == 11
# 11
# => [2, 1, 1]
# => [[1, 0, 0], [0, 0, 0], [0, 0, 0]]
# => [[[0, 0, 0], 0, 0], [0, 0, 0], [0, 0, 0]]
it "handles a coin of value 6" do
wonky_coins(6).should == 15
it "handles being given the zero coin" do
wonky_coins(0).should == 1
First of all, you should not have nested arrays. You want to flatten the array after each pass, so you just have coins; even better, use flat_map to do it in one step. 0 produces just itself: [0]; don't forget it or your code will lose all of your target coins!
Next, there is no logic to doing it n times that I can see. No fixed number of times will do. You want to do it until all coins are zero. You can set a flag at the top (all_zero = true), and flip it when you find a non-zero coin, that should tell your loop that further iterations are needed.
Also, you don't need to track the number of coins, since the number will be the final size of the array.
And finally, and unrelated to the problem, do get into the habit of using correct indentation. For one thing, it makes it harder for yourself to debug and maintain the code; for another, bad indentation makes many SO people not want to bother reading your question.
Went back a bit later now knowing about how to use .flatten and I got it! Thanks to #Amadan for the helpful tips. Feel free to leave any comments concerning my syntax as I am just starting out and can use all the constructive feedback I can get!
def wonky_coins(n)
coins = [n]
return 1 if n == 0
until coins[0] == 0! { |x|
next if x == 0
x = [x/2, x/3, x/4]
return coins.length

Weighted random selection from array

I would like to randomly select one element from an array, but each element has a known probability of selection.
All chances together (within the array) sums to 1.
What algorithm would you suggest as the fastest and most suitable for huge calculations?
id => chance
0 => 0.8
1 => 0.2
for this pseudocode, the algorithm in question should on multiple calls statistically return four elements on id 0 for one element on id 1.
Compute the discrete cumulative density function (CDF) of your list -- or in simple terms the array of cumulative sums of the weights. Then generate a random number in the range between 0 and the sum of all weights (might be 1 in your case), do a binary search to find this random number in your discrete CDF array and get the value corresponding to this entry -- this is your weighted random number.
The algorithm is straight forward
rand_no = rand(0,1)
for each element in array
if(rand_num < element.probablity)
select and break
rand_num = rand_num - element.probability
I have found this article to be the most useful at understanding this problem fully.
This stackoverflow question may also be what you're looking for.
I believe the optimal solution is to use the Alias Method (wikipedia).
It requires O(n) time to initialize, O(1) time to make a selection, and O(n) memory.
Here is the algorithm for generating the result of rolling a weighted n-sided die (from here it is trivial to select an element from a length-n array) as take from this article.
The author assumes you have functions for rolling a fair die (floor(random() * n)) and flipping a biased coin (random() < p).
Algorithm: Vose's Alias Method
Create arrays Alias and Prob, each of size n.
Create two worklists, Small and Large.
Multiply each probability by n.
For each scaled probability pi:
If pi < 1, add i to Small.
Otherwise (pi ≥ 1), add i to Large.
While Small and Large are not empty: (Large might be emptied first)
Remove the first element from Small; call it l.
Remove the first element from Large; call it g.
Set Prob[l]=pl.
Set Alias[l]=g.
Set pg := (pg+pl)−1. (This is a more numerically stable option.)
If pg<1, add g to Small.
Otherwise (pg ≥ 1), add g to Large.
While Large is not empty:
Remove the first element from Large; call it g.
Set Prob[g] = 1.
While Small is not empty: This is only possible due to numerical instability.
Remove the first element from Small; call it l.
Set Prob[l] = 1.
Generate a fair die roll from an n-sided die; call the side i.
Flip a biased coin that comes up heads with probability Prob[i].
If the coin comes up "heads," return i.
Otherwise, return Alias[i].
Here is an implementation in Ruby:
def weighted_rand(weights = {})
raise 'Probabilities must sum up to 1' unless weights.values.inject(&:+) == 1.0
raise 'Probabilities must not be negative' unless weights.values.all? { |p| p >= 0 }
# Do more sanity checks depending on the amount of trust in the software component using this method,
# e.g. don't allow duplicates, don't allow non-numeric values, etc.
# Ignore elements with probability 0
weights = weights.reject { |k, v| v == 0.0 } # e.g. => {"a"=>0.4, "b"=>0.4, "c"=>0.2}
# Accumulate probabilities and map them to a value
u = 0.0
ranges = { |v, p| [u += p, v] } # e.g. => [[0.4, "a"], [0.8, "b"], [1.0, "c"]]
# Generate a (pseudo-)random floating point number between 0.0(included) and 1.0(excluded)
u = rand # e.g. => 0.4651073966724186
# Find the first value that has an accumulated probability greater than the random number u
ranges.find { |p, v| p > u }.last # e.g. => "b"
How to use:
weights = {'a' => 0.4, 'b' => 0.4, 'c' => 0.2, 'd' => 0.0}
weighted_rand weights
What to expect roughly:
sample = { weighted_rand weights }
sample.count('a') # 396
sample.count('b') # 406
sample.count('c') # 198
sample.count('d') # 0
An example in ruby
#each element is associated with its probability
a = {1 => 0.25 ,2 => 0.5 ,3 => 0.2, 4 => 0.05}
#at some point, convert to ccumulative probability
acc = 0
a.each { |e,w| a[e] = acc+=w }
#to select an element, pick a random between 0 and 1 and find the first
#cummulative probability that's greater than the random number
r = rand
selected = a.find{ |e,w| w>r }
p selected[0]
This can be done in O(1) expected time per sample as follows.
Compute the CDF F(i) for each element i to be the sum of probabilities less than or equal to i.
Define the range r(i) of an element i to be the interval [F(i - 1), F(i)].
For each interval [(i - 1)/n, i/n], create a bucket consisting of the list of the elements whose range overlaps the interval. This takes O(n) time in total for the full array as long as you are reasonably careful.
When you randomly sample the array, you simply compute which bucket the random number is in, and compare with each element of the list until you find the interval that contains it.
The cost of a sample is O(the expected length of a randomly chosen list) <= 2.
This is a PHP code I used in production:
* #return \App\Models\CdnServer
protected function selectWeightedServer(Collection $servers)
if ($servers->count() == 1) {
return $servers->first();
$totalWeight = 0;
foreach ($servers as $server) {
$totalWeight += $server->getWeight();
// Select a random server using weighted choice
$randWeight = mt_rand(1, $totalWeight);
$accWeight = 0;
foreach ($servers as $server) {
$accWeight += $server->getWeight();
if ($accWeight >= $randWeight) {
return $server;
Ruby solution using the pickup gem:
require 'pickup'
chances = {0=>80, 1=>20}
picker =
5.times.collect {
gave output:
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
If the array is small, I would give the array a length of, in this case, five and assign the values as appropriate:
0 => 0
1 => 0
2 => 0
3 => 0
4 => 1
"Wheel of Fortune" O(n), use for small arrays only:
function pickRandomWeighted(array, weights) {
var sum = 0;
for (var i=0; i<weights.length; i++) sum += weights[i];
for (var i=0, pick=Math.random()*sum; i<weights.length; i++, pick-=weights[i])
if (pick-weights[i]<0) return array[i];
the trick could be to sample an auxiliary array with elements repetitions which reflect the probability
Given the elements associated with their probability, as percentage:
h = {1 => 0.5, 2 => 0.3, 3 => 0.05, 4 => 0.05 }
auxiliary_array = h.inject([]){|memo,(k,v)| memo +=*v).to_i,k) }
ruby-1.9.3-p194 > auxiliary_array
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
if you want to be as generic as possible, you need to calculate the multiplier based on the max number of fractional digits, and use it in the place of 100:
m = 10**h.values.collect{|e| e.to_s.split(".").last.size }.max
Another possibility is to associate, with each element of the array, a random number drawn from an exponential distribution with parameter given by the weight for that element. Then pick the element with the lowest such ‘ordering number’. In this case, the probability that a particular element has the lowest ordering number of the array is proportional to the array element's weight.
This is O(n), doesn't involve any reordering or extra storage, and the selection can be done in the course of a single pass through the array. The weights must be greater than zero, but don't have to sum to any particular value.
This has the further advantage that, if you store the ordering number with each array element, you have the option to sort the array by increasing ordering number, to get a random ordering of the array in which elements with higher weights have a higher probability of coming early (I've found this useful when deciding which DNS SRV record to pick, to decide which machine to query).
Repeated random sampling with replacement requires a new pass through the array each time; for random selection without replacement, the array can be sorted in order of increasing ordering number, and k elements can be read out in that order.
See the Wikipedia page about the exponential distribution (in particular the remarks about the distribution of the minima of an ensemble of such variates) for the proof that the above is true, and also for the pointer towards the technique of generating such variates: if T has a uniform random distribution in [0,1), then Z=-log(1-T)/w (where w is the parameter of the distribution; here the weight of the associated element) has an exponential distribution.
That is:
For each element i in the array, calculate zi = -log(T)/wi (or zi = -log(1-T)/wi), where T is drawn from a uniform distribution in [0,1), and wi is the weight of the I'th element.
Select the element which has the lowest zi.
The element i will be selected with probability wi/(w1+w2+...+wn).
See below for an illustration of this in Python, which takes a single pass through the array of weights, for each of 10000 trials.
import math, random
weights = [10, 20, 50, 20]
nw = len(weights)
results = [0 for i in range(nw)]
n = 10000
while n > 0: # do n trials
smallest_i = 0
smallest_z = -math.log(1-random.random())/weights[0]
for i in range(1, nw):
z = -math.log(1-random.random())/weights[i]
if z < smallest_z:
smallest_i = i
smallest_z = z
results[smallest_i] += 1 # accumulate our choices
n -= 1
for i in range(nw):
print("{} -> {}".format(weights[i], results[i]))
Edit (for history): after posting this, I felt sure I couldn't be the first to have thought of it, and another search with this solution in mind shows that this is indeed the case.
In an answer to a similar question, Joe K suggested this algorithm (and also noted that someone else must have thought of it before).
Another answer to that question, meanwhile, pointed to Efraimidis and Spirakis (preprint), which describes a similar method.
I'm pretty sure, looking at it, that the Efraimidis and Spirakis is in fact the same exponential-distribution algorithm in disguise, and this is corroborated by a passing remark in the Wikipedia page about Reservoir sampling that ‘[e]quivalently, a more numerically stable formulation of this algorithm’ is the exponential-distribution algorithm above. The reference there is to a sequence of lecture notes by Richard Arratia; the relevant property of the exponential distribution is mentioned in Sect.1.3 (which mentions that something similar to this is a ‘familiar fact’ in some circles), but not its relationship to the Efraimidis and Spirakis algorithm.
I would imagine that numbers greater or equal than 0.8 but less than 1.0 selects the third element.
In other terms:
x is a random number between 0 and 1
if 0.0 >= x < 0.2 : Item 1
if 0.2 >= x < 0.8 : Item 2
if 0.8 >= x < 1.0 : Item 3
I am going to improve on answer.
Basically you make one big array where the number of times an element shows up is proportional to the weight.
It has some drawbacks.
The weight might not be integer. Imagine element 1 has probability of pi and element 2 has probability of 1-pi. How do you divide that? Or imagine if there are hundreds of such elements.
The array created can be very big. Imagine if least common multiplier is 1 million, then we will need an array of 1 million element in the array we want to pick.
To counter that, this is what you do.
Create such array, but only insert an element randomly. The probability that an element is inserted is proportional the the weight.
Then select random element from usual.
So if there are 3 elements with various weight, you simply pick an element from an array of 1-3 elements.
Problems may arise if the constructed element is empty. That is it just happens that no elements show up in the array because their dice roll differently.
In which case, I propose that the probability an element is inserted is p(inserted)=wi/wmax.
That way, one element, namely the one that has the highest probability, will be inserted. The other elements will be inserted by the relative probability.
Say we have 2 objects.
element 1 shows up .20% of the time.
element 2 shows up .40% of the time and has the highest probability.
In thearray, element 2 will show up all the time. Element 1 will show up half the time.
So element 2 will be called 2 times as many as element 1. For generality all other elements will be called proportional to their weight. Also the sum of all their probability are 1 because the array will always have at least 1 element.
I wrote an implementation in C#:
O(1) gets (fast!), O(n) recalculates, O(n) memory use.
