How to search efficiently given 1mil data in array of integers - ruby - arrays

I have a mtd to search through 1mil or more records (stored as an arraylist of integers in asc order) to check if the pass in empID belongs to one of the records stored.
Currently, i uses sequential search through for loop. How to make it more efficient/faster?
def exist?(id)
for i in 0...$employee_list.length
if $employee_list[i] == id # match!
return true
elsif $employee_list[i] > id # have already gone beyond the point where id should've been found
return false
end
end
return false # cannot find id in the list
end
I also tried using hash as follows but still not fast enough.
hash = $employee_list.map{ |i | i}
if hash.include? id
return true
else
return false
end

Use a Set, unless you have proven that you cannot afford the memory:
# Do this just once
require 'set'
$employee_ids = Set.new $employee_list
# Do this each time you need to check
def exist?(id)
$employee_ids.include?(id)
end
It will be nearly instantaneous, regardless of the number of ids you have.

If you can't use a Set instead of an Array (for space reasons), and if your Array is sorted, you can use Array#bsearch with a block that returns an integer (like <=>).

Try this
array.bsearch {|x| number <=> x }
This does a binary search on the array. The array MUST be sorted.
Note that the element x is to the right side of the spaceship operator!
Use the ri command to read more documentation on the bsearch method. The time complexity of binary search is O(log n). Which is 20 steps only for an array of length 1 million.

Related

How to find a number that was repeated (n/3) times an array of size n, in O(n) time and O(n) space?

I have this question that I just can't figure it out! Any hints would mean a lot. Thank you in advance.
I have an array, A. It's size is n, and I want to find an algorithm that will find x that appears in this array at least n/3 times. If there is no such x in the array then we will print that we didn't find one!
I need to find an algorithm that does this in O(n) time and takes O(n) space.
For example:
A=[1 1 2 2 1 1 1 5 6 7]
For the above array, the algorithm should return 1.
If I was you, I write an algorithm that:
Instantiates a map (i.e. key/value pairs) in whatever language you're using. The key will be the integer you find, the value will be the number of times it has been seen so far.
Iterate over the array. For the current integer, check whether the number exists as a key in your map. If it exists, increment the map's value. If it doesn't exist, insert a new element with a count of 1.
After the iteration is complete, iterate over your map. If any elements have counts of greater than n/3, print it out. Handle the case where none are found, etc.
Here is my solution in pseudocode; note that it is possible to have two solutions as well as one or none:
func anna(A, n) # array and length
ht := {} # create empty hash table
for k in [0,n) # iterate over array
if A[k] in ht # previously seen
ht{k} := ht{k} + 1 # increment count
else # previously seen
ht{k} := 1 # initialize count
solved := False # flag if solution found
for k in keys(ht) # iterate over hash table
if ht{k} > n / 3 # found solution
solved := True # update flag
print k # write it
if not solved # no solution found
print "No solution" # report failure
The first for loop takes O(n) time. The second for loop potentially takes O(n) time if all items in the array are distinct, though most often the second for loop will take much less time. The hash table takes O(n) space if all items in the array are distinct, though most often it takes much less space.
It is possible to optimize the solution so it stops early and reports failure if there are no possible solutions. To do that, keep a variable max in the first for loop, increment it every time it is exceeded by a new hash table count, and check after each element is added to the hash table if max + n - k < n / 3.

algorithm which finds the numbers in a sequence which appear 3 times or more, and prints their indexes

Suppose I input a sequence of numbers which ends with -1.
I want to print all the values of the sequence that occur in it 3 times or more, and also print their indexes in the sequence.
For example , if the input is : 2 3 4 2 2 5 2 4 3 4 2 -1
so the expected output in that case is :
2: 0 3 4 6 10
4: 2 7 9
First I thought of using quick-sort , but then I realized that as a result I will lose the original indexes of the sequence. I also have been thinking of using count, but that sequence has no given range of numbers - so maybe count will be no good in that case.
Now I wonder if I might use an array of pointers (but how?)
Do you have any suggestions or tips for an algorithm with time complexity O(nlogn) for that ? It would be very appreciated.
Keep it simple!
The easiest way would be to scan the sequence and count the number of occurrence of each element, put the elements that match the condition in an auxiliary array.
Then, for each element in the auxiliary array, scan the sequence again and print out the indices.
First of all, sorry for my bad english (It's not my language) I'll try my best.
So similar to what #vvigilante told, here is an algorithm implemented in python (it is in python because is more similar to pseudo code, so you can translate it to any language you want, and moreover I add a lot of comment... hope you get it!)
from typing import Dict, List
def three_or_more( input_arr:int ) -> None:
indexes: Dict[int, List[int]] = {}
#scan the array
i:int
for i in range(0, len(input_arr)-1):
#create list for the number in position i
# (if it doesn't exist)
#and append the number
indexes.setdefault(input_arr[i],[]).append(i)
#for each key in the dictionary
n:int
for n in indexes.keys():
#if the number of element for that key is >= 3
if len(indexes[n]) >= 3:
#print the key
print("%d: "%(n), end='')
#print each element int the current key
el:int
for el in indexes[n]:
print("%d,"%(el), end='')
#new line
print("\n", end='')
#call the function
three_or_more([2, 3, 4, 2, 2, 5, 2, 4, 3, 4, 2, -1])
Complexity:
The first loop scan the input array = O(N).
The second one check for any number (digit) in the array,
since they are <= N (you can not have more number than element), so it is O(numbers) the complexity is O(N).
The loop inside the loop go through all indexes corresponding to the current number...
the complexity seem to be O(N) int the worst case (but it is not)
So the complexity would be O(N) + O(N)*O(N) = O(N^2)
but remember that the two nest loop can at least print all N indexes, and since the indexes are not repeated the complexity of them is O(N)...
So O(N)+O(N) ~= O(N)
Speaking about memory it is O(N) for the input array + O(N) for the dictionary (because it contain all N indexes) ~= O(N).
Well if you do it in c++ remember that maps are way slower than array, so if N is small, you should use an array of array (or std::vector> ), else you can also try an unordered map that use hashes
P.S. Remember that get the size of a vector is O(1) time because it is a difference of pointers!
Starting with a sorted list is a good idea.
You could create a second array of original indices and duplicate all of the memory moves for the sort on the indices array. Then checking for triplicates is trivial and only requires sort + 1 traversal.

How do only add item to an array if it doesn't exist already (in a case insensitive way)? [duplicate]

I want to know what's the best way to make the String.include? methods ignore case. Currently I'm doing the following. Any suggestions? Thanks!
a = "abcDE"
b = "CD"
result = a.downcase.include? b.downcase
Edit:
How about Array.include?. All elements of the array are strings.
Summary
If you are only going to test a single word against an array, or if the contents of your array changes frequently, the fastest answer is Aaron's:
array.any?{ |s| s.casecmp(mystr)==0 }
If you are going to test many words against a static array, it's far better to use a variation of farnoy's answer: create a copy of your array that has all-lowercase versions of your words, and use include?. (This assumes that you can spare the memory to create a mutated copy of your array.)
# Do this once, or each time the array changes
downcased = array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
Even better, create a Set from your array.
# Do this once, or each time the array changes
downcased = Set.new array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
My original answer below is a very poor performer and generally not appropriate.
Benchmarks
Following are benchmarks for looking for 1,000 words with random casing in an array of slightly over 100,000 words, where 500 of the words will be found and 500 will not.
The 'regex' text is my answer here, using any?.
The 'casecmp' test is Arron's answer, using any? from my comment.
The 'downarray' test is farnoy's answer, re-creating a new downcased array for each of the 1,000 tests.
The 'downonce' test is farnoy's answer, but pre-creating the lookup array once only.
The 'set_once' test is creating a Set from the array of downcased strings, once before testing.
user system total real
regex 18.710000 0.020000 18.730000 ( 18.725266)
casecmp 5.160000 0.000000 5.160000 ( 5.155496)
downarray 16.760000 0.030000 16.790000 ( 16.809063)
downonce 0.650000 0.000000 0.650000 ( 0.643165)
set_once 0.040000 0.000000 0.040000 ( 0.038955)
If you can create a single downcased copy of your array once to perform many lookups against, farnoy's answer is the best (assuming you must use an array). If you can create a Set, though, do that.
If you like, examine the benchmarking code.
Original Answer
I (originally said that I) would personally create a case-insensitive regex (for a string literal) and use that:
re = /\A#{Regexp.escape(str)}\z/i # Match exactly this string, no substrings
all = array.grep(re) # Find all matching strings…
any = array.any?{ |s| s =~ re } # …or see if any matching string is present
Using any? can be slightly faster than grep as it can exit the loop as soon as it finds a single match.
For an array, use:
array.map(&:downcase).include?(string)
Regexps are very slow and should be avoided.
You can use casecmp to do your comparison, ignoring case.
"abcdef".casecmp("abcde") #=> 1
"aBcDeF".casecmp("abcdef") #=> 0
"abcdef".casecmp("abcdefg") #=> -1
"abcdef".casecmp("ABCDEF") #=> 0
class String
def caseinclude?(x)
a.downcase.include?(x.downcase)
end
end
my_array.map!{|c| c.downcase.strip}
where map! changes my_array, map instead returns a new array.
To farnoy in my case your example doesn't work for me. I'm actually looking to do this with a "substring" of any.
Here's my test case.
x = "<TD>", "<tr>", "<BODY>"
y = "td"
x.collect { |r| r.downcase }.include? y
=> false
x[0].include? y
=> false
x[0].downcase.include? y
=> true
Your case works with an exact case-insensitive match.
a = "TD", "tr", "BODY"
b = "td"
a.collect { |r| r.downcase }.include? b
=> true
I'm still experimenting with the other suggestions here.
---EDIT INSERT AFTER HERE---
I found the answer. Thanks to Drew Olsen
var1 = "<TD>", "<tr>","<BODY>"
=> ["<TD>", "<tr>", "<BODY>"]
var2 = "td"
=> "td"
var1.find_all{|item| item.downcase.include?(var2)}
=> ["<TD>"]
var1[0] = "<html>"
=> "<html>"
var1.find_all{|item| item.downcase.include?(var2)}
=> []

Compare hash values with corresponding array values in Ruby

I've scripted my way into a corner where I now need to compare the values of a hash to the corresponding element in an array.
I have two "lists" of the same values, sorted in different ways, and they should be identical. The interesting part is when they don't, so I need to identify those cases. So basically I need to check whether the first value of the first key-value pair in the hash is identical to the first element of the array, and likewise the second value checked towards the second element and so on for the entire set of values in the hash.
I'm sort of new to Ruby scripting, but though this should be easy enough, but alas....
Sounds like all you need is something simple like:
hash.keys == array
The keys should come out in the same order as they are in the Hash so this is comparing the first key of the hash with the first element of the array, the second key with the second array element, ...
You could also transliterate what you're saying into Ruby like this:
hash.each_with_index.all? { |(k, _), i| k == array[i] }
Or you could say:
hash.zip(array).all? { |(k, _), e| k == e }
The zip version is pretty much the each_with_index version with the array indexing essentially folded into the zip.
Technically speaking, hash is not guaranteed to ordered, so your assumption of matching the value at 'each' index of hash may not always hold true. However, for sake of answering your question, assuming h is your hash, and a is your array:
list_matches = true
h.values.each_with_index {|v, i| list_matches = list_matches && a[i] == v}
if list_matches is not equal to true than that is where the the items in the two collections don't match.

Lua table.getn() returns 0?

I've embedded Lua into my C application, and am trying to figure out why a table created in my C code via:
lua_createtable(L, 0, numObjects);
and returned to Lua, will produce a result of zero when I call the following:
print("Num entries", table.getn(data))
(Where "data" is the table created by lua_createtable above)
There's clearly data in the table, as I can walk over each entry (string : userdata) pair via:
for key, val in pairs(data) do
...
end
But why does table.getn(data) return zero? Do I need to insert something into the meta of the table when I create it with lua_createtable? I've been looking at examples of lua_createtable use, and I haven't seen this done anywhere....
table.getn (which you shouldn't be using in Lua 5.1+. Use the length operator #) returns the number of elements in the array part of the table.
The array part is every key that starts with the number 1 and increases up until the first value that is nil (not present). If all of your keys are strings, then the size of the array part of your table is 0.
Although it's a costly (O(n) vs O(1) for simple lists), you can also add a method to count the elements of your map :
>> function table.map_length(t)
local c = 0
for k,v in pairs(t) do
c = c+1
end
return c
end
>> a = {spam="data1",egg='data2'}
>> table.map_length(a)
2
If you have such requirements, and if your environment allows you to do so think about using penlight that provides that kind of features and much more.
the # operator (and table.getn) effectivly return the size of the array section (though when you have a holey table the semantics are more complex)
It does not count anything in the hash part of the table (eg, string keys)
for k,v in pairs(tbl) do count = count + 1 end

Resources