I am writing a C extension for Ruby that really needs to merge two hashes, however the rb_hash_merge() function is STATIC in Ruby 1.8.6. I have tried instead to use:
rb_funcall(hash1, rb_intern("merge"), 1, hash2);
but this is much too slow, and performance is very critical in this application.
Does anyone know how to go about performing this merge with efficiency and speed in mind?
(Note I have tried simply looking at the source for rb_hash_merge() and replicating it but it is RIDDLED with other static functions, which are themselves riddled with yet more static functions so it seems almost impossible to disentangle...i need another way)
Ok, looks like might be not possible to optimize within the published API.
Test code:
#extconf.rb
require 'mkmf'
dir_config("hello")
create_makefile("hello")
// hello.c
#include "ruby.h"
static VALUE rb_mHello;
static VALUE rb_cMyCalc;
static void calc_mark(void *f) { }
static void calc_free(void *f) { }
static VALUE calc_alloc(VALUE klass) { return Data_Wrap_Struct(klass, calc_mark, calc_free, NULL); }
static VALUE calc_init(VALUE obj) { return Qnil; }
static VALUE calc_merge(VALUE obj, VALUE h1, VALUE h2) {
return rb_funcall(h1, rb_intern("merge"), 1, h2);
}
static VALUE
calc_merge2(VALUE obj, VALUE h1, VALUE h2)
{
VALUE h3 = rb_hash_new();
VALUE keys;
VALUE akey;
keys = rb_funcall(h1, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h3, akey, rb_hash_aref(h1, akey));
}
keys = rb_funcall(h2, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h3, akey, rb_hash_aref(h2, akey));
}
return h3;
}
static VALUE
calc_merge3(VALUE obj, VALUE h1, VALUE h2)
{
VALUE keys;
VALUE akey;
keys = rb_funcall(h1, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h2, akey, rb_hash_aref(h1, akey));
}
return h2;
}
void
Init_hello()
{
rb_mHello = rb_define_module("Hello");
rb_cMyCalc = rb_define_class_under(rb_mHello, "Calculator", rb_cObject);
rb_define_alloc_func(rb_cMyCalc, calc_alloc);
rb_define_method(rb_cMyCalc, "initialize", calc_init, 0);
rb_define_method(rb_cMyCalc, "merge", calc_merge, 2);
rb_define_method(rb_cMyCalc, "merge2", calc_merge, 2);
rb_define_method(rb_cMyCalc, "merge3", calc_merge, 2);
}
# test.rb
require "hello"
h1 = Hash.new()
h2 = Hash.new()
1.upto(100000) { |x| h1[x] = x+1; }
1.upto(100000) { |x| h2["#{x}-12"] = x+1; }
c = Hello::Calculator.new()
puts c.merge(h1, h2).keys.length if ARGV[0] == "1"
puts c.merge2(h1, h2).keys.length if ARGV[0] == "2"
puts c.merge3(h1, h2).keys.length if ARGV[0] == "3"
Now the test results:
$ time ruby test.rb
real 0m1.021s
user 0m0.940s
sys 0m0.080s
$ time ruby test.rb 1
200000
real 0m1.224s
user 0m1.148s
sys 0m0.076s
$ time ruby test.rb 2
200000
real 0m1.219s
user 0m1.132s
sys 0m0.084s
$ time ruby test.rb 3
200000
real 0m1.220s
user 0m1.128s
sys 0m0.092s
So it looks like we might shave off at maximum ~0.004s on a 0.2s operation.
Given that there's probably not that much besides setting the values, there might not be that much space for further optimizations. Maybe try to hack the ruby source itself - but at that point you no longer really develop "extension" but rather change the language, so it probably won't work.
If the join of hashes is something that you need to do many times in the C part - then probably using the internal data structures and only exporting them into Ruby hash in the final pass would be the only way of optimizing things.
p.s. The initial skeleton for the code borrowed from this excellent tutorial
Related
I have a large contract and I am in the process of splitting it out into two. The goal is to have the functions that are common (and will be used by many other contracts) to be separated out for efficiency.
One of these functions compares items in arrays "ownedSymbols" and "targetAssets". It produces a list "sellSymbols" if any item in "ownedSymbols" is not in "targetAssets".
The below code works fine while "sellSymbols" is stored as a string. As this function will become common, I need it to run in memory so the results aren't confused by calls from different contracts.
pragma solidity >0.8.0;
contract compareArrays {
string[] public ownedSymbols = ["A","B","C"];
string[] public targetAssets = ["A","B"];
string[] sellSymbols;
event sellListEvent(string[]);
function sellList(string[] memory _ownedSymbols, string[] memory _targetAssetsList) internal {
sellSymbols = _ownedSymbols;
for (uint256 i = 0; i < _targetAssetsList.length; i++) {
for (uint256 x = 0; x < sellSymbols.length; x++) {
if (
keccak256(abi.encodePacked((sellSymbols[x]))) ==
keccak256(abi.encodePacked((_targetAssetsList[i])))
) {
if (x < sellSymbols.length) {
sellSymbols[x] = sellSymbols[sellSymbols.length - 1];
sellSymbols.pop();
} else {
delete sellSymbols;
}
}
}
}
emit sellListEvent(sellSymbols);
}
function runSellList() public {
sellList(ownedSymbols,targetAssets);
}
}
Ideally the function would run with "string[] memory sellSymbols", however this kicks back an error.
pragma solidity >0.8.0;
contract compareArrays {
string[] public ownedSymbols = ["A","B","C"];
string[] public targetAssets = ["A","B"];
event sellListEvent(string[]);
function sellList(string[] memory _ownedSymbols, string[] memory _targetAssetsList) internal {
string[] memory sellSymbols = _ownedSymbols;
for (uint256 i = 0; i < _targetAssetsList.length; i++) {
for (uint256 x = 0; x < sellSymbols.length; x++) {
if (
keccak256(abi.encodePacked((sellSymbols[x]))) ==
keccak256(abi.encodePacked((_targetAssetsList[i])))
) {
if (x < sellSymbols.length) {
sellSymbols[x] = sellSymbols[sellSymbols.length - 1];
sellSymbols.pop();
} else {
delete sellSymbols;
}
}
}
}
emit sellListEvent(sellSymbols);
}
function runSellList() public {
sellList(ownedSymbols,targetAssets);
}
}
The error:
TypeError: Member "pop" is not available in string memory[] memory outside of storage.
--> contracts/sellSymbols.sol:20:25:
|
20 | sellSymbols.pop();
| ^^^^^^^^^^^^^^^
Two questions from me:
Is there a way to do this in memory so that the function can be common (i.e. used by multiple contracts at the same time)?
Is there a better way? The below is expensive to run, but it is the only way I have been able to achieve it.
One final comment - I know this would be much easier/cheaper to run off chain. That is not something I am willing to consider as I want this project to be decentralized.
If you want to keep the existing system, the best solution is described here: https://stackoverflow.com/a/49054593/11628256
if (x < sellSymbols.length) {
sellSymbols[x] = sellSymbols[sellSymbols.length - 1];
delete sellSymbols[myArray.length - 1];
sellSymbols.length--;
} else {
delete sellSymbols;
}
If all you care about is the presence or absence of a particular asset (and not enumerating through them), what you're going to want to do to really reduce gas costs is something called "lazy evaluation." Lazy evaluation is when instead of computing all results at once (like increasing all balances by 50% by iterating over an array), you modify the getters so that their return value reflects the operation (such as multiplying an internal variable by 50% and multiplying the original result of getBalance by that variable).
So, if this is the case, what you want to do is use the following function instead:
function except(string _item, mapping(string => bool) _ownedSymbols, mapping(string => bool) _targetAssets) internal returns (bool) {
return _ownedSymbols[_item] && !_targetAssets[_item];
}
<pet peeve>
Finally, I know you say you want this to be decentralized, but I really do feel the urge to say this. If this is a system that doesn't need to be decentralized, don't decentralize it! Decentralization is great for projects that other people rely on - for example, DNS or any sort of token.
From your variable names, it seems that this is probably some sort of system similar to a trading bot. Therefore, the incentive is on you to keep it running, as you are the one that gets the benefits. None of the problems that decentralization solves (censorship, conflict of interest, etc...) apply to your program, as the person running it is incentivized to keep it running and keep a copy of the program. It's cheaper for the user running it to not have security they don't need. You don't need a bank-grade vault to store a $1 bill!
</pet peeve>
I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
System.out.println(d);
}
}
} finally {
input.close()
}
}
}
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
}
}
}
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
i32::from_be_bytes(bytes)
}
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
f64::from_be_bytes(bytes)
}
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
};
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
}
unsafe {
transmute::<u64, f64>(val);
}
}
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
output.writeInt(outputLength);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
}
output.flush();
}
}
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.
In the constructor of an Array is there a guarantee that the init function will be called for the indexes in an increasing order?
It would make sense but I did not find any such information in the docs:
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-array/-init-.html#kotlin.Array%24%28kotlin.Int%2C+kotlin.Function1%28%28kotlin.Int%2C+kotlin.Array.T%29%29%29%2Finit
There is no guarantee for this in the API.
TLDR: If you need the sequential execution, because you have some state that changes see bottom.
First lets have a look at the implementations of the initializer:
Native: It is implemented in increasing order for Kotlin Native.
#InlineConstructor
public constructor(size: Int, init: (Int) -> Char): this(size) {
for (i in 0..size - 1) {
this[i] = init(i)
}
}
JVM: Decompiling the Kotlin byte code for
class test {
val intArray = IntArray(100) { it * 2 }
}
to Java in Android Studio yields:
public final class test {
#NotNull
private final int[] intArray;
#NotNull
public final int[] getIntArray() {
return this.intArray;
}
public test() {
int size$iv = 100;
int[] result$iv = new int[size$iv];
int i$iv = 0;
for(int var4 = result$iv.length; i$iv < var4; ++i$iv) {
int var6 = false;
int var11 = i$iv * 2;
result$iv[i$iv] = var11;
}
this.intArray = result$iv;
}
}
which supports the claim that it is initialized in ascending order.
Conclusion: It commonly is implemented to be executed in ascending order.
BUT: You can not rely on the execution order, as the implementation is not guaranteed by the API. It can change and it can be different for different platforms (although both is unlikely).
Solution: You can initialize the array manually in a loop, then you have control about the execution order.
The following example outlines a possible implementation that has a stable initialisation with random values, e.g. for tests.
val intArray = IntArray(100).also {
val random = Random(0)
for (index in it.indices) {
it[index] = index * random.nextInt()
}
}
Starting from the version 1.3.50 Kotlin has guaranteed sequential array initialization order in its API documentation: https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-array/-init-.html
The function init is called for each array element sequentially starting from the first one. It should return the value for an array element given its index.
I have this c++11 code:
auto gen = []() -> double { /* do stuff */ };
std::generate(myArray.begin(), myArray.end(), gen);
How would I do the same with D's array? std.algorithm.fill doesn't take a function object, and I don't know how to pass a function to recurrence.
Here's a version that seems to work:
import std.algorithm, std.array, std.range, std.stdio;
void main() {
writefln("%s", __VERSION__);
int i;
auto dg = delegate float(int) { return i++; };
float[] res = array(map!dg(iota(0, 10)));
float[] res2 = new float[10];
fill(res2, map!dg(iota(0, res2.length)));
writefln("meep");
writefln("%s", res);
writefln("%s", res2);
}
[edit] Added fill-based version (res2).
I tested it in Ideone (http://www.ideone.com/DFK5A) but it crashes .. a friend with a current version of DMD says it works though, so I assume Ideone's DMD is just outdated by about ten to twenty versions.
You could do something like
auto arr = {/* generate an array and return that array */}();
If it's assigned to a global it should be evaluated at compile-time.
You can also use string mixins to generate code for an array literal.
Relative Performance of Symbol#to_proc in Popular Ruby Implementations states that in MRI Ruby 1.8.7, Symbol#to_proc is slower than the alternative in their benchmark by 30% to 130%, but that this isn't the case in YARV Ruby 1.9.2.
Why is this the case? The creators of 1.8.7 didn't write Symbol#to_proc in pure Ruby.
Also, are there any gems that provide faster Symbol#to_proc performance for 1.8?
(Symbol#to_proc is starting to appear when I use ruby-prof, so I don't think I'm guilty of premature optimization)
The to_proc implementation in 1.8.7 looks like this (see object.c):
static VALUE
sym_to_proc(VALUE sym)
{
return rb_proc_new(sym_call, (VALUE)SYM2ID(sym));
}
Whereas the 1.9.2 implementation (see string.c) looks like this:
static VALUE
sym_to_proc(VALUE sym)
{
static VALUE sym_proc_cache = Qfalse;
enum {SYM_PROC_CACHE_SIZE = 67};
VALUE proc;
long id, index;
VALUE *aryp;
if (!sym_proc_cache) {
sym_proc_cache = rb_ary_tmp_new(SYM_PROC_CACHE_SIZE * 2);
rb_gc_register_mark_object(sym_proc_cache);
rb_ary_store(sym_proc_cache, SYM_PROC_CACHE_SIZE*2 - 1, Qnil);
}
id = SYM2ID(sym);
index = (id % SYM_PROC_CACHE_SIZE) << 1;
aryp = RARRAY_PTR(sym_proc_cache);
if (aryp[index] == sym) {
return aryp[index + 1];
}
else {
proc = rb_proc_new(sym_call, (VALUE)id);
aryp[index] = sym;
aryp[index + 1] = proc;
return proc;
}
}
If you strip away all the busy work of initializing sym_proc_cache, then you're left with (more or less) this:
aryp = RARRAY_PTR(sym_proc_cache);
if (aryp[index] == sym) {
return aryp[index + 1];
}
else {
proc = rb_proc_new(sym_call, (VALUE)id);
aryp[index] = sym;
aryp[index + 1] = proc;
return proc;
}
So the real difference is the 1.9.2's to_proc caches the generated Procs while 1.8.7 generates a brand new one every single time you call to_proc. The performance difference between these two will be magnified by any benchmarking you do unless each iteration is done in a separate process; however, one iteration per-process would mask what you're trying to benchmark with the start-up cost.
The guts of rb_proc_new look pretty much the same (see eval.c for 1.8.7 or proc.c for 1.9.2) but 1.9.2 might benefit slightly from any performance improvements in rb_iterate. The caching is probably the big performance difference.
It is worth noting that the symbol-to-hash cache is a fixed size (67 entries but I'm not sure where 67 comes from, probably related to the number of operators and such that are commonly used for symbol-to-proc conversions):
id = SYM2ID(sym);
index = (id % SYM_PROC_CACHE_SIZE) << 1;
/* ... */
if (aryp[index] == sym) {
If you use more than 67 symbols as procs or if your symbol IDs overlap (mod 67) then you won't get the full benefit of the caching.
The Rails and 1.9 programming style involves a lot of shorthands like:
id = SYM2ID(sym);
index = (id % SYM_PROC_CACHE_SIZE) << 1;
rather than the longer explicit block forms:
ints = strings.collect { |s| s.to_i }
sum = ints.inject(0) { |s,i| s += i }
Given that (popular) programming style, it makes sense to trade memory for speed by caching the lookup.
You're not likely to get a faster implementation from a gem as the gem would have to replace a chunk of the core Ruby functionality. You could patch the 1.9.2 caching into your 1.8.7 source though.
The following ordinary Ruby code:
if defined?(RUBY_ENGINE).nil? # No RUBY_ENGINE means it's MRI 1.8.7
class Symbol
alias_method :old_to_proc, :to_proc
# Class variables are considered harmful, but I don't think
# anyone will subclass Symbol
##proc_cache = {}
def to_proc
##proc_cache[self] ||= old_to_proc
end
end
end
Will make Ruby MRI 1.8.7 Symbol#to_proc slightly less slow than before, but not as fast as an ordinary block or a pre-existing proc.
However, it'll make YARV, Rubinius and JRuby slower, hence the if around the monkeypatch.
The slowness of using Symbol#to_proc isn't solely due to MRI 1.8.7 creating a proc each time - even if you re-use an existing one, it's still slower than using a block.
Using Ruby 1.8 head
Size Block Pre-existing proc New Symbol#to_proc Old Symbol#to_proc
0 0.36 0.39 0.62 1.49
1 0.50 0.60 0.87 1.73
10 1.65 2.47 2.76 3.52
100 13.28 21.12 21.53 22.29
For the full benchmark and code, see https://gist.github.com/1053502
In addition to not caching procs, 1.8.7 also creates (approximately) one array each time a proc is called. I suspect it's because the generated proc creates an array to accept the arguments - this happens even with an empty proc that takes no arguments.
Here's a script to demonstrate the 1.8.7 behavior. Only the :diff value is significant here, which shows the increase in array count.
# this should really be called count_arrays
def count_objects(&block)
GC.disable
ct1 = ct2 = 0
ObjectSpace.each_object(Array) { ct1 += 1 }
yield
ObjectSpace.each_object(Array) { ct2 += 1 }
{:count1 => ct1, :count2 => ct2, :diff => ct2-ct1}
ensure
GC.enable
end
to_i = :to_i.to_proc
range = 1..1000
puts "map(&to_i)"
p count_objects {
range.map(&to_i)
}
puts "map {|e| to_i[e] }"
p count_objects {
range.map {|e| to_i[e] }
}
puts "map {|e| e.to_i }"
p count_objects {
range.map {|e| e.to_i }
}
Sample output:
map(&to_i)
{:count1=>6, :count2=>1007, :diff=>1001}
map {|e| to_i[e] }
{:count1=>1008, :count2=>2009, :diff=>1001}
map {|e| e.to_i }
{:count1=>2009, :count2=>2010, :diff=>1}
It seems that merely calling a proc will create the array for every iteration, but a literal block only seems to create an array once.
But multi-arg blocks may still suffer from the problem:
plus = :+.to_proc
puts "inject(&plus)"
p count_objects {
range.inject(&plus)
}
puts "inject{|sum, e| plus.call(sum, e) }"
p count_objects {
range.inject{|sum, e| plus.call(sum, e) }
}
puts "inject{|sum, e| sum + e }"
p count_objects {
range.inject{|sum, e| sum + e }
}
Sample output. Note how we incur a double penalty in case #2, because we use a multi-arg block, and also call the proc.
inject(&plus)
{:count1=>2010, :count2=>3009, :diff=>999}
inject{|sum, e| plus.call(sum, e) }
{:count1=>3009, :count2=>5007, :diff=>1998}
inject{|sum, e| sum + e }
{:count1=>5007, :count2=>6006, :diff=>999}