I have a third-party C library I am using to write an R extension. I am required to create a few structs defined in the library (and initialize them) I need to maintain them as part of an S4 object (think of these structs as defining to state of a computation, to destroy them would be to destroy all remaining computation and the results of all that has been already computed).
I am thinking of creating a S4 object to hold pointers these structs as void* pointers but it is not at all clear how to do so, what would be the type of the slot?

As pointed out by #hrbrmstr, you can use the externalptr type to keep such objects "alive", which is touched on in this section of Writing R Extensions, although I don't see any reason why you will need to store anything as void*. If you don't have any issue with using a little C++, the Rcpp class XPtr can eliminate a fair amount of the boilerplate involved with managing EXTPTRSXPs. As an example, assume the following simplified example represents your third party library's API:
#include <Rcpp.h>
#include <stdlib.h>
typedef struct {
unsigned int count;
double total;
} CStruct;
CStruct* init_CStruct() {
return (CStruct*)::malloc(sizeof(CStruct));
void free_CStruct(CStruct* ptr) {
::printf("free_CStruct called.\n");
typedef Rcpp::XPtr<CStruct, Rcpp::PreserveStorage, free_CStruct> xptr_t;
When working with pointers created via new it is generally sufficient to use Rcpp::XPtr<SomeClass>, because the default finalizer simply calls delete on the held object. However, since you are dealing with a C API, we have to supply the (default) template parameter Rcpp::PreserveStorage, and more importantly, the appropriate finalizer (free_CStruct in this example) so that the XPtr does not call delete on memory allocated via malloc, etc., when the corresponding R object is garbage collected.
Continuing with the example, assume you write the following functions to interact with your CStruct:
// [[Rcpp::export]]
xptr_t MakeCStruct() {
CStruct* ptr = init_CStruct();
ptr->count = 0;
ptr->total = 0;
return xptr_t(ptr, true);
// [[Rcpp::export]]
void UpdateCStruct(xptr_t ptr, SEXP x) {
if (TYPEOF(x) == REALSXP) {
R_xlen_t i = 0, sz = XLENGTH(x);
for ( ; i < sz; i++) {
if (!ISNA(REAL(x)[i])) {
ptr->total += REAL(x)[i];
if (TYPEOF(x) == INTSXP) {
R_xlen_t i = 0, sz = XLENGTH(x);
for ( ; i < sz; i++) {
if (!ISNA(INTEGER(x)[i])) {
ptr->total += INTEGER(x)[i];
Rf_warning("Invalid SEXPTYPE.\n");
// [[Rcpp::export]]
void SummarizeCStruct(xptr_t ptr) {
"count: %d\ntotal: %f\naverage: %f\n",
ptr->count, ptr->total,
ptr->count > 0 ? ptr->total / ptr->count : 0
// [[Rcpp::export]]
int GetCStructCount(xptr_t ptr) {
return ptr->count;
// [[Rcpp::export]]
double GetCStructTotal(xptr_t ptr) {
return ptr->total;
// [[Rcpp::export]]
void ResetCStruct(xptr_t ptr) {
ptr->count = 0;
ptr->total = 0.0;
At this point, you have done enough to start handling CStructs from R:
ptr <- MakeCStruct() will initialize a CStruct and store it as an externalptr in R
UpdateCStruct(ptr, x) will modify the data stored in the CStruct, SummarizeCStruct(ptr) will print a summary, etc.
rm(ptr); gc() will remove the ptr object and force the garbage collector to run, thus calling free_CStruct(ptr) and destroying the object on the C side of things as well
You mentioned the use of S4 classes, which is one option for containing all of these functions in a single place. Here's one possibility:
slots = c(
ptr = "externalptr",
update = "function",
summarize = "function",
get_count = "function",
get_total = "function",
reset = "function"
function(.Object) {
.Object#ptr <- MakeCStruct()
.Object#update <- function(x) {
UpdateCStruct(.Object#ptr, x)
.Object#summarize <- function() {
.Object#get_count <- function() {
.Object#get_total <- function() {
.Object#reset <- function() {
Then, we can work with the CStructs like this:
ptr <- new("CStruct")
# count: 0
# total: 0.000000
# average: 0.000000
# count: 100
# total: 9.040591
# average: 0.090406
# count: 200
# total: -1.714089
# average: -0.008570
# count: 0
# total: 0.000000
# average: 0.000000
rm(ptr); gc()
# free_CStruct called.
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 484713 25.9 940480 50.3 601634 32.2
# Vcells 934299 7.2 1650153 12.6 1308457 10.0
Of course, another option is to use Rcpp Modules, which more or less take care of the class definition boilerplate on the R side (using reference classes rather than S4 classes, however).


Serious performance regression upon porting bubble sort from C to Rust [duplicate]

I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
} finally {
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
unsafe {
transmute::<u64, f64>(val);
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.

Integer vs Boolean array Swift Performance

I tried executing Sieve Of Eratosthenes algorithm using a large Integer array and a large Bool array.
The integer version seems to execute MUCH faster than the boolean one. What is the possible reason for this?
import Foundation
var n : Int = 100000000;
var prime = [Bool](repeating: true, count: n+1)
var p = 2
let start = DispatchTime.now()
if(prime[p] == true)
var i = p*2
while (i<=n)
prime[i] = false
i = i + p
p = p+1
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Boolean array execution time : 78223.342295 ms
import Foundation
var n : Int = 100000000;
var prime = [Int](repeating: 1, count: n+1)
var p = 2
let start = DispatchTime.now()
if(prime[p] == 1)
var i = p*2
while (i<=n)
prime[i] = 0
i = i + p
p = p+1
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Integer array execution time : 8535.54546 ms
Do not attempt to optimize your code in a Debug build. Always run it through the Profiler. Int was faster then Bool in Debug but the oposite was true when run through the Profiler.
Heap allocation is expensive. Use your memory judiciously. (This question discusses the complications in C, but also applicable to Swift)
Long answer
First, let's refactor your code for easier execution:
func useBoolArray(n: Int) {
var prime = [Bool](repeating: true, count: n+1)
var p = 2
if(prime[p] == true)
var i = p*2
while (i<=n)
prime[i] = false
i = i + p
p = p+1
func useIntArray(n: Int) {
var prime = [Int](repeating: 1, count: n+1)
var p = 2
if(prime[p] == 1)
var i = p*2
while (i<=n)
prime[i] = 0
i = i + p
p = p+1
Now, run it in the Debug build:
let count = 100_000_000
let start = DispatchTime.now()
useBoolArray(n: count)
let boolStop = DispatchTime.now()
useIntArray(n: count)
let intStop = DispatchTime.now()
print("Bool array:", Double(boolStop.uptimeNanoseconds - start.uptimeNanoseconds) / Double(NSEC_PER_SEC))
print("Int array:", Double(intStop.uptimeNanoseconds - boolStop.uptimeNanoseconds) / Double(NSEC_PER_SEC))
// Bool array: 70.097249517
// Int array: 8.439799614
So Bool is a lot slower than Int right? Let's run it through the Profiler by pressing Cmd + I and choose the Time Profile template. (Somehow the Profiler wasn't able to separate these functions, probably because they were inlined so I had to run only 1 function per attempt):
let count = 100_000_000
useBoolArray(n: count)
// useIntArray(n: count)
// Bool: 1.15ms
// Int: 2.36ms
Not only they are an order of magnitude faster than Debug but the results are reversed to: Bool is now faster than Int!!! The Profiler doesn't tell us why how so we must go on a witch hunt. Let's check the memory allocation by adding an Allocation instrument:
Ha! Now the differences are laid bare. The Bool array uses only one-eight as much memory as Int array. Swift array uses the same internals as NSArray so it's allocated on the heap and heap allocation is slow.
When you think even more about it: a Bool value only take up 1 bit, an Int takes 64 bits on a 64-bit machine. Swift may have chosen to represent a Bool with a single byte, while an Int takes 8 bytes, hence the memory ratio. In Debug, this difference may have caused all the difference as the runtime must do all kinds of checks to ensure that it's actually dealing with a Bool value so the Bool array method takes significantly longer.
Moral of the lesson: don't optimize your code in Debug mode. It can be misleading!
(A partial answer ...)
As #MartinR mentions in his comments to the question, there is no such major difference between the two cases if you build for release mode (with optimizations); the Bool case is slightly faster due its smaller memory footprint (but equally fast as e.g. UInt8 which has the same footprint).
Running instruments to profile the (non-optimized) debug build, we clearly see that the array element access & assignment is the culprit for the Bool case (an as far as my brief testing has seen; for all types except the integer ones, Int, UInt16, and so on).
We can further ascertain that its not the writing part in particular that yields the overhead, but rather the repeated accessing of the i:th element.
The same explicit read-access tests for an array of integer elements show no such large overhead.
It would almost seem as if the random element access is, for some reason, not working as it should (for non-integer types) when compiling with debug build config.

Simplifying enum definition?

I like to use some enums in Swift 3. They are also used as index to an array. So they are Int. I defined them as:
enum TypeOfArray: Int {
case src = 0, dst, srcCache, n
static var Start: Int { return 0 }
static var End : Int { return n.rawValue - 1 }
static let allValues = [src, srcCache, dst]
init() {
self = .n
So using .Start and .End I can use them as loop limits. But whenever I use the names "src" or "dst" itself, I have to add ".rawValue" to get the numeric value to be used as an index.
Is there any way to make it more convenience and to shorten it? (looks very complicated to me for such a simple task)

Ruby native C big integer segfault

I'm working on a Ruby native C method: power mod. Here's what I got:
#define TO_BIGNUM(x) (FIXNUM_P(x) ? rb_int2big(FIX2LONG(x)) : x)
VALUE method_big_power_mod(VALUE self, VALUE base, VALUE exp, VALUE mod){
base = TO_BIGNUM(base);
exp = TO_BIGNUM(exp);
mod = TO_BIGNUM(mod);
while (rb_big_cmp(exp, CONST2BIGNUM(0))) {
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = rb_big_mul(res, base);
res = rb_big_modulo(mul, mod);
base = rb_big_modulo(rb_big_pow(base, CONST2BIGNUM(2)), mod);
exp = rb_big_div(exp, CONST2BIGNUM(2));
return res;
It segfaults every time. I isolated the problem to rb_big_modulo calls. gdb stacktrace says that it crashes in the bigdivrem method after calling rb_big_modulo. I tried to look through the source of bignum.c, but I can't figure out what's causing the crash. Am I doing something wrong?
There are two problems that are causing the segfault:
1 - The functions rb_big_* sometimes doesn't return a Bignum object, but when you call then the first arg must be a Bignum object. For example:
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = rb_big_mul(res, base); // This maybe return a Fixnum
res = rb_big_modulo(mul, mod); // This will cause a segfault :(
2 - The function rb_big_pow when you call it with both args Bignum, it will warn you and will return a Float object where you can't convert easily to a Bignum object. So, you should replace the line where you call it by:
VALUE x = TO_BIGNUM(rb_big_pow(base, INT2NUM(2))); // Power by a Fixnum instead a Bignum
base = TO_BIGNUM(rb_big_modulo(x , mod));
The final implementation will be:
#define TO_BIGNUM(x) (FIXNUM_P(x) ? rb_int2big(FIX2LONG(x)) : x)
VALUE method_big_power_mod(VALUE self, VALUE base, VALUE exp, VALUE mod){
base = TO_BIGNUM(base);
exp = TO_BIGNUM(exp);
mod = TO_BIGNUM(mod);
while (rb_big_cmp(exp, CONST2BIGNUM(0))) {
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = TO_BIGNUM(rb_big_mul(res, base));
res = TO_BIGNUM(rb_big_modulo(mul, mod));
VALUE x = TO_BIGNUM(rb_big_pow(base, INT2NUM(2)));
base = TO_BIGNUM(rb_big_modulo(x , mod));
exp = TO_BIGNUM(rb_big_div(exp, CONST2BIGNUM(2)));
return res;
I don't know the performance impact with all these conversions. Maybe, you should test when it is a Fixnum or a Bignumand calculate it using the proper function or benchmark both approaches.
When I ran it, I went thought an infinite loop, but I don't know if I call it with the correct values.

How to efficiently merge two hashes in Ruby C API?

I am writing a C extension for Ruby that really needs to merge two hashes, however the rb_hash_merge() function is STATIC in Ruby 1.8.6. I have tried instead to use:
rb_funcall(hash1, rb_intern("merge"), 1, hash2);
but this is much too slow, and performance is very critical in this application.
Does anyone know how to go about performing this merge with efficiency and speed in mind?
(Note I have tried simply looking at the source for rb_hash_merge() and replicating it but it is RIDDLED with other static functions, which are themselves riddled with yet more static functions so it seems almost impossible to disentangle...i need another way)
Ok, looks like might be not possible to optimize within the published API.
Test code:
require 'mkmf'
// hello.c
#include "ruby.h"
static VALUE rb_mHello;
static VALUE rb_cMyCalc;
static void calc_mark(void *f) { }
static void calc_free(void *f) { }
static VALUE calc_alloc(VALUE klass) { return Data_Wrap_Struct(klass, calc_mark, calc_free, NULL); }
static VALUE calc_init(VALUE obj) { return Qnil; }
static VALUE calc_merge(VALUE obj, VALUE h1, VALUE h2) {
return rb_funcall(h1, rb_intern("merge"), 1, h2);
static VALUE
calc_merge2(VALUE obj, VALUE h1, VALUE h2)
VALUE h3 = rb_hash_new();
VALUE keys;
VALUE akey;
keys = rb_funcall(h1, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h3, akey, rb_hash_aref(h1, akey));
keys = rb_funcall(h2, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h3, akey, rb_hash_aref(h2, akey));
return h3;
static VALUE
calc_merge3(VALUE obj, VALUE h1, VALUE h2)
VALUE keys;
VALUE akey;
keys = rb_funcall(h1, rb_intern("keys"), 0);
while (akey = rb_each(keys)) {
rb_hash_aset(h2, akey, rb_hash_aref(h1, akey));
return h2;
rb_mHello = rb_define_module("Hello");
rb_cMyCalc = rb_define_class_under(rb_mHello, "Calculator", rb_cObject);
rb_define_alloc_func(rb_cMyCalc, calc_alloc);
rb_define_method(rb_cMyCalc, "initialize", calc_init, 0);
rb_define_method(rb_cMyCalc, "merge", calc_merge, 2);
rb_define_method(rb_cMyCalc, "merge2", calc_merge, 2);
rb_define_method(rb_cMyCalc, "merge3", calc_merge, 2);
# test.rb
require "hello"
h1 = Hash.new()
h2 = Hash.new()
1.upto(100000) { |x| h1[x] = x+1; }
1.upto(100000) { |x| h2["#{x}-12"] = x+1; }
c = Hello::Calculator.new()
puts c.merge(h1, h2).keys.length if ARGV[0] == "1"
puts c.merge2(h1, h2).keys.length if ARGV[0] == "2"
puts c.merge3(h1, h2).keys.length if ARGV[0] == "3"
Now the test results:
$ time ruby test.rb
real 0m1.021s
user 0m0.940s
sys 0m0.080s
$ time ruby test.rb 1
real 0m1.224s
user 0m1.148s
sys 0m0.076s
$ time ruby test.rb 2
real 0m1.219s
user 0m1.132s
sys 0m0.084s
$ time ruby test.rb 3
real 0m1.220s
user 0m1.128s
sys 0m0.092s
So it looks like we might shave off at maximum ~0.004s on a 0.2s operation.
Given that there's probably not that much besides setting the values, there might not be that much space for further optimizations. Maybe try to hack the ruby source itself - but at that point you no longer really develop "extension" but rather change the language, so it probably won't work.
If the join of hashes is something that you need to do many times in the C part - then probably using the internal data structures and only exporting them into Ruby hash in the final pass would be the only way of optimizing things.
p.s. The initial skeleton for the code borrowed from this excellent tutorial
