Estimating memory scope of erlang datastructure

Estimating memory scope of erlang datastructure - c

Being a former C programmer and current Erlang hacker one question has popped up.
How do I estimate the memory scope of my erlang datastructures?
Lets say I had an array of 1k integers in C, estimating the memory demand of this is easy, just the size of my array, times the size of an integer, 1k 32bit integers would take up 4kb or memory, and some constant amount of pointers and indexes.
In erlang however estimating the memory usage is somewhat more complicated, how much memory does an entry in erlangs array structure take up?, how do I estimate the size of a dynamically sized integer.
I have noticed that scanning over integers in array is fairly slow in erlang, scanning an array of about 1M integers takes almost a second in erlang, whereas a simple piece of c code will do it in arround 2 ms, this most likely is due to the amount of memory taken up by the datastructure.
I'm asking this, not because I'm a speed freak, but because estimating memory has, at least in my experience, been a good way of determining scalability of software.
My test code:
first the C code:
#include <cstdio>
#include <cstdlib>
#include <time.h>
#include <queue>
#include <iostream>
class DynamicArray{
protected:
int* array;
unsigned int size;
unsigned int max_size;
public:
DynamicArray() {
array = new int[1];
size = 0;
max_size = 1;
}
~DynamicArray() {
delete[] array;
}
void insert(int value) {
if (size == max_size) {
int* old_array = array;
array = new int[size * 2];
memcpy ( array, old_array, sizeof(int)*size );
for(int i = 0; i != size; i++)
array[i] = old_array[i];
max_size *= 2;
delete[] old_array;
}
array[size] = value;
size ++;
}
inline int read(unsigned idx) const {
return array[idx];
}
void print_array() {
for(int i = 0; i != size; i++)
printf("%d ", array[i]);
printf("\n ");
}
int size_of() const {
return max_size * sizeof(int);
}
};
void test_array(int test) {
printf(" %d ", test);
clock_t t1,t2;
t1=clock();
DynamicArray arr;
for(int i = 0; i != test; i++) {
//arr.print_array();
arr.insert(i);
}
int val = 0;
for(int i = 0; i != test; i++)
val += arr.read(i);
printf(" size %g MB ", (arr.size_of()/(1024*1024.0)));
t2=clock();
float diff ((float)t2-(float)t1);
std::cout<<diff/1000<< " ms" ;
printf(" %d \n", val == ((1 + test)*test)/2);
}
int main(int argc, char** argv) {
int size = atoi(argv[1]);
printf(" -- STARTING --\n");
test_array(size);
return 0;
}
and the erlang code:
-module(test).
-export([go/1]).
construct_list(Arr, Idx, Idx) ->
Arr;
construct_list(Arr, Idx, Max) ->
construct_list(array:set(Idx, Idx, Arr), Idx + 1, Max).
sum_list(_Arr, Idx, Idx, Sum) ->
Sum;
sum_list(Arr, Idx, Max, Sum) ->
sum_list(Arr, Idx + 1, Max, array:get(Idx, Arr) + Sum ).
go(Size) ->
A0 = array:new(Size),
A1 = construct_list(A0, 0, Size),
sum_list(A1, 0, Size, 0).
Timing the c code:
bash-3.2$ g++ -O3 test.cc -o test
bash-3.2$ ./test 1000000
-- STARTING --
1000000 size 4 MB 5.511 ms 0
and the erlang code:
1> f(Time), {Time, _} =timer:tc(test, go, [1000000]), Time/1000.0.
2189.418

First, an Erlang variable is always just a single word (32 or 64 bits depending on your machine). 2 or more bits of the word are used as a type tag. The remainder can hold an "immediate" value, such as a "fixnum" integer, an atom, an empty list ([]), or a Pid; or it can hold a pointer to data stored on the heap (tuple, list, "bignum" integer, float, etc.). A tuple has a header word specifying its type and length, followed by one word per element. A list cell on the uses only 2 words (its pointer already encodes the type): the head and tail elements.
For example: if A={foo,1,[]}, then A is a word pointing to a word on the heap saying "I'm a 3-tuple" followed by 3 words containing the atom foo, the fixnum 1, and the empty list, respectively. If A=[1,2], then A is a word saying "I'm a list cell pointer" pointing to the head word (containing the fixnum 1) of the first cell; and the following tail word of the cell is yet another list cell pointer, pointing to a head word containing the 2 and followed by a tail word containing the empty list. A float is represented by a header word and 8 bytes of double precision floating-point data. A bignum or a binary is a header word plus as many words as needed to hold the data. And so on. See e.g. http://stenmans.org/happi_blog/?p=176 for some more info.
To estimate size, you need to know how your data is structured in terms of tuples and lists, and you need to know the size of your integers (if too large, they will use a bignum instead of a fixnum; the limit is 28 bits incl. sign on a 32-bit machine, and 60 bits on a 64-bit machine).
Edit: https://github.com/happi/theBeamBook is a newer good resource on the internals of the BEAM Erlang virtual machine.

Is this what you want?
1> erts_debug:size([1,2]).
4
with it you can at least figure out how big a term is. The size returned is in words.

Erlang has integers as "arrays", so you cannot really estimate it in the same way as c, you can only predict how long your integers will be and calculate average amount of bytes needed to store them
check: http://www.erlang.org/doc/efficiency_guide/advanced.html and you can use erlang:memory() function to determine actual amount

Related

Do arrays in C have a maximum index size of 2048?

I've written a piece of code that uses a static array of size 3000.
Ordinarily, I would just use a for loop to scan in 3000 values, but it appears that I can only ever scan in a maximum of 2048 numbers. To me that seems like an issue with memory allocation, but I'm not sure.
The problem arises because I do not want a user to input the amount of numbers they intend to input. They should only input whatever amount of numbers they want, terminate the scan by inputting 0, after which the program does its work. (Otherwise I would just use malloc.)
The code is a fairly simple number occurrence counter, found below:
int main(int argc, char **argv)
{
int c;
int d;
int j = 0;
int temp;
int array[3000];
int i;
// scanning in elements to array (have just used 3000 because no explicit value for the length of the sequence is included)
for (i = 0; i < 3000; i++)
{
scanf("%d", &array[i]);
if (array[i] == 0)
{
break;
}
}
// sorting
for(c = 0; c < i-1; c++) {
for(d = 0; d < i-c-1; d++) {
if(array[d] > array[d+1]) {
temp = array[d]; // swaps
array[d] = array[d+1];
array[d+1] = temp;
}
}
}
int arrayLength = i + 1; // saving current 'i' value to use as 'n' value before reset
for(i = 0; i < arrayLength; i = j)
{
int numToCount = array[i];
int occurrence = 1; // if a number has been found the occurence is at least 1
for(j = i+1; j < arrayLength; j++) // new loops starts at current position in array +1 to check for duplicates
{
if(array[j] != numToCount) // prints immediately after finding out how many occurences there are, else adds another
{
printf("%d: %d\n", numToCount, occurrence);
break; // this break keeps 'j' at whatever value is NOT the numToCount, thus making the 'i = j' iterator restart the process at the right number
} else {
occurrence++;
}
}
}
return 0;
}
This code works perfectly for any number of inputs below 2048. An example of it not working would be inputting: 1000 1s, 1000 2s, and 1000 3s, after which the program would output:
1: 1000
2: 1000
3: 48
My question is whether there is any way to fix this so that the program will output the right amount of occurrences.

To answer your title question: The size of an array in C is limited (in theory) only by the maximum value that can be represented by a size_t variable. This is typically a 32- or 64-bit unsigned integer, so you can have (for the 32-bit case) over 4 billion elements (or much, much more in 64-bit systems).
However, what you are probably encountering in your code is a limit on the memory available to the program, where the line int array[3000]; declares an automatic variable. Space for these is generally allocated on the stack - which is a chunk of memory of limited size made available when the function (or main) is called. This memory has limited size and, in your case (assuming 32-bit, 4-byte integers), you are taking 12,000 bytes from the stack, which may cause problems.
There are two (maybe more?) ways to fix the problem. First, you could declared the array static - this would make the compiler pre-allocate the memory, so it would not need to be taken from the stack at run-time:
static int array[3000];
A second, probably better, approach would be to call malloc to allocate memory for the array; this assigns memory from the heap - which has (on almost all systems) considerably more space than the stack. It is often limited only by the available virtual memory of the operating system (many gigabytes on most modern PCs):
int *array = malloc(3000 * sizeof(int));
Also, the advantage of using malloc is that if, for some reason, there isn't enough memory available, the function will return NULL, and you can test for this.
You can access the elements of the array in the same way, using array[i] for example. Of course, you should be sure to release the memory when you've done with it, at the end of your function:
free(array);
(This will be done automatically in your case, when the program exits, but it's good coding style to get used to doing it explicitly!)

identify bits set in bitmap and print them in string

given a unsigned 64 bit integer.
which has multiple bits set in it.
want to process the bitmap and identify the position and return the string according to the position where bit is it.
example: unsigned integer is 12. means 1100 which implies third bit and fourth bit are set. this should print THREE FOUR
function takes unsigned int and returns string.
I looked some pieces of code and i don't see this as a dup of some other question.
char* unsigned_int_to_string(unsigned long int n)
{
unsigned int count = 0;
while(n)
{
int i, iter;
count += n & 1;
n >>= 1;
}
/*** Need help to fill this block ***/
/** should return string THREE FOUR***/
}
#include <stdio.h>
int main()
{
unsigned long int i = 12;
printf("%s", unsigned_int_to_sring(i));
return 0;
}

You could brute force it by having a lookup table which has a word representation for each bit you're interested in.
char* bit_to_word[10] = { "ONE","TWO","THREE","FOUR","FIVE","SIX","SEVEN","EIGHT","NINE","TEN" }; // and so forth...
Then in your function check every bit and if it is set, concatenate the corresponding word from your bit_to_word array. You can safely do this by using the strcat_s function.
strcat_s(number_string, BUF_SIZE, bit_to_word[i]);
One gotcha. After the first word you will want to add a space as well so you might want to keep track of that.
This code checks the first 10 bits of the number and prints out THREE FOUR for the test case. Be aware though that it doesn't do any memory cleanup.
#include <stdio.h>
#include <string.h>
#define BUF_SIZE 2048
char* bit_to_word[10] = { "ONE","TWO","THREE","FOUR","FIVE","SIX","SEVEN","EIGHT","NINE","TEN" };
char* unsigned_int_to_string(unsigned long int n)
{
char* number_string = (char*)malloc(BUF_SIZE);
memset(number_string, 0, BUF_SIZE);
int first_word = 1;
unsigned long int tester = 1;
int err;
for (unsigned long int i = 0; i < 10; i++)
{
if (tester & n)
{
if (!first_word)
{
strcat_s(number_string, BUF_SIZE, " ");
}
err = strcat_s(number_string, BUF_SIZE, bit_to_word[i]);
if (err)
{
printf("Something went wrong...\n");
}
first_word = 0;
}
tester <<= 1;
}
return number_string;
}
int main(int argc, char** argv)
{
char* res = unsigned_int_to_string(0b1100);
printf("%s\n", res);
}

Without writing the actual code, here is the description of a simple algorithm based on a 64 element lookup table of strings. 0 = ZERO, 1 = ONE, 2 = TWO ... 63 = SIXTY THREE. This table will be a 64 element array of strings. For C, you could make a static 2D array using char[256] to hold your strings (or optimize by using the value of the largest string + 1), or you could make a dynamic using malloc in a For Loop)
You then define your output string.
You then write a For Loop, iterating through all the bits using a bit mask (using left shift) if the Bit is set you can concatenate your output string (using strcat) with a space and the contents of your lookup table for that bit position.
Here is a brief code snippet on how you will do the concatenation: (Make sure you output string has enough memory in the outputstring variable to hold the largest string. If you want to be more sophisticated and optimize mem usage, you could use malloc and realloc, but you have to deal with freeing the memory when it is no longer needed.
#include <stdio.h>
#include <string.h>
int main ()
{
char str[80];
strcpy (str,"these ");
strcat (str,"strings ");
strcat (str,"are ");
strcat (str,"concatenated.");
puts (str);
return 0;
}
In your case, bit 3 will be encountered as the first set bit and the output string will then contain "THREE", then on the next iteration bit 4 will be detected as set and the output will be appended as "THREE FOUR".
Note: Because this appears to be an academic problem I would like to point out that there exists here the classical case of complexity vs space trade off. My description above was minimum complexity at the expense of space. Meaning, you will have 64 strings with redundancy in many of these strings. For example: TWENTY TWO, THIRTY TWO, FOURTY TWO, FIFTY TWO, and SIXTY TWO, all contain the string "TWO". Space could be optimized by using half the strings: ZERO, ONE, through NINETEEN, then TWENTY, THIRTY, FORTY, FIFTY, SIXTY. However, your indexing logic will be more complicated for bits greater than TWENTY. for bit 21 you will need to concatenate TWENTY and ONE.

How can this combination algorithm be modified to run in parallel on a cuda enabled gpu?

I currently have a c program that generates all possible combinations of a character string. Please note combinations, not permutations. This is the program:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
//Constants
static const char set[] = "abcd";
static const int setSize = sizeof(set) - 1;
void brute(char* temp, int index, int max){
//Declarations
int i;
for(i = 0; i < setSize; i++){
temp[index] = set[i];
if(index == max - 1){
printf("%s\n", temp);
}
else{
brute(temp, index + 1, max);
}
}
}
void length(int max_len){
//Declarations
char* temp = (char *) malloc(max_len + 1);
int i;
//Execute
for(i = 1; i <= max_len; i++){
memset(temp, 0, max_len + 1);
brute(temp, 0, i);
}
free(temp);
}
int main(void){
//Execute
length(2);
getchar();
return 0;
}
The maximum length of the combinations can be modified; it is currently set to 2 for demonstration purposes. Given how it's currently configured, the program outputs
a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, cb, cc...
I've managed to translate this program into cuda:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
//On-Device Memory
__constant__ char set_d[] = "adcd";
__constant__ int setSize_d = 4;
__device__ void brute(char* temp, int index, int max){
//Declarations
int i;
for(i = 0; i < setSize_d; i++){
temp[index] = set_d[i];
if(index == max - 1){
printf("%s\n", temp);
}
else{
brute(temp, index + 1, max);
}
}
}
__global__ void length_d(int max_len){
//Declarations
char* temp = (char *) malloc(max_len + 1);
int i;
//Execute
for(i = 1; i <= max_len; i++){
memset(temp, 0, max_len+1);
brute(temp, 0, i);
}
free(temp);
}
int main()
{
//Execute
cudaSetDevice(0);
//Launch Kernel
length_d<<<1, 1>>>(2);
cudaDeviceSynchronize();
getchar(); //Keep this console open...
return 0;
}
The cuda version of the original program is basically an exact copy of the c program (note that it is being compiled with -arch=sm_20. Therefore, printf and other host functions work in the cuda environment).
My goal is to compute combinations of a-z, A-Z, 0-9, and other characters of maximum lengths up to 10. That being the case, I want this program to run on my gpu. As it is now, it does not take advantage of parallel processing - which obviously defeats the whole purpose of writing the program in cuda. However, I'm not sure how to remove the recursive nature of the program in addition to delegating the threads to a specific index or starting point.
Any constructive input is appreciated.
Also, I get an occasional warning message on successive compiles (meaning it sporadically appears): warning : Stack size for entry function '_Z8length_di' cannot be statically determined.
I haven't pinpointed the problem yet, but I figured I would post it in case anyone identified the cause before I can. It is being compiled in visual studio 2012.
Note: I found this to be fairly interesting. As the cuda program is now, its output to the console is periodic - meaning that it prints a few dozen combinations, pauses, prints a few dozen combinations, pauses, and so forth. I also observe this behavior in its reported gpu usage - it periodically swings from 5% to 100%.

I don't think you need to use recursion for this. (I wouldn't).
Using printf from the kernel is problematic for large amounts of output; it's not really designed for that purpose. And printf from the kernel eliminates any speed benefit the GPU might have. And I assume if you're testing a large vector space like this, your goal is not to print out every combination. Even if that were your goal, printf from the kernel is not the way to go.
Another issue you will run into is storage for the entire vector space you have in mind. If you have some processing you intend to do on each vector and then you can discard it, then storage is not an issue. But storage for a vector space of length n=10 with "digits" (elements) that have k=62 or more possible values (a..z, A..Z, 0..9, etc.) will be huge. It's given by k^n, so in this example that would be 62^10 different vectors. If each digit required a byte to store it, that would be over 7 trillion gigabytes. So this pretty much dictates that storage of the entire vector space is out of the question. Whatever work you're going to do, you're going to have to do it on the fly.
Given the above discussion, this answer should have pretty much everything that you need. The vector digits are handled as unsigned int, you can create whatever mapping you want between unsigned int and your "digits" (i.e. a..z, A..Z, 0..9, etc.) In that example, the function that was performed on each vector was testing if the sum of the digits matched a certain value, so you could replace this line in the kernel:
if (vec_sum(vec, n) == sum) atomicAdd(count, 1UL);
with whatever function and processing you wanted to apply to each vector generated. You could even put a printf here, but for larger spaces the output will be fragmented and incomplete.

Listing prime numbers up to 2 billion using sieve's method in C

I tried listing prime numbers up to 2 billion, using Sieve Eratosthenes method. Here is what I used!
The problem I am facing is, I am not able to go beyond 10 million numbers. When I tried, it says 'Segmentation Fault'. I searched in the Internet to find the cause. Some sites say, it is the memory allocation limitation of the compiler itself. Some say, it is a hardware limitation. I am using a 64-bit processor with 4GB of RAM installed. Please suggest me a way to list them out.
#include <stdio.h>
#include <stdlib.h>
#define MAX 1000000
long int mark[MAX] = {0};
int isone(){
long int i;
long int cnt = 0;
for(i = 0; i < MAX ; i++){
if(mark[i] == 1)
cnt++;
}
if(cnt == MAX)
return 1;
else
return 0;
}
int main(int argc,char* argv[]){
long int list[MAX];
long int s = 2;
long int i;
for(i = 0 ; i < MAX ; i++){
list[i] = s;
s++;
}
s = 2;
printf("\n");
while(isone() == 0){
for(i = 0 ; i < MAX ; i++){
if((list[i] % s) == 0)
mark[i] = 1;
}
printf(" %lu ",s);
while(mark[++s - 2] != 0);
}
return 1;
}

long int mark[1000000] does stack allocation, which is not what you want for that amount of memory. try long int *mark = malloc(sizeof(long int) * 1000000) instead to allocate heap memory. This will get you beyond ~1Mil of array elements.
remember to free the memory, if you don't use it anymore. if yon don't know malloc or free, read the manpages (manuals) for the functions, available via man 3 malloc and man 3 free on any linux terminal. (alternatively you could just google man malloc)
EDIT: make that calloc(1000000, sizeof(long int)) to have a 0-initialized array, which is probably better.
Additionally, you can use every element of your array as a bitmask, to be able to store one mark per bit, and not per sizeof(long int) bytes. I'd recommend using a fixed-width integer type, like uint32_t for the array elements and then setting the (n % 32)'th bit in the (n / 32)'th element of the array to 1 instead of just setting the nth element to 1.
you can set the nth bit of an integer i by using:
uint32_t i = 0;
i |= ((uint32_t) 1) << n
assuming you start counting at 0.
that makes your set operation on the uint32_t bitmask array for a number n:
mask[n / 32] |= ((uint32_t)1) << (n % 32)
that saves you >99% of memory for 32bit types. Have fun :D
Another, more advanced approach to use here is prime wheel factorization, which basically means that you declare 2,3 and 5 (and possibly even more) as prime beforehand, and use only numbers that are not divisible by one of these in your mask array. But that's a really advanced concept.
However, I have written a primesieve wich wheel factorization for 2 and 3 in C in about ~15 lines of code (also for projecteuler) so it is possible to implement this stuff efficiently ;)

The most immediate improvement is to switch to bits representing the odd numbers. Thus to cover the M=2 billion numbers, or 1 billion odds, you need 1000/8 = 125 million bytes =~ 120 MB of memory (allocate them on heap, still, with the calloc function).
The bit at position i will represent the number 2*i+1. Thus when marking the multiples of a prime p, i.e. p^2, p^2+2p, ..., M, we have p^2=(2i+1)^2=4i^2+4i+1 represented by a bit at the position j=(p^2-1)/2=2i(i+1), and next multiples of p above it at position increments of p=2i+1,
for( i=1; ; ++i )
if( bit_not_set(i) )
{
p=i+i+1;
k=(p-1)*(i+1);
if( k > 1000000000) break;
for( ; k<1000000000; k+=p)
set_bit(k); // mark as composite
}
// all bits i>0 where bit_not_set(i) holds,
// represent prime numbers 2i+1
Next step is to switch to working in smaller segments that will fit in your cache size. This should speed things up. You will only need to reserve memory region for primes under the square root of 2 billion in value, i.e. 44721.
First, sieve this smaller region to find the primes there; then write these primes into a separate int array; then use this array of primes to sieve each segment, possibly printing the found primes to stdout or whatever.

storing known sequences in c

I'm working on Project Euler #14 in C and have figured out the basic algorithm; however, it runs insufferably slow for large numbers, e.g. 2,000,000 as wanted; I presume because it has to generate the sequence over and over again, even though there should be a way to store known sequences (e.g., once we get to a 16, we know from previous experience that the next numbers are 8, 4, 2, then 1).
I'm not exactly sure how to do this with C's fixed-length array, but there must be a good way (that's amazingly efficient, I'm sure). Thanks in advance.
Here's what I currently have, if it helps.
#include <stdio.h>
#define UPTO 2000000
int collatzlen(int n);
int main(){
int i, l=-1, li=-1, c=0;
for(i=1; i<=UPTO; i++){
if( (c=collatzlen(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
/* n != 0 */
int collatzlen(int n){
int len = 0;
while(n>1) n = (n%2==0 ? n/2 : 3*n+1), len+=1;
return len;
}

Your original program needs 3.5 seconds on my machine. Is it insufferably slow for you?
My dirty and ugly version needs 0.3 seconds. It uses a global array to store the values already calculated. And use them in future calculations.
int collatzlen2(unsigned long n);
static unsigned long array[2000000 + 1];//to store those already calculated
int main()
{
int i, l=-1, li=-1, c=0;
int x;
for(x = 0; x < 2000000 + 1; x++) {
array[x] = -1;//use -1 to denote not-calculated yet
}
for(i=1; i<=UPTO; i++){
if( (c=collatzlen2(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
int collatzlen2(unsigned long n){
unsigned long len = 0;
unsigned long m = n;
while(n > 1){
if(n > 2000000 || array[n] == -1){ // outside range or not-calculated yet
n = (n%2 == 0 ? n/2 : 3*n+1);
len+=1;
}
else{ // if already calculated, use the value
len += array[n];
n = 1; // to get out of the while-loop
}
}
array[m] = len;
return len;
}

Given that this is essentially a throw-away program (i.e. once you've run it and got the answer, you're not going to be supporting it for years :), I would suggest having a global variable to hold the lengths of sequences already calculated:
int lengthfrom[UPTO] = {};
If your maximum size is a few million, then we're talking megabytes of memory, which should easily fit in RAM at once.
The above will initialise the array to zeros at startup. In your program - for each iteration, check whether the array contains zero. If it does - you'll have to keep going with the computation. If not - then you know that carrying on would go on for that many more iterations, so just add that to the number you've done so far and you're done. And then store the new result in the array, of course.
Don't be tempted to use a local variable for an array of this size: that will try to allocate it on the stack, which won't be big enough and will likely crash.
Also - remember that with this sequence the values go up as well as down, so you'll need to cope with that in your program (probably by having the array longer than UPTO values, and using an assert() to guard against indices greater than the size of the array).

If I recall correctly, your problem isn't a slow algorithm: the algorithm you have now is fast enough for what PE asks you to do. The problem is overflow: you sometimes end up multiplying your number by 3 so many times that it will eventually exceed the maximum value that can be stored in a signed int. Use unsigned ints, and if that still doesn't work (but I'm pretty sure it does), use 64 bit ints (long long).
This should run very fast, but if you want to do it even faster, the other answers already addressed that.