Array manipulation in OCaml - arrays

I am manipulating 2-dimensional arrays in OCaml. I have some questions:
How to declare an array whose length is of type int64, instead of int? For instance, Array.make : int -> 'a -> 'a array, what if I need a bigger array whose index is of type int64?
May I write something like the following:
let array = Array.make_matrix 10 10 0 in
array.(1).(2) <- 5; array.(3).(4) <- 20; (* where I modify a part of values in array)
f array ...
...
The code above seems to me unnatural, because we modify the value of array inside the let, do I have to this, or is there a more natural way to do this?
Could anyone help? thank you very much!

On 64-bit systems, the size of OCaml arrays from the Array module is limited to 2^54 - 1 and on 32-bit systems the limit is 4,194,303. For arrays of float, the limit is 2 times smaller. In both cases the index is easily represented as an int, so there's no advantage in using int64 as an index.
The value for 32-bit systems is way too small for some problems, so there is another module named Bigarray that can represent larger arrays. It supports much larger arrays, but the indices are still int. If you really need to have large indices, you are possibly on a 64-bit system where this isn't such a limitation. If not, you're going to run out of address space anyway, I would think. Maybe what you really want is a hash table?
I'm not sure what you're saying about "let". The purpose of let is to give something a name. It's not unreasonable to give the array a name before you start storing values into it. If you want to define the values at the time you create the array you can use Array.init and write an arbitrary function for setting the array values.
Array code in OCaml is inherently imperative, so you will usually end up with code that has that look to it. I often use begin and end and just embrace the Algolic quality of it.

Related

Indexing array by tuples in Julia?

I would like to create (in Julia) a 2 dimensional array Y storing the spherical harmonics Y_lm(x) evaluated at some fixed x, indexed by an integer l>=0 and -l<=m<=l.
How can I create the array Y such that I may access elements via tuples, e.g, to access Y_20(x) I would call Y[(2,0)]?
More generally does Julia allow arrays indexed by tuples (x1,...xn) if we don't know anything about the possible range of the xi (like a dictionary, but indexed by tuples of integers instead of strings)?
Short answer, that isn't what arrays are "for" in Julia, this is what Dict is for. In Julia (and many languages) what is generally meant by an array is something that is indexed by a series of contiguous integer values. (That said, you can implement your own object implementing the Array interface that might work differently...).
A Dict allows for any arbitrary set of indices, that can be any type you want, not just strings. For example:
Y = Dict()
Y[(2,0)] = "Hello, World"
println(Y[(2,0)])
For your particular problem there may be a more efficient solution, but I don't know enough about spherical harmonics to know what it would be. It would be worth looking at the package mentioned in the comments. It probably has a more idiomatic approach.

Why [1:2] != Array[1:2]

I am learning Julia following the Wikibook, but I don't understand why the following two commands give different results:
julia> [1:2]
1-element Array{UnitRange{Int64},1}:
1:2
julia> Array[1:2]
1-element Array{Array,1}:
[1,2]
Apologies if there is an explanation I haven't seen in the Wikibook, I have looked briefly but didn't find one.
Type[a] runs convert on the elements, and there is a simple conversion between a Range to an Array (collect). So Array[1:2] converts 1:2 to an array, and then makes an array of objects like that. This is the same thing as why Float64[1;2;3] is an array of Float64.
These previous parts answer answered the wrong thing. Oops...
a:b is not an array, it's a UnitRange. Why would you create an array for A = a:b? It only takes two numbers to store it, and you can calculate A[i] basically for free for any i. Using an array would take an amount of memory which is proportional to the b-a, and thus for larger arrays would take a lot of time to allocate, whereas allocation for UnitRange is essentially free.
These kinds of types in Julia are known as lazy iterators. LinSpace is another. Another interesting set of types are the special matrix types: why use more than an array to store a Diagonal? The UniformScaling operator acts as the identity matrix while only storing one value (it's scale) to make A-kI efficient.
Since Julia has a robust type system, there is no reason to make all of these things arrays. Instead, you can make them a specialized type which will act (*, +, etc.) and index like an array, but actually aren't. This will make them take less memory and be faster. If you ever need the array, just call collect(A) or full(A).
I realized that you posted something a little more specific. The reason here is that Array[1:2] calls the getindex function for an array. This getindex function has a special dispatch on a Range so that way it "acts like it's indexed by an array" (see the discussion from earlier). So that's "special-cased", but in actuality it just has dispatches to act like an array just like it does with every other function. [A] gives an array of typeof(A) no matter what A is, so there's no magic here.

Why have arrays in Go?

I understand the difference between arrays and slices in Go. But what I don't understand is why it is helpful to have arrays at all. Why is it helpful that an array type definition specifies a length and an element type? Why can't every "array" that we use be a slice?
There is more to arrays than just the fixed length: they are comparable, and they are values (not reference or pointer types).
There are countless advantages of arrays over slices in certain situations, all of which together more than justify the existence of arrays (along with slices). Let's see them. (I'm not even counting arrays being the building blocks of slices.)
1. Being comparable means you can use arrays as keys in maps, but not slices. Yes, you could say now that why not make slices comparable then, so that this alone wouldn't justify the existence of both. Equality is not well defined on slices. FAQ: Why don't maps allow slices as keys?
They don't implement equality because equality is not well defined on such types; there are multiple considerations involving shallow vs. deep comparison, pointer vs. value comparison, how to deal with recursive types, and so on.
2. Arrays can also give you higher compile-time safety, as the index bounds can be checked at compile time (array length must evaluate to a non-negative constant representable by a value of type int):
s := make([]int, 3)
s[3] = 3 // "Only" a runtime panic: runtime error: index out of range
a := [3]int{}
a[3] = 3 // Compile-time error: invalid array index 3 (out of bounds for 3-element array)
3. Also passing around or assigning array values will implicitly make a copy of the entire array, so it will be "detached" from the original value. If you pass a slice, it will still make a copy but just of the slice header, but the slice value (the header) will point to the same backing array. This may or may not be what you want. If you want to "detach" a slice from the "original" one, you have to explicitly copy the content e.g. with the builtin copy() function to a new slice.
a := [2]int{1, 2}
b := a
b[0] = 10 // This only affects b, a will remain {1, 2}
sa := []int{1, 2}
sb := sa
sb[0] = 10 // Affects both sb and sa
4. Also since the array length is part of the array type, arrays with different length are distinct types. On one hand this may be a "pain in the ass" (e.g. you write a function which takes a parameter of type [4]int, you can't use that function to take and process an array of type [5]int), but this may also be an advantage: this may be used to explicitly specify the length of the array that is expected. E.g. you want to write a function which takes an IPv4 address, it can be modeled with the type [4]byte. Now you have a compile-time guarantee that the value passed to your function will have exactly 4 bytes, no more and no less (which would be an invalid IPv4 address anyway).
5. Related to the previous, the array length may also serve a documentation purpose. A type [4]byte properly documents that IPv4 has 4 bytes. An rgb variable of type [3]byte tells there are 1 byte for each color components. In some cases it is even taken out and is available, documented separately; for example in the crypto/md5 package: md5.Sum() returns a value of type [Size]byte where md5.Size is a constant being 16: the length of an MD5 checksum.
6. They are also very useful when planning memory layout of struct types, see JimB's answer here, and this answer in greater detail and real-life example.
7. Also since slices are headers and they are (almost) always passed around as-is (without pointers), the language spec is more restrictive regarding pointers to slices than pointers to arrays. For example the spec provides multiple shorthands for operating with pointers to arrays, while the same gives compile-time error in case of slices (because it's rare to use pointers to slices, if you still want / have to do it, you have to be explicit about handling it; read more in this answer).
Such examples are:
Slicing a p pointer to array: p[low:high] is a shorthand for (*p)[low:high]. If p is a pointer to slice, this is compile-time error (spec: Slice expressions).
Indexing a p pointer to array: p[i] is a shorthand for (*p)[i]. If p is pointer to a slice, this is a compile time error (spec: Index expressions).
Example:
pa := &[2]int{1, 2}
fmt.Println(pa[1:1]) // OK
fmt.Println(pa[1]) // OK
ps := &[]int{3, 4}
println(ps[1:1]) // Error: cannot slice ps (type *[]int)
println(ps[1]) // Error: invalid operation: ps[1] (type *[]int does not support indexing)
8. Accessing (single) array elements is more efficient than accessing slice elements; as in case of slices the runtime has to go through an implicit pointer dereference. Also "the expressions len(s) and cap(s) are constants if the type of s is an array or pointer to an array".
May be suprising, but you can even write:
type IP [4]byte
const x = len(IP{}) // x will be 4
It's valid, and is evaluated and compile-time even though IP{} is not a constant expression so e.g. const i = IP{} would be a compile-time error! After this, it's not even surprising that the following also works:
const x2 = len((*IP)(nil)) // x2 will also be 4
Note: When ranging over a complete array vs a complete slice, there may be no performance difference at all as obviously it may be optimized so that the pointer in the slice header is only dereferenced once. For details / example, see Array vs Slice: accessing speed.
See related questions where an array can be used / makes more sense than a slice:
Why use arrays instead of slices?
Why can't Go slice be used as keys in Go maps pretty much the same way arrays can be used as keys?
Hash with key as an array type
How do I check the equality of three values elegantly?
Slicing a slice pointer passed as argument
And this is just for curiosity: a slice can contain itself while an array can't. (Actually this property makes comparison easier as you don't have to deal with recursive data structures).
Must-read blogs:
Go Slices: usage and internals
Arrays, slices (and strings): The mechanics of 'append'
Arrays are values, and it is often useful to have a value instead of a pointer.
Values can be compared, hence you can use arrays as map keys.
Values are always initialized, so there's you don't need to initialize, or make them like you do with a slice.
Arrays give you better control of memory layout, where as you can't allocate space directly in a struct with a slice, you can with an array:
type Foo struct {
buf [64]byte
}
Here, a Foo value will contains a 64 byte value, rather than a slice header which needs to be separately initialized. Arrays are also used to pad structs to match alignment when interoperating with C code and to prevent false sharing for better cache performance.
Another aspect for improved performance is that you can better define memory layout than with slices, because data locality can have a very big impact on memory intensive calculations. Dereferencing a pointer can take considerable time compared to the operations being performed on the data, and copying values smaller than a cache line incurs very little cost, so performance critical code often uses arrays for that reason alone.
Arrays are more efficient in saving space. If you never update the size of the slice (i.e. start with a predefined size and never go past it) there really is not much of a performance difference. But there is extra overhead in space, as a slice is simply a wrapper containing the array at its core. Contextually, it also improves clarity as it makes the intended use of the variable more apparent.
Every array could be a slice but not every slice could be an array. If you have a fixed collection size you can get a minor performance improvement from using an array. At the very least you'll save the space occupied by the slice header.

What's the purpose of arrays starting with nonzero index?

I tried to find answers, but all I got was answers on how to realize arrays starting with nonzero indexes. Some languages, such as pascal, provide this by default, e.g., you can create an array such as
var foobar: array[1..10] of string;
I've always been wondering: Why would you want to have the array index not to start with 0?
I guess it may be more familiar for beginners to have arrays starting with 1 and the last index being the size of the array, but on a long-term basis, programmers should get used to values starting with 0.
Another purpose I could think of: In some cases, the index could actually represent something thats contained in the respective array-entry. e.g., you want to get all capital letters in an array, it may be handy to have an index being the ASCII-Code of the respective letter. But its pretty easy just to subtract a constant value. In this example, you could (in C) simply do something like this do get all capital letters and access the letter with ascii-code 67:
#define ASCII_SHIFT 65
main()
{
int capital_letters[26];
int i;
for (i=0; i<26; i++){
capital_letters[i] = i+ASCII_SHIFT;
}
printf("%c\n", capital_letters[67-ASCII_SHIFT]);
}
Also, I think you should use hash tables if you want to access entries by some sort of key.
Someone might retort: Why should the index always start with 0? Well, it's a hell of a lot simpler this way. You'll be faster when you just have to type one index when declaring an array. Also, you can always be sure that the first entry is array[0] and the last one is array[length_of_array-1]. It is also common that other data structures start with 0. e.g., if you read a binary file, you start with the 0th byte, not the first.
Now, why do some programming languages have this "feature" and why do some people ask how to achieve this in languages such as C/C++?, is there any situation where an array starting with a nonzero index is way more useful, or even, something simply cannot be done with an array starting at 0?
If your index means something, e.g. an id from a database or some such, then it's useful.
Oh, and you can't use hashes because you want to use it with some other piece of code that expects arrays.
For example, Rails checkboxes. They're passed from the web form as arrays but in my code I want to access the udnerlying database object. The array index is the id, et voila!
Non-zero based arrays are a natural extension of arrays with ordinal indexes that are not integers. In Pascal you can have arrays like:
var
letter_count : array['a'..'z'] of integer;
Or:
type
flags = (GREEN, YELOW, RED);
var
flags_seen = array[flags] of boolean;
A classic is an array with negative indexes:
zero_centered_grid = array[-N..N,-N..N] of sometype;
The idea is that:
Many indexing errors can be detected at compile time if the declaration of indexes is more specific.
Some algorithms (heaps come to mind) have cleaner implementations when the minimum index is something different from zero.
Languages with only zero-based arrays use well defined idioms for the latter, and have efficient implementations of dictionaries/maps for the rest.

How can I efficiently copy 2-dimensional arrays of bytes into a larger 2D array?

I have a structure called Patch that represents a 2D array of data.
newtype Size = (Int, Int)
data Patch = Patch Size Strict.ByteString
I want to construct a larger Patch from a set of smaller Patches and their assigned positions. (The Patches do not overlap.) The function looks like this:
newtype Position = (Int, Int)
combinePatches :: [(Position, Patch)] -> Patch
combinePatches plan = undefined
I see two sub-problems. First, I must define a function to translate 2D array copies into a set of 1D array copies. Second, I must construct the final Patch from all those copies.
Note that the final Patch will be around 4 MB of data. This is why I want to avoid a naive approach.
I'm fairly confident that I could do this horribly inefficiently, but I would like some advice on how to efficiently manipulate large 2D arrays in Haskell. I have been looking at the "vector" library, but I have never used it before.
Thanks for your time.
If the spec is really just a one-time creation of a new Patch from a set of previous ones and their positions, then this is a straightforward single-pass algorithm. Conceptually, I'd think of it as two steps -- first, combine the existing patches into a data structure with reasonable lookup for any give position. Next, write your new structure lazily by querying the compound structure. This should be roughly O(n log(m)) -- n being the size of the new array you're writing, and m being the number of patches.
This is conceptually much simpler if you use the Vector library instead of a raw ByteString. But it is simpler still if you simply use Data.Array.Unboxed. If you need arrays that can interop with C, then use Data.Array.Storable instead.
If you ditch purity, at least locally, and work with an ST array, you should be able to trivially do this in O(n) time. Of course, the constant factors will still be worse than using fast copying of chunks of memory at a time, but there's no way to keep that code from looking low-level.

Resources