Scala: array.toList vs array.to[List] - arrays

I am wondering what is the difference between .toList vs .to[List] in arrays. I made this test in the spark-shell and there is no difference in the result, but I don't know what is better to use. Any comments?
scala> val l = Array(1, 2, 3).toList
l: List[Int] = List(1, 2, 3)
scala> val l = Array(1, 2, 3).to[List]
l: List[Int] = List(1, 2, 3)

Adding to Luis' comment, the to[List] is (as Luis mentioned) actually using a factory parameter to construct the list. However, the linked source is only valid from Scala 2.13+, after the collections overhaul. The syntax you use wouldn't work in Scala 2.13, instead you would have to write .to(List), explicitly passing the factory argument. In previous Scala versions, the method looked like this. The CanBuildFrom is essentially a factory passed as an implicit parameter.
The reason this method exists is that it is generic beyond collections in the standard library (and you don't have to define every single possible transformation as a separate method). You can use another collections library (e.g. Breeze), and either construct a factory or use an included one, and use the to(factory) method. You can also use it in generic functions where you take the factory as a parameter and just pass it on to the conversion method, for example something like:
def mapAndConvert[A, B, C](list: List[A], f: A => B, factory: Factory[B, C]): C[B] =
list.map(f).to(factory)
In this example I don't need to know what collection C is, yet I can still return one with ease.

Related

Type hinting numpy arrays and batches

I'm trying to create a few array types for a scientific python project. So far, I have created generic types for 1D, 2D and ND numpy arrays:
from typing import Any, Generic, Protocol, Tuple, TypeVar
import numpy as np
from numpy.typing import _DType, _GenericAlias
Vector = _GenericAlias(np.ndarray, (Tuple[int], _DType))
Matrix = _GenericAlias(np.ndarray, (Tuple[int, int], _DType))
Tensor = _GenericAlias(np.ndarray, (Tuple[int, ...], _DType))
The first issue is that mypy says that Vector, Matrix and Tensor are not valid types (e.g. when I try myvar: Vector[int] = np.array([1, 2, 3]))
The second issue is that I'd like to create a generic type Batch that I'd like to use like so: Batch[Vector[complex]] should be like Matrix[complex], Batch[Matrix[float]] should be like Tensor[float] and Batch[Tensor[int] should be like Tensor[int]. I am not sure what I mean by "should be like" I guess I mean that mypy should not complain.
How to I get about this?
You should not be using protected members (names starting with an underscore) from the outside. They are typically marked this way to indicated implementation details that may change in the future, which is exactly what happened here between versions of numpy. For example in 1.24 your import line from numpy.typing fails at runtime because the members you try to import are no longer there.
There is no need to use internal alias constructors because numpy.ndarray is already generic in terms of the array shape and its dtype. You can construct your own type aliases fairly easily. You just need to ensure you parameterize the dtype correctly. Here is a working example:
from typing import Tuple, TypeVar
import numpy as np
T = TypeVar("T", bound=np.generic, covariant=True)
Vector = np.ndarray[Tuple[int], np.dtype[T]]
Matrix = np.ndarray[Tuple[int, int], np.dtype[T]]
Tensor = np.ndarray[Tuple[int, ...], np.dtype[T]]
Usage:
def f(v: Vector[np.complex64]) -> None:
print(v[0])
def g(m: Matrix[np.float_]) -> None:
print(m[0])
def h(t: Tensor[np.int32]) -> None:
print(t.reshape((1, 4)))
f(np.array([0j+1])) # prints (1+0j)
g(np.array([[3.14, 0.], [1., -1.]])) # prints [3.14 0. ]
h(np.array([[3.14, 0.], [1., -1.]])) # prints [[ 3.14 0. 1. -1. ]]
The issue currently is that shapes have almost no typing support, but work is underway to implement that using the new TypeVarTuple capabilities provided by PEP 646. Until then, there is little practical use in discriminating the types by shape.
The batch issue should be a separate question. Try and ask one question at a time.

Does blockwise allow iteration over out-of-core arrays?

The blockwise docs mention that with concatenate=False:
In the case of a contraction the passed function should expect an iterable of blocks on any array that holds that index.
My question then is whether or not there is a fundamental limitation that would prohibit this "iterable of blocks" from loading the blocks one at a time rather than keeping them all in a list (i.e. in memory). Is this possible? It does not look like blockwise works this way now, but I am wondering if it could:
import dask.array as da
import operator
# Create an array and write to disk
x = da.random.random(size=(10, 6), chunks=(5, 3))
da.to_zarr(x, '/tmp/x.zarr', overwrite=True)
x = da.from_zarr('/tmp/x.zarr')
y = x.T
def fn(x, y):
print(type(x), type(x[0]))
x = np.concatenate(x, axis=1)
y = np.concatenate(y, axis=0)
return np.matmul(x, y)
da.blockwise(fn, 'ik', x, 'ij', y, 'jk', concatenate=False, dtype='float').compute(scheduler='single-threaded')
# <class 'list'> <class 'numpy.ndarray'>
Is it possible for these lists to be generators instead?
This was true very early on in Dask, but we switched to concrete lists eventually. Today a task does not start until all of its dependency tasks are available in memory.
Given the context of your question I'm guessing that you're running up against memory issues with tensordot style applications. The memory use of tensordot style applications depends heavily on chunk structure. I encourage you to look at this issue, and especially at the talk referenced in the first post: https://github.com/dask/dask/issues/2225

Modify a member variable outside the class object and have changes in the class object

I have a class in python :
class A:
def __init__(self):
self.obj = None
def setObj(self, npArray):
self.obj = npArray
def getObj(self):
return self.obj
In another python script I instantiate an object of the class A and set "obj" and then get it some where else
objOfA.setObj(npArray)
''' Some operations '''
objOut = objOfA.getObj()
''' More operations '''
np.append(objOut,[0.25]) ## Here np.append is used just as an example. There can be many other algebraic operations.
''' operations using objOfA '''
Here in the operations above using objOfA, I want to see the modified array (appended with 0.25).
In C++ it is quite possible using pointers or references. But I am having hard time doing this in python. I understand how and when python uses references to the objects. By my problem is as soon as I do
objOut = objOfA.getObj()
I am getting a copy of the array but not a reference into objOut.
Is there a way in which I can do this.
Thank you in advance.
As per the documentation, np.append returns a new array (emphasis mine):
A copy of arr with values appended to axis. Note that append does not occur in-place: a new array is allocated and filled. If axis is None, out is a flattened array.
So you would have to assign the return value back to your object:
objOut.setObj(np.append(objOut.getObj(), [0.25]))
Note that the use of getter and setter methods is discouraged in Python, you should just access the object directly:
objOut.obj = np.append(objOut.obj, [0.25])
If you are not dependent on numpy arrays, you could just use lists which are mutable:
objOut.obj = [1, 2, 3]
objOut.obj.append(0.25)
objOut.obj.extend([4, 5, 6])

How do I create a Flow with a different input and output types for use inside of a graph?

I am making a custom sink by building a graph on the inside. Here is a broad simplification of my code to demonstrate my question:
def mySink: Sink[Int, Unit] = Sink() { implicit builder =>
val entrance = builder.add(Flow[Int].buffer(500, OverflowStrategy.backpressure))
val toString = builder.add(Flow[Int, String, Unit].map(_.toString))
val printSink = builder.add(Sink.foreach(elem => println(elem)))
builder.addEdge(entrance.out, toString.in)
builder.addEdge(toString.out, printSink.in)
entrance.in
}
The problem I am having is that while it is valid to create a Flow with the same input/output types with only a single type argument and no value argument like: Flow[Int] (which is all over the documentation) it is not valid to only supply two type parameters and zero value parameters.
According to the reference documentation for the Flow object the apply method I am looking for is defined as
def apply[I, O]()(block: (Builder[Unit]) ⇒ (Inlet[I], Outlet[O])): Flow[I, O, Unit]
and says
Creates a Flow by passing a FlowGraph.Builder to the given create function.
The create function is expected to return a pair of Inlet and Outlet which correspond to the created Flows input and output ports.
It seems like I need to deal with another level of graph builders when I am trying to make what I think is a very simple flow. Is there an easier and more concise way to create a Flow that changes the type of it's input and output that doesn't require messing with it's inside ports? If this is the right way to approach this problem, what would a solution look like?
BONUS: Why is it easy to make a Flow that doesn't change the type of its input from it's output?
If you want to specify both the input and the output type of a flow, you indeed need to use the apply method you found in the documentation. Using it, though, is done pretty much exactly the same as you already did.
Flow[String, Message]() { implicit b =>
import FlowGraph.Implicits._
val reverseString = b.add(Flow[String].map[String] { msg => msg.reverse })
val mapStringToMsg = b.add(Flow[String].map[Message]( x => TextMessage.Strict(x)))
// connect the graph
reverseString ~> mapStringToMsg
// expose ports
(reverseString.inlet, mapStringToMsg.outlet)
}
Instead of just returning the inlet, you return a tuple, with the inlet and the outlet. This flow can now we used (for instance inside another builder, or directly with runWith) with a specific Source or Sink.

Flink Scala API functions on generic parameters

It's a follow up question on Flink Scala API "not enough arguments".
I'd like to be able to pass Flink's DataSets around and do something with it, but the parameters to the dataset are generic.
Here's the problem I have now:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.reflect.ClassTag
object TestFlink {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?")
val split = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
id(split).print()
env.execute()
}
def id[K: ClassTag](ds: DataSet[K]): DataSet[K] = ds.map(r => r)
}
I have this error for ds.map(r => r):
Multiple markers at this line
- not enough arguments for method map: (implicit evidence$256: org.apache.flink.api.common.typeinfo.TypeInformation[K], implicit
evidence$257: scala.reflect.ClassTag[K])org.apache.flink.api.scala.DataSet[K]. Unspecified value parameters evidence$256, evidence$257.
- not enough arguments for method map: (implicit evidence$4: org.apache.flink.api.common.typeinfo.TypeInformation[K], implicit evidence
$5: scala.reflect.ClassTag[K])org.apache.flink.api.scala.DataSet[K]. Unspecified value parameters evidence$4, evidence$5.
- could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[K]
Of course, the id function here is just an example, and I'd like to be able to do something more complex with it.
How it can be solved?
you also need to have TypeInformation as a context bound on the K parameter, so:
def id[K: ClassTag: TypeInformation](ds: DataSet[K]): DataSet[K] = ds.map(r => r)
The reason is, that Flink analyses the types that you use in your program and creates a TypeInformation instance for each type you use. If you want to create generic operations then you need to make sure a TypeInformation of that type is available by adding a context bound. This way, the Scala compiler will make sure an instance is available at the call site of the generic function.

Resources