I have a bunch of files that I want to read in parallel using Python's multiprocessing and collect all the data in a single NumPy array. For this purpose, I want to define a shared memory NumPy array and pass its slices to different processes to read in parallel. A toy illustration of what I am trying to do is given in the following code where I am trying to modify a numpy array using multiprocessing.
Example 1:
import numpy as np
import multiprocessing
def do_stuff(i, arr):
arr[:]=i
return
def print_error(err):
print(err)
if __name__ == '__main__':
idx = [0,1,2,3]
# Need to fill this array in parallel
arr = np.zeros(4)
p = multiprocessing.Pool(4)
# Passing slices to arr to modify using multiprocessing
for i in idx:
p.apply(do_stuff, args=(i,arr[i:i+1]))
p.close()
p.join()
print(arr)
In this code, I want the arr to be filled with 0, 1, 2, 3. This however prints arr to be all zeros. After reading the answers here, I used multiprocessing.Array to define the shared memory variable and modified my code as follows
Example 2:
import numpy as np
import multiprocessing
def do_stuff(i, arr):
arr[:]=i
return
def print_error(err):
print(err)
if __name__ == '__main__':
idx = [0,1,2,3]
p = multiprocessing.Pool(4)
# Shared memory Array
shared = multiprocessing.Array('d', 4)
arr = np.ctypeslib.as_array(shared.get_obj())
for i in idx:
p.apply(do_stuff, args=(i,arr[i:i+1]))
p.close()
p.join()
print(arr)
This also prints all zeros for arr. However, when I define the array outside main and use pool.map, the code works. For e.g., the following code works
Example 3:
import numpy as np
import multiprocessing
shared = multiprocessing.Array('d', 4)
arr = np.ctypeslib.as_array(shared.get_obj())
def do_stuff(i):
arr[i]=i
return
def print_error(err):
print(err)
if __name__ == '__main__':
idx = [0,1,2,3]
p = multiprocessing.Pool(4)
shared = multiprocessing.Array('d', 4)
p.map(do_stuff, idx)
p.close()
p.join()
print(arr)
This prints [0,1,2,3].
I am very confused by all this. My questions are:
When I define arr = np.zeros(4), which processor owns this variable? When I then send the slice of this array to different processors what is being sent if this variable is not defined on those processors.
Why doesn't example 2 work while example 3 does?
I am working on Linux and Python/3.7/4
When I define arr = np.zeros(4), which processor owns this variable?
Only the main process should have access to this. If you use "fork" for the start method, everything will be accessible to the child process, but as soon as something tries to modify it, it will be copied to it's own private memory space before being modified (copy on write). This reduces overhead if you have large read-only arrays, but doesn't help you much for writing data back to those arrays.
what is being sent if this variable is not defined on those processors.
A new array is created within the child process when the arguments are re-constructed after being sent from the main process via a pipe and pickle. The data is serialized to text and re-constructed, so no information other than the value of the data in the slice remains. it's a totally new object.
Why doesn't example 2 work while example 3 does?
example 3 works because at the time of "fork" (the moment you call Pool), arr has already been created, and will be shared. It's also important that you used an Array to create it, so when you attempt to modify the data, the data is shared (the exact mechanics of this are complicated).
example 2 does not work in a similar way that example 1 does not work: you pass a slice of an array as an argument, which gets converted into a totally new object, so arr inside your do_stuff function is just a copy of arr[i:i+1] from the main process. It is still important to create anything which will be shared between processes before calling Pool (if you're relying on "fork" to share the data), but that's not why this example doesn't work.
You should know: example 3 only works because you're on linux, and the default start method is fork. This is not the preferred start method due to the possibility of deadlocks with copying lock objects in a locked state. This will not work on Windows at all, and won't work on MacOS by default on 3.8 and above.
The best solution (most portable) to all this is to pass the Array itself as the argument, and re-construct the numpy array inside the child process. This has the complication that "shared objects" can only be passed as arguments at the creation of the child process. This isn't as big a deal if you use Process, but with Pool, you basically have to pass any shared objects as arguments to an initialization function, and get the re-constructed array as a global variable of the child's scope. In this example for instance you will get an error trying to pass buf as an argument with p.map or p.apply, but not when passing buf as initargs=(buf,) to Pool()
import numpy as np
from multiprocessing import Pool, Array
def init_child(buf):
global arr #use global context (for each process) to pass arr to do_stuff
arr = np.frombuffer(buf.get_obj(), dtype='d')
def do_stuff(i):
global arr
arr[i]=i
if __name__ == '__main__':
idx = [0,1,2,3]
buf = Array('d', 4)
arr = np.frombuffer(buf.get_obj(), dtype='d')
arr[:] = 0
#"with" context is easier than writing "close" and "join" all the time
with Pool(4, initializer=init_child, initargs=(buf,)) as p:
for i in idx:
p.apply(do_stuff, args=(i,)) #you could pass more args to get slice indices too
print(arr)
with 3.8 and above there's a new module which is better than Array or any of the other sharedctypes classes called: shared_memory. This is a bit more complicated to use, and has some additional OS dependent nastiness, but it's theoretically lower overhead and faster. If you want to go down the rabbit hole I've written a few answers on the topic of shared_memory, and have recently been answering lots of questions on concurrency in general if you want to take a gander at my answers from the last month or two.
Related
This question already has answers here:
How do you run your own code alongside Tkinter's event loop?
(5 answers)
Closed 1 year ago.
I want to code a script, that has several different functions in which an array filled with serial data has to be received.
The serial data comes from an arduino every 1 seconds. (don't worry. I changed the code to a reproducible example by using a random array.)
What I've succeeded in so far is, that the code does send the array into the function example ONCE the first time and displays it as I want it to.
What it does not do yet, is, that the information inside the function gets updated as it comes in from the arduino. When you see the code, you're gonna say, well the data is only sent once. BUT when I randomize the array every second inside a loop, the loop obviously blocks the rest of the code and the gui won't build. The fact is, that serial read updates the array WITHOUT a loop, which is highly appreciated.
The question is: How do I transport this updating into the function. Remember: It would be the natural solution for the code below to simply insert the serial read stuff INSIDE the function. BUT this is just the code that boils down the issue. The real code has several widgets invoked inside several functions and I ended up copy-&-pasting THE ENTIRE serial data signal conditioning code block into EVERY function that needs the data. This significantly increased the lag of the code, and thus is no solution.
The example script contains commented out sections to make it easier to follow what I've been trying to do to solve it so far:
import numpy as np
#import rhinoscriptsyntax as rs
import time
import tkinter as tk
import serial
"""
ser = serial.Serial(
port='/dev/ttyUSB0',
baudrate = 500000,
#parity=serial.PARITY_NONE,
#stopbits=serial.STOPBITS_ONE,
#bytesize=serial.EIGHTBITS,
timeout=1
)
ser.flushInput()
ser.flushOutput()
#I've been trying to embed the serial read stuff inside a function itself, which I'd LOVE to implement,
#but it has the same problem: either it is called just once as it is written here, ore a loop blocks the code
def serial_data():
#serialData = ser.readline()
#serialData = serialData.decode()
#floats = [float(value) for value in serialData.split(',')]
#arr = np.array(floats)
arr = np.random.rand(100)
time.sleep(1)
return arr
"""
#above commented out and replaced with random array below.
#serialData = ser.readline()
#serialData = serialData.decode()
#floats = [float(value) for value in serialData.split(',')]
#arr = np.array(floats)
arr = np.random.rand(100)
time.sleep(1)
print(np.round(arr, 3))
def nextWindow(root):
frame1 = tk.Frame(root, width=800, height=500)
frame1.pack()
text= tk.Text(frame1, width=80, height= 12, bd=0)
text.pack()
text.delete("1.0",tk.END)
#serialData = serial_data()
#text.insert(tk.INSERT, serialData)
text.insert(tk.INSERT, np.round(arr, 3))
text.update()
root = tk.Tk()
root.title('PythonGuides')
root.geometry('300x200')
root.config(bg='#4a7a8c')
nextWindow(root)
root.mainloop()
A minimal example as to how to use .after "loops" (explanation in code comments):
import tkinter as tk
# for the example
counter = 1
# your serial_data function, renamed to
# be more self-explanatory
def get_serial_data():
# put the serial reading stuff here
# the counter is just an example
global counter
counter += 1
return str(counter)
def update_text(txt):
# get data, clear text widget, insert new data
serial_data = get_serial_data()
txt.delete('0.0', 'end')
txt.insert('end', serial_data)
# schedule this function to run again in 100ms
# so it will repeat all of this effectively
# updating the text widget, you don't need to call `.update`
# note that this is not recursive
root.after(100, update_text)
def create_text():
text = tk.Text(root)
text.pack()
update_text(text)
root = tk.Tk()
create_text()
root.mainloop()
The blockwise docs mention that with concatenate=False:
In the case of a contraction the passed function should expect an iterable of blocks on any array that holds that index.
My question then is whether or not there is a fundamental limitation that would prohibit this "iterable of blocks" from loading the blocks one at a time rather than keeping them all in a list (i.e. in memory). Is this possible? It does not look like blockwise works this way now, but I am wondering if it could:
import dask.array as da
import operator
# Create an array and write to disk
x = da.random.random(size=(10, 6), chunks=(5, 3))
da.to_zarr(x, '/tmp/x.zarr', overwrite=True)
x = da.from_zarr('/tmp/x.zarr')
y = x.T
def fn(x, y):
print(type(x), type(x[0]))
x = np.concatenate(x, axis=1)
y = np.concatenate(y, axis=0)
return np.matmul(x, y)
da.blockwise(fn, 'ik', x, 'ij', y, 'jk', concatenate=False, dtype='float').compute(scheduler='single-threaded')
# <class 'list'> <class 'numpy.ndarray'>
Is it possible for these lists to be generators instead?
This was true very early on in Dask, but we switched to concrete lists eventually. Today a task does not start until all of its dependency tasks are available in memory.
Given the context of your question I'm guessing that you're running up against memory issues with tensordot style applications. The memory use of tensordot style applications depends heavily on chunk structure. I encourage you to look at this issue, and especially at the talk referenced in the first post: https://github.com/dask/dask/issues/2225
I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
If you are simply looking to count the number of rows in the rdd, do:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
Yes Finally I got the solution.
Include these libraries.
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
How to find the RDD Size:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
Function to find DataFrame size:
(This function just convert DataFrame to RDD internally)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)
Below is one way apart from SizeEstimator.I use frequently
To know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? to get the storage level, also want to know the current actual caching status.to Know memory consumption.
Spark Context has developer api method getRDDStorageInfo()
Occasionally you can use this.
Return information about what RDDs are cached, if they are in mem or
on disk, how much space they take, etc.
For Example :
scala> sc.getRDDStorageInfo
res3: Array[org.apache.spark.storage.RDDInfo] =
Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb,
firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;
TotalPartitions: 1;
MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)
Seems like spark ui also used the same from this code
See this Source issue SPARK-17019 which describes...
Description
With SPARK-13992, Spark supports persisting data into
off-heap memory, but the usage of off-heap is not exposed currently,
it is not so convenient for user to monitor and profile, so here
propose to expose off-heap memory as well as on-heap memory usage in
various places:
Spark UI's executor page will display both on-heap and off-heap memory usage.
REST request returns both on-heap and off-heap memory.
Also these two memory usage can be obtained programmatically from SparkListener.
Is there a way to extend an array that stores data from a file on each iteration of a for-loop and with command combo, using glob. Currently, I have something like
import glob
from myfnc import func
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHead, data = func(thefile)
where func is defined in another script myfnc. What this does is on each iteration in the directory, stores the data from each file in fileHead and data (as arrays), erasing whatever was there on the previous iteration. What I need is something that will extend each array on each pass. Is there a nice way to do this? It doesn't need to be a for-loop, with combo. That is just how I am reading in all files from the directory.
I thought of initializing the arrays beforehand and then try extending them after the with is done on one pass, but it was giving me some kind of error with the extend command. With the error, the code would look like
import glob
from myfnc import func
fileHead, data = [0]*2
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHeadExtend, dataExtend = func(thefile)
fileHead.extend(fileHeadExtend)
data.extend(dataExtend)
So, the issue that it has is fileHead and data are both initialized but as int's. However, I don't want want to initialize the arrays to so many zeros. There should not be any arbitrary values in there to begin with. So, that is where issue is lying for this.
You want:
import glob
from myfnc import func
fileHead = list()
data = list()
for filename in glob.glob('*.dta'):
with open(filename,'rb') as thefile:
fileHeadExtend, dataExtend = func(thefile)
fileHead.extend(fileHeadExtend)
data.extend(dataExtend)
A fixed-length array of a native type (or of a type that implements the Copy trait) can be cloned in Rust up to the length of 32. That is, this compiles:
fn main() {
let source: [i32; 32] = [0; 32]; // length 32
let _cloned = source.clone();
}
But this doesn't:
fn main() {
let source: [i32; 33] = [0; 33]; // length 33
let _cloned = source.clone(); // <-- compile error
}
In fact, the trait Clone only declares a method for each generic array length, from 0 to 32.
What is an efficient and idiomatic way to clone a generic array of length, say, 33?
You can't add the impl Clone in your own code. This problem will be fixed at some point, in the mean time you can mostly work around it with varying amount of effort:
If you just have a local variable of a concrete type and the type is Copy (as in your example), you can simply copy rather than cloning, i.e., let _cloned = source;.
If the array is a field of a struct you want to implement Clone for (and derive won't work), you can still manually implement Clone and using the above trick in the implementation.
Cloning an array of non-Copy types is trickier, because Clone can fail. You could write out [x[0].clone(), x[1].clone(), ...] for as many times as you need, it's a lot of work but at least it's certain to be correct.
If all else fails, you can still create a newtype wrapper. This requires quite a bit of boilerplate to delegate all the other traits you need, but then you can (again, manually) implement Clone.
You can clone arbitrary-length arrays since Rust 1.21.0. The "Libraries" section of the CHANGELOG says:
Generate builtin impls for Clone for all arrays and tuples that are T: Clone