Background
I have a list with the paths of thousand image stacks (3D numpy arrays) preprocessed and saved as .npy binaries.
Case Study I would like to calculate the mean of all the images and in order to speed the analysis I thought to parallelise the processing.
Approach using dask.delayed
# List with the file names
flist_img_to_filter
# I chunk the list of paths in sublists. The number of chunks correspond to
# the number of cores used for the analysis
chunked_list
# Scatter the images sublists to be able to process in parallel
futures = client.scatter(chunked_list)
# Create dask processing graph
output = []
for future in futures:
ImgMean = delayed(partial_image_mean)(future)
output.append(ImgMean)
ImgMean_all = delayed(sum)(output)
ImgMean_all = ImgMean_all/len(futures)
# Compute the graph
ImgMean = ImgMean_all.compute()
Approach using dask.arrays
modified from Matthew Rocklin blog
imread = delayed(np.load, pure=True) # Lazy version of imread
# Lazily evaluate imread on each path
lazy_values = [imread(img_path) for img_path in flist_img_to_filter]
arrays = [da.from_delayed(lazy_value, dtype=np.uint16,shape=shape) for
lazy_value in lazy_values]
# Stack all small Dask arrays into one
stack = da.stack(arrays, axis=0)
ImgMean = stack.mean(axis=0).compute()
Questions
1. In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?
2. The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?
3. Is there a better way to approach the issue?
Thanks!
In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?
Simple answer is no, as of Dask version 0.15.4 there is no very robust way to submit a computation on "all of the tasks of a certain type currently present on this worker".
However, you can easily ask the scheduler which keys are present on the scheduler using the who_has or has_what client methods.
from dask.distributed import wait
import wait
futures = dask.persist(futures)
wait(futures)
client.who_has(futures)
The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?
You might want to play with the split_every= keyword of the mean function or else rechunk your array to group images together (probably similar to what yo do above) before calling mean to play with parallelism/memory tradeoffs.
Is there a better way to approach the issue?
You might also try as_completed and compute running means as data completes. You would have to switch from delayed to futures for this.
Related
The N2 diagram for my full problem is below.
The N2 diagram for the coupled portion of the problem is below.
I have a DirectSolver handling the coupling between LLTForces and ImplicitLiftingLine, and an LNBGS solver handling the coupling between LiftingLineGroup and TestCL.
The gist for the problem is here: https://gist.github.com/eufren/31c0e569ed703b2aea3e2ef5360610f7
I have implemented guess_nonlinear() on ImplicitLiftingLine, which should use various outputs from LLTGeometry to give a good initial guess for the vortex strengths based on a linearised form of the governing equations.
def guess_nonlinear(self, inputs, outputs, resids):
freestream_unit_vector = inputs['freestream_unit_vector']
freestream_velocity = inputs['freestream_velocity']
n = inputs['normal_vectors']
A = inputs['surface_areas']
l = inputs['bound_vortices']
ic_tot = inputs['influence_coefficients_total']
v_inf = freestream_velocity
v_inf_vec = v_inf*freestream_unit_vector
lin_numerator = np.pi * v_inf * A * np.sum(n * v_inf_vec, axis=1)
lin_denominator = (np.linalg.norm(np.cross(v_inf_vec, l), axis=1) - np.pi * v_inf * A * np.sum(np.sum(n * ic_tot, axis=2), axis=1))
lin_vtx_str = lin_numerator / lin_denominator
outputs['vortex_strengths'] = lin_vtx_str
However, when the problem is run for the first time, any inputs not explicitly set with p.set_val() are all 1s. This causes guess_nonlinear() to give a bad output and so the system fails to converge:
As far as I can tell, the execution order for the LLT group is correct, and the geometry components should be being executed before the implicit component. I'm confused as to why this doesn't seem to actually be happening when the code is run, and instead these inputs are taking their default values.
What do I need to change to get this to work properly? Additionally, I've found difficulty in getting LNBGS to converge (hence adding guess_nonlinear()) during optimisation - only DirectSolver gets all the way through the optimisation without issues, but it's very slow for large numbers of LLT nodes). How can I improve the linear and nonlinear solver selection, and improve the reliability of the iterative solver?
Note: Thanks for providing a testable example. It made figuring out the answer to your question a lot simpler. Your problem was a bit subtle and I would not have been able to give a good answer without runnable code
Your first question: "Why are all the inputs 1"
"Short" Answer
You have put the nonlinear solver to high in the model hierarchy, which then included a key precurser component that computed your input values. By moving the solver down to a lower level of the model, I was able to ensure that the precurser component (LTTGeometry) ran and had valid outputs before you got to the guess_nonlinear of implicit component.
Here is what you had (Notice the implicit solver included LTTGeometry even though the data cycle does not require that component:
I moved both the nonlinear solver and the linear solver down into the LTTCycle group, which then allows the LTTGeometry component to execute before getting to the nonlinear solver and guess_nonlinear step:
My fix is only partially correct, since there is a secondary cycle from the TestCL component that also needs a solver and does not have one. However, that cycle still does not involve the LTTGeometry group. So the fully correct fix is to restructure you model top run geometry first, and then put the LTTCycle and TestCL groups together so you can run a solver over just them. That was a bit more hacking than I wanted to do on your test problem, but you can see the general idea from the adjusted N2 above.
Long Answer
The guess_nonlinear sequence in OpenMDAO does NOT run the compute method of explicit components or of groups. It follows the execution hierarchy, and calls any guess_nonlinear that it finds. So that means that any explicit components you have in your model will NOT get executed, their outputs will not get updated with computed values, and those computed values will not get passed to the inputs of downstream components.
Things get a little tricky when you have deep model hierarchies. The guess_nonlinear method is called as the first step in the nonlinear solver process. If you have a NonLinearRunOnce solver at the top level, it will follow the compute chain down the line calling compute or solve_nonlinear on each child and doing a data transfer after each one. If one of those children happens to be a group with a nonlinear solver, then that solver will call guess_nonlinear on its children (grandchildren of the top group with the NonLinearRunOnce solver) as the first step. So any outputs that were computed by the siblings of this group will be valid, but none of the outputs from the grandchild level will have been computed yet.
You may be wondering why not just have the guess_nonlinear method call the compute for any explicit components? There is a difficult to balance trade off here. If you assume that all explicit components are very cheap to run, then it might make sense to run the compute methods --- or it might not. A lot depends on the cyclic data structure. If some early component in the group needs guesses from the later one, then running its compute isn't going to help you much at all. Perhaps more importantly though, not all explicit components are cheap to run. You might have a very expensive computation, and calling compute as part of the guess process would be way too costly.
The compromise here, if you need some kind of top level guess process, is that you can implement guess_nonlinear at the group level. It's less common to do, but it gives you total control over what happens. You can call whatever you need to call in whatever sequence.
So the absolute key thing to remember is that the only data you have available to you when a guess_nonlinear is called is any data that was computed before your containing solver was executed. That means any thing that was computed before you got to the model scope of the containing solver (not the scope of the component with the guess_method itself).
Your second question: "How can I speed this up when the number of nodes gets large?"
This one not possible to give a generic answer to at all. I noticed that you have already specified sparse partial derivatives. That is a great start, but if its still not fast enough for you then it means you're reaching the limits of what you can do with a DirectSolver. You note that this solver is the only one that gets you through the optimization without issues, which I will take to mean that ScipyKryloventer link description here and PetscKrylov are not converging the linear system well for you --- at least not by themselves. Thats not surprising, as krylov solvers almost always require some kind of preconditioner... and this is why I can't offer a generic answer. Setting up efficient linear solvers for larger-scale compute is a tricky subject. If you look into the literature, you'll find some good suggestions. You can also study open source implementations like VSPAero for some tips.
effectively, you've reached the limit of what simple linear solvers can offer you. From this point forward, OpenMDAO can help a bit by making it easier to implement some preconditioning, but you'll have to suffer the math side yourself.
I am working on a port of some IDL code to Python (3.7). I have a translation working which uses whatever direct Python alternatives are available, and supplementing what I can with idlwrap. In an effort to eliminate legacy IDL functions from the code, I am looking for an alternative to ARRAY_INDICES(). Right now, I have simply translated the entire function directly and import it on its own. I've spent a good deal of time trying to understand exactly what it does, and even after translating it verbatim, it is still unclear to me, which makes coming up with a simple Python solution challenging.
The good news is I only need it to work with one specific set of arrays whose shapes wont change. An example of the code that will be run follows:
temp = np.sum(arr, axis=0)
goodval = idlwrap.where(temp > -10)
ngood = goodval.size
arr2 = np.zeros_like(arr)
for i in range(0, ngood - 1):
indices = array_indices(arr2, goodval[i])
#use indices for computation
I have looked Pytorch source code of MNIST dataset but it seems to read numpy array directly from binaries.
How can I just create train_data and train_labels like it? I have already prepared images and txt with labels.
I have learned how to read image and label and write get_item and len, what really confused me is how to make train_data and train_labels, which is torch.Tensor. I tried to arrange them into python lists and convert to torch.Tensor but failed:
for index in range(0,len(self.files)):
fn, label = self.files[index]
img = self.loader(fn)
if self.transform is not None:
img = self.transform(img)
train_data.append(img)
self.train_data = torch.tensor(train_data)
ValueError: only one element tensors can be converted to Python scalars
There are two ways to go. First, the manual. Torchvision.datasets states the following:
datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers.
So you can just implement your own class which scans for all the images and labels, keeps a list of their paths (so that you don't have to keep them in RAM) and has the __getitem__ method which given index i reads the i-th file, its label and returns them. This minimal interface is enough to work with the parallel dataloader in torch.utils.data.
Secondly, if your data directory can be rearranged into either structure, you can use DatasetFolder and ImageFolder pre-built loaders. This will save you some coding and automatically provide support for data augumentation routines from torchvision.transforms.
I want to use Faster rcnn inception v2 to do object detection in tensorflow.js. But i can't find some method in tfjs like get_tensor_by_name and session run for prediction.
In tensorflow (python), the code as the following:
Define input and output node:
# Definite input Tensors for detection_graph
self.image_tensor = self.detection_graph.get_tensor_by_name('image_tensor:0')
# Definite output Tensors for detection_graph
self.detection_boxes = self.detection_graph.get_tensor_by_name('detection_boxes:0')
self.detection_scores = self.detection_graph.get_tensor_by_name('detection_scores:0')
self.detection_classes = self.detection_graph.get_tensor_by_name('detection_classes:0')
self.num_detections = self.detection_graph.get_tensor_by_name('num_detections:0')
Predict:
(boxes, scores, classes, num) = self.sess.run(
[self.detection_boxes, self.detection_scores, self.detection_classes, self.num_detections],
feed_dict={self.image_tensor: image_np_expanded})
Do anyone know how to implement those two part of code in tfjs?
Please help. Thank you!
You don't have a session.run function in tensorflow.Js as there is in Python. In python, you start defining a graph and in the run function, you execute the graph. Tensors and Variables are assigned values in the graph, but the graph defines only the flow of computation, it does not hold any values. The real computation occurs when you run the session. One can create many session where each session can assign different values to the variable, that is why the graph has a get_by_tensor_name which outputs the tensor whose name is given as parameter.
You don't have the same mechanism in Js. You can use the variable as soon as you define them. It means that whenever you define a new tensor or variable, you can print it in the following line whereas in python, you can only print the tensor or variable only during a session. The get_by_tensor_name does not really have a sense in Js.
As for the predict function, you do have one in Js as well. if you create a model using tf.model or tf.sequential, you can invoke predict to make a prediction.
I tried using the different caching methods in section 4.2.1 of PHPExcel manual.
did a benchmark with 100k rows and here are the results
gzip = time=50,memoryused=177734904
ser = time=34,memoryused=291654272
phptm= time=41,memoryused=325973456
isamm= time=39,memoryused=325972824
the manual says that the phptmp and isamm methods use disk instead of memory. Hence, they should use the least memory, but it seems that this was the opposite.
Here is the code I used to test:
$cacheMethod = PHPExcel_CachedObjectStorageFactory::cache_in_memory_gzip;
// $cacheMethod = PHPExcel_CachedObjectStorageFactory::cache_in_memory_serialized;
// $cacheMethod = PHPExcel_CachedObjectStorageFactory:: cache_to_phpTemp;
// $cacheSettings = array( 'memoryCacheSize' => '8MB');
// $cacheMethod = PHPExcel_CachedObjectStorageFactory:: cache_to_discISAM;
PHPExcel_Settings::setCacheStorageMethod($cacheMethod, $cacheSettings);
$xlsReader = PHPExcel_IOFactory::createReader($fileType);
$xlsReader->setReadDataOnly(true);
Anyone can shed light on this mystery?
This depends on many factors including PHP version, content of cells (numeric, string, rich text, etc), extensions enabled for PHP, etc; so it is impossible for anybody other than yourself to actually answer, because it is unique for your situation.
However, all methods retain some information about each cell in memory, with the exception of SQLite, so using an SQLite database is the most memory efficient option.
EDIT
I ran some tests with different caching methods against different versions of PHP some months ago, and the following summarises the results
These are still fairly arbitrary result, disk speed and other factors will affect performance for some caching methods like discisam and phptemp, and any configuration settings for options like phptemp will also have some affect; but it should give a relative guideline for working out which options are better for memory, and which are better for speed of execution.