Python Impute using BayesianRidge() sklearn impute.IterativeImputer regression impute analysis value error - arrays

PROBLEM
Use interativeImputer from sklearn.impute.IterativeImputer, to get regression model fit for for BayesianRidge() for impute missing data in variable 'Frontage'.
After the interative_imputer_fit = interative_imputer.fit(data) run, the interative_imputer_fit.transform(X) runs but invoke on function, imputer_bay_ridge(data), the transform() function from interative_imputer, e.g., interative_imputer_fit.transform(X) error on value error. Passed in two variables, Frontage and Area. But only Frontage was inside the numpy.array.
Python CODE using sklearn
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge
def imputer_bay_ridge(data):
data_array = data.to_numpy()
data_array.reshape(1, -1)
interative_imputer = IterativeImputer(BayesianRidge())
interative_imputer_fit = interative_imputer.fit(data_array)
X = data['LotFrontage']
data_imputed = interative_imputer_fit.transform(X)
train_data[['Frontage', 'Area']]
INVOKE FUNCTION
fit_tranformed_imputed = imputer_bay_ridge(train_data[['Frontage', 'Area']])
DATA EXAMPLE
train_data[['Frontage', 'Area']]
Frontage Area
0 65.0 8450
1 80.0 9600
2 68.0 11250
3 60.0 9550
4 84.0 14260
... ... ...
1455 62.0 7917
1456 85.0 13175
1457 66.0 9042
1458 68.0 9717
1459 75.0 9937
1460 rows × 2 columns
ERROR
ValueError Traceback (most recent call last)
Cell In[243], line 1
----> 1 fit_tranformed_imputed = imputer_bay_ridge(train_data[['LotFrontage', 'LotArea']])
Cell In[242], line 12, in imputer_bay_ridge(data)
10 interative_imputer_fit = interative_imputer.fit(data_array)
11 X = data['LotFrontage']
---> 12 data_imputed = interative_imputer_fit.transform(X)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:724, in IterativeImputer.transform(self, X)
707 """Impute all missing values in `X`.
708
709 Note that this is stochastic, and that if `random_state` is not fixed,
(...)
720 The imputed input data.
721 """
722 check_is_fitted(self)
--> 724 X, Xt, mask_missing_values, complete_mask = self._initial_imputation(X)
726 X_indicator = super()._transform_indicator(complete_mask)
728 if self.n_iter_ == 0 or np.all(mask_missing_values):
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:514, in IterativeImputer._initial_imputation(self, X, in_fit)
511 else:
512 force_all_finite = True
--> 514 X = self._validate_data(
515 X,
516 dtype=FLOAT_DTYPES,
517 order="F",
518 reset=in_fit,
519 force_all_finite=force_all_finite,
520 )
521 _check_inputs_dtype(X, self.missing_values)
523 X_missing_mask = _get_mask(X, self.missing_values)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py:769, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
767 # If input is 1D raise error
768 if array.ndim == 1:
--> 769 raise ValueError(
770 "Expected 2D array, got 1D array instead:\narray={}.\n"
771 "Reshape your data either using array.reshape(-1, 1) if "
772 "your data has a single feature or array.reshape(1, -1) "
773 "if it contains a single sample.".format(array)
774 )
776 # make sure we actually converted to numeric:
777 if dtype_numeric and array.dtype.kind in "OUSV":
ValueError: Expected 2D array, got 1D array instead:
array=[65. 80. 68. ... 66. 68. 75.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Related

How to fix the 'The columns in the computed data do not match the columns in the provided metadata' error?

I have a table of price updates in the format (timestamp, price, amount).
The timestamp is a datetime, price categorical and amount float64. The timestamp column is set as an index.
My goal is to get the amount available at each price level at each point in time.
First, I use the pivot to spread the prices into columns, and then forward fill.
pivot = price_table.pivot_table(index = 'timestamp',
columns = 'price', values = 'amount')
pivot_ffill = pivot.fillna(method = 'ffill')
I can compute or apply head to pivot_ffill and it works fine.
Clearly, there are still NAs at the beginning of the table where there have been no updates yet.
When I apply
pivot_nullfill = pivot_ffill.fillna(0)
pivot_nullfill.head()
I do get an error
The columns in the computed data do not match the columns in the provided metadata. I tried replacing the zero with 0.0 or float(0), but to no avail. As the previous steps work, I strongly suspect it has something to do with the fillna, but due to the delayed calculations that does not have to be true.
Does someone know what causes this? Thank you!
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-180-f8ab344c7939> in <module>
----> 1 pivot_ffill.fillna(0).head()
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\dataframe\core.py in head(self, n, npartitions, compute)
896
897 if compute:
--> 898 result = result.compute()
899 return result
900
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\base.py in compute(*args, **kwargs)
396 keys = [x.__dask_keys__() for x in collections]
397 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 398 results = schedule(dsk, keys, **kwargs)
399 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
400
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
74 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
75 cache=cache, get_id=_thread_get_id,
---> 76 pack_exception=pack_exception, **kwargs)
77
78 # Cleanup pools associated to dead threads
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
460 _execute_task(task, data) # Re-execute locally
461 else:
--> 462 raise_exception(exc, tb)
463 res, worker_id = loads(res_info)
464 state['cache'][key] = res
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\compatibility.py in reraise(exc, tb)
110 if exc.__traceback__ is not tb:
111 raise exc.with_traceback(tb)
--> 112 raise exc
113
114 import pickle as cPickle
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
228 try:
229 task, data = loads(task_info)
--> 230 result = _execute_task(task, data)
231 id = get_id()
232 result = dumps((result, id))
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
116 elif istask(arg):
117 func, args = arg[0], arg[1:]
--> 118 args2 = [_execute_task(a, cache) for a in args]
119 return func(*args2)
120 elif not ishashable(arg):
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\core.py in <listcomp>(.0)
116 elif istask(arg):
117 func, args = arg[0], arg[1:]
--> 118 args2 = [_execute_task(a, cache) for a in args]
119 return func(*args2)
120 elif not ishashable(arg):
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
117 func, args = arg[0], arg[1:]
118 args2 = [_execute_task(a, cache) for a in args]
--> 119 return func(*args2)
120 elif not ishashable(arg):
121 return arg
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\optimization.py in __call__(self, *args)
940 % (len(self.inkeys), len(args)))
941 return core.get(self.dsk, self.outkey,
--> 942 dict(zip(self.inkeys, args)))
943
944 def __reduce__(self):
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\core.py in get(dsk, out, cache)
147 for key in toposort(dsk):
148 task = dsk[key]
--> 149 result = _execute_task(task, cache)
150 cache[key] = result
151 result = _execute_task(out, cache)
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
117 func, args = arg[0], arg[1:]
118 args2 = [_execute_task(a, cache) for a in args]
--> 119 return func(*args2)
120 elif not ishashable(arg):
121 return arg
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\compatibility.py in apply(func, args, kwargs)
91 def apply(func, args, kwargs=None):
92 if kwargs:
---> 93 return func(*args, **kwargs)
94 else:
95 return func(*args)
C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\dask\dataframe\core.py in apply_and_enforce(*args, **kwargs)
3800 if not np.array_equal(np.nan_to_num(meta.columns),
3801 np.nan_to_num(df.columns)):
-> 3802 raise ValueError("The columns in the computed data do not match"
3803 " the columns in the provided metadata")
3804 else:
ValueError: The columns in the computed data do not match the columns in the provided metadata
The error message should have give you a suggestion of how to fix the situation. We assume you are loading from CSV (the question doesn't say), so you would probably end up with a line like
df = dd.read_csv(..., dtype={...})
which instructs the pandas reader on the dtypes you want to enforce, since you know more information than pandas does. That ensures that all partitions have the same types for all columns - see the notes part of the docs.

Impossible to invoke endpoint with sagemaker

I am using aws sagemaker to invoke the endpoint :
payload = pd.read_csv('payload.csv', header=None)
>> payload
0 1 2 3 4
0 setosa 5.1 3.5 1.4 0.2
1 setosa 5.1 3.5 1.4 0.2
with this code :
response = runtime.invoke_endpoint(EndpointName=r_endpoint,
ContentType='text/csv',
Body=payload)
But I got this problem :
ParamValidationError Traceback (most recent call last)
<ipython-input-304-f79f5cf7e0e0> in <module>()
1 response = runtime.invoke_endpoint(EndpointName=r_endpoint,
2 ContentType='text/csv',
----> 3 Body=payload)
4
5 result = json.loads(response['Body'].read().decode())
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
312 "%s() only accepts keyword arguments." % py_operation_name)
313 # The "self" in this scope is referring to the BaseClient.
--> 314 return self._make_api_call(operation_name, kwargs)
315
316 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
584 }
585 request_dict = self._convert_to_request_dict(
--> 586 api_params, operation_model, context=request_context)
587
588 handler, event_response = self.meta.events.emit_until_response(
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _convert_to_request_dict(self, api_params, operation_model, context)
619 api_params, operation_model, context)
620 request_dict = self._serializer.serialize_to_request(
--> 621 api_params, operation_model)
622 prepare_request_dict(request_dict, endpoint_url=self._endpoint.host,
623 user_agent=self._client_config.user_agent,
~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/validate.py in serialize_to_request(self, parameters, operation_model)
289 operation_model.input_shape)
290 if report.has_errors():
--> 291 raise ParamValidationError(report=report.generate_report())
292 return self._serializer.serialize_to_request(parameters,
293 operation_model)
ParamValidationError: Parameter validation failed:
Invalid type for parameter Body, value: 0 1 2 3 4
0 setosa 5.1 3.5 1.4 0.2
1 setosa 5.1 3.5 1.4 0.2, type: <class 'pandas.core.frame.DataFrame'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object
I am just using the same code/step like in the aws tutorial .
Can you help me to resolve this problem please?
thank you
The payload variable is a Pandas' DataFrame, while invoke_endpoint() expects Body=b'bytes'|file.
Try something like this (coding blind):
response = runtime.invoke_endpoint(EndpointName=r_endpoint,
ContentType='text/csv',
Body=open('payload.csv'))
More on the expected formats here.
Make sure the file doesn't include a header.
Alternatively, convert your DataFrame to bytes, like in this example, and pass those bytes instead of passing a DataFrame.

Using apply in building a tensor of features from a matrix

Consider the following code:
EmbedFeatures <- function(x,w) {
c_rev <- seq(from=w,to=1,by=-1)
em <- embed(x,w)
em <- em[,c_rev]
return(em)
}
m=matrix(1:1400,100,14)
X.tr<-c()
F<-dim(m)[2]
W=16
for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}
this builds an array of features, each row has W=16 timesteps.
The dimensions are:
> dim(X.tr)
[1] 85 16 14
The following are the first samples:
> X.tr[1,,1]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> X.tr[1,,2]
[1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
> X.tr[1,,3]
[1] 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
I would like to use apply to build this array, but the following code does not work:
X.tr <- apply(m,2,EmbedFeatures, w=W)
since it gives me the following dimensions:
> dim(X.tr)
[1] 1360 14
How can I do it?
Firstly, thanks for providing a great reproducible example!
Now, as far as I know, you can't do this with apply. You can, however, do it with a combination of plyr::aaply, which allows you to return multidimensional arrays, and base::aperm, which allows you to transpose multidimensional arrays.
See here for aaply function details and here for aperm function details.
After running your code above, you can do:
library(plyr)
Y.tr <- plyr::aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
dim(Y.tr)
[1] 14 85 16
dim(Z.tr)
[1] 85 16 14
I turned those two lines of code into a function.
using_aaply <- function(m = m) {
Y.tr <- aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
return(Z.tr)
}
Then I did some microbenchmarking.
library(microbenchmark)
microbenchmark(for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}, times=100)
Unit: milliseconds
expr
for (i in 1:F) { X.tr <- abind(list(X.tr, EmbedFeatures(m[, i], W)), along = 3) }
min lq mean median uq max neval
405.0095 574.9824 706.0845 684.8531 802.4413 1189.845 100
microbenchmark(using_aaply(m=m), times=100)
Unit: milliseconds
expr min lq mean median uq max
using_aaply(m = m) 4.873627 5.670474 7.797129 7.083925 9.674041 19.74449
neval
100
It seems like it's loads faster using aaply and aperm compared to abind in a for-loop.

Cython: ctypedef char array throws obscure error

Why does this work:
In [17]: %%cython -f
...: from libc.string cimport memcpy
...:
...: DEF KLEN = 5
...: DEF TRP_KLEN = KLEN * 3
...:
...: cdef:
...: unsigned char k[KLEN]
...: unsigned char kk[TRP_KLEN]
...: kk = bytearray(b'12345abcde!##$%')
...: memcpy(&k, &kk[5], KLEN)
...: print(k)
b'abcde'
While this:
In [16]: %%cython -f
...: from libc.string cimport memcpy
...:
...: DEF KLEN = 5
...: DEF TRP_KLEN = KLEN * 3
...: ctypedef unsigned char SingleKey[KLEN]
...: ctypedef unsigned char TripleKey[TRP_KLEN]
...:
...: cdef:
...: SingleKey k
...: TripleKey kk
...: kk = bytearray(b'12345abcde!##$%')
...: memcpy(&k, &kk[5], KLEN)
...: print(k)
throws an obscure error, which doesn't directly mention my code:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-16-a4bb608248b0> in <module>()
----> 1 get_ipython().run_cell_magic('cython', '-f', "from libc.string cimport memcpy\n\nDEF KLEN = 5\nDEF TRP_KLEN = KLEN * 3\nctypedef unsigned char SingleKey[KLEN]\nctypedef unsigned char DoubleKey[TRP_KLEN]\n\ncdef:\n SingleKey k[KLEN]\n DoubleKey kk[TRP_KLEN]\nkk = bytearray(b'12345abcde!##$%')\nmemcpy(&k, &kk[5], KLEN)\nprint(k)")
/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2165 magic_arg_s = self.var_expand(line, stack_depth)
2166 with self.builtin_trap:
-> 2167 result = fn(magic_arg_s, cell)
2168 return result
2169
<decorator-gen-118> in cython(self, line, cell)
/usr/lib/python3.6/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Build/IpythonMagic.py in cython(self, line, cell)
318 extension = None
319 if need_cythonize:
--> 320 extensions = self._cythonize(module_name, code, lib_dir, args, quiet=args.quiet)
321 assert len(extensions) == 1
322 extension = extensions[0]
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Build/IpythonMagic.py in _cythonize(self, module_name, code, lib_dir, args, quiet)
426 elif sys.version_info[0] >= 3:
427 opts['language_level'] = 3
--> 428 return cythonize([extension], **opts)
429 except CompileError:
430 return None
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Build/Dependencies.py in cythonize(module_list, exclude, nthreads, aliases, quiet, force, language, exclude_failures, **options)
1024 if not nthreads:
1025 for args in to_compile:
-> 1026 cythonize_one(*args)
1027
1028 if exclude_failures:
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Build/Dependencies.py in cythonize_one(pyx_file, c_file, fingerprint, quiet, options, raise_on_failure, embedded_metadata, full_module_name, progress)
1127 any_failures = 0
1128 try:
-> 1129 result = compile_single(pyx_file, options, full_module_name=full_module_name)
1130 if result.num_errors > 0:
1131 any_failures = 1
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Main.py in compile_single(source, options, full_module_name)
647 recursion.
648 """
--> 649 return run_pipeline(source, options, full_module_name)
650
651
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Main.py in run_pipeline(source, options, full_module_name, context)
497
498 context.setup_errors(options, result)
--> 499 err, enddata = Pipeline.run_pipeline(pipeline, source)
500 context.teardown_errors(err, options, result)
501 return result
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Pipeline.py in run_pipeline(pipeline, source, printtree)
352 exec("def %s(phase, data): return phase(data)" % phase_name, exec_ns)
353 run = _pipeline_entry_points[phase_name] = exec_ns[phase_name]
--> 354 data = run(phase, data)
355 if DebugFlags.debug_verbose_pipeline:
356 print(" %.3f seconds" % (time() - t))
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Pipeline.py in run(phase, data)
332
333 def run(phase, data):
--> 334 return phase(data)
335
336 error = None
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Pipeline.py in generate_pyx_code_stage(module_node)
50 def generate_pyx_code_stage_factory(options, result):
51 def generate_pyx_code_stage(module_node):
---> 52 module_node.process_implementation(options, result)
53 result.compilation_source = module_node.compilation_source
54 return result
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/ModuleNode.py in process_implementation(self, options, result)
140 self.find_referenced_modules(env, self.referenced_modules, {})
141 self.sort_cdef_classes(env)
--> 142 self.generate_c_code(env, options, result)
143 self.generate_h_code(env, options, result)
144 self.generate_api_code(env, options, result)
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/ModuleNode.py in generate_c_code(self, env, options, result)
376 # generate normal variable and function definitions
377 self.generate_variable_definitions(env, code)
--> 378 self.body.generate_function_definitions(env, code)
379 code.mark_pos(None)
380 self.generate_typeobj_definitions(env, code)
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_function_definitions(self, env, code)
436 #print "StatListNode.generate_function_definitions" ###
437 for stat in self.stats:
--> 438 stat.generate_function_definitions(env, code)
439
440 def generate_execution_code(self, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_function_definitions(self, env, code)
9242 entry.cname = cname
9243
-> 9244 self.node.generate_function_definitions(env, code)
9245
9246 def generate_execution_code(self, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_function_definitions(self, env, code)
1974 # ----- Function body -----
1975 # -------------------------
-> 1976 self.generate_function_body(env, code)
1977
1978 code.mark_pos(self.pos, trace=False)
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_function_body(self, env, code)
1736
1737 def generate_function_body(self, env, code):
-> 1738 self.body.generate_execution_code(code)
1739
1740 def generate_function_definitions(self, env, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
442 for stat in self.stats:
443 code.mark_pos(stat.pos)
--> 444 stat.generate_execution_code(code)
445
446 def annotate(self, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
6185 for i, if_clause in enumerate(self.if_clauses):
6186 self._set_branch_hint(if_clause, if_clause.body)
-> 6187 if_clause.generate_execution_code(code, end_label, is_last=i == last)
6188 if self.else_clause:
6189 code.mark_pos(self.else_clause.pos)
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code, end_label, is_last)
6247 self.condition.generate_disposal_code(code)
6248 self.condition.free_temps(code)
-> 6249 self.body.generate_execution_code(code)
6250 code.mark_pos(self.pos, trace=False)
6251 if not (is_last or self.body.is_terminator):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
442 for stat in self.stats:
443 code.mark_pos(stat.pos)
--> 444 stat.generate_execution_code(code)
445
446 def annotate(self, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/UtilNodes.py in generate_execution_code(self, code)
324 def generate_execution_code(self, code):
325 self.setup_temp_expr(code)
--> 326 self.body.generate_execution_code(code)
327 self.teardown_temp_expr(code)
328
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
6605 self.item.generate_evaluation_code(code)
6606 self.target.generate_assignment_code(self.item, code)
-> 6607 self.body.generate_execution_code(code)
6608 code.mark_pos(self.pos)
6609 code.put_label(code.continue_label)
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
442 for stat in self.stats:
443 code.mark_pos(stat.pos)
--> 444 stat.generate_execution_code(code)
445
446 def annotate(self, code):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_execution_code(self, code)
5100 def generate_execution_code(self, code):
5101 code.mark_pos(self.pos)
-> 5102 self.generate_rhs_evaluation_code(code)
5103 self.generate_assignment_code(code)
5104
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/Nodes.py in generate_rhs_evaluation_code(self, code)
5387
5388 def generate_rhs_evaluation_code(self, code):
-> 5389 self.rhs.generate_evaluation_code(code)
5390
5391 def generate_assignment_code(self, code, overloaded_assignment=False):
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/ExprNodes.py in generate_evaluation_code(self, code)
718 self.allocate_temp_result(code)
719
--> 720 self.generate_result_code(code)
721 if self.is_temp and not (self.type.is_string or self.type.is_pyunicode_ptr):
722 # If we are temp we do not need to wait until this node is disposed
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/ExprNodes.py in generate_result_code(self, code)
13135
13136 code.putln(self.type.from_py_call_code(
> 13137 self.arg.py_result(), self.result(), self.pos, code, from_py_function=from_py_function))
13138 if self.type.is_pyobject:
13139 code.put_gotref(self.py_result())
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/PyrexTypes.py in from_py_call_code(self, source_code, result_code, error_pos, code, from_py_function, error_condition)
512 source_code, result_code, error_pos, code,
513 from_py_function or self.from_py_function,
--> 514 error_condition or self.error_condition(result_code)
515 )
516
~/code/lsup/virtualenv/lib/python3.6/site-packages/Cython/Compiler/PyrexTypes.py in from_py_call_code(self, source_code, result_code, error_pos, code, from_py_function, error_condition)
2481 def from_py_call_code(self, source_code, result_code, error_pos, code,
2482 from_py_function=None, error_condition=None):
-> 2483 assert not error_condition, '%s: %s' % (error_pos, error_condition)
2484 call_code = "%s(%s, %s, %s)" % (
2485 from_py_function or self.from_py_function,
AssertionError: (<StringSourceDescriptor:carray.from_py>, 87, 19): (!__pyx_t_11) && PyErr_Occurred()
This is especially problematic in my 1000+-line program, where I had to find out what was causing the problem. Removing all ctypedefs for char arrays that were being assigned removed the error, but I'd like to know what is causing it.
Thanks.
UPDATE
Note: This is not a NUL-terminated string, but a byte sequence that may have NULs anywhere. That's why I am using memcpy rather than string functions or regular = assignments.

How to transform a CSV file data in Apache camel

I want to transform some field's data in specific rows in csv file.I tried the following .
1).Using csv marshaling and unmarshaling I achieved it ,but the output CSV is not coming in proper order even though I sent list of maps (i.e List) .
following is my program
from("file:E://camelinput//?noop=true")
.unmarshal(csv)
.convertBodyTo(List.class)
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<List<String>> data = (List<List<String>>) msg.getIn().getBody();
List<Map<String,Object>> newdata=new ArrayList<Map<String,Object>>();
Map<String,Object> map=null;
for (List<String> line : data) {
System.out.println(line.size());
map=new HashMap<String,Object>();
if("1502873".equals(line.get(3))){
line.set(18, "Y");
}
// newdata.add(line);
int count=0;
for(Object field:line){
// System.out.println("line.get(count) "+line.get(count));
map.put(String.valueOf(count),field);
count++;
}
newdata.add(map);
}
msg.getIn().setBody(newdata);
}
})
.marshal().csv().convertBodyTo(List.class)
.to("file:E://camelout").end();
2)And again I tried Using .split(body()) and trying to process each row(i.e with out using Marshaling I am trying),but it is taking very huge time and getting terminated with some Interrupted exception.
following is the code
from("file:E://camelinput//?noop=true")
.unmarshal(csv)
.convertBodyTo(List.class)
.split(body())
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<String> rec= new ArrayList<String>();
if("1502873".equals(rec.get(3))){
rec.set(18, "Y");
}
String dt=rec.toString().trim().replace("[","").replace("]", "");
msg.getIn().setBody(dt, String.class);
}
})
.to("file:E://camelout").end();
following is my sample Csv
25 STANDARD N 1435635 415 1087 15904 7 null 36 Cross Mechanical Pencil, Lead and Eraser, 0.5 mm 2 23162 116599 7/7/2015 15:45 N 828
25 STANDARD N 1435635 415 1087 15905 8 null 36 Jumbo Ballpoint and Selectip Refill, Medium, Black 4 23163 116599 7/7/2015 15:45 N 829
25 STANDARD N 1435635 415 1087 15906 1 3487 null 598 Copier Toner, Cannon 220 23164 116599 7/7/2015 15:45 N 830
25 STANDARD N 1435635 415 1087 15907 2 3495 null 823 Envelopes 27 23165 116599 7/7/2015 15:45 N 831
25 STANDARD N 1435635 415 1087 15908 3 3513 null 789 Legal Pads, 8 1/2 x 11 3/4" White" 30 23166 116599 N 832
25 STANDARD N 1435635 415 1087 15909 4 3577 null 791 Paper Clips 5 23167 116599 7/7/2015 15:45 N 833
31 STANDARD N 1574437 415 1087 15910 5 null 36 Clic Stic Pen, Fine, Black 0.72 23168 116599 7/7/2015 15:45 N 834
31 STANDARD N 1574437 557 1233 15911 6 null 36 Laser Cards, 50 Note Cards/Envelopes, 4-1/4 inch x 5-1/2 inch, White 21.58 23169 116599 7/7/2015 15:45 N 835
31 STANDARD N 1574437 578 1275 15912 1 201 null 32 Keyboard - 101 Key 20.82 23170 116599 7/7/2015 15:45 N 836
25 STANDARD N 1574437 147 2033 15913 1 225 null 30 Monitor - 19" 225.39 23171 116599 7/7/2015 15:45 N 837
1314 STANDARD N 1502873 22 2199 16287 1 628 null 1 Envoy Deluxe Laptop 822.87 23545 116599 7/7/2015 15:45 N 838
1314 STANDARD N 1502873 22 2199 16288 1 151 null 91 Envoy Standard Laptop 1283.44 23546 116599 7/7/2015 15:45 N 839
7653 STANDARD N 1502873 22 2199 16289 2 606 null 1 Battery - Extended Life 28 23547 116599 7/7/2015 15:45 N 840
7652 STANDARD N 1502873 21 459 16290 1 2157 null 1 Envoy Laptop - Rugged 1525.02 23548 116599 7/7/2015 15:45 N 841
1314 STANDARD N 1502873 3 1594 16291 1 251 null 32 RAM - 256MB 51.25 23549 116599 7/7/2015 15:45 N 842
7654 STANDARD N 1502873 22 2199 16292 1 606 null 1 Battery - Extended Life 28 23550 116599 7/7/2015 15:45 N 843
7652 STANDARD N 1502873 21 459 16293 1 247 null 30 Monitor - 17" 225.39 23551 116599 7/7/2015 15:45 N 844
1704 STANDARD N 1502873 41 2200 16294 2 225 null 30 Monitor - 19" 225.39 23552 116599 7/7/2015 15:45 N 845
7658 STANDARD N 1502873 21 460 16295 1 201 null 32 Keyboard - 101 Key 20.82 23553 116599 7/7/2015 15:45 N 846
I have large Csv files which contains hundreds of thousands of rows.
I think your solution 1 might be overly complex if you only want to alter values in csv and output it it back in the same order. Just edit fields in the original List and marshall it back to file.
I've made here assumption that your data was actually delimited by tabs rather than random amount of spaces in your example but I've included the CsvDataFormat that I used. Code uses camel-core and camel-csv version 2.15.3.
public void configure() {
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter('\t'); // Tabs
csv.setQuoteDisabled(true); // Otherwise single quotes will be doubled.
from("file://src/data?fileName=data.csv&noop=true&delay=15m")
.unmarshal(csv)
.convertBodyTo(List.class)
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<List<String>> data = (List<List<String>>) msg.getIn().getBody();
for (List<String> line : data) {
// Checks if column two contains text STANDARD
// and alters its value to DELUXE.
if ("STANDARD".equals(line.get(1))) {
System.out.println("Original:" + line);
line.set(1, "DELUXE");
System.out.println("After: " + line);
}
}
}
}).marshal(csv).to("file://src/data?fileName=out.csv")
.log("done.").end();
}
The problem is that you are processing single line in single thread. If parallel processing correct for you, try to use ThreadPool.
<camel:camelContext id="camelContext">
.....
<camel:threadPoolProfile id="customThreadPoolProfile"
defaultProfile="true"
poolSize="{{split.thread.pool.size}}"
maxPoolSize="{{split.thread.max.pool.size}}"
maxQueueSize="{{split.thread.max.queue.size}}">
</camel:threadPoolProfile>
</camel:camelContext>
And upgrade split
.split(body().tokenize("\n"))
.streaming()
.parallelProcessing()
.executorServiceRef("customThreadPoolProfile")
.....
.end()

Resources