How to transform a CSV file data in Apache camel

How to transform a CSV file data in Apache camel - apache-camel

I want to transform some field's data in specific rows in csv file.I tried the following .
1).Using csv marshaling and unmarshaling I achieved it ,but the output CSV is not coming in proper order even though I sent list of maps (i.e List) .
following is my program
from("file:E://camelinput//?noop=true")
.unmarshal(csv)
.convertBodyTo(List.class)
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<List<String>> data = (List<List<String>>) msg.getIn().getBody();
List<Map<String,Object>> newdata=new ArrayList<Map<String,Object>>();
Map<String,Object> map=null;
for (List<String> line : data) {
System.out.println(line.size());
map=new HashMap<String,Object>();
if("1502873".equals(line.get(3))){
line.set(18, "Y");
}
// newdata.add(line);
int count=0;
for(Object field:line){
// System.out.println("line.get(count) "+line.get(count));
map.put(String.valueOf(count),field);
count++;
}
newdata.add(map);
}
msg.getIn().setBody(newdata);
}
})
.marshal().csv().convertBodyTo(List.class)
.to("file:E://camelout").end();
2)And again I tried Using .split(body()) and trying to process each row(i.e with out using Marshaling I am trying),but it is taking very huge time and getting terminated with some Interrupted exception.
following is the code
from("file:E://camelinput//?noop=true")
.unmarshal(csv)
.convertBodyTo(List.class)
.split(body())
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<String> rec= new ArrayList<String>();
if("1502873".equals(rec.get(3))){
rec.set(18, "Y");
}
String dt=rec.toString().trim().replace("[","").replace("]", "");
msg.getIn().setBody(dt, String.class);
}
})
.to("file:E://camelout").end();
following is my sample Csv
25 STANDARD N 1435635 415 1087 15904 7 null 36 Cross Mechanical Pencil, Lead and Eraser, 0.5 mm 2 23162 116599 7/7/2015 15:45 N 828
25 STANDARD N 1435635 415 1087 15905 8 null 36 Jumbo Ballpoint and Selectip Refill, Medium, Black 4 23163 116599 7/7/2015 15:45 N 829
25 STANDARD N 1435635 415 1087 15906 1 3487 null 598 Copier Toner, Cannon 220 23164 116599 7/7/2015 15:45 N 830
25 STANDARD N 1435635 415 1087 15907 2 3495 null 823 Envelopes 27 23165 116599 7/7/2015 15:45 N 831
25 STANDARD N 1435635 415 1087 15908 3 3513 null 789 Legal Pads, 8 1/2 x 11 3/4" White" 30 23166 116599 N 832
25 STANDARD N 1435635 415 1087 15909 4 3577 null 791 Paper Clips 5 23167 116599 7/7/2015 15:45 N 833
31 STANDARD N 1574437 415 1087 15910 5 null 36 Clic Stic Pen, Fine, Black 0.72 23168 116599 7/7/2015 15:45 N 834
31 STANDARD N 1574437 557 1233 15911 6 null 36 Laser Cards, 50 Note Cards/Envelopes, 4-1/4 inch x 5-1/2 inch, White 21.58 23169 116599 7/7/2015 15:45 N 835
31 STANDARD N 1574437 578 1275 15912 1 201 null 32 Keyboard - 101 Key 20.82 23170 116599 7/7/2015 15:45 N 836
25 STANDARD N 1574437 147 2033 15913 1 225 null 30 Monitor - 19" 225.39 23171 116599 7/7/2015 15:45 N 837
1314 STANDARD N 1502873 22 2199 16287 1 628 null 1 Envoy Deluxe Laptop 822.87 23545 116599 7/7/2015 15:45 N 838
1314 STANDARD N 1502873 22 2199 16288 1 151 null 91 Envoy Standard Laptop 1283.44 23546 116599 7/7/2015 15:45 N 839
7653 STANDARD N 1502873 22 2199 16289 2 606 null 1 Battery - Extended Life 28 23547 116599 7/7/2015 15:45 N 840
7652 STANDARD N 1502873 21 459 16290 1 2157 null 1 Envoy Laptop - Rugged 1525.02 23548 116599 7/7/2015 15:45 N 841
1314 STANDARD N 1502873 3 1594 16291 1 251 null 32 RAM - 256MB 51.25 23549 116599 7/7/2015 15:45 N 842
7654 STANDARD N 1502873 22 2199 16292 1 606 null 1 Battery - Extended Life 28 23550 116599 7/7/2015 15:45 N 843
7652 STANDARD N 1502873 21 459 16293 1 247 null 30 Monitor - 17" 225.39 23551 116599 7/7/2015 15:45 N 844
1704 STANDARD N 1502873 41 2200 16294 2 225 null 30 Monitor - 19" 225.39 23552 116599 7/7/2015 15:45 N 845
7658 STANDARD N 1502873 21 460 16295 1 201 null 32 Keyboard - 101 Key 20.82 23553 116599 7/7/2015 15:45 N 846
I have large Csv files which contains hundreds of thousands of rows.

I think your solution 1 might be overly complex if you only want to alter values in csv and output it it back in the same order. Just edit fields in the original List and marshall it back to file.
I've made here assumption that your data was actually delimited by tabs rather than random amount of spaces in your example but I've included the CsvDataFormat that I used. Code uses camel-core and camel-csv version 2.15.3.
public void configure() {
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter('\t'); // Tabs
csv.setQuoteDisabled(true); // Otherwise single quotes will be doubled.
from("file://src/data?fileName=data.csv&noop=true&delay=15m")
.unmarshal(csv)
.convertBodyTo(List.class)
.process(new Processor() {
#Override
public void process(Exchange msg) throws Exception {
List<List<String>> data = (List<List<String>>) msg.getIn().getBody();
for (List<String> line : data) {
// Checks if column two contains text STANDARD
// and alters its value to DELUXE.
if ("STANDARD".equals(line.get(1))) {
System.out.println("Original:" + line);
line.set(1, "DELUXE");
System.out.println("After: " + line);
}
}
}
}).marshal(csv).to("file://src/data?fileName=out.csv")
.log("done.").end();
}

The problem is that you are processing single line in single thread. If parallel processing correct for you, try to use ThreadPool.
<camel:camelContext id="camelContext">
.....
<camel:threadPoolProfile id="customThreadPoolProfile"
defaultProfile="true"
poolSize="{{split.thread.pool.size}}"
maxPoolSize="{{split.thread.max.pool.size}}"
maxQueueSize="{{split.thread.max.queue.size}}">
</camel:threadPoolProfile>
</camel:camelContext>
And upgrade split
.split(body().tokenize("\n"))
.streaming()
.parallelProcessing()
.executorServiceRef("customThreadPoolProfile")
.....
.end()

Related

Python Impute using BayesianRidge() sklearn impute.IterativeImputer regression impute analysis value error

PROBLEM
Use interativeImputer from sklearn.impute.IterativeImputer, to get regression model fit for for BayesianRidge() for impute missing data in variable 'Frontage'.
After the interative_imputer_fit = interative_imputer.fit(data) run, the interative_imputer_fit.transform(X) runs but invoke on function, imputer_bay_ridge(data), the transform() function from interative_imputer, e.g., interative_imputer_fit.transform(X) error on value error. Passed in two variables, Frontage and Area. But only Frontage was inside the numpy.array.
Python CODE using sklearn
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge
def imputer_bay_ridge(data):
data_array = data.to_numpy()
data_array.reshape(1, -1)
interative_imputer = IterativeImputer(BayesianRidge())
interative_imputer_fit = interative_imputer.fit(data_array)
X = data['LotFrontage']
data_imputed = interative_imputer_fit.transform(X)
train_data[['Frontage', 'Area']]
INVOKE FUNCTION
fit_tranformed_imputed = imputer_bay_ridge(train_data[['Frontage', 'Area']])
DATA EXAMPLE
train_data[['Frontage', 'Area']]
Frontage Area
0 65.0 8450
1 80.0 9600
2 68.0 11250
3 60.0 9550
4 84.0 14260
... ... ...
1455 62.0 7917
1456 85.0 13175
1457 66.0 9042
1458 68.0 9717
1459 75.0 9937
1460 rows × 2 columns
ERROR
ValueError Traceback (most recent call last)
Cell In[243], line 1
----> 1 fit_tranformed_imputed = imputer_bay_ridge(train_data[['LotFrontage', 'LotArea']])
Cell In[242], line 12, in imputer_bay_ridge(data)
10 interative_imputer_fit = interative_imputer.fit(data_array)
11 X = data['LotFrontage']
---> 12 data_imputed = interative_imputer_fit.transform(X)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:724, in IterativeImputer.transform(self, X)
707 """Impute all missing values in `X`.
708
709 Note that this is stochastic, and that if `random_state` is not fixed,
(...)
720 The imputed input data.
721 """
722 check_is_fitted(self)
--> 724 X, Xt, mask_missing_values, complete_mask = self._initial_imputation(X)
726 X_indicator = super()._transform_indicator(complete_mask)
728 if self.n_iter_ == 0 or np.all(mask_missing_values):
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:514, in IterativeImputer._initial_imputation(self, X, in_fit)
511 else:
512 force_all_finite = True
--> 514 X = self._validate_data(
515 X,
516 dtype=FLOAT_DTYPES,
517 order="F",
518 reset=in_fit,
519 force_all_finite=force_all_finite,
520 )
521 _check_inputs_dtype(X, self.missing_values)
523 X_missing_mask = _get_mask(X, self.missing_values)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py:769, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
767 # If input is 1D raise error
768 if array.ndim == 1:
--> 769 raise ValueError(
770 "Expected 2D array, got 1D array instead:\narray={}.\n"
771 "Reshape your data either using array.reshape(-1, 1) if "
772 "your data has a single feature or array.reshape(1, -1) "
773 "if it contains a single sample.".format(array)
774 )
776 # make sure we actually converted to numeric:
777 if dtype_numeric and array.dtype.kind in "OUSV":
ValueError: Expected 2D array, got 1D array instead:
array=[65. 80. 68. ... 66. 68. 75.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Google Data Studio: Compare daily sales to 7-day average

I have a data source with daily sales per product.
I want to create a field that calculates the average daily sales for the 7 last days, for each product and day (e.g. on day 10 for product A, it will give me the average sales for product A on days 3 - 9; on Day 15 for product B, I'll see the average sales of B on days 8 - 14).
Is this possible?
Example data (I have the first 3 columns. need to generate the fourth)
Date Product Sales 7-Day Average
1/11 A 983 201
2/11 A 650 983
3/11 A 328 817
4/11 A 728 654
5/11 A 246 672
6/11 A 613 587
7/11 A 575 591
8/11 A 601 589
9/11 A 462 534
10/11 A 979 508
11/11 A 148 601
12/11 A 238 518
13/11 A 53 517
14/11 A 500 437
15/11 A 684 426
16/11 A 261 438
17/11 A 69 409
18/11 A 159 279
19/11 A 964 281
20/11 A 429 384
21/11 A 731 438
1/11 B 790 471
2/11 B 265 486
3/11 B 94 487
4/11 B 66 490
5/11 B 124 477
6/11 B 555 357
7/11 B 190 375
8/11 B 232 298
9/11 B 747 218
10/11 B 557 287
11/11 B 432 353
12/11 B 526 405
13/11 B 690 463
14/11 B 350 482
15/11 B 512 505
16/11 B 273 545
17/11 B 679 477
18/11 B 164 495
19/11 B 799 456
20/11 B 749 495
21/11 B 391 504
Haven't really tried anything. Couldn't figure out how to do get started with this)

This may not be the super perfect solution but it does give your expected result in a crude way.
Cross-join the same data source first as shown in the screenshot
Use the calculated field to get the last 7 day average
(CASE WHEN Date (Table 2) BETWEEN DATETIME_SUB(Date (Table 1), INTERVAL 7 DAY) AND DATETIME_SUB(Date (Table 1), INTERVAL 1 DAY) THEN Sales (Table 2) ELSE 0 END)/7
-

Repeat certain pandas series values, so that it has an entry for all index values between 1 and 100

I have created a list of pandas series, with each series indexed by numbers between 1 and 100 eg
Index Value
1 62.99
4 64.39
37 75.225
65 88.12
74 89.89
79 93.30
88 94.30
92 95.83
100 100.00
What I want to do, either while it is a Series, or as an array after calling .to_numpy() on it, is to fill it out so that my series has 100 values (1 to 100), with any new entries having the previous existing value ie
Index Value
1 62.99
2 62.99
3 62.99
4 64.39
5 64.39
6 64.39
...
...
36 64.39
37 75.225
38 75.225
and so on.
I can do this programmatically the long-winded way by iterating through each series and checking for a change in value; my question is, is there a version of Series.repeat() which could do this in one hit, or a numpy function which can 'pad out' my array in this manner with my 100 values?
Thanks in advance for reading, and for any suggestions. This isn't homework; it's a genuine question so please don't attack me if my style of asking isn't as you expect.

What you need yo do is to frontfill the values in a series:
This code
series = pd.Series([33.2, 36, 39, 55], index=[3, 6, 12, 14], name='series')
indices = range(100)
df = pd.DataFrame(indices)
series = df.join(series).ffill()['series']
produces
0 NaN
1 NaN
2 NaN
3 33.2
4 33.2
...
95 55.0
96 55.0
97 55.0
98 55.0
99 55.0
First values ar NaN because there are no values to fill them in the series

So here's the solution I went with - an ffill() with fillna(0), joining to range(1,101). I had to iterate through a larger dataset which needed grouping by ID first / taking the maximum 'Pct' per 'Bucket' :-
j=df[['ID','Bucket','Pct']].groupby(['ID','Bucket']).max()
for i in df['ID'].unique():
index=pd.DataFrame(range(1,101))
index.columns=['Bucket']
k=pd.merge(index,j.loc[i],how='left',on='Bucket').ffill().fillna(0)
In:
Bucket Pct
3 0.03
3 0.1
3 0.26
3 0.42
3 0.45
3 0.59
3 0.69
3 0.83
3 0.86
3 0.91
3 0.94
3 0.98
4 1.1
... ...
91 98.89
93 99.08
94 99.17
94 99.26
94 99.43
94 99.48
94 99.63
100 100.0
Out:
Bucket Pct
1 0.00
2 0.00
3 0.98
4 1.83
5 22.83
... ...
91 98.89
92 98.89
93 99.08
94 99.63
95 99.63
96 99.63
97 99.63
98 99.63
99 99.63
100 100.00
Many, many thanks once again to you both!

What is the behavior of iscntrl?

The function iscntrl is standardized. Unfortuneately on C99 we have:
The iscntrl function tests for any control character
Considering the prototype which is int iscntrl(int c); I am expecting something like true for 0..31 and perhaps 127 too. However in the following:
#include <stdio.h>
#include <ctype.h>
int main()
{
int i;
printf("The ASCII value of all control characters are ");
for (i=0; i<=1024; ++i)
{
if (iscntrl(i)!=0)
printf("%d ", i);
}
return 0;
}
I get this output:
The ASCII value of all control characters are 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 264 288 308 310 320 334
336 346 348 372 374 390 398 404 406 412 420 428 436 444 452 458 460 466 468 474
476 484 492 500 506 512 518 530 536 542 638 644 656 662 668 682 688 694 700 706
708 714 716 718 760 774 780 782 788 798 826 834 836 846 854 856 864 866 874 876
882 888 890 892 898 900 908 962 968 970 988 994 1000
So I am wondering how this function is implemented behind the scene. I tried to search on the standard library, but the answer is not obvious.
https://github.com/bminor/glibc/search?q=iscntrl&unscoped_q=iscntrl
Any ideas?

You are invoking undefined behavior by passing improper values to iscntrl().
Per 7.4 Character handling <ctype.h>, paragraph 1:
The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Trouble reading a data set into R

I am new to R and I am trying to read in a data set. The data set is here:
http://petitlien.fr/myfiles
(The above link will expand to a GMX File storage folder link and click on Guest access to retrieve the file.)
The file named mydata.log has 32 entries with no header and it consists of 2 columns which are delimited by spaces.
I am trying the powerful command scan
test.frame<-scan(file="mydata.log",sep= "", nlines=32,blank.lines.skip=TRUE)
The above just read the first 3 rows:
head(test.frame)
[1] 0.0000 0.0000 144.3210 0.3400 159.4070 0.8925
I have tried also read.table:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
This one reads the first 6 lines only as shown below:
names(test.frame)
[1] "V1" "V2"
> head(test.frame)
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
Does someone know how to read this data set properly?
A related question: Can I control the number of significant digits or perhaps decimal places in the data being read in?
Thanks a lot...

This line of your code works perfectly:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
The reason why you only get 6 lines in your output is because you are using head. To view all lines, just enter the name of your object:
> test.frame
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
7 296.918 1.1025
8 346.773 1.1550
9 442.955 1.2075
10 694.879 1.2600
11 892.436 1.3125
12 1492.970 1.3650
13 2916.960 1.4175
14 3596.060 1.4700
15 5278.950 1.5225
16 7480.730 1.5750
17 12259.800 1.6275
18 14032.600 1.6800
19 19565.600 1.7325
20 31427.700 1.7850
21 58221.400 1.8375
22 92283.900 1.9900
23 165601.000 1.9425
24 165703.000 1.9950
25 213925.000 2.8750
26 260381.000 2.1000
27 312701.000 2.1525
28 370853.000 2.2050
29 479303.000 2.2575
30 487265.000 2.3100
31 545225.000 2.3625
32 703186.000 2.4150
Here is an easy way to see how many rows you have (useful when you have many observations):
nrow(test.frame)
[1] 32
As for the number of digits, see the round command. To look at the documentation for a command, enter a ? and then the command, in this case a function: ?round
#note that you do not have to put "digits=2", you can just put "2", but this way is clearer
> rounded_test.frame <- round(test.frame, digits=2)
> rounded_test.frame
V1 V2
1 0.00 0.00
2 144.32 0.34
3 159.41 0.89
4 198.41 0.94
5 222.56 1.00
6 235.46 1.05
7 296.92 1.10
8 346.77 1.16
9 442.95 1.21
10 694.88 1.26
11 892.44 1.31
12 1492.97 1.36
13 2916.96 1.42
14 3596.06 1.47
15 5278.95 1.52
16 7480.73 1.57
17 12259.80 1.63
18 14032.60 1.68
19 19565.60 1.73
20 31427.70 1.78
21 58221.40 1.84
22 92283.90 1.99
23 165601.00 1.94
24 165703.00 2.00
25 213925.00 2.88
26 260381.00 2.10
27 312701.00 2.15
28 370853.00 2.21
29 479303.00 2.26
30 487265.00 2.31
31 545225.00 2.36
32 703186.00 2.42
Note in the above I created a new object instead of replacing the current one. If you want to replace the current one and lose the data forever (until you reload the dataset of course!), then you can use this line instead:
test.frame <- round(test.frame, digits=2)
If you don't really want to compress your numbers, you might just be interested in viewing the rounded numbers. You can do this the following command:
print(test.frame,digits=2)

Instead of nrow() as suggested, I would recommend str() ("structure") that gives you more useful information about your data set (class of variables etc). It's also a bit less cryptic....:)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to transform a CSV file data in Apache camel - apache-camel

Related

Python Impute using BayesianRidge() sklearn impute.IterativeImputer regression impute analysis value error

Google Data Studio: Compare daily sales to 7-day average

Repeat certain pandas series values, so that it has an entry for all index values between 1 and 100

What is the behavior of iscntrl?

Trouble reading a data set into R

Categories

Resources