Read/Write/Find/Replace huge csv file - file

I have a huge (4,5 GB) csv file.. I need to perform basic cut and paste, replace operations for some columns.. the data is pretty well organized.. the only problem is I cannot play with it with Excel because of the size (2000 rows, 550000 columns).
here is some part of the data:
ID,Affection,Sex,DRB1_1,DRB1_2,SENum,SEStatus,AntiCCP,RFUW,rs3094315,rs12562034,rs3934834,rs9442372,rs3737728
D0024949,0,F,0101,0401,SS,yes,?,?,A_A,A_A,G_G,G_G
D0024302,0,F,0101,7,SN,yes,?,?,A_A,G_G,A_G,?_?
D0023151,0,F,0101,11,SN,yes,?,?,A_A,G_G,G_G,G_G
I need to remove 4th, 5th, 6th, 7th, 8th and 9th columns;
I need to find every _ character from column 10 onwards and replace it with a space ( ) character;
I need to replace every ? with zero (0);
I need to replace every comma with a tab;
I need to remove first row (that has column names;
I need to replace every 0 with 1, every 1 with 2 and every ? with 0 in 2nd column;
I need to replace F with 2, M with 1 and ? with 0 in 3rd column;
so that in the resulting file the output reads:
D0024949 1 2 A A A A G G G G
D0024302 1 2 A A G G A G 0 0
D0023151 1 2 A A G G G G G G
(both input and output should read one line per row, ne extra blank row)
Is there a memory efficient way of doing that with java(and I need a code to do that) or a usable tool for playing with this large data so that I can easily apply Excel functionality..

You need two things:
- Knowledge of Regular Expressions (aka Regex, Regexes)
- PowerGrep

Related

Is there a more effective way to combine 24 columns into a single column as an array in R

I have a code below that works to take 24 columns (hours) of data and combine it into a single column array for each row in a dataframe:
# Adds all of the values into column twentyfourhours with "," as the separator.
agg_bluetooth_data$twentyfourhours <- paste(agg_bluetooth_data[,1],
agg_bluetooth_data[,2], agg_bluetooth_data[,3], agg_bluetooth_data[,4],
agg_bluetooth_data[,5], agg_bluetooth_data[,6], agg_bluetooth_data[,7],
agg_bluetooth_data[,8], agg_bluetooth_data[,9], agg_bluetooth_data[,10],
agg_bluetooth_data[,11], agg_bluetooth_data[,12], agg_bluetooth_data[,13],
agg_bluetooth_data[,14], agg_bluetooth_data[,15], agg_bluetooth_data[,16],
agg_bluetooth_data[,17], agg_bluetooth_data[,18], agg_bluetooth_data[,19],
agg_bluetooth_data[,20], agg_bluetooth_data[,21], agg_bluetooth_data[,22],
agg_bluetooth_data[,23], agg_bluetooth_data[,24], sep=",")
However, after this I still have to write more lines of code to remove spaces, add brackets around it, and delete the columns. None of this is difficult to do, but I feel like there should be a shorter/cleaner code to use to get the results I am looking for. Does anyone have any suggestions?
There is a built-in function to do rowSums. It looks like you want an analogous rowPaste function. We can do this with apply:
# create example dataset
df <- data.frame(
v=1:10,
x=letters[1:10],
y=letters[6:15],
z=letters[11:20],
stringsAsFactors = FALSE
)
# rowPaste columns 2 through 4
apply(df[, 2:4], 1, paste, collapse=",")
Another option, using #Dan Y's data (might be helpful if you posted a subset of your data using dput though).
library(tidyr)
library(dplyr)
df %>%
unite('new_col', v, x, y, z, sep = ',')
new_col
1 1,a,f,k
2 2,b,g,l
3 3,c,h,m
4 4,d,i,n
5 5,e,j,o
6 6,f,k,p
7 7,g,l,q
8 8,h,m,r
9 9,i,n,s
10 10,j,o,t
You can then perform the neccessary edits with mutate. There's also a fair amount of flexibility in the column selections within the unite call. Check out the "Useful Functions" section of the select documentation.

T-SQL: How to break a column with concatenated string into multiple rows?

I'm working with a dataset where most columns are normal, but one has one or more concatenated values jammed into a single string, using a '|' as a delimiter between values. I need to reshape it so that there's one row per existing row, per concatenated value. There are 60 potential values--that I know of-- in the concatenated string, and most rows have between 0 and 10 values smashed into the string. It's also going to be necessary to repeat this process over the next few months, and it's possible the list will change/ add new members.
I'm going to have to do this on an unknown number of future tables--at least 4 more--so if there's an approach I can easily repurpose it will be MUCH better. Also, I'm using t-SQL, but l could probably bring in R or something if that would help. Any ideas?
If you have a table containing the 60 possible values, you could join to it with tsql something like this:
select table1.id, potentialvalues.value
from table1
inner join potentialvalues
on charindex('|'+potentialvalues.value+'|', '|'+table1.concatField+'|')>0
Note: Added the pipes to beginning and end of the concatfield so that it can match the first and last values in the field. So, if your concatfield is something like '1|2|10' on a record it would be able to match '|1|', '|2|' and '|10|'.
In R, you could use dplyr and tidyr functions to expand your rows by separating each combined string at the pipe symbol. This has the advantage that it can be applied to your table without knowing what the piped combinations are in advance.
library(dplyr)
library(tidyr)
separate_rows(df, string, sep = "[|]") %>%
mutate(string = trimws(string))
The trimws function from base R is used to remove any extra whitespace that may be between your piped string components. Toy test data and results shown below.
Test data
df = data.frame(key = c("A", "B", "C", "D"),
string = c("Simple", "Piped 1 | Piped 2", "Simple 2", "Piped A1 | Piped A2 | Piped A3"), stringsAsFactors = FALSE)
> df
key string
1 A Simple
2 B Piped 1 | Piped 2
3 C Simple 2
4 D Piped A1 | Piped A2 | Piped A3
Result
key string
1 A Simple
2 B Piped 1
3 B Piped 2
4 C Simple 2
5 D Piped A1
6 D Piped A2
7 D Piped A3

Setting Up a Dynamic Stopping Point for a Loop

Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N

Changing indices and order in arrays

I have a struct mpc with the following structure:
num type col3 col4 ...
mpc.bus = 1 2 ... ...
2 2 ... ...
3 1 ... ...
4 3 ... ...
5 1 ... ...
10 2 ... ...
99 1 ... ...
to from col3 col4 ...
mpc.branch = 1 2 ... ...
1 3 ... ...
2 4 ... ...
10 5 ... ...
10 99 ... ...
What I need to do is:
1: Re-order the rows of mpc.bus, such that all rows of type 1 are first, followed by 2 and at last, 3. There is only one element of type 3, and no other types (4 / 5 etc.).
2: Make the numbering (column 1 of mpc.bus, consecutive, starting at 1.
3: Change the numbers in the to-from columns of mpc.branch, to correspond to the new numbering in mpc.bus.
4: After running simulations, reverse the steps above to turn up with the same order and numbering as above.
It is easy to update mpc.bus using find.
type_1 = find(mpc.bus(:,2) == 1);
type_2 = find(mpc.bus(:,2) == 2);
type_3 = find(mpc.bus(:,2) == 3);
mpc.bus(:,:) = mpc.bus([type1; type2; type3],:);
mpc.bus(:,1) = 1:nb % Where nb is the number of rows of mpc.bus
The numbers in the to/from columns in mpc.branch corresponds to the numbers in column 1 in mpc.bus.
It's OK to update the numbers on the to, from columns of mpc.branch as well.
However, I'm not able to find a non-messy way of retracing my steps. Can I update the numbering using some simple commands?
For the record: I have deliberately not included my code for re-numbering mpc.branch, since I'm sure someone has a smarter, simpler solution (that will make it easier to redo when the simulations are finished).
Edit: It might be easier to create normal arrays (to avoid woriking with structs):
bus = mpc.bus;
branch = mpc.branch;
Edit #2: The order of things:
Re-order and re-number.
Columns (3:end) of bus and branch are changed. (Not part of this question)
Restore original order and indices.
Thanks!
I'm proposing this solution. It generates a n x 2 matrix, where n corresponds to the number of rows in mpc.bus and a temporary copy of mpc.branch:
function [mpc_1, mpc_2, mpc_3] = minimal_example
mpc.bus = [ 1 2;...
2 2;...
3 1;...
4 3;...
5 1;...
10 2;...
99 1];
mpc.branch = [ 1 2;...
1 3;...
2 4;...
10 5;...
10 99];
mpc.bus = sortrows(mpc.bus,2);
mpc_1 = mpc;
mpc_tmp = mpc.branch;
for I=1:size(mpc.bus,1)
PAIRS(I,1) = I;
PAIRS(I,2) = mpc.bus(I,1);
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = I;
mpc.bus(I,1) = I;
end
mpc_2 = mpc;
% (a) the following mpc_tmp is only needed if you want to truly reverse the operation
mpc_tmp = mpc.branch;
%
% do some stuff
%
for I=1:size(mpc.bus,1)
% (b) you can decide not to use the following line, then comment the line below (a)
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = PAIRS(I,2);
mpc.bus(I,1) = PAIRS(I,2);
end
% uncomment the following line, if you commented (a) and (b) above:
% mpc.branch = mpc_tmp;
mpc.bus = sortrows(mpc.bus,1);
mpc_3 = mpc;
The minimal example above can be executed as is. The three outputs (mpc_1, mpc_2 & mpc_3) are just in place to demonstrate the workings of the code but are otherwise not necessary.
1.) mpc.bus is ordered using sortrows, simplifying the approach and not using find three times. It targets the second column of mpc.bus and sorts the remaining matrix accordingly.
2.) The original contents of mpc.branch are stored.
3.) A loop is used to replace the entries in the first column of mpc.bus with ascending numbers while at the same time replacing them correspondingly in mpc.branch. Here, the reference to mpc_tmp is necessary so ensure a correct replacement of the elements.
4.) Afterwards, mpc.branch can be reverted analogously to (3.) - here, one might argue, that if the original mpc.branch was stored earlier on, one could just copy the matrix. Also, the original values of mpc.bus are re-assigned.
5.) Now, sortrows is applied to mpc.bus again, this time with the first column as reference to restore the original format.

Array formula not working in Excel

I have the following table in Excel (blank spaces are empty):
A B C D
1 1
2 3
3 4
4 -2
5 4
6 9
7 8
8
9
10
I would like to return the minimum of column A from A1 to A1000000, using the QUARTILE function, while excluding all negative values. The reason I want it from A1 to A1000000 and not A1 to A7 is because I want to update the table (adding new rows starting from A8) and have the formula also automatically update. The reason I want the QUARTILE and not MIN function is because I will be extending it to calculate other statistics like 1st and 3rd quartile.
This function works correctly and returns 1 (pressing ctrl+shift+enter):
QUARTILE(IF(A1:A7 > -1, A1:A7), 0)
However, when I tried the following, it returned 0 when it should still return 1 (pressing ctrl+shift+enter):
QUARTILE(IF(A1:A1000000 > -1, A1:A1000000), 0)
I also tried the following and it returned 0 (pressing ctrl+shift+enter):
QUARTILE(IF(AND(NOT(ISBLANK(A1:A1000000)), A1:A1000000 > -1), A1:A1000000), 0)
Anybody have a solution to my problem?
Create a dynamic named range, called for example, rng, defined by =OFFSET($A$1,0,0,COUNT($A1:$A10000),1)
Then modify your array formula to refer to rng, via =QUARTILE(IF(rng >-1,rng), 0)
Actually what you have works. Try doing:
=QUARTILE(IF(A:A > 0,A:A ),0)
The reason you are returning 0 is that a blank cell is considered to be of the value 0 when this formula is ran. For example, erase one of the values in the A1:A7 range and your original formula will return 0. Also, I would run the formula on the entire A column if possible (for readability, etc.)
Or do you need to return a "0" if that number is in the list?

Resources