Changing a file - c

My program was writing two values to a file side by side - (as two columns) By some stupid programming mistake I have managed to the following
fprintf(network_energy_delta_E_BM, "%f\n\t %f\n", delta_network_energy_BM,
network_energy_initial);
"%f\n\t %f\n"
Means that my data ended up looking a bit like this:
1234
56
24
99
and so on and so forth.
This causing some problems for me and what I need to do.. Is there any way to amend the file so that they are side by side? I tried
paste network_energy_delta_E_BM.dat foo.dat | awk '{ print $1 }' > new.dat
where network_energy_delta_E_BM.dat is my file and foo.dat is just an empty file. But it takes all the entries and just puts them into one column. Can anyone help fix these please?
There are over 2000000 entries so I can't fix this by hand.
Or is there any way of taking this new file new.dat and taking every second entry and writing those to a new column?
Thank you

What about using this awk expression?
awk '!(NR%2){print p,$0*1; next}{p=$0}' file
It joins two lines into one, where NR means number of record (line, in this case). Note that $0*1 is used to delete the trailing tab of the 2nd line.
Test
$ cat a
1234
56
24
99
$ awk '!(NR%2){print p,$0*1; next}{p=$0}' a
1234 56
24 99

Related

Why is awk seemingly failing to completely store a very large array?

I'm using the following code to identify the 99th percentile of a list of numbers:
sort -n file.txt | awk '{all[NR] = $0} END{print all[int(NR*0.99 - 0.01)]}'
where file.txt looks like:
50
150
75
313
etc.
This works fine for small (20,000) line files but for bigger ones (140-790 million lines) it is giving a smaller number than when you just sort the text file into a new file, and go down to the appropriate line. As far as I can tell awk isn't giving an error that it's run out of memory, but just prints out a number at around the 40th percentile for a 140M line file. I can implement this step a different way, but I was just wondering if anyone knew why this might be happening? From looking at other posts it doesn't seem like awk has a maximum array size so not sure what's happening.

Vlookup-like function using awk in ksh

Disclaimers:
1) English is my second language, so please forgive any gramatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
3) You will find some text in capital letters here and there. Is is of course not me "shouting" at you, but only a way to make portions of text stand out. Plase do not consider this an act of unpoliteness.
4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. I then preferred to be accused of overwording if need be... It would be, at least, not a common offense...
The "Problem":
I have 2 files (a and b for simplification). File a has 7 columns separated by commas. File b has 2 columns separated by commas.
What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c".
--- MORE INFO IN THE NOTES AT THE BOTTOM ---
file a:
Server Name,File System,Path,File,Date,Type,ID
horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333
hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999
file b:
ID,Size
11111111,215915
22222222,1716
33333333,212856
44444444,1729
55555555,215927
66666666,1728
88888888,1729
99999999,213876
bbbbbbbb,26669080
Expected file c:
Server Name,File System,Path,File,Date,Type,ID,Size
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
Additional notes:
0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place.
1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like.
2) I added headers just to provide a better idea of the nature of the data. The real files don't have it, so no need to skip them on the source file nor create it in the destination file.
3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines).
4) VERY VERY VERY IMPORTANT:
4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours.
4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | grep file b as many times as there are lines in file a, as I am currently doing.
4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong.
Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ):
cat "${file_a}" | while read -r line_file_a; do
server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'`
filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'`
folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'`
file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'`
file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'`
file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'`
file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'`
cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do
file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'`
file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'`
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv
fi
done
done
One last additional note, just in case you wonder:
The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute).
You have reached the end. CONGRATULATIONS! You've been awarded the medal to patience.
$ cat stuff.awk
BEGIN { FS=OFS="," }
NR == FNR { a[$1] = $2; next }
$7 in a { print $0, a[$7] }
Note the order for providing the files to the awk command, b first, followed by a:
$ awk -f stuff.awk b.txt a.txt
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
EDIT: Updated calculation
You can try to predict how often you are calling another program:
At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b.
(9 * 160.000).
For file b: 2 awk's, one file open and one file close for each hit. With 60K output, that would be 4 * 60.000.
A small change in the code can change this into "only" 160.000 times a grep:
cat "${file_a}" | while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done >> ${file_c}.csv
Well, try this with your 160K files and see how much faster it is.
Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done).
while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done < "${file_a}" >> ${file_c}.csv
The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a.
This solution is a nice improvement in the performance, but still a lot overhead with grep. Another huge improvement can be found with awk.
The best solution is using awk as explained in What is "NR==FNR" in awk? and found in the answer of #jas.
It is only one system call and both files are only read once.

printing part of file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.
To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH
I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Find the most frequent entry in a file using unix

I have a file containing around 2,000,000 entries - just one column with that many entries, all numbers. I would like to quickly find out what the most frequent number in the file is.. Is there a way to do this using unix?
I know how to do it using gnuplot but it is a slightly tedious way and was wondering if there was a simpler way just by using some unix commands?
Like if my file is
1
1
1
2
3
4
Then I want it to read the file and give me the answer 1 because thats the most frequent.
You can do it like this:
$ cat file|sort -n|uniq -c|sort -n|tail -n 1|awk '{print $2}'
sort test.txt | uniq -c | sort -rn | head -n 1 should help. It prints the number of occurences and the number that is most used, so for your example file it would be: 3 1
My first answer to that would be building an histogram. It helps if the range of possible values is small enough.
Once the histogram is built, just look for the highest amount in it.

Linux: search and remove in file, new line when it is between two lines of digits

I have a big text file that has this format:
80708730272
598305807640 45097682220
598305807660 87992655320
598305807890
598305808720
598305809030
598305809280
598305809620 564999067
598305809980
33723830870
As you can see there is a row of digits and then in some occasions there is a second row.
In the text file (on solaris) the second row is under the first one.
I don't know why they are here side by side.
I want to put a coma whenever there is a number in the second row.
598305809620
564999067
make it like:
598305809620, 564999067
And if I could put also a semicolon ';' at the end of each line it would be perfect.
Could you please help?
What could I use and basically how could I do that?
My first instinct was sed rather than awk. They are both excellent tools to have.
I couldn't find an easy way to do it all in a single regex ("regular expression"), though. No doubt someone else will.
sed -i.bak -r "s/([0-9]+)(\s+[0-9]+)/\1,\2/g" filename.txt
sed -i -r "s/[0-9]+$/&;/g" filename.txt.bak
The first line takes care of the lines with two groups of digits, writing it out to a new file with an extra '.bak' file extension, just to be paranoid (aka 'good practice') and not risk overwriting your original file if you made a mistake.
The second line appends the semi-colon to all lines that contain at least one digit - so, skipping blank lines, for example. It overwrites the .bak file in place.
Once you have verified that the result is satisfactory, replace your original file with this one.
Let me know if you want a detailed explanation of exactly what's going on here.
In this situation, awk is your friend. Give this a whirl:
awk '{if (NF==2) printf "%s, %s;\n\n", $1, $2; else if (NF==1) printf "%s;\n\n", $1}' big_text.txt | cat > txt_file.txt
This should result in the following output:
80708730272;
598305807640, 45097682220;
598305807660, 87992655320;
598305807890;
598305808720;
598305809030;
598305809280;
598305809620, 564999067;
598305809980;
33723830870;
Hope that works for you!

Resources