printing part of file - file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.

To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH

I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Related

How to edit a file with shell scripting

I have a file containing thousands of lines like this:
0x7f29139ec6b3: W 0x7fff06bbf0a8
0x7f29139f0010: W 0x7fff06bbf0a0
0x7f29139f0014: W 0x7fff06bbf098
0x7f29139f0016: W 0x7fff06bbf090
0x7f29139f0036: R 0x7f2913c0db80
I want to make a new file which contains only the second hex number on each line (the part marked in bold above)
I have to put all these hex numbers in an array in a C program. So I am trying to make a file with only the hex numbers on the right hand side, so that my C program can use the fscanf function to read these numbers from the modified file.
I guess we can use some shell script to make a file containing those hex numbers? grep or something?
You can use sed and edit inplace. For matching "R" or any other char use
sed -i "s/.*:..//g" file
cat file
0x7fff06bbf0a8
0x7fff06bbf0a0
0x7fff06bbf098
0x7fff06bbf090
You can use grep -oP command:
grep -oP ' \K0x[a-fA-F0-9]*' file
0x7fff06bbf0a8
0x7fff06bbf0a0
0x7fff06bbf098
0x7fff06bbf090
0x7f2913c0db80
You can run a command on the file that will create a new file in the format you want: somecommand <oldfile >newfile. That will leave the original file intact and create a new one for you to feed to your C program.
As to what somecommand should be, you have multiple options. The easiest is probably awk:
awk '{print $NF}'
But you can also do it with sed or grep or perl or cut ... see other answers for an embarrassment of choices.
Since it seems that you always want to select the third field, the simplest approach is to use cut:
cut -d ' ' -f 3 file
or awk:
awk '{print $3}' file

Linux: search and remove in file, new line when it is between two lines of digits

I have a big text file that has this format:
80708730272
598305807640 45097682220
598305807660 87992655320
598305807890
598305808720
598305809030
598305809280
598305809620 564999067
598305809980
33723830870
As you can see there is a row of digits and then in some occasions there is a second row.
In the text file (on solaris) the second row is under the first one.
I don't know why they are here side by side.
I want to put a coma whenever there is a number in the second row.
598305809620
564999067
make it like:
598305809620, 564999067
And if I could put also a semicolon ';' at the end of each line it would be perfect.
Could you please help?
What could I use and basically how could I do that?
My first instinct was sed rather than awk. They are both excellent tools to have.
I couldn't find an easy way to do it all in a single regex ("regular expression"), though. No doubt someone else will.
sed -i.bak -r "s/([0-9]+)(\s+[0-9]+)/\1,\2/g" filename.txt
sed -i -r "s/[0-9]+$/&;/g" filename.txt.bak
The first line takes care of the lines with two groups of digits, writing it out to a new file with an extra '.bak' file extension, just to be paranoid (aka 'good practice') and not risk overwriting your original file if you made a mistake.
The second line appends the semi-colon to all lines that contain at least one digit - so, skipping blank lines, for example. It overwrites the .bak file in place.
Once you have verified that the result is satisfactory, replace your original file with this one.
Let me know if you want a detailed explanation of exactly what's going on here.
In this situation, awk is your friend. Give this a whirl:
awk '{if (NF==2) printf "%s, %s;\n\n", $1, $2; else if (NF==1) printf "%s;\n\n", $1}' big_text.txt | cat > txt_file.txt
This should result in the following output:
80708730272;
598305807640, 45097682220;
598305807660, 87992655320;
598305807890;
598305808720;
598305809030;
598305809280;
598305809620, 564999067;
598305809980;
33723830870;
Hope that works for you!

Speed for getting lines between specific line numbers

I used the following command for getting lines between specific line numbers in a file:
sed -n '100000,200000p' file1.xml > file2.xml
It took quite a while. Is there a faster way?
If your file has a lot more records than the limit you set, 200000, then you spend time reading the records you do not want.
You can quit out of sed with the q command, and avoid reading many lines you don't want.
sed -n '100000,200000p; 200001q' file1.xml > file2.xml
You might try the split command.
split -l 100000 file1.xml file2
Then you will get multiple files with postfix aa, ab, etc. You will be interested in the one postfixed with ab.

Shell Script to remove duplicate entries from file

I would like to remove duplicate entries from a file. The file looks like this:
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd3:mE7YHNejLCviM:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
How can I remove the duplicates from this file by using shell script?
From the sort manpage:
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
sort -u yourFile
should do.
If you do not want to change the order of the input file, you can do:
$ awk '!v[$0]{ print; v[$0]=1 }' input-file
or, if the file is small enough (less than 4 billion lines, to ensure that no line is repeated 4 billion times), you can do:
$ awk '!v[$0]++' input-file
Depending on the implementation of awk, you may not need to worry about the file being less than 2^32 lines long. The concern is that if you see the same line 2^32 times, you may overflow an integer in the array value, and the 2^32nd instance (or 2^31st) of the duplicate line will be output a second time. In reality, this is highly unlikely to be an issue!
#shadyabhi answer is correct, if the output needs to be redirected to a different file use:
sort -u inFile -o outFile

How can I make 'grep' show a single line five lines above the grepped line?

I've seen some examples of grepping lines before and after, but I'd like to ignore the middle lines.
So, I'd like the line five lines before, but nothing else.
Can this be done?
OK, I think this will do what you're looking for. It will look for a pattern, and extract the 5th line before each match.
grep -B5 "pattern" filename | awk -F '\n' 'ln ~ /^$/ { ln = "matched"; print $1 } $1 ~ /^--$/ { ln = "" }'
basically how this works is it takes the first line, prints it, and then waits until it sees ^--$ (the match separator used by grep), and starts again.
If you only want to have the 5th line before the match you can do this:
grep -B 5 pattern file | head -1
Edit:
If you can have more than one match, you could try this (exchange pattern with your actual pattern):
sed -n '/pattern/!{H;x;s/^.*\n\(.*\n.*\n.*\n.*\n.*\)$/\1/;x};/pattern/{x;s/^\([^\n]*\).*$/\1/;p}' file
I took this from a Sed tutorial, section: Keeping more than one line in the hold buffer, example 2 and adapted it a bit.
This is option -B
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
This way is easier for me:
grep --no-group-separator -B5 "pattern" file | sed -n 1~5p
This greps 5 lines before and including the pattern, turns off the --- group separator, then prints every 5th line.

Resources