Speed for getting lines between specific line numbers - file

I used the following command for getting lines between specific line numbers in a file:
sed -n '100000,200000p' file1.xml > file2.xml
It took quite a while. Is there a faster way?

If your file has a lot more records than the limit you set, 200000, then you spend time reading the records you do not want.
You can quit out of sed with the q command, and avoid reading many lines you don't want.
sed -n '100000,200000p; 200001q' file1.xml > file2.xml

You might try the split command.
split -l 100000 file1.xml file2
Then you will get multiple files with postfix aa, ab, etc. You will be interested in the one postfixed with ab.

Related

sed addressing for each of multiple input files

I would like to print from line 10 until the end of the file for each of several files in a folder. For a single file, I would do this with sed -n '10,$p', however when providing multiple input files to sed the addressing becomes in terms of the concatenated files. How can I print using the sed command and address each file's line numbers? This website says that the $ addressing character refers to each file's end if the -s option is used, but this does not work for me on my Macbook Pro.
Ideally I would like the whole procedure to be done with a single tool without writing a loop. I'm ok with the output being concatenated. I'm open to other tools than sed. tail might work for this like so tail -n +10 filenames but this is very very slow, so I imagine sed is better to use.
awk 'FNR>9{print $0}' file1 file2
This will do it

printing part of file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.
To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH
I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Linux: search and remove in file, new line when it is between two lines of digits

I have a big text file that has this format:
80708730272
598305807640 45097682220
598305807660 87992655320
598305807890
598305808720
598305809030
598305809280
598305809620 564999067
598305809980
33723830870
As you can see there is a row of digits and then in some occasions there is a second row.
In the text file (on solaris) the second row is under the first one.
I don't know why they are here side by side.
I want to put a coma whenever there is a number in the second row.
598305809620
564999067
make it like:
598305809620, 564999067
And if I could put also a semicolon ';' at the end of each line it would be perfect.
Could you please help?
What could I use and basically how could I do that?
My first instinct was sed rather than awk. They are both excellent tools to have.
I couldn't find an easy way to do it all in a single regex ("regular expression"), though. No doubt someone else will.
sed -i.bak -r "s/([0-9]+)(\s+[0-9]+)/\1,\2/g" filename.txt
sed -i -r "s/[0-9]+$/&;/g" filename.txt.bak
The first line takes care of the lines with two groups of digits, writing it out to a new file with an extra '.bak' file extension, just to be paranoid (aka 'good practice') and not risk overwriting your original file if you made a mistake.
The second line appends the semi-colon to all lines that contain at least one digit - so, skipping blank lines, for example. It overwrites the .bak file in place.
Once you have verified that the result is satisfactory, replace your original file with this one.
Let me know if you want a detailed explanation of exactly what's going on here.
In this situation, awk is your friend. Give this a whirl:
awk '{if (NF==2) printf "%s, %s;\n\n", $1, $2; else if (NF==1) printf "%s;\n\n", $1}' big_text.txt | cat > txt_file.txt
This should result in the following output:
80708730272;
598305807640, 45097682220;
598305807660, 87992655320;
598305807890;
598305808720;
598305809030;
598305809280;
598305809620, 564999067;
598305809980;
33723830870;
Hope that works for you!

How to find duplicate lines across 2 different files? Unix

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.
Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.
file1: http://pastebin.com/taRcegVn
file2: http://pastebin.com/2fXeMrHQ
And the output should output the lines that appears in both files.
output: http://pastebin.com/FnjXFshs
I am able to use python to do it as such but i think it's a little too much to put into the terminal:
x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
print>>outfile, i
If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:
sort file1 file2 | uniq -d
As #tjameson mentioned it may be solved in another thread.
Just would like to post another solution:
sort file1 file2 | awk 'dup[$0]++ == 1'
refer to awk guide to get some awk
basics, when the pattern value of a line is true this line will be
printed
dup[$0] is a hash table in which each key is each line of the input,
the original value is 0 and increments once this line occurs, when
it occurs again the value should be 1, so dup[$0]++ == 1 is true.
Then this line is printed.
Note that this only works when there are not duplicates in either file, as was specified in the question.

Shell Script to remove duplicate entries from file

I would like to remove duplicate entries from a file. The file looks like this:
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd3:mE7YHNejLCviM:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
How can I remove the duplicates from this file by using shell script?
From the sort manpage:
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
sort -u yourFile
should do.
If you do not want to change the order of the input file, you can do:
$ awk '!v[$0]{ print; v[$0]=1 }' input-file
or, if the file is small enough (less than 4 billion lines, to ensure that no line is repeated 4 billion times), you can do:
$ awk '!v[$0]++' input-file
Depending on the implementation of awk, you may not need to worry about the file being less than 2^32 lines long. The concern is that if you see the same line 2^32 times, you may overflow an integer in the array value, and the 2^32nd instance (or 2^31st) of the duplicate line will be output a second time. In reality, this is highly unlikely to be an issue!
#shadyabhi answer is correct, if the output needs to be redirected to a different file use:
sort -u inFile -o outFile

Resources