Shell Script to remove duplicate entries from file

Shell Script to remove duplicate entries from file - file

I would like to remove duplicate entries from a file. The file looks like this:
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd3:mE7YHNejLCviM:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd1:5!b4RlH/IgYzI:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
xyabcd4:kQiRgQTU20Y0I:cvsabc
xyabcd2:JXfFZCZrL.6HY:cvsabc
How can I remove the duplicates from this file by using shell script?

From the sort manpage:
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
sort -u yourFile
should do.

If you do not want to change the order of the input file, you can do:
$ awk '!v[$0]{ print; v[$0]=1 }' input-file
or, if the file is small enough (less than 4 billion lines, to ensure that no line is repeated 4 billion times), you can do:
$ awk '!v[$0]++' input-file
Depending on the implementation of awk, you may not need to worry about the file being less than 2^32 lines long. The concern is that if you see the same line 2^32 times, you may overflow an integer in the array value, and the 2^32nd instance (or 2^31st) of the duplicate line will be output a second time. In reality, this is highly unlikely to be an issue!

#shadyabhi answer is correct, if the output needs to be redirected to a different file use:
sort -u inFile -o outFile

Related

Constructing commandline flags for an external program out of bash array

I've got an array of filenames I'd like to be ignored by stow, for example
IGNORES=('post_install\.sh' 'dummy')
(actually, that list isn't fixed but read from a file and will not always have the same length, so hardcoding like below won't work).
I form commandline flags out of the array like so
IGNORES=("${IGNORES[#]/#/--ignore=\'}")
IGNORES=("${IGNORES[#]/%/\'}")
When I do
stow -v "${IGNORES[#]}" -t $home $pkg
however, the ignores are not respected by stow, but it doesn't complain about invalid arguments either. Directly writing
stow -v --ignore='post_install\.sh' --ignore='ignore' -t $home $pkg
does work though.
What is the difference between these two ways of passing the --ignore flags, any ideas how to fix the issue? To my understanding, "${IGNORES[#]}" should evaluate to one word per array element and have the intended effect (I tried removing the quotes and indexing the array with *, too, but to no avail).
Thanks!

So while writing the post, I came across the solution: The single quotes I added here
IGNORES=("${IGNORES[#]/#/--ignore=\'}")
IGNORES=("${IGNORES[#]/%/\'}")
became part of the file names to ignore, and indeed a file named 'ignore' would be skipped; doing only
IGNORES=("${IGNORES[#]/#/--ignore=}")
has the desired effect. I still need to check how this copes with spaces in the array elements, but my guess is that it works just fine since the necessity of quoting words with spaces only stems from splitting a complete commandline into words like
stow -v --ignore='the file' -t $home $pkg
vs
stow -v --ignore=the file -t $home $pkg
which is not a problem for the above and "${IGNORES[#]}" gets the words just right.

printing part of file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.

To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH

I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Linux: search and remove in file, new line when it is between two lines of digits

I have a big text file that has this format:
80708730272
598305807640 45097682220
598305807660 87992655320
598305807890
598305808720
598305809030
598305809280
598305809620 564999067
598305809980
33723830870
As you can see there is a row of digits and then in some occasions there is a second row.
In the text file (on solaris) the second row is under the first one.
I don't know why they are here side by side.
I want to put a coma whenever there is a number in the second row.
598305809620
564999067
make it like:
598305809620, 564999067
And if I could put also a semicolon ';' at the end of each line it would be perfect.
Could you please help?
What could I use and basically how could I do that?

My first instinct was sed rather than awk. They are both excellent tools to have.
I couldn't find an easy way to do it all in a single regex ("regular expression"), though. No doubt someone else will.
sed -i.bak -r "s/([0-9]+)(\s+[0-9]+)/\1,\2/g" filename.txt
sed -i -r "s/[0-9]+$/&;/g" filename.txt.bak
The first line takes care of the lines with two groups of digits, writing it out to a new file with an extra '.bak' file extension, just to be paranoid (aka 'good practice') and not risk overwriting your original file if you made a mistake.
The second line appends the semi-colon to all lines that contain at least one digit - so, skipping blank lines, for example. It overwrites the .bak file in place.
Once you have verified that the result is satisfactory, replace your original file with this one.
Let me know if you want a detailed explanation of exactly what's going on here.

In this situation, awk is your friend. Give this a whirl:
awk '{if (NF==2) printf "%s, %s;\n\n", $1, $2; else if (NF==1) printf "%s;\n\n", $1}' big_text.txt | cat > txt_file.txt
This should result in the following output:
80708730272;
598305807640, 45097682220;
598305807660, 87992655320;
598305807890;
598305808720;
598305809030;
598305809280;
598305809620, 564999067;
598305809980;
33723830870;
Hope that works for you!

Speed for getting lines between specific line numbers

I used the following command for getting lines between specific line numbers in a file:
sed -n '100000,200000p' file1.xml > file2.xml
It took quite a while. Is there a faster way?

If your file has a lot more records than the limit you set, 200000, then you spend time reading the records you do not want.
You can quit out of sed with the q command, and avoid reading many lines you don't want.
sed -n '100000,200000p; 200001q' file1.xml > file2.xml

You might try the split command.
split -l 100000 file1.xml file2
Then you will get multiple files with postfix aa, ab, etc. You will be interested in the one postfixed with ab.

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.

SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.

The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh

Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight