Bash help tallying/parsing substrings - arrays
I have a shell script I wrote a while back, that reads a word list (HITLIST), and recursively searches a directory for all occurrences of those words. Each line containing a "hit" is appended to file (HITOUTPUT).
I have used this script a couple of times over the last year or so, and have noticed that we often get hits from frequent offenders, and that it would be nice if we kept a count of each "super-string" that is triggered, and automatically remove repeat offenders.
For instance, if my word list contains "for" I might get a hundred hits or so for "foreign" or "form" or "force". Instead of validating each of these lines, it would be nice to simply wipe them all with one "yes/no" dialog per super-string.
I was thinking the best way to do this would be to start with a word from the hitlist, and record each unique occurrence of the super-string for that word (go until you are book-ended by why space) and go from there.
So on to the questions ...
What would be a good and efficient way to do this? My current idea
was to read in the file as a string, perform my counts, remove
repeat offenders from the the file input string, and output, but this is
proving to be a little more painful that I first suspected.
Would any specific data type/structure be preferred for this type of
work?
I have also thought about building the super-string count as I
create the HitOutput file, but I could not figure out a clean way of
doing this either. Any thoughts or suggestions?
A sample of the file I am reading in, and my code for reading in and traversing the hitlist and creating the HitOutput file below:
# Loop through hitlist list
while read -re hitlist || [[ -n "$hitlist" ]]
do
# If first character is "#" it's a comment, or line is blank, skip
if [ "$(echo $hitlist | head -c 1)" != "#" ]; then
if [ ! -z "$hitlist" -a "$histlist" != "" ]; then
# Parse comma delimited hitlist
IFS=',' read -ra categoryWords <<< "$hitlist"
# Search for occurrences/hits for each hit
for categoryWord in "${categoryWords[#]}"; do
# Append results to hit output string
eval 'find "$DIR" -type f -print0 | xargs -0 grep -HniI "$categoryWord"' >> HITOUTPUT
done
fi
fi
done < "$HITLIST"
src/fakescript.sh:1:Never going to win the war you mother!
src/open_source_licenses.txt:6147:May you share freely, never taking more than you give.
src/open_source_licenses.txt:8764:May you share freely, never taking more than you give.
src/open_source_licenses.txt:21711:No Third Party Beneficiaries. You agree that, except as otherwise expressly provided in this TOS, there shall be no third party beneficiaries to this Agreement. Waiver and Severability of Terms. The failure of UBM LLC to exercise or enforce any right or provision of the TOS shall
not constitute a waiver of such right or provision. If any provision of the TOS is found by a court of competent jurisdiction to be invalid, the parties nevertheless agree that the court should endeavor to give effect to the parties' intentions as reflected in the provision, and the other provisions of the TOS remain in full force and effect.
src/fakescript.sh:1:Never going to win the war you mother!
An example of my hitlist file:
# Comment out any category word lines that you do not want processed (the comma delimited lines)
# -----------------
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother
Let's divide this problem into two parts. First, we will update the hitlist interactively as requires by your customer. Second, we will find all matches to the updated hitlist.
1. Updating the hitlist
This searches for all words in files under directory dir that contain any word on the hitlist:
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
This script runs interactively. As an example, suppose that dir contains these two files:
$ cat dir/file1.txt
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
$ cat dir/file2.txt
dog and cat and formula, formula, formula
And hitlist contains two words:
$ cat hitlist
for
cat
If we then run our script, it looks like:
$ bash script.sh
catapult occurs 2 times. Include (y/n)? y
catermaran occurs 1 times. Include (y/n)? n
foreign occurs 2 times. Include (y/n)? y
form occurs 1 times. Include (y/n)? n
formula occurs 4 times. Include (y/n)? n
After the script is run, the file hitlist is updated with all the words that you want to include. We are now ready to proceed to the next step:
2. Finding matches to the updated hitlist
To read each word from a "hitlist" and search recursively for matches while ignoring, foreign even if the hitlist contains for, try:
grep -wrFf ../hitlist dir
-w tells grep to look only for full-words. Thus foreign will be ignored.
-r tells grep to search recursively.
-F tells grep to treat the hitlist as word, not regular expressions. (optional)
-f ../hitlist tells grep to read words from the file ../hitlist.
Following on with the example above, we would have:
$ grep -wrFf ./hitlist dir
dir/file2.txt:dog and cat and formula, formula, formula
dir/file1.txt:for all foreign or catapult also cat.
dir/file1.txt:The catapult hit the catermaran.
dir/file1.txt:The form of a foreign formula
If we don't want the file names displayed, use the -h option:
$ grep -hwrFf ./hitlist dir
dog and cat and formula, formula, formula
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
Automatic update for counts 10 or less
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
a=y
[ "$n" -gt 10 ] && read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
Reformatting the customer's hitlist
I see that your customer's hitlist has extra formatting, including comments, empty lines, and duplicated words. For example:
$ cat hitlist.source
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother
To convert that to format useful here, try:
$ sed -E 's/#.*//; s/[[:space:],]+/\n/g; s/\n\n+/\n/g; /^$/d' hitlist.source | grep . | sort -u >hitlist
$ cat hitlist
give
going
mother
never
to
up
word
you
your
Related
Bash Array Script Exclude Duplicates
So I have written a bash script (named music.sh) for a Raspberry Pi to perform the following functions: When executed, look into one single directory (Music folder) and select a random folder to look into. (Note: none of these folders here have subdirectories) Once a folder within "Music" has been selected, then play all mp3 files IN ORDER until the last mp3 file has been reached At this point, the script would go back to the folders in the "Music" directory and select another random folder Then it would again play all mp3 files in that folder in order Loop indefinitely until input from user I have this code which does all of the above EXCEPT for the following items: I would like to NOT play any other "album" that has been played before Once all albums played once, then shutdown the system Here is my code so far that is working (WITH duplicates allowed): #!/bin/bash folderarray=($(ls -d /home/alphekka/Music/*/)) for i in "${folderarray[#]}"; do folderitems=(${folderarray[RANDOM % ${#folderarray[#]}]}) for j in "${folderitems[#]}"; do echo `ls $j` cvlc --play-and-exit "${j[#]}" done done exit 0 Please note that there isn't a single folder or file that has a space in the name. If there is a space, then I face some issues with this code working. Anyways, I'm getting close, but I'm not quite there with the entire functionality I'm looking for. Any help would be greatly appreciated! Thank you kindly! :)
Use an associative array as a set. Note that this will work for all valid folder and file names. #!/bin/bash declare -A folderarray # Each folder name is a key mapped to an empty string for d in /home/alphekka/Music/*/; do folderarray["$d"]= done while [[ "${!folderarray[*]}" ]]; do # Get a list of the remaining folder names foldernames=( "${!folderarray[#]}" ) # Pick a folder at random folder=${foldernames[RANDOM%${#foldernames[#]}]} # Remove the folder from the set # Must use single quotes; see below unset folderarray['$folder'] for j in "$folder"/*; do cvlc --play-and-exit "$j" done done Dealing with keys that contain spaces (and possibly other special characters) is tricky. The quotes shown in the call to unset above are not syntactic quotes in the usual sense. They do not prevent $folder from being expanded, but they do appear to be used by unset itself to quote the resulting string.
Here's another solution: randomize the list of directories first, save the result in an array and then play (my script just prints) the files from each element of the array MUSIC=/home/alphekka/Music OLDIFS=$IFS IFS=$'\n' folderarray=($(ls -d $MUSIC/*/|while read line; do echo $RANDOM $line; done| sort -n | cut -f2- -d' ')) for folder in ${folderarray[*]}; do printf "Folder: %s\n" $folder fileArray=($(find $folder -type f)) for j in ${fileArray[#]}; do printf "play %s\n" $j done done For the random shuffling I used this answer.
One liner solution with mpv, rl (randomlines), xargs, find: find /home/alphekka/Music/ -maxdepth 1 -type d -print0 | rl -d \0 | xargs -0 -l1 mpv
Vlookup-like function using awk in ksh
Disclaimers: 1) English is my second language, so please forgive any gramatical horrors you may find. I am pretty confident you will be able to understand what I need despite these. 2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs. 3) You will find some text in capital letters here and there. Is is of course not me "shouting" at you, but only a way to make portions of text stand out. Plase do not consider this an act of unpoliteness. 4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. I then preferred to be accused of overwording if need be... It would be, at least, not a common offense... The "Problem": I have 2 files (a and b for simplification). File a has 7 columns separated by commas. File b has 2 columns separated by commas. What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c". --- MORE INFO IN THE NOTES AT THE BOTTOM --- file a: Server Name,File System,Path,File,Date,Type,ID horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111 host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222 hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666 hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333 hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777 hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444 hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555 server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999 file b: ID,Size 11111111,215915 22222222,1716 33333333,212856 44444444,1729 55555555,215927 66666666,1728 88888888,1729 99999999,213876 bbbbbbbb,26669080 Expected file c: Server Name,File System,Path,File,Date,Type,ID,Size host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915 host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716 hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728 hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856 hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729 hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927 server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876 Additional notes: 0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place. 1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like. 2) I added headers just to provide a better idea of the nature of the data. The real files don't have it, so no need to skip them on the source file nor create it in the destination file. 3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines). 4) VERY VERY VERY IMPORTANT: 4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours. 4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | grep file b as many times as there are lines in file a, as I am currently doing. 4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong. Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ): cat "${file_a}" | while read -r line_file_a; do server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'` filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'` folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'` file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'` file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'` file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'` file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'` cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'` file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'` if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv fi done done One last additional note, just in case you wonder: The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute). You have reached the end. CONGRATULATIONS! You've been awarded the medal to patience.
$ cat stuff.awk BEGIN { FS=OFS="," } NR == FNR { a[$1] = $2; next } $7 in a { print $0, a[$7] } Note the order for providing the files to the awk command, b first, followed by a: $ awk -f stuff.awk b.txt a.txt host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915 host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716 hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728 hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856 hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729 hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927 server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
EDIT: Updated calculation You can try to predict how often you are calling another program: At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b. (9 * 160.000). For file b: 2 awk's, one file open and one file close for each hit. With 60K output, that would be 4 * 60.000. A small change in the code can change this into "only" 160.000 times a grep: cat "${file_a}" | while IFS=, read -r server_name_file_a \ filespace_name_file_a folder_name_file_a file_name_file_a \ file_date_file_a file_type_file_a file_id_file_a; do grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" fi done done >> ${file_c}.csv Well, try this with your 160K files and see how much faster it is. Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done). while IFS=, read -r server_name_file_a \ filespace_name_file_a folder_name_file_a file_name_file_a \ file_date_file_a file_type_file_a file_id_file_a; do grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" fi done done < "${file_a}" >> ${file_c}.csv The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a. This solution is a nice improvement in the performance, but still a lot overhead with grep. Another huge improvement can be found with awk. The best solution is using awk as explained in What is "NR==FNR" in awk? and found in the answer of #jas. It is only one system call and both files are only read once.
How can I use sed (or awk or maybe a perl one-liner) to get values from specific columns in file A and use it to find lines in file B?
OK, sedAwkPerl-fu-gurus. Here's one similar to these (Extract specific strings...) and (Using awk to...), except that I need to use the number extracted from columns 4-10 in each line of File A (a PO number from a sales order line item) and use it to locate all related lines from File B and print them to a new file. File A (purchase order details) lines look like this: xxx01234560000000000000000000 yyy zzzz000000 File B (vendor codes associated with POs) lines look like this: 00xxxxx01234567890123456789001234567890 Columns 4-10 in File A have a 7-digit PO number, which is found in columns 7-13 of file B. What I need to do is parse File A to get a PO number, and then create a new sub-file from File B containing only those lines in File B which have the POs found in File A. The sub-file created is essentially the sub-set of vendors from File B who have orders found in File A. I have tried a couple of things, but I'm really spinning my wheels on trying to make a one-liner for this. I could work it out in a script by defining variables, etc., but I'm curious whether someone knows a slick one-liner to do a task like this. The two referenced methods put together ought to do it, but I'm not quite getting it.
Here's a one-liner: egrep -f <(cut -c4-10 A | sed -e 's/^/^.{6}/') B It looks like the POs in file B actually start at column 8, not 7, but I made my regex start at column 7 as you asked in the question. And in case there's the possibility of duplicates in A, you could increase efficiency by weeding those out before scanning file B: egrep -f <(cut -c4-10 A | sort -u | sed -e 's/^/^.{6}/') B
sed 's_^...\(\d\{7\}\).*_/^.\{6\}\1/p_' FIRSTFILE > FILTERLIST sed -n -f FILTERLIST SECONDFILE > FILTEREDFILE The first line generates a sed script from firstfile than the second line uses that script to filter the second line. This can be combined to one line too... If the files are not that big you can do something like awk 'BEGIN { # read the whole FIRSTFILE PO numbers to an array } substr($0,7,7} in array { print $0 }' SECONDFILE > FILTERED You can do it like (but it will find the PO numbers anywhere on a line) fgrep -f <(cut -b 4-10 FIRSTFILE) SECONDFILE
Another way using only grep: grep -f <(grep -Po '^.{3}\K.{7}' fileA) fileB Explanation: -P for perl regex -o to select only the match \K is Perl positive lookbehind
script for getting extensions of a file
I need to get all the file extension types in a folder. For instance, if the directory's ls gives the following: a.t b.t.pg c.bin d.bin e.old f.txt g.txt I should get this by running the script .t .t.pg .bin .old .txt I have a bash shell. Thanks a lot!
See the BashFAQ entry on ParsingLS for a description of why many of these answers are evil. The following approach avoids this pitfall (and, by the way, completely ignores files with no extension): shopt -s nullglob for f in *.*; do printf '%s\n' ".${f#*.}" done | sort -u Among the advantages: Correctness: ls behaves inconsistently and can result in inappropriate results. See the link at the top. Efficiency: Minimizes the number of subprocess invoked (only one, sort -u, and that could be removed also if we wanted to use Bash 4's associative arrays to store results) Things that still could be improved: Correctness: this will correctly discard newlines in filenames before the first . (which some other answers won't) -- but filenames with newlines after the first . will be treated as separate entries by sort. This could be fixed by using nulls as the delimiter, or by the aforementioned bash 4 associative-array storage approach.
try this: ls -1 | sed 's/^[^.]*\(\..*\)$/\1/' | sort -u ls lists files in your folder, one file per line sed magic extracts extensions sort -u sorts extensions and removes duplicates sed magic reads as: s/ / /: substitutes whatever is between first and second / by whatever is between second and third / ^: match beginning of line [^.]: match any character that is not a dot *: match it as many times as possible \( and \): remember whatever is matched between these two parentheses \.: match a dot .: match any character *: match it as many times as possible $: match end of line \1: this is what has been matched between parentheses
People are really over-complicating this - particularly the regex: ls | grep -o "\..*" | uniq ls - get all the files grep -o "\..*" - -o only show the match; "\..*" match at the first "." & everything after it uniq - don't print duplicates but keep the same order you can also sort if you like, but sorting doesn't match the example This is what happens when you run it: > ls -1 a.t a.t.pg c.bin d.bin e.old f.txt g.txt > ls | grep -o "\..*" | uniq .t .t.pg .bin .old .txt
How can I make 'grep' show a single line five lines above the grepped line?
I've seen some examples of grepping lines before and after, but I'd like to ignore the middle lines. So, I'd like the line five lines before, but nothing else. Can this be done?
OK, I think this will do what you're looking for. It will look for a pattern, and extract the 5th line before each match. grep -B5 "pattern" filename | awk -F '\n' 'ln ~ /^$/ { ln = "matched"; print $1 } $1 ~ /^--$/ { ln = "" }' basically how this works is it takes the first line, prints it, and then waits until it sees ^--$ (the match separator used by grep), and starts again.
If you only want to have the 5th line before the match you can do this: grep -B 5 pattern file | head -1 Edit: If you can have more than one match, you could try this (exchange pattern with your actual pattern): sed -n '/pattern/!{H;x;s/^.*\n\(.*\n.*\n.*\n.*\n.*\)$/\1/;x};/pattern/{x;s/^\([^\n]*\).*$/\1/;p}' file I took this from a Sed tutorial, section: Keeping more than one line in the hold buffer, example 2 and adapted it a bit.
This is option -B -B NUM, --before-context=NUM Print NUM lines of leading context before matching lines. Places a line containing -- between contiguous groups of matches.
This way is easier for me: grep --no-group-separator -B5 "pattern" file | sed -n 1~5p This greps 5 lines before and including the pattern, turns off the --- group separator, then prints every 5th line.