Bash help tallying/parsing substrings - arrays

I have a shell script I wrote a while back, that reads a word list (HITLIST), and recursively searches a directory for all occurrences of those words. Each line containing a "hit" is appended to file (HITOUTPUT).
I have used this script a couple of times over the last year or so, and have noticed that we often get hits from frequent offenders, and that it would be nice if we kept a count of each "super-string" that is triggered, and automatically remove repeat offenders.
For instance, if my word list contains "for" I might get a hundred hits or so for "foreign" or "form" or "force". Instead of validating each of these lines, it would be nice to simply wipe them all with one "yes/no" dialog per super-string.
I was thinking the best way to do this would be to start with a word from the hitlist, and record each unique occurrence of the super-string for that word (go until you are book-ended by why space) and go from there.
So on to the questions ...
What would be a good and efficient way to do this? My current idea
was to read in the file as a string, perform my counts, remove
repeat offenders from the the file input string, and output, but this is
proving to be a little more painful that I first suspected.
Would any specific data type/structure be preferred for this type of
work?
I have also thought about building the super-string count as I
create the HitOutput file, but I could not figure out a clean way of
doing this either. Any thoughts or suggestions?
A sample of the file I am reading in, and my code for reading in and traversing the hitlist and creating the HitOutput file below:
# Loop through hitlist list
while read -re hitlist || [[ -n "$hitlist" ]]
do
# If first character is "#" it's a comment, or line is blank, skip
if [ "$(echo $hitlist | head -c 1)" != "#" ]; then
if [ ! -z "$hitlist" -a "$histlist" != "" ]; then
# Parse comma delimited hitlist
IFS=',' read -ra categoryWords <<< "$hitlist"
# Search for occurrences/hits for each hit
for categoryWord in "${categoryWords[#]}"; do
# Append results to hit output string
eval 'find "$DIR" -type f -print0 | xargs -0 grep -HniI "$categoryWord"' >> HITOUTPUT
done
fi
fi
done < "$HITLIST"
src/fakescript.sh:1:Never going to win the war you mother!
src/open_source_licenses.txt:6147:May you share freely, never taking more than you give.
src/open_source_licenses.txt:8764:May you share freely, never taking more than you give.
src/open_source_licenses.txt:21711:No Third Party Beneficiaries. You agree that, except as otherwise expressly provided in this TOS, there shall be no third party beneficiaries to this Agreement. Waiver and Severability of Terms. The failure of UBM LLC to exercise or enforce any right or provision of the TOS shall
not constitute a waiver of such right or provision. If any provision of the TOS is found by a court of competent jurisdiction to be invalid, the parties nevertheless agree that the court should endeavor to give effect to the parties' intentions as reflected in the provision, and the other provisions of the TOS remain in full force and effect.
src/fakescript.sh:1:Never going to win the war you mother!
An example of my hitlist file:
# Comment out any category word lines that you do not want processed (the comma delimited lines)
# -----------------
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother

Let's divide this problem into two parts. First, we will update the hitlist interactively as requires by your customer. Second, we will find all matches to the updated hitlist.
1. Updating the hitlist
This searches for all words in files under directory dir that contain any word on the hitlist:
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
This script runs interactively. As an example, suppose that dir contains these two files:
$ cat dir/file1.txt
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
$ cat dir/file2.txt
dog and cat and formula, formula, formula
And hitlist contains two words:
$ cat hitlist
for
cat
If we then run our script, it looks like:
$ bash script.sh
catapult occurs 2 times. Include (y/n)? y
catermaran occurs 1 times. Include (y/n)? n
foreign occurs 2 times. Include (y/n)? y
form occurs 1 times. Include (y/n)? n
formula occurs 4 times. Include (y/n)? n
After the script is run, the file hitlist is updated with all the words that you want to include. We are now ready to proceed to the next step:
2. Finding matches to the updated hitlist
To read each word from a "hitlist" and search recursively for matches while ignoring, foreign even if the hitlist contains for, try:
grep -wrFf ../hitlist dir
-w tells grep to look only for full-words. Thus foreign will be ignored.
-r tells grep to search recursively.
-F tells grep to treat the hitlist as word, not regular expressions. (optional)
-f ../hitlist tells grep to read words from the file ../hitlist.
Following on with the example above, we would have:
$ grep -wrFf ./hitlist dir
dir/file2.txt:dog and cat and formula, formula, formula
dir/file1.txt:for all foreign or catapult also cat.
dir/file1.txt:The catapult hit the catermaran.
dir/file1.txt:The form of a foreign formula
If we don't want the file names displayed, use the -h option:
$ grep -hwrFf ./hitlist dir
dog and cat and formula, formula, formula
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
Automatic update for counts 10 or less
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
a=y
[ "$n" -gt 10 ] && read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
Reformatting the customer's hitlist
I see that your customer's hitlist has extra formatting, including comments, empty lines, and duplicated words. For example:
$ cat hitlist.source
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother
To convert that to format useful here, try:
$ sed -E 's/#.*//; s/[[:space:],]+/\n/g; s/\n\n+/\n/g; /^$/d' hitlist.source | grep . | sort -u >hitlist
$ cat hitlist
give
going
mother
never
to
up
word
you
your

Related

Bash Array Script Exclude Duplicates

So I have written a bash script (named music.sh) for a Raspberry Pi to perform the following functions:
When executed, look into one single directory (Music folder) and select a random folder to look into. (Note: none of these folders here have subdirectories)
Once a folder within "Music" has been selected, then play all mp3 files IN ORDER until the last mp3 file has been reached
At this point, the script would go back to the folders in the "Music" directory and select another random folder
Then it would again play all mp3 files in that folder in order
Loop indefinitely until input from user
I have this code which does all of the above EXCEPT for the following items:
I would like to NOT play any other "album" that has been played before
Once all albums played once, then shutdown the system
Here is my code so far that is working (WITH duplicates allowed):
#!/bin/bash
folderarray=($(ls -d /home/alphekka/Music/*/))
for i in "${folderarray[#]}";
do
folderitems=(${folderarray[RANDOM % ${#folderarray[#]}]})
for j in "${folderitems[#]}";
do
echo `ls $j`
cvlc --play-and-exit "${j[#]}"
done
done
exit 0
Please note that there isn't a single folder or file that has a space in the name. If there is a space, then I face some issues with this code working.
Anyways, I'm getting close, but I'm not quite there with the entire functionality I'm looking for. Any help would be greatly appreciated! Thank you kindly! :)
Use an associative array as a set. Note that this will work for all valid folder and file names.
#!/bin/bash
declare -A folderarray
# Each folder name is a key mapped to an empty string
for d in /home/alphekka/Music/*/; do
folderarray["$d"]=
done
while [[ "${!folderarray[*]}" ]]; do
# Get a list of the remaining folder names
foldernames=( "${!folderarray[#]}" )
# Pick a folder at random
folder=${foldernames[RANDOM%${#foldernames[#]}]}
# Remove the folder from the set
# Must use single quotes; see below
unset folderarray['$folder']
for j in "$folder"/*; do
cvlc --play-and-exit "$j"
done
done
Dealing with keys that contain spaces (and possibly other special characters) is tricky. The quotes shown in the call to unset above are not syntactic quotes in the usual sense. They do not prevent $folder from being expanded, but they do appear to be used by unset itself to quote the resulting string.
Here's another solution: randomize the list of directories first, save the result in an array and then play (my script just prints) the files from each element of the array
MUSIC=/home/alphekka/Music
OLDIFS=$IFS
IFS=$'\n'
folderarray=($(ls -d $MUSIC/*/|while read line; do echo $RANDOM $line; done| sort -n | cut -f2- -d' '))
for folder in ${folderarray[*]};
do
printf "Folder: %s\n" $folder
fileArray=($(find $folder -type f))
for j in ${fileArray[#]};
do
printf "play %s\n" $j
done
done
For the random shuffling I used this answer.
One liner solution with mpv, rl (randomlines), xargs, find:
find /home/alphekka/Music/ -maxdepth 1 -type d -print0 | rl -d \0 | xargs -0 -l1 mpv

Vlookup-like function using awk in ksh

Disclaimers:
1) English is my second language, so please forgive any gramatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
3) You will find some text in capital letters here and there. Is is of course not me "shouting" at you, but only a way to make portions of text stand out. Plase do not consider this an act of unpoliteness.
4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. I then preferred to be accused of overwording if need be... It would be, at least, not a common offense...
The "Problem":
I have 2 files (a and b for simplification). File a has 7 columns separated by commas. File b has 2 columns separated by commas.
What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c".
--- MORE INFO IN THE NOTES AT THE BOTTOM ---
file a:
Server Name,File System,Path,File,Date,Type,ID
horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333
hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999
file b:
ID,Size
11111111,215915
22222222,1716
33333333,212856
44444444,1729
55555555,215927
66666666,1728
88888888,1729
99999999,213876
bbbbbbbb,26669080
Expected file c:
Server Name,File System,Path,File,Date,Type,ID,Size
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
Additional notes:
0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place.
1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like.
2) I added headers just to provide a better idea of the nature of the data. The real files don't have it, so no need to skip them on the source file nor create it in the destination file.
3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines).
4) VERY VERY VERY IMPORTANT:
4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours.
4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | grep file b as many times as there are lines in file a, as I am currently doing.
4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong.
Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ):
cat "${file_a}" | while read -r line_file_a; do
server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'`
filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'`
folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'`
file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'`
file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'`
file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'`
file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'`
cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do
file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'`
file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'`
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv
fi
done
done
One last additional note, just in case you wonder:
The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute).
You have reached the end. CONGRATULATIONS! You've been awarded the medal to patience.
$ cat stuff.awk
BEGIN { FS=OFS="," }
NR == FNR { a[$1] = $2; next }
$7 in a { print $0, a[$7] }
Note the order for providing the files to the awk command, b first, followed by a:
$ awk -f stuff.awk b.txt a.txt
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
EDIT: Updated calculation
You can try to predict how often you are calling another program:
At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b.
(9 * 160.000).
For file b: 2 awk's, one file open and one file close for each hit. With 60K output, that would be 4 * 60.000.
A small change in the code can change this into "only" 160.000 times a grep:
cat "${file_a}" | while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done >> ${file_c}.csv
Well, try this with your 160K files and see how much faster it is.
Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done).
while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done < "${file_a}" >> ${file_c}.csv
The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a.
This solution is a nice improvement in the performance, but still a lot overhead with grep. Another huge improvement can be found with awk.
The best solution is using awk as explained in What is "NR==FNR" in awk? and found in the answer of #jas.
It is only one system call and both files are only read once.

How can I use sed (or awk or maybe a perl one-liner) to get values from specific columns in file A and use it to find lines in file B?

OK, sedAwkPerl-fu-gurus. Here's one similar to these (Extract specific strings...) and (Using awk to...), except that I need to use the number extracted from columns 4-10 in each line of File A (a PO number from a sales order line item) and use it to locate all related lines from File B and print them to a new file.
File A (purchase order details) lines look like this:
xxx01234560000000000000000000 yyy zzzz000000
File B (vendor codes associated with POs) lines look like this:
00xxxxx01234567890123456789001234567890
Columns 4-10 in File A have a 7-digit PO number, which is found in columns 7-13 of file B. What I need to do is parse File A to get a PO number, and then create a new sub-file from File B containing only those lines in File B which have the POs found in File A. The sub-file created is essentially the sub-set of vendors from File B who have orders found in File A.
I have tried a couple of things, but I'm really spinning my wheels on trying to make a one-liner for this. I could work it out in a script by defining variables, etc., but I'm curious whether someone knows a slick one-liner to do a task like this. The two referenced methods put together ought to do it, but I'm not quite getting it.
Here's a one-liner:
egrep -f <(cut -c4-10 A | sed -e 's/^/^.{6}/') B
It looks like the POs in file B actually start at column 8, not 7, but I made my regex start at column 7 as you asked in the question.
And in case there's the possibility of duplicates in A, you could increase efficiency by weeding those out before scanning file B:
egrep -f <(cut -c4-10 A | sort -u | sed -e 's/^/^.{6}/') B
sed 's_^...\(\d\{7\}\).*_/^.\{6\}\1/p_' FIRSTFILE > FILTERLIST
sed -n -f FILTERLIST SECONDFILE > FILTEREDFILE
The first line generates a sed script from firstfile than the second line uses that script to filter the second line. This can be combined to one line too...
If the files are not that big you can do something like
awk 'BEGIN { # read the whole FIRSTFILE PO numbers to an array }
substr($0,7,7} in array { print $0 }' SECONDFILE > FILTERED
You can do it like (but it will find the PO numbers anywhere on a line)
fgrep -f <(cut -b 4-10 FIRSTFILE) SECONDFILE
Another way using only grep:
grep -f <(grep -Po '^.{3}\K.{7}' fileA) fileB
Explanation:
-P for perl regex
-o to select only the match
\K is Perl positive lookbehind

script for getting extensions of a file

I need to get all the file extension types in a folder. For instance, if the directory's ls gives the following:
a.t
b.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
I should get this by running the script
.t
.t.pg
.bin
.old
.txt
I have a bash shell.
Thanks a lot!
See the BashFAQ entry on ParsingLS for a description of why many of these answers are evil.
The following approach avoids this pitfall (and, by the way, completely ignores files with no extension):
shopt -s nullglob
for f in *.*; do
printf '%s\n' ".${f#*.}"
done | sort -u
Among the advantages:
Correctness: ls behaves inconsistently and can result in inappropriate results. See the link at the top.
Efficiency: Minimizes the number of subprocess invoked (only one, sort -u, and that could be removed also if we wanted to use Bash 4's associative arrays to store results)
Things that still could be improved:
Correctness: this will correctly discard newlines in filenames before the first . (which some other answers won't) -- but filenames with newlines after the first . will be treated as separate entries by sort. This could be fixed by using nulls as the delimiter, or by the aforementioned bash 4 associative-array storage approach.
try this:
ls -1 | sed 's/^[^.]*\(\..*\)$/\1/' | sort -u
ls lists files in your folder, one file per line
sed magic extracts extensions
sort -u sorts extensions and removes duplicates
sed magic reads as:
s/ / /: substitutes whatever is between first and second / by whatever is between second and third /
^: match beginning of line
[^.]: match any character that is not a dot
*: match it as many times as possible
\( and \): remember whatever is matched between these two parentheses
\.: match a dot
.: match any character
*: match it as many times as possible
$: match end of line
\1: this is what has been matched between parentheses
People are really over-complicating this - particularly the regex:
ls | grep -o "\..*" | uniq
ls - get all the files
grep -o "\..*" - -o only show the match; "\..*" match at the first "." & everything after it
uniq - don't print duplicates but keep the same order
you can also sort if you like, but sorting doesn't match the example
This is what happens when you run it:
> ls -1
a.t
a.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
> ls | grep -o "\..*" | uniq
.t
.t.pg
.bin
.old
.txt

How can I make 'grep' show a single line five lines above the grepped line?

I've seen some examples of grepping lines before and after, but I'd like to ignore the middle lines.
So, I'd like the line five lines before, but nothing else.
Can this be done?
OK, I think this will do what you're looking for. It will look for a pattern, and extract the 5th line before each match.
grep -B5 "pattern" filename | awk -F '\n' 'ln ~ /^$/ { ln = "matched"; print $1 } $1 ~ /^--$/ { ln = "" }'
basically how this works is it takes the first line, prints it, and then waits until it sees ^--$ (the match separator used by grep), and starts again.
If you only want to have the 5th line before the match you can do this:
grep -B 5 pattern file | head -1
Edit:
If you can have more than one match, you could try this (exchange pattern with your actual pattern):
sed -n '/pattern/!{H;x;s/^.*\n\(.*\n.*\n.*\n.*\n.*\)$/\1/;x};/pattern/{x;s/^\([^\n]*\).*$/\1/;p}' file
I took this from a Sed tutorial, section: Keeping more than one line in the hold buffer, example 2 and adapted it a bit.
This is option -B
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
This way is easier for me:
grep --no-group-separator -B5 "pattern" file | sed -n 1~5p
This greps 5 lines before and including the pattern, turns off the --- group separator, then prints every 5th line.

Resources