flock correct usage to prevent read while writing - file

*/10 * * * * /usr/bin/flock -x -w 10 /tmp/craigslist.lock /usr/bin/lynx -width=120 -dump "http://sfbay.craigslist.org/search/roo/sfc?query=&srchType=A&minAsk=&maxAsk=1100&nh=6&nh=8&nh=16&nh=24&nh=17&nh=21&nh=22&nh=23&nh=27" | grep "sort by most recent" -A 53 > /home/winchell/apartments.txt
*/10 * * * * /usr/bin/flock -x -w 10 /tmp/craigslist.lock /usr/bin/php /home/winchell/apartments.php
This is a cron job. The second line php command seems to be executing even while lynx is writing to apartments.txt, and I don't see the reason. Is this correct usage assuming I'm trying to prevent read from apartments.txt while lynx/grep are writing to it? Thanks!

Your usage is not correct. Notice how your first cron job is a pipeline consisting of two commands:
/usr/bin/flock -x -w 10 /tmp/craigslist.lock /usr/bin/lynx -width=120 -dump
"http://sfbay.craigslist.org/search/roo/sfc?query=&srchType=A&minAsk=&maxAsk=1100&nh=6&nh=8&nh=16&nh=24&nh=17&nh=21&nh=22&nh=23&nh=27"
which is then piped to:
grep "sort by most recent" -A 53 > /home/winchell/apartments.txt
So the first command is locking a file but it's the second command that's writing to that file! The second command will happily execute without waiting for the lock.
One way to fix this would be to write the file while holding the lock:
lynx etc... | grep etc.. |
flock -x -w 10 /tmp/craigslist.lock tee /home/winchell/apartments.txt
The disadvantage of this approach is that lynx and grep run even if the file is locked. To prevent this, you will have to run the whole thing under the lock:
flock -x -w 10 /tmp/craigslock.lock sh -c "lynx etc... | grep etc... >thefile"
With this approach you will have to pay careful attention to quoting as the URL argument of lynx as it will require double quoting.
Finally: consider using curl or wget instead of lynx. lynx is meant for interactive usage!

Related

Save one line of `top`, `htop` or `intel_gpu_top` outputs into a Bash array

I want to save 1 line from the output of top into a Bash array to later access its components:
$ timeout 1 top -d 2 | awk 'NR==8'
2436 USER 20 0 1040580 155268 91100 S 6.2 1.0 56:38.94 Xorg
Terminated
I tried:
$ gpu=($(timeout 1s top -d 2 | awk 'NR==8'))
$ mapfile -t gpu < <($(timeout 1s top -d 2 | awk 'NR==8'))
and, departing from the array requisite, even:
$ read -r gpu < <(timeout 1s top -d 2 | awk 'NR==8')
all returned a blank for either ${gpu[#]} (first two) or $gpu (last).
Edit:
As pointed out by #Cyrus and others gpu=($(top -n 1 -d 2 | awk 'NR==8')) is the obvious solution. However I want to build the cmd dynamically so top -d 2 may be replaced by other cmds such as htop -d 20 or intel_gpu_top -s 1. Only top can limit its maximum number of iterations, so that is not an option in general, and for that reason I resort to timeout 1s to kill the process in all shown attempts...
End edit
Using a shell other than Bash is not an option. Why did the above attempts fail and how can I achieve that ?
Why did the above attempts fail
Because redirection to pipe does not have terminal capabilities, top process receives SIGTTOU signal when it tries to write the terminal and take the terminal "back" from the shell. The signal causes top to terminate.
how can I achieve that ?
Use top -n 1. Generally, use the tool specific options to disable using terminal utilities by that tool.
However I want to build the cmd dynamically so top -d 2 may be replaced by other cmds such as htop -d 20 or intel_gpu_top -s 1
Write your own terminal emulation and extract the first line from the buffer of the first stuff the command displays. See GNU screen and tmux source code for inspiration.
I dont think you need the timeout there if its intended to quit top. You can instead use the -n and -b flags but feel free to add it if you need it
#!/bin/bash
arr=()
arr[0]=$(top -n 1 -b -d 2 | awk 'NR==8')
arr[1]=random-value
arr[2]=$(top -n 1 -b -d 2 |awk 'NR==8')
echo ${arr[0]}
echo ${arr[1]}
echo ${arr[2]}
output:
1 root 20 0 99868 10412 7980 S 0.0 0.5 0:00.99 systemd
random-value
1 root 20 0 99868 10412 7980 S 0.0 0.5 0:00.99 systemd
from top man page:
-b :Batch-mode operation
Starts top in Batch mode, which could be useful for sending output from top to other programs or to a
file. In this mode, top will not accept input and runs until the iterations limit you've set with the
`-n' command-line option or until killed.
-n :Number-of-iterations limit as: -n number
Specifies the maximum number of iterations, or frames, top should produce before ending.
-d :Delay-time interval as: -d ss.t (secs.tenths)
Specifies the delay between screen updates, and overrides the corresponding value in one's personal
configuration file or the startup default. Later this can be changed with the `d' or `s' interactive
commands.

assign two bash arrays with a single command

youtube-dl can take some time parsing remote sites when called multiple times.
EDIT0 : I want to fetch multiple properties (here fileNames and remoteFileSizes) output by youtube-dl without having to run it multiple times.
I use those 2 properties to compare the local file size and ${remoteFileSizes[$i]} to tell if the file is finished downloading.
$ youtube-dl --restrict-filenames -o "%(title)s__%(format_id)s__%(id)s.%(ext)s" -f m4a,18,webm,251 -s -j https://www.youtube.com/watch?v=UnZbjvyzteo 2>errors_youtube-dl.log | jq -r ._filename,.filesize | paste - - > input_data.txt
$ cat input_data.txt
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__18__UnZbjvyzteo__youtube_com.mp4 8419513
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__250__UnZbjvyzteo__youtube_com.webm 1528955
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__140__UnZbjvyzteo__youtube_com.m4a 2797366
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__244__UnZbjvyzteo__youtube_com.webm 8171725
I want the first column in the fileNames array and the second column in the remoteFileSizes.
For the time being, I use a while read loop, but when this loop is finished my two arrays are lost :
$ fileNames=()
$ remoteFileSizes=()
$ cat input_data.txt | while read fileName remoteFileSize; do \
fileNames+=($fileName); \
remoteFileSizes+=($remoteFileSize); \
done
$ for fileNames in "${fileNames[#]}"; do \
echo PROCESSING....; \
done
$ echo "=> fileNames[0] = ${fileNames[0]}"
=> fileNames[0] =
$ echo "=> remoteFileSizes[0] = ${remoteFileSizes[0]}"
=> remoteFileSizes[0] =
$
Is it possible to assign two bash arrays with a single command ?
You assign variables in a subshell, so they are not visible in the parent shell. Read https://mywiki.wooledge.org/BashFAQ/024 . Remove the cat and do a redirection to solve your problem.
while IFS=$'\t' read -r fileName remoteFileSize; do
fileNames+=("$fileName")
remoteFileSizes+=("$remoteFileSize")
done < input_data.txt
You might also interest yourself in https://mywiki.wooledge.org/BashFAQ/001.
For what it's worth, if you're looking for specific/bespoke functionality from youtube-dl, I recommend creating your own python scripts using the 'embedded' approach: https://github.com/ytdl-org/youtube-dl/blob/master/README.md#embedding-youtube-dl
You can set your own signal for when a download is finished (text/chime/mail/whatever) and track downloads without having to compare file sizes.

Bash script for running parallel commands using input from a file

I am trying to make a shell script which reads a configuration file and executes commands for each line in parallel.
For example, I have IPs.cfg, which can contain a variable amount of IPs. It could be one or several
IPs.cfg
145.x.x.x
176.x.x.x
192.x.x.x
I want to read the file and then execute a command for each line at the same time... for instance :
scp test.iso root#$IP1:/tmp &
scp test.iso root#$IP2:/tmp &
scp test.iso root#$IP3:/tmp &
wait
The way I'm thinking this is that I store the IPs into an array
IFS=$'\n' read -d '' -r -a array < IPs.cfg
Then I extract the number of lines from the file and decrease it by 1 since the array starts at 0.
NUMLINES=`cat IPs.cfg | wc -l`
NUMLINES=$((NUMLINES-1))
Now I want to execute the commands all at the same time. It's a variable number of parameters so I can't just manually use scp test.iso root#${array[0]}:/tmp & scp test.iso root#${array[1]}:/tmp & so on. I could use a while loop, but that would mean doing the commands one at a time. I'm also thinking about using recursion, but I have never done that in a bash script.
It might be a silly question, but what are my options here?
You could make use of GNU parallel
The loop should look like this:
while read -r ip ; do
scp test.iso "root#$ip:/tmp" &
done < IPs.conf
wait
with this trick, you can control the number of simultaneous processes running at the same:
cat IPs.conf | xargs -n1 -I{} -P10 scp test.iso "root#{}:/tmp"
Check the -p10 which means to use 10 processes at the same time.

Save curl download statistics / log

I'm reading a stream with curl and grep some highlights.
curl url | grep desired_key_word
I've noticed that curl is providing me some nice download statistics such as:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 10.9M 0 10.9M 0 0 1008k 0 --:--:-- 0:00:11 --:--:-- 1092k
How can I save those statistics e.g. every second in a file?
I found this: http://curl.haxx.se/mail/archive-2002-11/0115.html however it was not able to abstract it to my problem.
curl -n agent.mtconnect.org/sample\?interval=0 -o xml_stream.log 2>> dl.log
The dl.log should have the statistics included, however is does not work.
Here is the unabstracted version.
curl -s -S -n http://speedtest.fremont.linode.com/100MB-fremont.bin -o /dev/null -w "%{time_total},%{size_download},%{speed_download}\n" >> stats.log
Only the stdout get redirected by the -o flag.
For the -o flag the man page states:
-o/--output <file>
Write output to <file> instead of stdout...
If you want stderr, you need something like this:
curl -n agent.mtconnect.org/sample\?interval=0 >> xml_stream.log 2>> dl.log

Faster grep function for big (27GB) files

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.
3) Use fgrep because you're searching for a fixed string, not a regular expression.
4) Use -f to make grep read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.
My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.
If you have to use grep, you can use
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.
ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take #BrianAgnew's advice and use a single invocation of grep try the --mmap option.
Though it should be noted that mmap can be a bit quirky if the source files changes during the search.
from man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.
Using GNU Parallel it would look like this:
awk '{print $1}' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

Resources