Shell script vs C performance - c

I was wondering how bad would be the impact in the performance of a program migrated to shell script from C.
I have intensive I/O operations.
For example, in C, I have a loop reading from a filesystem file and writing into another one. I'm taking parts of each line without any consistent relation. I'm doing this using pointers. A really simple program.
In the Shell script, to move through a line, I'm using ${var:(char):(num_bytes)}. After I finish processing each line I just concatenate it to another file.
"$out" >> "$filename"
The program does something like:
while read line; do
out="$out${line:10:16}.${line:45:2}"
out="$out${line:106:61}"
out="$out${line:189:3}"
out="$out${line:215:15}"
...
echo "$out" >> "outFileName"
done < "$fileName"
The problem is, C takes like half a minute to process a 400MB file and the shell script takes 15 minutes.
I don't know if I'm doing something wrong or not using the right operator in the shell script.
Edit: I cannot use awk since there is not a pattern to process the line
I tried commenting the "echo $out" >> "$outFileName" but it doesn't gets much better. I think the problem is the ${line:106:61} operation. Any suggestions?
Thanks for your help.

I suspect, based on your description, that you're spawning off new processes in your shell script. If that's the case, then that's where your time is going. It takes a lot of OS resource to fork/exec a new process.

As donitor and Dietrich sugested, I did a little research about the AWK language and, again, as they said, it was a total success. here is a little example of the AWK program:
#!/bin/awk -f
{
option=substr($0, 5, 9);
if (option=="SOMETHING"){
type=substr($0, 80, 1)
if (type=="A"){
type="01";
}else if (type=="B"){
type="02";
}else if (type=="C"){
type="03";
}
print substr($0, 7, 3) substr($0, 49, 8) substr($0, 86, 8) type\
substr($0, 568, 30) >> ARGV[2]
}
}
And it works like a charm. It takes barely 1 minute to process a 500mb file

What's wrong with the C program? Is it broken? Too hard to maintain? Too inflexible? You are more of a Shell than a C expert?
If it ain't broke, don't fix it.
A look at Perl might be an option, too. Easier than C to modify and still speedy I/O; and it's much harder to create useless forks in Perl than in the shell.
If you told us exactly what the C program does, maybe there's a simple and faster-than-light solution with sed, grep, awk or other gizmos in the Unix tool box. In other words, tell us what you actually want to achieve, don't ask us to solve some random problem you ran into while pursuing what you think is a step towards your actual goal.
Alright, one problem with your shell script is the repeated open in echo "$out" >> "outFileName". Use this instead:
while read line; do
echo "${line:10:16}.${line:45:2}${line:106:61}${line:189:3}${line:215:15}..."
done < "$fileName" > "$outFileName"
As an alternative, simply use the cut utility (but note that it doesn't insert the dot after the first part):
cut -c 10-26,45-46,106-166 "$fileName" > "$outFileName"
You get the idea?

Related

Estimating the execution time of code on local linux system

I spend most of my time solving problems on Topcoder/SPOJ. So definitely I thought of performance (execution time) of my code on my system before submitting the code.
So, on searching I found time command in linux. But the problem is that it also includes the time for inputting the values for several test cases, in addition to processing time. So I thought of making an input file and sending that content to my code.
Something like
cat input.txt > ./myprogram
But this doesn't work. (I am not good at linux pipelining). Can anyone point out the mistake, or a better approach to judge my code execution time?
EDIT
All of my programs read from stdin
You need this:
./myprogram < input.txt
Or if you insist on the Useless Use of Cat:
cat input.txt | ./myprogram
You can put time in front of ./myprogram in either case.
You might want to look at xargs.
Something along the lines of
cat input.txt | xargs ./myprogram
You can add below code in your script for
assign the file descriptor to file for input and output fd # 3 is
Input file
exec 3< input.txt
Use read command in while loop to read the file line by line
while read -u 3 -r a
do

Trouble storing the output of mediainfo video times into an array

For the life of me, I cannot figure out why I can't store the output of the mediainfo --Inform command into an array. I've done for loops in Bash before without issue, perhaps I'm missing something really obvious here. Or, perhaps I'm going about it the completely wrong way.
#!/bin/bash
for file in /mnt/sda1/*.mp4
do vidtime=($(mediainfo --Inform="Video;%Duration%" $file))
done
echo ${vidtime[#]}
The output is always the time of the last file processed in the loop and the rest of the elements of the array are null.
I'm working on a script to endlessly play videos on a Raspberry Pi, but I'm finding that omxplayer isn't always exiting at the end of a video, it's really hard to reproduce so I've given up on troubleshooting the root cause. I'm trying to build some logic to kill off any omxplayer processes that are running longer than they should be.
Give this a shot. Note the += operator. You might also want to add quotes around $file if your filenames contain spaces:
#!/bin/bash
for file in /mnt/sda1/*.mp4
do vidtime+=($(mediainfo --Inform="Video;%Duration%" "$file"))
done
echo ${vidtime[#]}
It's more efficient to do it this way:
read -ra vidtime < <(exec mediainfo --Inform='Video;%Duration% ' -- /mnt/sda1/*.mp4)
No need to use a for loop and repeatingly call mediainfo.

Potential Dangers of Running Code in Parallel

I am working in OSX and using bash for my shell. I have a script which calls an executable hundreds of times, and each call is independent of the other. Therefore I am going to run this code in parallel. However, each call to the executable appends output to a community text file on a new line.
The ordering of the text file is not of importance (although it would be nice, but totally not worth over complicating since I can just use unix sort command), but what is, is that every call of the executable properly printed to the file. My concern is that if I run the script in parallel that the by some freak accident, two threads will check out the text file, print to it and then save different copies back to the original directory of the text file. Thus nullifying one of the writes to the file.
Does this actually happen, or is my understanding of printing to a file flawed? I don't fully know if this would also be a case by case bases so I will provide some mock code of what is being done in my program below.
Script:
#!/bin/sh
abs=$1
input=$(echo "$abs" | awk '{print 0.004 + 0.005*$1 }')
./program input
"./program":
~~Normal .c file stuff here~~
~~VALUE magically calculated here~~
~~run number is pulled out of input and assigned to index for sorting~~
FILE *fpp;
fpp = fopen("Doc.txt","a");
fprintf(fpp,"%d, %.3f\n", index, VALUE);
fclose(fpp);
~Closing events of program.c~~
Commands to run script in parallel in bash:
printf "%s\n" {0..199} | xargs -P 8 -n 1 ./program
Thanks for any help you guys can offer.
A write() call (like fwrite()) with the append flag set in open() (like during fopen()) is guaranteed to avoid the race condition you describe.
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
From: POSIX specifications for open:
opengroup.org open
Race conditions are what you are thinking of.
Not 100% sure but if you simple append to the end of the file rather than opening it and editing it should be right
If you have the option, make your program write to standard output instead of directly to a file. Then you can let the shell merge the output of your programs:
printf "%s\n" {0..199} | parallel -P 8 -n 1 ./program > merged_output.txt
Yeah, that looks like a recipe for disaster. If those processes both hit opening the file at the roughly the same time, only one will "take".
I suggest either (easier) writing to separate files then catting them together when the processing is done, or (harder) sending all results to a consumer process that will write the file for everyone.

Fast way to add line/row number to text file

I have a file wich has about 12 millon lines, each line looks like this:
0701648016480002020000002030300000200907242058CRLF
What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.
The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I haven´t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).
Right now I'm trying a solution in python that looks something like this.
file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()
I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.
Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.
thanks
oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:
file = open('file.cas', 'r')
try:
output = open('output.cas', 'w')
try:
output.writelines('%d %s' % tpl for tpl in enumerate(file))
finally:
output.close()
finally:
file.close()
That uses a generator expression which runs through the file processing one line at a time.
Why don't you try cat -n ?
Stefano is right:
$ time cat -n file.cas > output.cas
Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.

C wrapper to remove users on command "ps"

I have one question maybe someone here can help me. If i do "ps aux --sort user" on linux console I have one list of users and their processes runing on the machine. My question is how do I remove the users name and print that list like this in a C program:
for example:
(…)
--------------------------------------------------------------------------
user: APACHE
--------------------------------------------------------------------------
3169 0.0 1.2 39752 12352 ? S 04:10 0:00 /usr/sbin/httpd
--------------------------------------------------------------------------
user: VASCO
--------------------------------------------------------------------------
23030 0.0 0.1 4648 1536 pts/1 Ss 20:02 0:00 –bash
(…)
I print the user name then I print his processes... any ideas ?
thx
ps aux --sort user | perl -npe 's/^(\w+)//g; if ($user ne $1) {$user=$1; print "user: " . uc($user) . "\n";}'
You have a number of options depending on how much of it you want to do in C.
The simplest is to use system() to run a shell command (such as the one I posted earlier) to do the whole lot. system() will actually spawn a shell, so things like redirection will all work just as they do from the command line.
If you want to avoid using system() you could do it yourself, spawning two processes and linking them together. Look up pipe() and dup2(). Probably a waste of time.
You can run the ps program and parse its output in C. Again pipe() and dup2() are relevant. For the actual parsing, I'd just do it using the normal C string handling routines, as it's really quite straightforward. Obviously you could use a regex library instead but I think in this case it would result in more complicated code than without it.
Of course, you could do the whole lot in C by looking at files in /proc.
Not really an answer to your question, but user names are case-sensitive in unix, so capitalising them all probably isn't a good idea. If you want to make them stand out visually then "USER: apache" would be better.
Apart from that bmdhacks' answer is good (but not quite right). You could do something similar in awk, but it would be rather more complicated.
This should work:
ps haux --sort user | perl -npe 's/^(\S+)\s+//; if ($user ne $1) {$user=$1; print "user: " . uc($user) . "\n";}'
Based on bmdhacks's answer, but with the following fixes:
a) it counts any non-whitespace as part of the user name,
b) it deletes any whitespace after the user name, like your example output had, otherwise things wouldn't line up
c) I had to remove the g to get it to work. I think because with the g it can potentially match lots of times, so perl doesn't set $1 as it could be ambiguous.
d) Added h to the ps command so that it doesn't output a header.
that's command line for linux to make what i said... but that's not what i want... i want to make that in some C program... I hav to wrote C program that makes that... so i use fork() to create one process that executes ps aux --sort user... and then i want with another process to control the print of the processes and users... sry, if i explain my problem rong.
The command that i want to run is like this: ps aux --sort user | sort_by_user... this option sort_by_user doesn.t exists.. Make some process in C to run that command is simple with the commands fork() and execlp(), but create some option to that command in C i don´t hav any ideas.
Use popen and manipulate the redirected stdout string in your C program
I solved my problem by redirecting the stdout put and editing it.

Resources