Finding "main" functions' names in a C file via Bash script

Finding "main" functions' names in a C file via Bash script - c

I'm having a large number of C files, that are structured with the following principle:
All functions are declared in the C file and are with return type int, double or void.
All functions start with "ksz_". Only functions use this - nothing else uses "ksz_" in their names.
The file contains "main" functions. All supporting functions use their "main" function's name to form themselves.
Because they were made by different people they are quite messly made and have spaces placed at random places:
A rought visualization would be(note the spaces):
int ksz_Print(...)
{
...
}
void ksz_Print_Helper1 (... ){
...
}
void ksz_Print_Helper2(...) {
...
}
int ksz_Input(...){
...
}
double ksz_Input_Helper1 ( ...){
...
}
I need to find the "main" function names of each individual C file in order to use them for another seach algorithm.
Since these files are huge(sme of them have over a dozen thousand lines) and there are hundreds of them - I need a Bash scrip for this.
Ideally this script would extract only the "main" functions:
ksz_Print
ksz_Input
What stops me is that i can't think the Regex of my grep in order to extract the function lines. I think its logic should look like this:
(spaces)(int/float/double)(spaces)(ksz_)(other characers without space)(spaces)(open bracket)
After that I guess I'll extract the word containing "ksz_" from each line with cut(after trimming and removing duplicate spaces).
And last I'll need to find a way to filter out the supporting functions.
But what would be my initial grep in this script?

If I understand your specifications correctly this should do it:
root#local [~]# awk '/^[ \t]*(int|float|double)[ \t]+ksz_/ {print $2}' sample.txt
One thing I did not understand was whether there should only be one "_" after ksz so for example if "double ksz_Input_Helper1" is not something you want to match. In the regex above it does match.
I also chose to go with awk rather than grep as you said you want only the name the above awk prints only the second field using whitespace as a delimiter. If you still want to use grep this one does the same task:
root#local [~]# egrep '^\s*(int|float|double)\s+ksz_' sample.txt
Here is a breakdown(note in awk I use [ \t] in place of \s as I could not get it to recognize \s]:
^ - match start of line
\s* - match if there are 0 or more white spaces
(int|float|double) - match int, float, OR double
\s+ - match at least one whitespace
ksz_ - match literal string "ksz_"

Try using a regex that only matches the portion you want and only print that:
grep -oRE "(ksz_[a-zA-Z_]*\b)" *
-o - output only match
-R - recursive
-E - regex
[a-zA-Z_] - upper and lower case letters, underscore
\b - ending at word boundry

Related

How to portability use "${#:2}"?

On Allow for ${#:2} syntax in variable assignment they say I should not use "${#:2}" because it breaks things across different shells, and I should use "${*:2}" instead.
But using "${*:2}" instead of "${#:2}" is nonsense because doing "${#:2}" is not equivalent to "${*:2}" as the following example:
#!/bin/bash
check_args() {
echo "\$#=$#"
local counter=0
for var in "$#"
do
counter=$((counter+1));
printf "$counter. '$var', ";
done
printf "\\n\\n"
}
# setting arguments
set -- "space1 notspace" "space2 notspace" "lastargument"; counter=1
echo $counter': ---------------- "$*"'; counter=$((counter+1))
check_args "$*"
echo $counter': ---------------- "${*:2}"'; counter=$((counter+1))
check_args "${*:2}"
echo $counter': ---------------- "${#:2}"'; counter=$((counter+1))
check_args "${#:2}"
-->
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
1: ---------------- "$*"
$#=1
1. 'space1 notspace space2 notspace lastargument',
2: ---------------- "${*:2}"
$#=1
1. 'space2 notspace lastargument',
3: ---------------- "${#:2}"
$#=2
1. 'space2 notspace', 2. 'lastargument',
If I cannot use "${#:2}" (as they say), what is the equivalent can I use instead?
This is original question Process all arguments except the first one (in a bash script) and their only answer to keep arguments with spaces together is to use "${#:2}"

There's context that's not clear in the question unless you follow the links. It's concerning the following recommendation from shellcheck.net:
local _help_text="${#:2}"
^––SC2124 Assigning an array to a string! Assign as array, or use * instead of # to concatenate.
Short answer: Don't assign lists of things (like arguments) to plain variables, use an array instead.
Long answer: Generally, "${#:2}" will get all but the first argument, with each treated as a separate item ("word"). "${*:2}", on the other hand, produces a single item consisting of all but the first argument stuck together, separated by a space (or whatever the first character of $IFS is).
But in the specific case where you're assigning to a plain variable, the variable is only capable of storing a single item, so var="${#:2}" also collapses the arguments down to a single item, but it does it in a less consistent way than "${*:2}". In order to avoid this, use something that is capable of storing multiple items: an array. So:
Really bad: var="${#:2}"
Slightly less bad: var="${*:2}"
Much better: arrayvar=("${#:2}") (the parentheses make this an array)
Note: to get the elements of the array back, with each one treated properly as a separate item, use "${arrayvar[#]}". Also, arrays are not supported by all shells (notably, dash doesn't support them), so if you use them you should be sure to use a bash shebang (#!/bin/bash or #!/usr/bin/env bash). If you really need portability to other shells, things get much more complicated.

Neither ${#:2} nor ${*:2} is portable, and many shells will reject both as invalid syntax. If you want to process all arguments except the first, you should get rid of the first with a shift.
first="${1}"
shift
echo The arguments after the first are:
for x; do echo "$x"; done
At this point, the first argument is in "$first" and the positional parameters are shifted down one.

This demonstrates how to combine all ${#} arguments into a single variable one without the hack ${#:1} or ${#:2} (live example):
#!/bin/bash
function get_all_arguments_as_single_one_unquoted() {
single_argument="$(printf "%s " "${#}")";
printf "unquoted arguments %s: '%s'\\n" "${#}" "${single_argument}";
}
function get_all_arguments_as_single_one_quoted() {
single_argument="${1}";
printf "quoted arguments %s: '%s'\\n" "${#}" "${single_argument}";
}
function escape_arguments() {
escaped_arguments="$(printf '%q ' "${#}")";
get_all_arguments_as_single_one_quoted "${escaped_arguments}";
get_all_arguments_as_single_one_unquoted ${escaped_arguments};
}
set -- "first argument" "last argument";
escape_arguments "${#}";
-->
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
quoted arguments 1: 'first\ argument last\ argument '
unquoted arguments 4: 'first\ argument last\ argument '
As #William Pursell answer points out, if you would like to get only {#:2} arguments, you can add a shift call before "${#}"
function escape_arguments() {
shift;
escaped_arguments="$(printf '%q ' "${#}")";
get_all_arguments_as_single_one_quoted "${escaped_arguments}";
get_all_arguments_as_single_one_unquoted ${escaped_arguments};
}
-->
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
quoted arguments 1: 'last\ argument '
unquoted arguments 2: 'last\ argument '

How to insert lines of text after any C function begin and before end of function?

I have hundred of C-functions like
void test()
{
<content of function>
}
(Functions may have a return value UBYTE, BOOL, WORD, ...)
Now I would like to add a text to all functions as follows:
void test()
{
LABEEL_BEGIN
<blank line>
<content of function>
<blank line>
LABEL_END
}
So I need to insert LABEL_BEGIN and a blank line at the start of the function and a blank line and LABEL_END at end of the function.
I assume that this might be possible with some ticky regex!? Or is there a Text Editor available which has such a feature? Currently, I have MS Studio 2013 IDE, Notapad++, Textpad, PSpad available and also the GNU grep 2.4.5 command line tool.

The following regex will select the each void, its name, any parameters and its first opening curly brace. If you match this against your input, you can take it's match index and match length to find the beginning index of the void's inner content. You can then add you're additional content at the index.
(void [\w_][\w\d_]*\(.*\)(\r|)(\n|){)
Feel free to play around with it here: https://regexr.com/3jarc
To get the end of the function, you could use the regex above as a positive lookbehind then match the closing '}'.

What about a brace counter ?
you open your source code file, and you read char by char.
Each time you have '{', you increment the brace_counter, and each time you have '}', you decremente it.
If you found a '{' and the counter is 0 before incrementation, you have the beginning of your function.
If you found a '}' and the counteur is 0 after decrementation, you have the end of your function.
Could this work ? Or there are tricky C syntax that can broke that down ?

splittling a file into multiple with a delimiter awk

I am trying to split files evenly in a number of chunks. This is my code:
awk '/*/ { delim++ } { file = sprintf("splits/audio%s.txt", int(delim /2)); print >> file; }' < input_file
my files looks like this:
"*/audio1.lab"
0 6200000 a
6200000 7600000 b
7600000 8200000 c
.
"*/audio2.lab"
0 6300000 a
6300000 8300000 w
8300000 8600000 e
8600000 10600000 d
.
It is giving me an error: awk: line 1: syntax error at or near *
I do not know enough about awk to understand this error. I tried escaping characters but still haven't been able to figure it out. I could write a script in python but I would like to learn how to do this in awk. Any awkers know what I am doing wrong?
Edit: I have 14021 files. I gave the first two as an example.

For one thing, your regular expression is illegal; '*' says to match the previous character 0 or more times, but there is no previous character.
It's not entirely clear what you're trying to do, but it looks like when you encounter a line with an asterisk you want to bump the file number. To match an asterisk, you'll need to escape it:
awk '/\*/ { close(file); delim++ } { file = sprintf("splits/audio%d.txt", int(delim /2)); print >> file; }' < input_file
Also note %d is the correct format character for decimal output from an int.

idk what all the other stuff around this question is about but to just split your input file into separate output files all you need is:
awk '/\*/{close(out); out="splits/audio"++c".txt"} {print > out}' file
Since "repetition" metacharacters like * or ? or + can take on a literal meaning when they are the first character in a regexp, the regexp /*/ will work just fine in some (e.g. gawk) but not all awks and since you apparently have a problem with having too many files open you must not be using gawk (which manages files for you) so you probably need to escape the * and close() each output file when you're done writing to it. No harm doing that and it makes the script portable to all awks.

Using sed, How to Insert a line at the beginning of a C function - closing paren, newline, opening curly brace

I want to insert a line at the beginning of several C functions that are formatted the same. I suspect sed is the way to do this but I have limited sed knowledge. Thanks.
void func (any arbitrary list of parameters)
{
void func (any arbitrary list of parameters)
{
myNewInsertedLineHere

If the opening braces for functions begin on the first column and if they are the only braces that are in the first column (i.e. if you place opening braces for structs and enums at the end of a line), you can use:
sed -e 's/^{/{\n MYNEWLINE;/g' orig.c > edited.c
This seems to work in a quick test, but usual warnings and disclaimers apply.
Edit: As pointed out in the comments, not only functions have curly braces in the first column, so some context is needed. We can use another tool from the 70s, awk:
awk 'BEGIN {split("typedef union struct enum", a); \
for (i in a) skip[a[i]] = 1;}; \
{print; if (/^{/ && !(last in skip)) print " MYFIRSTLINE();"; \
if (NF > 0) last = $1; }' orig.c > edited.c
That's a one-liner in theory, but it might be better in a separate file, say first.awk:
#!/usr/bin/awk -f
BEGIN {
split("typedef union struct enum", a);
for (i in a) skip[a[i]] = 1;
};
{
print;
if (/^{/ && !(last in skip))
print " MYFIRSTLINE();";
if (NF > 0) last = $1;
}
Then you can call the script with
awk -f first.awk orig.c > edited.c
or, after chmodding executing permissions as
first.awk orig.c > edited.c
Of course, the same strategy:
print every line;
when there is a brace in the first column and the context isn't a type or variable definition, print the additional content;
save the first word to determine the context for the next line
can be implemented in any other scripting language, too.

A program is almost ever too complex for such simple rules. You said in title closing paren, newline, opening curly brace. What do you want to do with:
if testfunction(val)
{
It follows the criteria but is not a function definition.
That being said, the following sedscript should do the trick, it even cares for optional tabs or spaces around the {
/)[ \t\r]*$/ {
n
s/^[ \t]*}[ \t\r]*$/&/
t add
b end
:add
a\
\
end
}
In english, it reads:
look for a line ending with right paren
look at next line
try to replace a line containing only an opening curly brace (apart from white spaces) by itself
if substitution matched add an empty line

Regex to detect begining of the c function body

I working on a perl script that prints the required function body from the c source file. i have written a regex to get to the start of the function body as
(/(void|int)\s*($function_name)\s*\(.*?\)\s*{/s
but this works only for functions returning void or int(basic types)
how can i change this regex to handle user defined datatypes (struct or pointers)

Try this one (untested!), although it does expect the function to start at the beginning of a line :
/
^ # Start of line
\s*(?:struct\s+)[a-z0-9_]+ # return type
\s*\** # return type can be a pointer
\s*([a-z0-9_]+) # Function name
\s*\( # Opening parenthesis
(
(?:struct\s+) # Maybe we accept a struct?
\s*[a-z0-9_]+\** # Argument type
\s*(?:[a-z0-9_]+) # Argument name
\s*,? # Comma to separate the arguments
)*
\s*\) # Closing parenthesis
\s*{? # Maybe a {
\s*$ # End of the line
/mi # Close our regex and mark as case insensitive
You can squeeze all of these into a single line by removing the whitespace and comments.
Parsing code with a regex is generally hard though, and this regex is not perfect at all.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Finding "main" functions' names in a C file via Bash script - c

Try using a regex that only matches the portion you want and only print that: grep -oRE "(ksz_[a-zA-Z_]\b)" -o - output only match -R - recursive -E - regex [a-zA-Z_] - upper and lower case letters, underscore \b - ending at word boundry

Related

How to portability use "${#:2}"?

How to insert lines of text after any C function begin and before end of function?

splittling a file into multiple with a delimiter awk

Using sed, How to Insert a line at the beginning of a C function - closing paren, newline, opening curly brace

Regex to detect begining of the c function body

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Finding "main" functions' names in a C file via Bash script - c

Try using a regex that only matches the portion you want and only print that: grep -oRE "(ksz_[a-zA-Z_]*\b)" * -o - output only match -R - recursive -E - regex [a-zA-Z_] - upper and lower case letters, underscore \b - ending at word boundry

Related

How to portability use "${#:2}"?

How to insert lines of text after any C function begin and before end of function?

splittling a file into multiple with a delimiter awk

Using sed, How to Insert a line at the beginning of a C function - closing paren, newline, opening curly brace

Regex to detect begining of the c function body

Categories

Resources

Try using a regex that only matches the portion you want and only print that: grep -oRE "(ksz_[a-zA-Z_]\b)" -o - output only match -R - recursive -E - regex [a-zA-Z_] - upper and lower case letters, underscore \b - ending at word boundry