splittling a file into multiple with a delimiter awk - file

I am trying to split files evenly in a number of chunks. This is my code:
awk '/*/ { delim++ } { file = sprintf("splits/audio%s.txt", int(delim /2)); print >> file; }' < input_file
my files looks like this:
"*/audio1.lab"
0 6200000 a
6200000 7600000 b
7600000 8200000 c
.
"*/audio2.lab"
0 6300000 a
6300000 8300000 w
8300000 8600000 e
8600000 10600000 d
.
It is giving me an error: awk: line 1: syntax error at or near *
I do not know enough about awk to understand this error. I tried escaping characters but still haven't been able to figure it out. I could write a script in python but I would like to learn how to do this in awk. Any awkers know what I am doing wrong?
Edit: I have 14021 files. I gave the first two as an example.

For one thing, your regular expression is illegal; '*' says to match the previous character 0 or more times, but there is no previous character.
It's not entirely clear what you're trying to do, but it looks like when you encounter a line with an asterisk you want to bump the file number. To match an asterisk, you'll need to escape it:
awk '/\*/ { close(file); delim++ } { file = sprintf("splits/audio%d.txt", int(delim /2)); print >> file; }' < input_file
Also note %d is the correct format character for decimal output from an int.

idk what all the other stuff around this question is about but to just split your input file into separate output files all you need is:
awk '/\*/{close(out); out="splits/audio"++c".txt"} {print > out}' file
Since "repetition" metacharacters like * or ? or + can take on a literal meaning when they are the first character in a regexp, the regexp /*/ will work just fine in some (e.g. gawk) but not all awks and since you apparently have a problem with having too many files open you must not be using gawk (which manages files for you) so you probably need to escape the * and close() each output file when you're done writing to it. No harm doing that and it makes the script portable to all awks.

Related

How can I replace two characters in a 40GB file in Unix?

I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ] for , (note that there is a space after the comma) and erasing [ from the second file. Then, I would join the two files with the cat unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo and tr but I might be doing something wrong with the syntax.
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ] character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ] and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c +2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c +2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read() or use readline() in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total += n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)

How to use C code variable inside system()

I am using C code with sed. I want to read lines in the interval 1-10,11-20 etc. to perform some calculation.
int i,j,m,n;
for(i=0;i<10;i++){
j=i+1;
//correction. m,n is modified which was incorrect earlier.
m=i*10;
n=j*10;
system("sed -n 'm,n p' oldfile > newfile");
}
Ouput.
m,n p
It looks the variable is not passed in system. Is there any way to do that?
Use sprintf to build the command line:
char cmdline[100];
sprintf(cmdline, "sed -n '%d,%dp' oldfile.txt > newfile.txt", 10*i+1, 10*(i+1));
puts(cmdline); // optionally, verify manually it's going to do the right thing
system(cmdline);
(This is vulnerable to buffer overflow, but if your command-line arguments are not too flexible, 100 bytes should be enough.)
You cannot replace part of a string literal in C. What you need is to
Form a string with patterns
Replace those patterns with proper values with formatted I/O functions.
sprintf()/snprintf() will be your friend in this. You can do something like (copying from pmg's comment)
char cmd[100];
snprintf(cmd, 100, "sed -n '%d,%dp' oldfile > newfile", 10*i+1, 10*(i+1));
system(cmd);

Finding "main" functions' names in a C file via Bash script

I'm having a large number of C files, that are structured with the following principle:
All functions are declared in the C file and are with return type int, double or void.
All functions start with "ksz_". Only functions use this - nothing else uses "ksz_" in their names.
The file contains "main" functions. All supporting functions use their "main" function's name to form themselves.
Because they were made by different people they are quite messly made and have spaces placed at random places:
A rought visualization would be(note the spaces):
int ksz_Print(...)
{
...
}
void ksz_Print_Helper1 (... ){
...
}
void ksz_Print_Helper2(...) {
...
}
int ksz_Input(...){
...
}
double ksz_Input_Helper1 ( ...){
...
}
I need to find the "main" function names of each individual C file in order to use them for another seach algorithm.
Since these files are huge(sme of them have over a dozen thousand lines) and there are hundreds of them - I need a Bash scrip for this.
Ideally this script would extract only the "main" functions:
ksz_Print
ksz_Input
What stops me is that i can't think the Regex of my grep in order to extract the function lines. I think its logic should look like this:
(spaces)(int/float/double)(spaces)(ksz_)(other characers without space)(spaces)(open bracket)
After that I guess I'll extract the word containing "ksz_" from each line with cut(after trimming and removing duplicate spaces).
And last I'll need to find a way to filter out the supporting functions.
But what would be my initial grep in this script?
If I understand your specifications correctly this should do it:
root#local [~]# awk '/^[ \t]*(int|float|double)[ \t]+ksz_/ {print $2}' sample.txt
One thing I did not understand was whether there should only be one "_" after ksz so for example if "double ksz_Input_Helper1" is not something you want to match. In the regex above it does match.
I also chose to go with awk rather than grep as you said you want only the name the above awk prints only the second field using whitespace as a delimiter. If you still want to use grep this one does the same task:
root#local [~]# egrep '^\s*(int|float|double)\s+ksz_' sample.txt
Here is a breakdown(note in awk I use [ \t] in place of \s as I could not get it to recognize \s]:
^ - match start of line
\s* - match if there are 0 or more white spaces
(int|float|double) - match int, float, OR double
\s+ - match at least one whitespace
ksz_ - match literal string "ksz_"
Try using a regex that only matches the portion you want and only print that:
grep -oRE "(ksz_[a-zA-Z_]*\b)" *
-o - output only match
-R - recursive
-E - regex
[a-zA-Z_] - upper and lower case letters, underscore
\b - ending at word boundry

Using sed, How to Insert a line at the beginning of a C function - closing paren, newline, opening curly brace

I want to insert a line at the beginning of several C functions that are formatted the same. I suspect sed is the way to do this but I have limited sed knowledge. Thanks.
void func (any arbitrary list of parameters)
{
void func (any arbitrary list of parameters)
{
myNewInsertedLineHere
If the opening braces for functions begin on the first column and if they are the only braces that are in the first column (i.e. if you place opening braces for structs and enums at the end of a line), you can use:
sed -e 's/^{/{\n MYNEWLINE;/g' orig.c > edited.c
This seems to work in a quick test, but usual warnings and disclaimers apply.
Edit: As pointed out in the comments, not only functions have curly braces in the first column, so some context is needed. We can use another tool from the 70s, awk:
awk 'BEGIN {split("typedef union struct enum", a); \
for (i in a) skip[a[i]] = 1;}; \
{print; if (/^{/ && !(last in skip)) print " MYFIRSTLINE();"; \
if (NF > 0) last = $1; }' orig.c > edited.c
That's a one-liner in theory, but it might be better in a separate file, say first.awk:
#!/usr/bin/awk -f
BEGIN {
split("typedef union struct enum", a);
for (i in a) skip[a[i]] = 1;
};
{
print;
if (/^{/ && !(last in skip))
print " MYFIRSTLINE();";
if (NF > 0) last = $1;
}
Then you can call the script with
awk -f first.awk orig.c > edited.c
or, after chmodding executing permissions as
first.awk orig.c > edited.c
Of course, the same strategy:
print every line;
when there is a brace in the first column and the context isn't a type or variable definition, print the additional content;
save the first word to determine the context for the next line
can be implemented in any other scripting language, too.
A program is almost ever too complex for such simple rules. You said in title closing paren, newline, opening curly brace. What do you want to do with:
if testfunction(val)
{
It follows the criteria but is not a function definition.
That being said, the following sedscript should do the trick, it even cares for optional tabs or spaces around the {
/)[ \t\r]*$/ {
n
s/^[ \t]*}[ \t\r]*$/&/
t add
b end
:add
a\
\
end
}
In english, it reads:
look for a line ending with right paren
look at next line
try to replace a line containing only an opening curly brace (apart from white spaces) by itself
if substitution matched add an empty line

Replacing string within a file using bash + environmental variables

Let's say for simplicities sake I have a file (please forgive my useless pseudo code)
file.txt
std::string filename = "filename.txt"
double v_no = 2.0;
const int v_minor = 0; // < --- Target
std::string random_var1 = "Hello"
std::string random_var2 = "Hello 2"
int main()
{
// ..
}
And I have a bash file in the same directory - set_version.sh
I want to replace a string in this file with this script - specifically "v_minor = 0" with "v_minor = $VARIABLE" - in my case the variable will be an environmental set on a server.
So lets say it has been successfully run a couple of times. Now the string reads "v_minor = 2". I still want the same set_version.sh script to change 2 to whatever the variable is.
In the windows build of my software I have a batch file that changes "v_minor = %d" to "v_minor = %VERSION%".
My question is how I do something similar in bash? E.g. ignoring what the decimal is in the string and changing it to variable.
What I've got so far:
set_version.sh
#!/bin/bash
VERSION=75
sed -i '' 's/v_minor = %d/v_minor = $VERSION/g/' file.txt
Version var being set is just for testing purposes.
This returns error
sed: 1: "s/v_minor = %d/v_minor ...": bad flag in substitute command: '/'
I'm running Mac OS X Yosemite for this test.
Again, essentially %d can be any integer.
Thank you
That will work for you:
sed -i '' "s/v_minor = .*$/v_minor = $VERSION/g" file.txt
.*$ means till the end of that string.
Don't forget to use " " when operating with variables.
sed -i '' 's/v_minor = %d/v_minor = $VERSION/g/' file.txt
# ^
# /
# remove this slash ---
According to your description, I would suggest another easier way as follows (and simplicity will make less bugs...):
First, Change your target to
const int v_minor = V_MINOR; // < --- Target
Second, add an include line, anywhere before the target statement:
#include "version.h"
Third, write a script to generate the version.h similar to the followings:
#ifndef _VERSION_H_
#define V_MINOR 0 // <== this 0 is what you want to change.
#endif
Generate a script to output the said version.h is too simple (Just some fixed prints plus the target number). Thus, I don't provide it here.
Comparing to those possible error-prone sed-awk-perl solution, I prefer this simple solution.

Resources