Keep only last line of a repeated pattern - c

I would like to know if it is possible to delete all the lines of a selected pattern except the last one. It is not so easy to explain, so I will make an example.
I have a text file with content similar to this:
A sent (1)
A received (1)
B sent (1)
B sent (2)
B sent (3)
B received (1)
I would like to have an alternation between "sent" and "received" messages, where the "sent" one is the last between the sent messages with the same letter. So I need an output like:
A sent (1)
A received (1)
B sent (3)
B received (1)
Is there some program that can do something like that? I can use either Ubuntu or Windows, or build a simple C/C++ application, if necessary.

Here's a simple way:
tac FILE | uniq -w 6 | tac
We:
Reverse-print the file using tac (necessary for uniq to work right here).
Weed out duplicate lines basing uniqueness on only the first 6 characters (thereby ignoring the incrementing number in parantheses). Only the first line of a set of duplicate lines is kept, which is why we have used tac.
Then reverse-print the file again so it's in the order you want.

Under linux, this can be a one-liner, for example in awk:
awk '$1 $2 != prev {if (buf) print buf} {prev = $1 $2; buf = $0} END {print buf}' mylog.txt
The exact syntax depends on your pattern. Here, I just use the first two words ($1 $2) of the line to determine whether a line should be skipped. The skipped lines ($0) are stored in a temporary which is printed when the pattern is different or at the END.
If it is okay to print the first line of a similar block instead of the last line, the script reduces to:
awk '$1 $2 != prev; {prev = $1 $2}' mylog.txt
or you can use the even more succinct alternative:
uniq -w 6
which sorts out unique lines, but considering only the first 6 characters.

In C, something like this will do:
bool isFirstSent = false;
bool isSecondSent = false;
char *firstLine = getLine(file); // getline returns a char array
// if a line exists and NULL otherwise
if (strstr(firstLine, "sent"))
isFirstSent = true;
char *secondLine = getLine(file);
while (secondLine)
{
if (strstr(secondLine, "sent"))
isSecondSent = true;
else
isSecondSent = false;
if (isFirstSent != isSecondSent)
printf("%s", firstLine);
free(firstLine);
isFirstSent = isSecondSent;
firstLine = secondLine;
secondLine = getLine(file);
}
free(firstLine);

Related

How can I replace two characters in a 40GB file in Unix?

I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ] for , (note that there is a space after the comma) and erasing [ from the second file. Then, I would join the two files with the cat unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo and tr but I might be doing something wrong with the syntax.
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ] character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ] and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c +2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c +2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read() or use readline() in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total += n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)

Finding "main" functions' names in a C file via Bash script

I'm having a large number of C files, that are structured with the following principle:
All functions are declared in the C file and are with return type int, double or void.
All functions start with "ksz_". Only functions use this - nothing else uses "ksz_" in their names.
The file contains "main" functions. All supporting functions use their "main" function's name to form themselves.
Because they were made by different people they are quite messly made and have spaces placed at random places:
A rought visualization would be(note the spaces):
int ksz_Print(...)
{
...
}
void ksz_Print_Helper1 (... ){
...
}
void ksz_Print_Helper2(...) {
...
}
int ksz_Input(...){
...
}
double ksz_Input_Helper1 ( ...){
...
}
I need to find the "main" function names of each individual C file in order to use them for another seach algorithm.
Since these files are huge(sme of them have over a dozen thousand lines) and there are hundreds of them - I need a Bash scrip for this.
Ideally this script would extract only the "main" functions:
ksz_Print
ksz_Input
What stops me is that i can't think the Regex of my grep in order to extract the function lines. I think its logic should look like this:
(spaces)(int/float/double)(spaces)(ksz_)(other characers without space)(spaces)(open bracket)
After that I guess I'll extract the word containing "ksz_" from each line with cut(after trimming and removing duplicate spaces).
And last I'll need to find a way to filter out the supporting functions.
But what would be my initial grep in this script?
If I understand your specifications correctly this should do it:
root#local [~]# awk '/^[ \t]*(int|float|double)[ \t]+ksz_/ {print $2}' sample.txt
One thing I did not understand was whether there should only be one "_" after ksz so for example if "double ksz_Input_Helper1" is not something you want to match. In the regex above it does match.
I also chose to go with awk rather than grep as you said you want only the name the above awk prints only the second field using whitespace as a delimiter. If you still want to use grep this one does the same task:
root#local [~]# egrep '^\s*(int|float|double)\s+ksz_' sample.txt
Here is a breakdown(note in awk I use [ \t] in place of \s as I could not get it to recognize \s]:
^ - match start of line
\s* - match if there are 0 or more white spaces
(int|float|double) - match int, float, OR double
\s+ - match at least one whitespace
ksz_ - match literal string "ksz_"
Try using a regex that only matches the portion you want and only print that:
grep -oRE "(ksz_[a-zA-Z_]*\b)" *
-o - output only match
-R - recursive
-E - regex
[a-zA-Z_] - upper and lower case letters, underscore
\b - ending at word boundry

splittling a file into multiple with a delimiter awk

I am trying to split files evenly in a number of chunks. This is my code:
awk '/*/ { delim++ } { file = sprintf("splits/audio%s.txt", int(delim /2)); print >> file; }' < input_file
my files looks like this:
"*/audio1.lab"
0 6200000 a
6200000 7600000 b
7600000 8200000 c
.
"*/audio2.lab"
0 6300000 a
6300000 8300000 w
8300000 8600000 e
8600000 10600000 d
.
It is giving me an error: awk: line 1: syntax error at or near *
I do not know enough about awk to understand this error. I tried escaping characters but still haven't been able to figure it out. I could write a script in python but I would like to learn how to do this in awk. Any awkers know what I am doing wrong?
Edit: I have 14021 files. I gave the first two as an example.
For one thing, your regular expression is illegal; '*' says to match the previous character 0 or more times, but there is no previous character.
It's not entirely clear what you're trying to do, but it looks like when you encounter a line with an asterisk you want to bump the file number. To match an asterisk, you'll need to escape it:
awk '/\*/ { close(file); delim++ } { file = sprintf("splits/audio%d.txt", int(delim /2)); print >> file; }' < input_file
Also note %d is the correct format character for decimal output from an int.
idk what all the other stuff around this question is about but to just split your input file into separate output files all you need is:
awk '/\*/{close(out); out="splits/audio"++c".txt"} {print > out}' file
Since "repetition" metacharacters like * or ? or + can take on a literal meaning when they are the first character in a regexp, the regexp /*/ will work just fine in some (e.g. gawk) but not all awks and since you apparently have a problem with having too many files open you must not be using gawk (which manages files for you) so you probably need to escape the * and close() each output file when you're done writing to it. No harm doing that and it makes the script portable to all awks.

Using sed, How to Insert a line at the beginning of a C function - closing paren, newline, opening curly brace

I want to insert a line at the beginning of several C functions that are formatted the same. I suspect sed is the way to do this but I have limited sed knowledge. Thanks.
void func (any arbitrary list of parameters)
{
void func (any arbitrary list of parameters)
{
myNewInsertedLineHere
If the opening braces for functions begin on the first column and if they are the only braces that are in the first column (i.e. if you place opening braces for structs and enums at the end of a line), you can use:
sed -e 's/^{/{\n MYNEWLINE;/g' orig.c > edited.c
This seems to work in a quick test, but usual warnings and disclaimers apply.
Edit: As pointed out in the comments, not only functions have curly braces in the first column, so some context is needed. We can use another tool from the 70s, awk:
awk 'BEGIN {split("typedef union struct enum", a); \
for (i in a) skip[a[i]] = 1;}; \
{print; if (/^{/ && !(last in skip)) print " MYFIRSTLINE();"; \
if (NF > 0) last = $1; }' orig.c > edited.c
That's a one-liner in theory, but it might be better in a separate file, say first.awk:
#!/usr/bin/awk -f
BEGIN {
split("typedef union struct enum", a);
for (i in a) skip[a[i]] = 1;
};
{
print;
if (/^{/ && !(last in skip))
print " MYFIRSTLINE();";
if (NF > 0) last = $1;
}
Then you can call the script with
awk -f first.awk orig.c > edited.c
or, after chmodding executing permissions as
first.awk orig.c > edited.c
Of course, the same strategy:
print every line;
when there is a brace in the first column and the context isn't a type or variable definition, print the additional content;
save the first word to determine the context for the next line
can be implemented in any other scripting language, too.
A program is almost ever too complex for such simple rules. You said in title closing paren, newline, opening curly brace. What do you want to do with:
if testfunction(val)
{
It follows the criteria but is not a function definition.
That being said, the following sedscript should do the trick, it even cares for optional tabs or spaces around the {
/)[ \t\r]*$/ {
n
s/^[ \t]*}[ \t\r]*$/&/
t add
b end
:add
a\
\
end
}
In english, it reads:
look for a line ending with right paren
look at next line
try to replace a line containing only an opening curly brace (apart from white spaces) by itself
if substitution matched add an empty line

Filter text from huge .csv files, in C

I have the raw and unfiltered records in a csv file (more than 1000000 records), and I am suppose to filter out those records from a list of files (each weighing more than 282MB; approx. more than 2000000 records). I tried using strstr in C. This is my code:
while (!feof(rawfh)) //loop to read records from raw file
{
j=0; //counter
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh)) //read a line from raw file
{
line[j] = c; line[j+1] = '\0'; j++;
}
//function to extract the element in the specified column, in the CSV
extractcol(line, relcolraw, entry);
printf("\nWorking on : %s", entry);
found=0;
//read a set of 4000 bytes; this is the target file
while( fgets(buffer, 4000, dncfh)!=NULL && !found )
{
if( strstr(buffer, entry) !=NULL) //compare it
found++;
}
rewind(dncfh); //put the file pointer back to the start
// if the record was not found in the target list, write it into another file
if(!found)
{
fprintf(out, "%s,\n", entry); printf(" *** written to filtered ***");
}
else
{
found=0; printf(" *** Found ***");
}
//I hope this is the right way to null out a string
entry[0] = '\0'; line[0] ='\0';
//just to display a # on the screen, to let the user know that the program
//is still alive and running.
rawreccntr++;
if(rawreccntr>=10)
{
printf("#"); rawreccntr=0;
}
}
This program takes approximately 7 to 10 seconds, on an average, to search one entry in the target file (282 MB). So, 10*1000000 = 10000000 seconds :( God knows how much is that going to take if I decide to search in 25 files.
I was thinking of writing a program, and not going to spoon fed solutions (grep, sed etc.). OH, sorry, but I am using Windows 8 (64 bit, 4 GB RAM, AMD processor Radeon 2 core - 1000Mhz). I used DevC++ (gcc) to compile this.
Please enlighten me with your ideas.
Thanks in advance, and sorry if I sound stupid.
Update by Ali, the key information extracted from a comment:
I have a raw CSV file with details for customer's phone number and address. I have the target file(s) in CSV format; the Do Not Call list. I am suppose to write a program to filter out phone number that are not present in the Do No Call List. The phone numbers (for both files) are in the 2nd column. I, however, don't know of any other method. I searched for Boyer-Moore algorithm, however, could not implement that in C. Any suggestions about how should I go about searching for records?
EDITED
I would recommend you have a try with the readymade tools in any Unix/Linux system, grep and awk. You'll probably find they are just as fast and much more easily maintained. I haven't seen your data format, but you say the phone numbers are in the second column, so you can get the phone numbers on their own like this:
awk '{print $2}' DontCallFile.csv
If your phone numbers are in double quotes, you can remove those like this:
awk '{print $2}' DontCallFile.csv | tr -d '"'
Then you can use fgrep with the -f option, to search whether strings listed in one file are present in a second file, like this:
fgrep -f file1.csv file2.csv
or you can invert the search and search for strings NOT present in another file, by adding the -v switch to fgrep.
So, your final command would probably end up like this:
fgrep -v -f <(awk '{print $2}' DontCallFile.csv | tr -d '"') file2.csv
That says... search, in file2.csv for all strings not present (-v option) in column 2 of file "DontCallFile.csv". If you want to understand the bit in <() it is called process substitution and it basically makes a pseudo-file out of the result of running the command inside the brackets. And we need a pseudo-file because fgrep -f expects a file.
ORIGINAL ANSWER
Why are you using fgetc() anyway. Surely you would use getline() like this:
while(getline(myfile,line ))
{
...
}
Are you really reading the whole "target" file from the start for every single line in your main file? That will kill you! And why are you doing it in chunks of 4,000 bytes? And what if one of your strings straddles the 4,000 bytes you compare it with - i.e. the first 8 bytes are in one 4k chunk and the last however many bytes are in the nect 4k chunk?
I think you will get better help on here if you take the time to explain properly what you are trying to do - and maybe do it with awk or grep (at least figuratively) so we can see what you are actually trying to achieve. Your decription doesn't mention the "target" file you use in the code, for example.
You can do this with awk, like this:
awk -F, '
FNR==NR {gsub(/"/,"",$2);dcn[$2]++;next}
{gsub(/ /,"",$2);if(!dcn[$2])print}
' DontCallFile.csv x.csv
That says... the field separator is a comma (-F,). Now read the first file (DontCallFile.csv) and process according to the part in curly braces after FNR==NR. Remove the double quotes from around the phone number in field 2, using gsub (global substitution). Then increment the element in the associative array (i.e. hash) as indexed by unquoted field 2 and then move to next record. So basically, after file "DontCallFile.csv" is processed, the array dcn[] will hold a hash of all the numbers not to call (dcn=dontcallnumbers). Then, the code in the second set of curly braces is executed for each line of the second file ("x.csv"). That says... remove all spaces from around the phone number in field 2. Then, if that phone number is not present in the array dcn[] that we built earlier, print the line.
Here is one idea for improvement...
In the code below, what's the point in setting line[j+1] = '\0' at every iteration?
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
{
line[j] = c; line[j+1] = '\0'; j++;
}
You might as well do it outside the loop:
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
line[j++] = c;
line[j] = '\0';
My advice is the following.
Put all don't call phone numbers into an array.
Sort this array.
Use binary search to check if a given phone number is among the sorted
don't call numbers.
In the code below, I just hard-coded the numbers. In your application, you will have to replace that with the corresponding code.
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int compare(const void* a, const void* b) {
return (strcmp(*(char **)a, *(char **)b));
}
int binary_search(const char** first, const char** last, const char* val) {
ptrdiff_t len = last - first;
while (len > 0) {
ptrdiff_t half = len >> 1;
const char** middle = first;
middle += half;
if (compare(&*middle, &val)) {
first = middle;
++first;
len = len - half - 1;
}
else
len = half;
}
return first != last && !compare(&val,&*first);
}
int main(int argc, char** argv) {
size_t i;
/* Read _all_ of your don't call phone numbers into an array. */
/* For the sake of the example, I just hard-coded it. */
char* dont_call[] = { "908-444-555", "800-200-400", "987-654-321" };
/* in your program, change length to the number of dont_call numbers actually read. */
size_t length = sizeof dont_call / sizeof dont_call[0];
qsort(dont_call, length, sizeof(char *), compare);
printf("The don\'t call numbers sorted\n");
for (i=0; i<length; ++i)
printf("%lu %s\n", i, dont_call[i]);
/* For each phone number, check if it is in the sorted dont_call list. */
/* Use binary search to check it. */
char* numbers[] = { "999-000-111", "333-444-555", "987-654-321" };
size_t n = sizeof numbers / sizeof numbers[0];
printf("Now checking if we should call a given number\n");
for (i=0; i<n; ++i) {
int should_call = binary_search((const char **)dont_call, (const char **)dont_call+length, numbers[i]);
char* as_text = should_call ? "no" : "yes";
printf("Should we call %s? %s\n",numbers[i], as_text);
}
return 0;
}
This prints:
The don't call numbers sorted
0 800-200-400
1 908-444-555
2 987-654-321
Now checking if we should call a given number
Should we call 999-000-111? yes
Should we call 333-444-555? yes
Should we call 987-654-321? no
The code is definitely not perfect but it is sufficient to get you started.
The problem with your algorithm is complexity. You approach is O(n*m) where n is number of customers and m is number of do_not_call records (or size of file in your case). You need reduce this complexity. (And Boyer-Moore algorithm would not help there which suggested by Ali. It would not improve asymptotic complexity but only constant.) Even binary search as Ali suggest in his answer is not best. It would be O((n+m)*log m). We can do better. Nice solutions are using fgrep and awk as suggested by Mark Setchell in his answers. (I would chose one using fgrep which should perform better I guess but it is only guess.) I can provide one similar solution in Perl which will provide more robust CSV parsing and should handle your data sizes in easy on decent HW. This type of solutions has complexity O(n+m).
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHN_COL_DNC => 1;
use constant PHN_COL_CUSTOMERS => 1;
die "Usage: $0 dnc_file [customers]" unless #ARGV>0;
my $dncfile = shift #ARGV;
my $csv = Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1, binary=>1});
my %dnc;
open my $dnc, '<', $dncfile;
while(my $row = $csv->getline($dnc)){
$dnc{$row->[PHN_COL_DNC]} = undef;
}
close $dnc;
while(my $row = $csv->getline(*ARGV)){
$csv->print(*STDOUT, $row) unless exists $dnc{$row->[PHN_COL_CUSTOMERS]};
}
If it would not meet our performance expectation you can go down to C road but I would definitely recommend use some good csv parsing and hashmap libraries. I would try libcsv and khash.h

Resources