How to bypass dotted abbreviations with awk? - c

I have a line I want to separate into sentences using awk. I've set my field separator to '.' with -F. and used loop to print the grabbed sentences. But as expected it will also separate the dotted abbreviations.
For example, I have this line:
I was born in 1990. Specifically Aug. 13, 1990. Etc etc etc.
What it does is it will output:
I was born in 1990
Specifically on Aug
13, 1990
Etc etc etc
Even though what I want was:
I was bon in 1990
Specifically on Aug. 13, 1990
Etc etc etc
What is the simplest method to bypass said abbreviations? Was a . for -F enough?
EDIT
Abbreviated words were months.

$ awk -v RS='.' '{gsub(/^ +/,"")} /(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)$/{printf "%s. ",$0; next} /[^[:space:]]/{print $0 "."}' input.txt
I was born in 1990.
Specifically Aug. 13, 1990.
Etc etc etc.
How it works
-v RS='.'
Use the period as a record separator.
gsub(/^ +/,"")
Remove any leading spaces from records.
/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)$/{printf "%s. ",$0; next}
If a record ends with a month abbreviation, print the record followed by a period and a space but no newline. Skip the remaining commands and jump to the next record.
/[^[:space:]]/{print $0 "."}
If the record contains any non-blanks, print it followed by a period.

Related

How can I avoid newlines after array elements when using readarray?

I've got a text file
Today, 12:34
Today, 21:43
Today, 12:43
https://123
https://456
https://789
and wanted to print each line into an array. Therefore I used:
readarray array <'file.txt'
Now I'd like to create a new array mixing date and the corresponding link, so in this case, line 1 corresponds with line 4 and so on.
I wrote
declare -a array2
array2[0]=${array[0]}${array[3]}
array2[1]=${array[1]}${array[4]}
...
printing the whole array2 using "echo ${array2[*]}" gets the following:
Today, 12:34
https://123
Today, 21:43
https://456
Today, 12:43
https://789
Why are there newlines between the elements, so e.g. between array2[0] and array2[1] ? How could I get rid of them?
And why is there an empty space before T in the second and the following lines?
And is there a possibility to write the code above in a loop?
Kind regards,
X3nion
Use the -t argument to prevent the newlines from being included in the data stored in the individual array elements:
readarray -t array <file.txt
BTW, you can always strip your newlines after-the-fact, even if you don't prevent them from being read in the first place, by using the ${var%suffix} parameter expansion with the $'\n' syntax to refer to a newline literal:
array2[0]=${array[0]%$'\n'}${array[3]%$'\n'}
An awk solution would be much simpler (and much faster). Simply read all lines containing "Today" into an array in awk. Then beginning with the line not containing "Today" write the current line followed by the associated line from the array, e.g.
awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", $0, a[++m]}' file.txt
Example Use/Output
With your example lines in file.txt, you would receive:
$ awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", $0, a[++m]}' file.txt
https://123 Today, 12:34
https://456 Today, 21:43
https://789 Today, 12:43
Or if you wanted to change the order:
$ awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", a[++m], $0}' file.txt
Today, 12:34 https://123
Today, 21:43 https://456
Today, 12:43 https://789
Addition Per-Comment
If you are receiving whitespace before the output with awk that is due to having whitespace before the first field. To eliminate the whitespace, you can force awk to recalculate each field, removing whitespace simply by setting a field equal to itself, e.g.
awk '{$1 = $1} /Today/{a[++n] = $0; next} {printf "%s\t%s\n", a[++m], $0}' file.txt
By setting the first field equal to itself ($1 = $1), you force awk to recalculate each field which would eliminate leading whitespace. Take for example your data with leading whitespace (each line is preceded by 3-spaces):
Today, 12:34
Today, 21:43
Today, 12:43
https://123
https://456
https://789
Using the updated command gives the answers shown above with the whitespace removed.
Using paste
You can use the paste command as another option along with the wc -l (word count lines) command. Simply determined the number of lines and then use process substitution to output the first 1/2 of the lines followed by the last 1/2 of the lines and combine them with paste, e.g.
$ lc=$(wc -l <file.txt); paste <(head -n $((lc/2)) file.txt) <(tail -n $((lc/2)) file.txt)
Today, 12:34 https://123
Today, 21:43 https://456
Today, 12:43 https://789
(above, lc holds the line-count and then head and tail are used to split the file)
Let me know if you have questions or if this isn't what you were attempting to do.

Change Git Log Plus/Minus Signs to Anything Custom?

Git CMD line noob here, how do I change the default plus/minus (+/-) signs to something more unique, such as (>>>/<<<) or (|/~). Or any other symbol not as common as (+/-)!
Reason: I am trying to automate a report that collects all the changes to our schema.sql files. I have the line below that does an adequate job:
git log -p --since="14 days ago" -- *Schema*.sql
My only real issue with the output is the plus/minus (+/-) signs which are used to show what has been added or removed:
+ This line was added
- This line was removed
Comments in SQL (t-SQL) are two minus signs (--), so when a comment is removed, I end up with this:
--- This comment was removed
If I can substitute the (+/-) with a unique value I can format the results and make a nice, pretty report for the people that want to see things like that. Thanks in advance!
--output-indicator-new=<char>
--output-indicator-old=<char>
--output-indicator-context=<char>
Specify the character wanted for -old.
https://git-scm.com/docs/git-log#_common_diff_options
I don't know if git can do this natively, but you can certainly achieve what you want by piping the output of git log into sed. For example to change the plus to '$' and the minus to '%' in your report you could use the following command:
git log -p --since="14 days ago" -- *Schema*.sql | sed 's/^+/$/g' | sed 's/^-/%/g'

How to get local (regional) date format in linux?

is it possible (if so, how?) to get local (regional) date format? Ideally in cross-platform way, otherwise at least Linux for start would be enough.
What am I talking about: For example this line when executed in terminal returns date (and time) formatted in local (regional) manner:
date +"%c"
What I would like to have instead of the numbers is the form in which this is displayed, for example if I set my regional setting to Lithuanian ones I get:
2016 m. birželio 27 d. 19:06:11
So I would like to get this instead of the above:
YYYY MM DD
If I set regional settings to US ones:
Mon 27 Jun 2016 07:09:24 PM EEST
In this case instead of the above I would like to get:
DD MM YYYY
Meaning - not the actual numbers, but how local(regional) date is formatted.
I later want to use this information for input/output operations facing user
While Joachims hint is correct, here a solution for your original question.
Just enter in bash:
locale -k LC_TIME | grep ^d_fmt | cut -d= -f2
If you need the time format instead of the date format, use t_fmt instead of d_fmt, and for the combined date/time format use d_t_fmt

Bash display files by date

I am just creating bash script to: Print a long listing of the files in the login directory for a specific month. The user is prompted to enter the first 3 letters of a month name, starting with a capital, and the program will display a long list of all files that were last modified in that month.
For example, if user entered “Jul”, all the files that were last modified in July will be listed.
Is it possible to sort files by date and then limit them? or can it be done differently?
Take a look at this answer: https://stackoverflow.com/a/5289636/851273
It covers both month and year, though you can remove the match against the year.
read mon
la -la | grep $mon
You can grep -i for case insensitive grep. So user inputs can become case insensitive.
Note: This is crude because it returns results that have the text matching the month name. Ex: it will return files that are name after month. TO refine this you will have to just look at the date column
Here is the script that should do it
Month=Dec
ls -ltr |awk '$6 ~ /'$Month'/ {print $9}'
This will have awk look at the date field from the ls field ($6), ls -ltr will sort it by date. This will then expand the variable $Month and use that to search the $6 field, and print out the file name (the 9th field $9).

Sorting by unique values of multiple fields in UNIX shell script

I am new to unix and would like to be able to do the following but am unsure how.
Take a text file with lines like:
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
And output this:
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
I would like the script to be able to find all all the lines for each TR value that have a unique Line value.
Thanks
Since you are apparently O.K. with randomly choosing among the values for dir, day, TI, and stn, you can write:
sort -u -t ';' -k 1,1 -k 6,6 -s < input_file > output_file
Explanation:
The sort utility, "sort lines of text files", lets you sort/compare/merge lines from files. (See the GNU Coreutils documentation.)
The -u or --unique option, "output only the first of an equal run", tells sort that if two input-lines are equal, then you only want one of them.
The -k POS[,POS2] or --key=POS1[,POS2] option, "start a key at POS1 (origin 1), end it at POS2 (default end of line)", tells sort where the "keys" are that we want to sort by. In our case, -k 1,1 means that one key consists of the first field (from field 1 through field 1), and -k 6,6 means that one key consists of the sixth field (from field 6 through field 6).
The -t SEP or --field-separator=SEP option tells sort that we want to use SEP — in our case, ';' — to separate and count fields. (Otherwise, it would think that fields are separated by whitespace, and in our case, it would treat the entire line as a single field.)
The -s or --stabilize option, "stabilize sort by disabling last-resort comparison", tells sort that we only want to compare lines in the way that we've specified; if two lines have the same above-defined "keys", then they're considered equivalent, even if they differ in other respects. Since we're using -u, that means that means that one of them will be discarded. (If we weren't using -u, it would just mean that sort wouldn't reorder them with respect to each other.)

Resources