Character encoding problem

Character encoding problem - c

I was recently editing a Unicode-encoded text file that also includes Thai characters (alongside "normal" characters). For some reason, after each sequence of Thai characters, a new line appeared.
After some mucking around with C, trying to remove all newline characters, I fired up vim to inspect the file. Apparently, after each Thai character sequence, there appears a "^M" string (without quotes).
Why is this happening, and what's that "^M"? I've found that I can fix the problem by removing the last three characters from the Thai string, but there surely must be a more elegant way to fix this ...

This has nothing to do with the fact that you have some Thai characters in the file. The ^M ('carrot M') is the representation of a Microsoft (DOS) carriage return. Dos2unix the file to get rid of these before editing it in vim.

Related

Exercise of indentation of C code

I am trying to improve the indentation of this C code. I am not picky as to the exact rules used for indentation, I just care that it looks like something more or less standard!
Interestingly, the code looks good on my browser but once downloaded it is clear that the indentation is caused by a mixture of "tabs" and "spaces" and it looks ugly on both SublimeText2 and TextWrangler.
I tried
indent file.c
but it returns a long series of error that seem related to the presence counterslash followed by newline ("\\n" or "\\\n" if the slash needs to be escaped) within strings. So I tried to remove those with
tr "\\\n" " " < file.c > newfile.c
but I did not quite reach my goal.

When I download the file, it has CRLF (DOS or Windows style) line endings. When you run that through Unix indent, it doesn't like the backslashes followed by CR instead of NL (aka LF). If you replace the CRLF line endings, it should be formattable by indent.
The problem is that indent expects the backslashes to be followed by a newline, but CR is not a newline.
If you have a dos2unix or dtou command, use that. If not, use tr:
tr -d '\015' < sfs_code.c > x31; mv x31 sfs_code.c
You can also improve things by replacing the tabs with 8 spaces. Again, there may be other tools that can do the job, but the classic tool is pr:
pr -e8 -l1 -t sfs_code.c > x31; mv x31 sfs_code.c
The backslash newline sequences for splitting strings over multiple lines is very 1980s-style coding. Since the C89/C90 standard introduced string concatenation, it isn't necessary (though it is still recognized as legitimate C).
At 6781 lines, that's a pretty big file. It includes a number of other source files (.c extensions) and some headers. I poked at it with uncrustify which had no problem with it; after removing the CR characters, indent was fine with it too.
(I used my own ule — Uniform Line Endings — program to do the conversion. It's one of a lot of different possible techniques.)
Note that your tr command failed because it mapped both backslash and newline to blank (and left the CR characters unchanged), which is not what you want. The tr command maps single characters; it is not appropriate for mapping combinations of characters as you wanted and needed.

Difference between linefeed and newline [duplicate]

What is the meaning of the following control characters:
Carriage return
Line feed
Form feed

Carriage return means to return to the beginning of the current line without advancing downward. The name comes from a printer's carriage, as monitors were rare when the name was coined. This is commonly escaped as "\r", abbreviated CR, and has ASCII value 13 or 0xD.
Linefeed means to advance downward to the next line; however, it has been repurposed and renamed. Used as "newline", it terminates lines (commonly confused with separating lines). This is commonly escaped as "\n", abbreviated LF or NL, and has ASCII value 10 or 0xA. CRLF (but not CRNL) is used for the pair "\r\n".
Form feed means advance downward to the next "page". It was commonly used as page separators, but now is also used as section separators. Text editors can use this character when you "insert a page break". This is commonly escaped as "\f", abbreviated FF, and has ASCII value 12 or 0xC.
As control characters, they may be interpreted in various ways.
The most important interpretation is how these characters delimit lines. Lines end with NL on Unix (including OS X), CRLF on Windows, and CR on older Macs. Note the shift in meaning from LF to NL, for the exact same character, gives the differences between Windows and Unix, which is also why many Windows programs use CRLF to separate instead of terminate lines. Many text editors can read files in any of these three formats and convert between them, but not all utilities can.
Form feed is much less commonly used. As page separator, it can only come between lines or at the start or end of the file.

\r is carriage return and moves the cursor back like if i will do-
printf("stackoverflow\rnine")
ninekoverflow
means it has shifted the cursor to the beginning of "stackoverflow" and overwrites the starting four characters since "nine" is four character long.
\n is new line character which changes the line and takes the cursor to the beginning of a new line like-
printf("stackoverflow\nnine")
stackoverflow
nine
\f is form feed, its use has become obsolete but it is used for giving indentation like
printf("stackoverflow\fnine")
stackoverflow
nine
if i will write like-
printf("stackoverflow\fnine\fgreat")
stackoverflow
nine
great

In short:
Carriage return (\r or 0xD): To take control at starting on the same line.
          Line feed (\n or 0xA): To Take control at starting on the next line.
         Form feed (\f or 0xC): To take control at starting on the next page.
More details and more control characters can be found on the following page: Control character

Have a look at Wikipedia:
Systems based on ASCII or a compatible character set use either LF (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, 0x0D 0x0A). These characters are based on printer commands: The line feed indicated that one line of paper should feed out of the printer, and a carriage return indicated that the printer carriage should return to the beginning of the current line.

\f is used for page break.
You cannot see any effect in the console. But when you use this character constant in your file then you can see the difference.
Other example is that if you can redirect your output to a file then you don't have to write a file or use file handling.
For ex:
Write this code in c++
void main()
{
clrscr();
cout<<"helloooooo" ;
cout<<"\f";
cout<<"hiiiii" ;
}
and when you compile this it generate an exe(for ex. abc.exe)
then you can redirect your output to a file using this:
abc > xyz.doc
then open the file xyz.doc you can see the actual page break between hellooo and hiiii....

On old paper-printer terminals, advancing to the next line involved two actions: moving the print head back to the beginning of the horizontal scan range (carriage return) and advancing the roll of paper being printed on (line feed).
Since we no longer use paper-printer terminals, those actions aren't really relevant anymore, but the characters used to signal them have stuck around in various incarnations.

Apart from above information, there is still an interesting history of LF (\n) and CR (\r).
[Original author : 阮一峰 Source : http://www.ruanyifeng.com/blog/2006/04/post_213.html]
Before computer came out, there was a type of teleprinter called Teletype Model 33. It can print 10 characters each second. But there is one problem with this, after finishing printing each line, it will take 0.2 second to move to next line, which is time of printing 2 characters. If a new characters is transferred during this 0.2 second, then this new character will be lost.
So scientists found a way to solve this problem, they add two ending characters after each line, one is 'Carriage return', which is to tell the printer to bring the print head to the left.; the other one is 'Line feed', it tells the printer to move the paper up 1 line.
Later, computer became popular, these two concepts are used on computers. At that time, the storage device was very expensive, so some scientists said that it was expensive to add two characters at the end of each line, one is enough, so there are some arguments about which one to use.
In UNIX/Mac and Linux, '\n' is put at the end of each line, in Windows, '\r\n' is put at the end of each line. The consequence of this use is that files in UNIX/Mac will be displayed in one line if opened in Windows. While file in Windows will have one ^M at the end of each line if opened in UNIX or Mac.

Consider an IBM 1403 impact printer. CR moved the print head to the start of the line, but did NOT advance the paper. This allowed for "overprinting", placing multiple lines of output on one line. Things like underlining were achieved this way, as was BOLD print. LF advanced the paper one line. If there was no CR, the next line would print as a staggered-step because LF didn't move the print head. FF advanced the paper to the next page. It typically also moved the print head to the start of the first line on the new page, but you might need CR for that. To be sure, most programmers coded CRFF instead of CRLF at the end of the last line on a page because an extra CR created by FF wouldn't matter.

As a supplement,
1, Carriage return: It's a printer terminology meaning changing the print location to the beginning of current line. In computer world, it means return to the beginning of current line in most cases but stands for new line rarely.
2, Line feed: It's a printer terminology meaning advancing the paper one line. So Carriage return and Line feed are used together to start to print at the beginning of a new line. In computer world, it generally has the same meaning as newline.
3, Form feed: It's a printer terminology, I like the explanation in this thread.
If you were programming for a 1980s-style printer, it would eject the
paper and start a new page. You are virtually certain to never need
it.
http://en.wikipedia.org/wiki/Form_feed
It's almost obsolete and you can refer to Escape sequence \f - form feed - what exactly is it? for detailed explanation.
Note, we can use CR or LF or CRLF to stand for newline in some platforms but newline can't be stood by them in some other platforms. Refer to wiki Newline for details.
LF: Multics, Unix and Unix-like systems (Linux, OS X, FreeBSD, AIX,
Xenix, etc.), BeOS, Amiga, RISC OS, and others
CR: Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple
II family, Oberon, the classic Mac OS up to version 9, MIT Lisp
Machine and OS-9
RS: QNX pre-POSIX implementation
0x9B: Atari 8-bit machines using ATASCII variant of ASCII (155 in
decimal)
CR+LF: Microsoft Windows, DOS (MS-DOS, PC DOS, etc.), DEC TOPS-10,
RT-11, CP/M, MP/M, Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC,
and most other early non-Unix and non-IBM OSes
LF+CR: Acorn BBC and RISC OS spooled text output.

"\n" is the linefeed character. It means end the present line and go to a new line for anyone who is reading it.

Carriage return and line feed are also references to typewriters, in that the with a small push on the handle on the left side of the carriage (the place where the paper goes), the paper would rotate a small amount around the cylinder, advancing the document one line. If you had finished typing one line, and wanted to continue on to the next, you pushed harder, both advancing a line and sliding the carriage all the way to the right, then resuming typing left to right again as the carriage traveled with each keystroke. Needless to say, word-wrap was the default setting for all word processing of the era. P:D

Those are non-printing characters, relating to the concept of "new line". \n is linefeed. \r is carriage return. On different platforms they have different meanings, relative to a valid new line. In windows, a new line is \r\n. In linux, \n. In mac, \r.
In practice, you put them in any string, and it will have effect on the print-out of the string.

when I was an apprentice in the Royal Signals many (50) years ago, teletypes and typewriters had "Carriage" with the printing head on them. When you pressed RETURN the Carriage would fly to the left. Hence Carriage Return (CR). You could just return the Carriage, but on mechanical typewriters, you'd use the Lever (much like a tremolo lever on an electric guitar) which would also do the Line Feed. Your next question is why would you not want the line feed? heh heh well in those days to delete characters we'd do a CR then use a Tip-ex-like paper in between the hammerheads and paper and type the same keys to over-write with white ink. Some fancy typewriters had a key you could press. So there you go.

Leading Zeroes Getting Trimmed while loading data into excel using unix

Leading Zeroes Getting Trimmed while loading data into excel using unix . My platform is MAC. Is there any way we can handle it in Unix without any manual effort.
Thanks

The leading zeroes are removed because as strings containing digits only are converted into numbers. Nothing stops you from changing the format the numbers to show leading zeroes or convert the column to text.

Convert numbers into strings with something like
cat inputfile | while read line
quoteLine=$(echo ${line}|sed 's/,/","/g')
# do not forget first and last quote, some ugly backslashes here
echo "\"${quoteLine}\""
done > outputfile.csv
In your case (TAB-separated), you can replace the first , in the sed command by a TAB. Difficult to see when you copy-paste, so you can do it after copying.

C How do i specify a POSIX regex that begins in a blank line and ends in a blank line?

I am trying to write code to scan a file and produce a "match!" message when the tool reads a certain line of code preceded and followed by blank lines. The line I am interested in matching is:
Appliance Version 3.1.2
Using regex.h, I have a simple tool that compiles my regex pattern then executes it against every line in the file to search for a match. The basic functionality of the tool is fine: I am able to get it to successfully search for various regex matches. Trouble arises when I try to match a regex containing a blank line before and after the above line of text. Here is my precompiled regex:
[[:space:]]+\n^Appliance Version [[:alnum:]]$\n
I have tried a series of different combinations similar to this, and nothing seems to work. I think it might have to do with \n in which case I would need to figure out a new way to specify the two blank lines. Any insight of POSIX regex would be greatly appreciated!

Looking at your regex, it looks like it is trying to match
Appliance Version [[:alnum:]]
at the end of a line ($). That would be matched by
Appliance Version 3
(3 is an instance of [:alnum:]), but not by
Appliance version 33
([[:alnum:]] only matches one character), and much less by
Appliance version 3.1.2
(the above problem, and also . is not an instance of [:alnum:])
So at a minimum you need to change [[:alnum:]] to [.[:alnum:]]* (or some such).
In addition, your use of ^ and $ is redundant with the explicit \n, but nothing in the regex requires the match to be preceded or followed by a blank line. For example, [[:space:]]\n would happily be matched with the line:
Not a blank line, but with a blank at the end: \n
(where I've written the \n explicitly to show the blank character at the end of the line.)
Matching blank lines
A single blank line is matched with ^[[:space:]]*$. That does not match the newlines at either end. If you want to match a blank line before something, use: ^[[:space:]]*\nSOMETHING. To match a blank line after something: SOMETHING\n[[:space:]]*$. Or, if you really want a blank line before and after: ^[[:space:]]*\nSOMETHING\n[[:space:]]*$. (But that won't match if SOMETHING happens to be the first line of the input, for example. Or the last line.)

As #rici notes, you cannot combine \n^ to match two blank lines -- the markers ^ and $ match a position, not a literal \n character.
To match a blank line, use \n\n, or -- better because you probably don't want to do anything with the hard return that ends the line above, (?<=\n)\n at the start. You can leave the \n\n at the end, though.

\n characters before EOF in file causing problems

I have written some data to a file manually i.e. not by my application.
My code is reading the data char by char and storing them in different arrays but my program gets stuck when I insert the condition EOF.
After some investigation I found out that in my file before EOF there are three to four \n characters. I have not inserted them. I don't understand why they are in my file.

Want to remove those pesky extra characters? First, see how many of them there are at the end of your file:
od -c <filename> | tail
Then, remove however many characters you don't like. If it's 3:
truncate -s -3 <filename>
But overall, if it were me, I'd change my program to discard undesired newline characters, unless they're truly invalid according to the input file format specification.

It is very easy to add additional newlines to the end of a file in every text editor. You have to push the cursor around to see them. Open your file in your editor and see what happens when you navigate to the end, you'll see the extra newlines.
There is no such thing as an EOF character in general. Windows treats control-Z as EOF in some cases. Perhaps you are talking about the return value from some API that indicates that it has reached the end of file?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight