Processing files in C under different operating systems

Processing files in C under different operating systems - c

I am working with files in C. They are very, very huge files.
I read and write to these files (they are just formatted text files).
I give no extension to them;
I just say fopen("filename","r").
Now a file has contents like
1 line1 100
2 line2 200
.
.
20000 line 20000 14567
The problem is I am executing it in Mac OS (Leopard to be particular) and when I open the same file (no extension) in another o/s (Windows to be particular) the program fails to read the contents of the file.
I suppose it's because the formatting of the file is differing.
Any solutions for a standard file format or extension that does not conflict with other o/s?

If you're working with plaintext files, remember that in Unix and Unix-like OSes lines end with \n and in Windows they end with \r\n.
If you do a file transfer as plaintext between operating systems, your client may change the line endings to be compatible with your target OS. You can set the transfer to binary so that you get an exact byte representation of the original file.

Related

creating .mid file: writing a '\n' causes '\r\n' in Windows

I have created a .mid file by writing bytes to a file and save it as .midi. I can run it and it works, but there are some special cases where it does not.
If I write a byte containing \n (ASCII 10) then it will instead write 2 bytes \r\n, which makes the .mid not runnable. (This is normal for Windows machine to do, but not desirable in my case.) An example of writing \n could be when picking the key which is being represented by \n.
Is there a workaround to write \n and not \r\n or another way to make sure that byte written is ASCII 10 on a Windows machine?
Thanks!

On linux/unix, it doesn't matter whether you specify "wb" or "w" to create a file.
But creating a text file using fopen in windows means that all \n are converted to \r\n, so if you're using this to create binary files, the binary files will be "corrupt" if there are some bytes with value "10" (linefeed)
Simple solution: always use fopen("file.bin","wb") when creating a binary file, on all platforms so your code is portable.

Reading in the output of a program, saving it as a string, and using that string in the original program

I am using trying to get specific information from a group of MP3 files, currently I am in the main cygwin64 that holds MP3 files and a .C file which simply contains
FILE * fp;
It contains that single line of code because when that line of code is in place and I type and run "thing.c" in the cygwin command line it outputs what seems the be the information of the contents of the folder. For example it outputs,
home: sticky, directory
lib: directory
sbin: directory
setup-x86_64.exe: PE32+ executable (GUI) x86-64 (stripped to external PDB), for MS Windows
song.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
song1.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
thing.c: ASCII text, with CRLF line terminators
thing.txt: empty
What I want to do is be able to pull that output into a string that I can then use in my C file and alter and then re print out the new altered information. However I'm not sure where the output really is coming from or how I might be able to get it or save the output as a .txt file or back into a C file.
Any advice is appreciated Thanks!

This file is not really a C file at all. Because you're in Cygwin, you're likely operating on a case-insensitive filesystem (NTFS). As such, Cygwin's file command is running when you run the .c file. The way you've attempted to declare a variable (apparently) just so happens to be doing a 'file * fp' command. I'm sure you're getting fp: Cannot open "fp" or something similar after the rest of your output.
This is not anything C-related at all but is just being interpreted as a script by your shell.
It sounds like you have a lot to learn if you want to do this in C. More likely, you can probably write a shell script to accomplish what you want. While I've never used it, mp3info (https://github.com/jaalto/cygwin-package--mp3info) exists for pulling tag information from MP3 files. You could possibly get the exact information you want from that, or pipe the output into sed, awk, or a number of other tools.

Does file type rely on file extension?

As a general question: What's the role of file extension when determining file types?
For example, I can change .jpeg file to .png extension and even .txt. Of course, in the case of changing to .txt, it will neither be opened as picture, nor readable.
To determine file type, it seems the safe way is to parse the first few bytes of the file. If extension is not trustable, extension is no more than file name.

As a general rule, you should ALWAYS parse the COMPLETE file in order to be sure that the file is what the extension says. As you can easily imagine, it is pretty simple to create a binary file resembling a e.g. BMP (with a correct header) but then containing something different.
You should never trust the extension neither the header because otherwise a malicious user could exploit some of your code to generate e.g. a buffer overflow, and this is absolutely paramount if you are writing programs that must run at root/admin privilege.
Having said the obvious, the file extension nowadays is mainly used so that the OS can associate a program to that particular file (usually calling the program and passing the selected file as first parameter), and then it's up to the program to determine the file content.
It is a little bit different when talking about executable files. Under Unix, in order to be executable a file has to have the "x" flag set, otherwise it would not run, regardless of the extension. Under Windows, there is not such thing and the OS relies on only a few extensions (EXE, COM, BAT, etc.) to determine which files can be executed.
The EXE file, for example, has to start with "MZ" followed by some information for its allocation and size (http://www.delorie.com/djgpp/doc/exe/) and the OS surely checks its internal headers. Other formats (e.g. the COM executable format of the MS-DOS era) is just "pure" assembly code, so there is no check done by the OS. It just interprets those opcodes, hoping that everything will be fine.
So, to summarize:
File extension is mainly used so that the OS can call the appropriate program to open it (and passing the filename as the first parameter, argc/argv in C language for example)
Windows relies on some file extension to know if a file is executable, while Unix/Mac relies on a particular flag (x) associated with the file
Two things that are not well known about file extensions: directory names can have extension too, and extension can be way longer than the usual 3 characters.

With the help of file extension, you know how to read the first few and all the rest of the bytes. You also know what program to use to read the file. Or if it is an executable, you know that it is to be executed and not shown as a picture.
Yes you can change the file extension, but what does it mean then? It only means that OS (or any program that tried to read the file) is working correctly. Only you are providing bad data to it.
File extension is not something that some bytes of data inherently have. Extensions are given to those bytes depending upon the protocol followed to write them that way. After you have encoded the letters in binary form, you provide that binary form with .txt extension so that the text reader knows that these bytes convert to letters. That's the role of file extension. With bad file extension, this role is not fulfilled, resulting in incomprehension of the data you saved in binary.

As a general question: What's the role of file extension when determining file types?
The file extension usually identifies the application that opens a file.
If you rename a .JPG to a .PNG and while having JPG and PNG opened by the same application (usually an image viewer) that application can read the image stream and process it correctly regardless of having an incorrect file stream.
The problem arises if you rename the file in such a way that the file gets routed to an application that cannot handle the file's content.
If you rename a .DOCX (word) file to an Autocad extension (.DWG), opening the word file in autocad is likely to produce errors (unless per chance autocad can read word files).

Auto detect OS in C and handle with their specific line breaks

Is there a way to detect the OS where the C code is compiled to handle with it's specific line break characters in text files? For example I compile my code on a Windows machine, it should use \r\n as line break in text files, on Linux it should just use \n.
I need this for a program which should read text files binary and match substrings of the file with other strings. This should work on windows and Linux.
Thanks for your help!

You don't need to know the native storage format. When reading a file, you cannot know if it was created on a Window, Linux, or other system -- it could be created on another system than the one you are working on. When writing, your program will use the native libraries for your OS and output whatever it deems appropriate for \n.
Reading a text file line-ending agnostically comes down to this:
use a binary mode rather than "text mode" (you seem to already do this).
read text until you encounter either an \r or \n.
if you get an \r, skip all next \n;
if you get an \n, skip all next \r.
This will work for line endings of \n (Linux and other Unix-like OSes such as Mac OS X), Windows-like \r\n and older Mac OS files ending with \r only. That covers about 99.99% of all "normal" text files you are likely to encounter. There used to be a very rare one that used \r\n\n (or possibly \n\r\r) but even that will be handled correctly.

The best way would be to check for a predefined macro and #ifdef on it.
You can print all the predefined MACROs using the command
gcc -dM -E - < /dev/null
and grep for "LINUX" or "WIN32"
I'd expect to find _ LINUX _ defined on Linux machines and _ WIN32 _ defined on windows machine.

Using DOS Prompt and via' FTP copying files on to the server

I would like when we Copy files on to the server via FTP the size of the file changes. What is the reason behind this. Does the change in the file size can make the files corrupt and the FTP process Fails ?

Most likely, you are copying between Windows and Unix, and the difference in size is due to the difference between CRLF and just LF for line endings.
If it is crucial to preserve the line endings, use BIN (binary) mode to transfer the data. The alternative is ASC (ASCII) mode, where the systems map line endings.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight