The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.
Related
When writing to a binary file, when should I use .bin vs .dat? If I'm just trying to store information not meant to be read by humans, like item description/serial number pair, does it matter which one I pick if I'm just trying to make it unreadable from a text editor?
Let me give you some brief details about these files :
.BIN File : The BIN file type is primarily associated with 'Binary File'. Binary files are used for a wide variety of content and can be associated with a great many different programs. In general, a .BIN file will look like garbage when viewed in a file editor.
.DAT File : The DAT file type is primarily associated with 'Data'. Can be just about anything: text, graphic, or general binary data. Data file in special format or ASCII.
Reference:
Abhijit Banerjee answered that question on quora
.dat is a more frequently used suffix for binary data. It doesn't matter what extension you pick, as long as you are on Unix or Linux based systems.
Sufixes can mean whatever you want them to mean... Those rules are more like guidelines than actual rules...
However, BIN seems like a short to binary, so a BIN file will likely hold data in binary form. DAT seems like a short to data, so a DAT file will contain information in whatever format the developer of the program that reads that file seems fit (ASCII, Binary, a mix of them, something else entirely)
On a UNIX system, there is no difference. The extensions are interchangeable.
If you do not put any extension, it makes it kinda hard for someone not knowing what the file's extension should be, to open the file. Additionally, with Unix or Linux, if you place a dot (period) before the file name, the file hides itself.
I have data stored in .ddm, .pnt, .fdt and .bin files.
How can I export (or extract or transform) data from those file formats into .csv?
I think it's an ADABAS database.
Yes. The file extensions looks like an adabas database.
You need an adabas/natural enviroment for running database and you can write a simple program in Natural for read database content and put them to text "work file" with delimiters ";" and csv extensions. J don't met any tool for manual unpack database files.
As peterozgood pointed out, you would normally use Natural for that.
If you're using Natural on Windows or Unix you can code the following
DEFINE WORK FILE nn TYPE 'CSV'
...where nn is a number between 1 & 32, identifying the desired workfile.
(this may also be specified by your Admin in the so-called Natparm, along with Codepage & Delimiter)
Then you can output data to the file by coding
WRITE WORK FILE nn operand1 ... operandN
Natural will automatically create the csv.
Fields will be separated by the delimiter and quoted and escaped as necessary.
(the delimiter may be specified in the Natparm or as a startup parameter)
Unfortunately this functionality is not available with Mainframe Natural.
(CSV that is. Workfiles are of course available)
I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
I am currently taking a curse in C programming, and for our final project we need to read some text from a pdf into a string, so we can manipulate the string.
In essence what i am looking for is something similar to this, only with a .pdf instead of a .txt file.
char *line;
fscanf(myfile.txt," %[^\n]", line);
I have no experience with ghostscript, so I have no idea if this is even possible, although we where told that we should use ghostscript.
The current version of Ghostscript includes the 'txtwrite' device, which will extract text from any supported input (PostScript, PDF, XPS, PCL) and will emit it in a variety of forms.
The UTF-8 output would probably be most useful to you.
Caveat! Many things which appear to be text in PDF files are not text, and no attempt is made to deal with these.
ps2ascii is deprecated with the release of the txtwrite device, but in any case its perfectly capable (despite the name) of dealing with PDF as an input.
I can't think why anyone assigned you this project, PDF files are not text files, and cannot be treated as such. In addition to the fact that PDF files are generally compressed, identifying the contents stream and all the other streams it relies on (which may themselves include text) is non-trivial. Plus, the text is often encoded in a way which can be difficult to understand (this is particularly true of CIDFonts and TrueType fonts).
Perhaps your tutor expected you to first become expert in the PDF format, but that seems excessive for a C course.
You can convert your PDF to Postscript using pdf2ps, and then to ASCII using ps2ascii. You already know how to read ASCII.
Both utilities mentioned are in the ghostscript package.
For some reason, every file that I bake with CakePHP's console is regarded as ISO-8859-1 encoded by my IDE Dreamweaver. This works fine up to the point where I end up typing a special character, which will be wrongly displayed by the browser, since its encoding (by the editor) differs from the overall rendering.
How can I force the console to produce UTF-8 files, with a BOM if necessary?
I've already tried converting the template files that are used to bake the standard scaffolding pages, but with no luck.
I have the same problem - baked files are NOT UTF-8 but ASCII. (use notepad++ editor which allows easily convert, save files in another format).
Once bake generates files I have to convert them to UTF-8 one by one, to be able to work with Polish local characters.
I tried changing template files to UTF-8 but somehow this does not help. This may have something to do with the fact that the default file does not contain any non ascii character, therefore even if saved as UTF they stay ASCII.
The simplest way I found to overcome this is to modify template file eg.
cake\console\templates\default\classes\model.ctp
to include utf-8 character somewhere, e.g.:
//'message' => 'Your custom message here ł',
(notice last non ASCII character at the end of line.
then converting and saving as UTF-8 makes sure template file is utf-8.
now, model files are generated as UTF-8.
The baked files are UTF-8, or rather, they only contain basic ASCII characters which are identical to the basic UTF-8 range, so can be regarded as either. It's Dreamweaver's problem, not a problem with bake. Check the Dreamweaver settings (or code in a decent editor ;-P).
You do not want to include a BOM, it'll screw you over later.
Use the Bake_UTF8 plugin =]
http://www.github.com/pedroelsner/bake_utf8
I hope this be helpfull.
Pedro Elsner
Another way to achieve this is to open PHP files that are producing UTF-8 content (without BOM) and then saving them in format UTF-8 with BOM using Notepad++ (Encoding->Encode in UTF-8)
In my case I had Excel CSV file:
/patients/exportFirstReport/atskaite1-25-10-2013.csv
Then I had to convert encoding of PHP files down the stack:
\index.php
\app\Controller\PatientsController.php
\app\View\Patients\csv\export_first_report.ctp
\app\View\Layouts\csv\default.ctp
After conversion of encoding of these files it produces readable UTF8 excel files