Reading Microsoft Outlook MSG Content in pure C code - c

I need to read some Microsoft Outlook MSG file in pure C code. What I need is a library that doesn't depend on any particular framework (.NET, Java, etc), so a library/class/set of functions completely written in C.

Well, if you're okay with MFC or ATL (which would be C++), you should take a look at this article. Also, the format is described in detail here.
What you'll essentially be doing is reading the nodes which have information in ASCII text format, which boils down to the __substg1.0_xxxxxxxx nodes ending with 001E.

Related

How would I read the NTFS master file table in C (*not* C++)?

I need a simple, lightweight way to read the NTFS MFT on a Windows server, using only C. My goal is to return a collection of directories and their permissions in a programmatic way for an application my company is building.
Every other answer I've researched on StackOverflow and other places have involved using C++ or other languages, and are typically very bloated. I'm pretty sure that what I want can be done in just a few lines of code, using the Windows API to call CreateFile (to get a handle to the root volume) and DeviceIoControl (to read the MFT). But I can't find a simple C solution for doing this.
Note that, although I've been a C#/.NET developer for many years (and also know other languages including Java and Python), I am fairly new to low-level C programming and Windows API calls. I also realize that there is a free too, Mft2Csv, that does exactly this. But the actual source code isn't available for me to reverse-engineer (GitHub has only the executable and supporting files).
I also realize I could just parse the directory tree using C# the .NET namespaces System.IO and System.Security.AccessControl. But this is way too slow for my purposes.

Auto-detect language of file

Is there a way to auto-detect the language that a file is written in or a way to say "this file is 20% C, 30% python, 50% shell." There must be some way because Github's remote server seems to autodetect languages. Also, if the file is a hybrid of languages, what is the de-facto way to set the file extension so that it represents those languages that are in the file. Maybe files have to all be homogeneous in regards to language. I am still learning. Additionally, is there a way to autodetect bytes of a codebase on a remote site like Github. So basically like Github's bar for languages except the bar shows how many bytes the project is taking up.
The file command on Linux does a reasonable job of guessing the language of a file, but basically it's just looking at the first characters of a file and comparing them to known situations: "if the file starts with blah-blah-blah it is probably thus-and-so".
As far as the file containing "20% C, 30% Python, etc" -- what would you do with such a file if you had one? Neither the C compiler nor the Python compiler would be happy with it.
I think Github uses file extensions to decide what language a code is written in.
As for auto-detecting file extension using the language, I suppose you could create a classification model.
You will have to create a large dataset with many files in different languages and their corresponding labels (language name). Then feed that training data to a neural network (maybe RNN-LSTM) to train the model. Then use that model on new data to predict language based on code.
I have never done something like this. But it would be a fun project.

HANDLE - File Handles and Directory Handles Structures

Language: C
OS: Windows
My application is framed with nt level apis and has to manipulate file and directory handles.
On a Zwopenfile or zwcreate file, I get a HANDLE as a result. Usually the values for the HANDLE are like 0x00000024, 28,2c... etc.
When I cast it as a LPBYTE to view the contents. Visual studio shows "Expression could not be evaluated". I understood from that the HANDLE returned from create/open file apis are not pointers to a memory location. However, windows uses the value and performing file operations.
Ntquerydirectory object supplies me the infomation about handles. However, how windows have implemented this functionality is unknown.
Can anyone throw light on it.
That's a so-called "opaque value" which means "it's completely up to Windows how it is done inside. For example, it could be an index in some global table that is not accessible directly to your program - Windows just knows how to get there and you shouldn't even think of doing it.
Handles are stored in a table accessible only from kernel code. If you are interested in how Windows kernel works, you may find Mark Russinovitch blog or driver development interesting.
The last book I know of that was a good reference for this kind of stuff was Inside Windows 2000 by Mark E. Russinovitch and David A. Solomon. While clearly out of date, a lot of that book is still relevant. Google for "Inside Windows 7" for links to videos of talks by Russinovitch and some other books that I can't vouch for, but seem on topic.
HANDLE is actually a pointer to a struct that contains various fields, often they point to some kernel object. HANDLES are generally used when programming in C to have a notion of object oriented programming.
When debugging with WinDbg you have an extension called !handle that can display various information about a given handle.
The book Windows Internals (by Mark Russinovich) goes into great detail about this and many other Windows' mechanisms.
Perhaps you will find this discussion useful: What is a Windows Handle?
Also check out this blog post by Mark: http://blogs.technet.com/b/markrussinovich/archive/2009/09/29/3283844.aspx. It contains alot of information which could help you answer your question.

Where can I get started with Unicode-friendly programming in C?

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.
Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.
I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.
Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.
A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
The UTF-8 home-page: http://www.utf-8.com/
man 3 iconv (as well as iconv_open and iconvctl)
International Components for Unicode (via Geoff Reedy)
libbasekit, which seems to include light Unicode-handling tools
Glib has some Unicode functions
A basic UTF-8 detector function, by Christoph
International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.
GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):
Object and type system
Main loop
Dynamic loading of modules (i.e. plug-ins)
Thread support
Timer support
Memory allocator
Threaded Queues (synchronous and asynchronous)
Lists (singly linked, doubly linked, double ended)
Hash tables
Arrays
Trees (N-ary and binary balanced)
String utilities and charset handling
Lexical scanner and XML parser
Base64 (encoding & decoding)
I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are
a) utf8 in vanilla c-strings
b) utf16 in unsigned short arrays
In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.
Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

Is there a C header parser tool for wrapper generation like gccxml?

I need to write a few c header wrappers for a new programming language and would like something like gccxml but without the full dependency on gcc and the problems it gives on a windows system.
Just needs to read C not C++. Output in any format is okay as long it is fully documented.
Need it for Curl, SQLite, GTK2, SDL, OpenGL, Win32 API and C posix API's on Linux/Solaris/FreeBSD/MacOSX.
VivaCore is very cool. Have you tried SWIG the wikipedia page on ffi has some good links too. I think there is a MSVC codedom example that does C also.
See our SD C Front End for DMS. Full C parsing, symbol table construction, post parsing dump of any information you like. Can dump code and symbol tables in XML format.
You may like pycparser in Python. Used in CFFI and other awesome projects.

Resources