Auto-detect language of file - file

Is there a way to auto-detect the language that a file is written in or a way to say "this file is 20% C, 30% python, 50% shell." There must be some way because Github's remote server seems to autodetect languages. Also, if the file is a hybrid of languages, what is the de-facto way to set the file extension so that it represents those languages that are in the file. Maybe files have to all be homogeneous in regards to language. I am still learning. Additionally, is there a way to autodetect bytes of a codebase on a remote site like Github. So basically like Github's bar for languages except the bar shows how many bytes the project is taking up.

The file command on Linux does a reasonable job of guessing the language of a file, but basically it's just looking at the first characters of a file and comparing them to known situations: "if the file starts with blah-blah-blah it is probably thus-and-so".
As far as the file containing "20% C, 30% Python, etc" -- what would you do with such a file if you had one? Neither the C compiler nor the Python compiler would be happy with it.

I think Github uses file extensions to decide what language a code is written in.
As for auto-detecting file extension using the language, I suppose you could create a classification model.
You will have to create a large dataset with many files in different languages and their corresponding labels (language name). Then feed that training data to a neural network (maybe RNN-LSTM) to train the model. Then use that model on new data to predict language based on code.
I have never done something like this. But it would be a fun project.

Related

Is there a way to prevent a file from being completely loaded by a software?

Is there a way to limit a hard drive from reading a certain file? Ex. It's given to Program A the order to open a .txt file. Program B overloads the .txt file opening hundreds times a second. Program A is unable to open the txt file.
So I'm trying to stress test a game engine that relies on extracting all used textures from a single file at once. I think that this extraction method is causing some core problems to the game developing experience of the engine overall. My theory is that the problem is caused by the slow reading time of some hard drives. But I'm not sure if I'm right on this, and I needed I way to test this out.
Most operating systems support file locking and file sharing so that you can establish rules for processes that share access to a file.
.NET, for example (which runs on Windows, Linux, and MacOS), provides the facility to open a file in a variety of sharing modes.
For very rapid access like you describe, you may want to consider a memory-mapped file. They are supported on many operating systems and via various programming languages. .NET also provides support.

How would I read the NTFS master file table in C (*not* C++)?

I need a simple, lightweight way to read the NTFS MFT on a Windows server, using only C. My goal is to return a collection of directories and their permissions in a programmatic way for an application my company is building.
Every other answer I've researched on StackOverflow and other places have involved using C++ or other languages, and are typically very bloated. I'm pretty sure that what I want can be done in just a few lines of code, using the Windows API to call CreateFile (to get a handle to the root volume) and DeviceIoControl (to read the MFT). But I can't find a simple C solution for doing this.
Note that, although I've been a C#/.NET developer for many years (and also know other languages including Java and Python), I am fairly new to low-level C programming and Windows API calls. I also realize that there is a free too, Mft2Csv, that does exactly this. But the actual source code isn't available for me to reverse-engineer (GitHub has only the executable and supporting files).
I also realize I could just parse the directory tree using C# the .NET namespaces System.IO and System.Security.AccessControl. But this is way too slow for my purposes.

Coding example for thrift in C, using files

I've started using thrift for C. I managed to generated the .c and .h files via the compiler. I'm looking to write to a file, preferably JSON. However there are no examples on the apache thrift site. A internet search turns up next to nothing useful. Does anybody have any sample code that I can use? I essentially have a struct has a bunch of ints and char *.
However there are no examples on the apache thrift site. A internet search turns up next to nothing useful.
That's simply not true, we have a great tutorial covering a great part of the languages. You can find them quite easily via Google. The tutorial code can be found in the release tarball or in the Git repository as a top-level directory named tutorial.
Since you are looking specifically for JSON, I recommend to have a look at the cross-language test client/server, which can be found under test or lib (a bit inconsistent right now, we are about cleaning that up). AFAIK for plain C there is no JSON available, but for C++ there is.
In order to store things into a file, you basically choose a stream or file transport and the protocol of your choice from the available protocols. It's as simple as (pseudocode)
var data = InitializeMyDataStructure();
var trans = new TFileTransport("myfile");
var prot = new TJSONProtocol(trans);
data.write(prot);
The support for plain C is somewhat limited yet, but there is all kinds of transports/protocols for C++.

Reading complex binary file formats

Is there any book or tutorial that can learn me how to read binary files with a complex structure. I did a lot of attempts to make a program that has to read a complex file format and save it in a struct. But it always failed because of heap overruns etc. that made the program crash.
Probably your best bet is to look for information on binary network protocols rather than file formats. The main issues (byte order, structure packing, serializing and unserializing pointers, ...) are the same but networking people tend to be more aware of the issues and more explicit in how they are handled. Reading and writing a blob of binary to or from a wire really isn't much different than dealing with binary blobs on disk.
You could also find a lot of existing examples in open source graphics packages (such as netpbm or The Gimp). An open source office package (such as LibreOffice) would also give you lots of example code that deals with complex and convoluted binary formats.
There might even be something of use for you in Google's Protocol Buffers or old-school ONC RPC and XDR.
I don't know any books or manuals on such things but maybe a bunch of real life working examples will be more useful to you than a HOWTO guide.
One of the best tools to debug memory access problems is valgrind. I'd give that a try next time. As for books, you'd need to be more specific about what formats you want to parse. There are lots of formats and many of them are radically different from each other.
Check out Flavor. It allows you to specify the format using C-like structure and will auto-generate the parser for the data in C++ or Java.

Use an INI file in C on Linux

Is there a standard way of reading a kind of configuration like INI files for Linux using C?
I am working on a Linux based handheld and writing code in C.
Otherwise, I shall like to know about any alternatives.
Final update:
I have explored and even used LibConfig. But the footprint is high and my usage is too simple. So, to reduce the footprint, I have rolled out my own implementation. The implementation is not too generic, in fact quite coupled as of now. The configuration file is parsed once at the time of starting the application and set to some global variables.
Try libconfig:
a simple library for processing structured configuration files, like this one: test.cfg. This file format is more compact and more readable than XML. And unlike XML, it is type-aware, so it is not necessary to do string parsing in application code.
Libconfig is very compact — a fraction of the size of the expat XML parser library. This makes it well-suited for memory-constrained systems like handheld devices.
The library includes bindings for both the C and C++ languages. It works on POSIX-compliant UNIX and UNIX-like systems (GNU/Linux, Mac OS X, Solaris, FreeBSD), Android, and Windows (2000, XP and later)...
No, there isn't one standard way. I'm sorry, but that is probably the most precise answer :)
You could look at this list of Linux configuration file libraries, though. That might be helpful.
Here are four options:
Iniparser
libini
sdl-cfg
RWini
If you can use the (excellent, in any C-based application) glib, it has a key-value file parser that is suitable for .ini-style files. Of course, you'd also get access to the various (very nice) data structures in glib, "for free".
There is an updated fork of iniparser at ccan, the original author has not been able to give it much attention over the years. Disclaimer - I maintain it.
Additionally, iniparser contains a dictionary that is very useful on its own.
If you need a fast and small code just for reading config files I suggest the inih
It loads the config file content just once, parse the content and calls a callback function for each key/value pair.
Really small. It can be used on embedded systems too.
I hate to suggest something entirely different in suggesting XML, but libexpat is pretty minimal, but does XML.
I came to this conclusion as I had the same question as you did, but then I realized the project already had libexpat linked-in--and I should probably just use that.

Resources