Detect whether file contains text or binary - mime-types

I am using Apache Tika to detect whether a given file is binary or text.
I'd like the following extensions (".txt", ".csv", ".log", ".bat", ".m", ".properties", ".inf", ".ini",".java", ".c", ".cpp", ".h", ".vpp" ) to be detected as text files.
I am simply using Tika.detect(file) method to do this. But I notice that some of the above extensions like .inf (which is clearly text based) and .vpp are getting wrongly detected as 'application'.
Using javax.activation.MimetypesFileTypeMap.MimetypesFileTypeMap(), .vpp files are detected as application/octect-stream (binary).
Using, SVNAccessControl svn:mimetype, we get type as text.
Is there a way to detect these files as text correctly in a Java program using any of these third party libs ?

Related

Frama-c: save plugin analysis results in c file

I'am new in frama-c. So I apologize in advance for my question.
I would like to make a plugin that will modify the source code, clone some functions, insert some functions calls and I would like my plugin to generate a second file that will contain the modified version of the input file.
I would like to know if it is possible to generate a new file c with frama-c. For example, the results of the Sparecode and Semantic constant folding plugins are displayed on the terminal directly and not in a file. So I would like to know if Frama-c has the function to write to a file instead of sending the result of the analysis to the standard output.
Of course we can redirect the output of frama-c to a file.c for example, but in this case, for the plugin scf for example, the results of value is there and I found that frama-c replaces for example the "for" loops by while.
But what I would like is that frama-c can generate a file that will contain my original code plus the modifications that I would have inserted.
I looked in the directory src / kernel_services / ast_printing but I have not really found functions that can guide me.
Thanks.
On the command line, option -ocode <file> indicates that any subsequent -print will be done in <file> instead of the standard output (use -ocode "" after that if you want to print on stdout again). Note that -print prints the code corresponding to the current project. You can use -then-on <prj> to change the project you're interested in. More information is of course available in the user manual.
All of this is of course available programmatically. In particular, File.pretty_ast by defaults pretty-prints (i.e. output a C program) the AST of the current project on stdout, but takes two optional argument for changing the project or the formatter to which the output should be done.

How to check the given file is binary file or not in C programming?

I'm trying to check the given file is binary or not.
I refer the link given below to find the solution,
How can I check if file is text (ASCII) or binary in C
But the given solutions is not working properly, If I pass the .c file as argument, Its not working, It gives wrong output.
The possible files I may pass as argument:
a.out
filename.c
filename.txt
filename.pl
filename.php
So I need to know whether there is any function or way to solve the problem?
Thanks...
Note : [ Incase of any query, Please ask me before down vote ]
You need to clearly define what a binary file is for you, and then base your checking on that.
If you just want to filter based on file extensions, then create a list of the ones you consider binary files, and a list of the ones you don't consider binary files and then check based on that.
If you have a list of known formats rather then file extensions, attempt to parse the file in the formats, and if it doesn't parse / the parse result makes no sense, it's a binary file (for your purposes).
Depending on your OS, binary files begin with a header specifying that they are an executable and contain several informations about them (architecture, size, byte order etc.). So by trying to parse this header, you should know if a file is binary or not. If you are on mac, check the Mach-O file format, if you are on Linux, it should be ELF format. But beware, it's a lot of documentation.

Prevent CEDET semantic from parsing certain file types

I have to work with a C/C++ build environment that drops intermediate files all over the place:
.i files containing the output of the C-preprocessor (roughly raw C)
.s files containing the input of the C-assembler
CEDET (I assume the semantic analyzer) eventually finds these files and attempts to index them. This results in jumping to .i files containing raw C for definitions and generally slowing down parsing and loading of the .semanticdb.
I never open these files in emacs, so they must be being loaded by the background analyser.
Is it possible to prevent the analyser from loading these files? I can't find any configuration options that define the file-types that are parsed by the background analyser.
If you never need C mode for these files, here's a quick fix:
(add-to-list 'auto-mode-alist '("\\.i\\'" . fundamental-mode))
(add-to-list 'auto-mode-alist '("\\.s\\'" . fundamental-mode))
The answer from abo-abo gave me the clues I needed. The grep implementation (used by EDE) of semantic-symref-perform-search uses auto-mode-alist to find matching files for a given semantic mode (based on the current buffer's mode - eg `c-mode) when trying to resolve symbols.
The final fix I used is to specifically eliminate the default entries in the auto-mode-alist using:
(delete '("\\.i\\'" . c-mode) auto-mode-alist)
(delete '("\\.ii\\'" . c++-mode) auto-mode-alist)
Adding fundamental-mode entries as suggested by abo-abo seems to work also, however I was concerned that since the c-mode entries were still in the list a change in implementation could result in them being reactivated.

How to stop file names/paths from appearing in compiled C binary

This may be compiler specific, in which case I am using the IAR EWARM 5.50 compiler (firmware development for the STM32 chip).
Our project consists of a bunch of C-code libraries that we compile first, and then the main application which compiles its C-code and then links in those libraries (pretty standard stuff).
However, if I use a hex editor and open up any of the library object files produced or the final application binary, I find a whole bunch of plain text references inside the output binary to the file paths of the C files that were compiled. (eg. I see "C:\Development\trunk\Common\Encryption\SHA_1.c")
Two issues with this:
we don't really want the file paths being easily readable as that indicates our design some what
the size of the binary grows if you have your C-files located in a long subdirectory (the binary contains the full path, not just the name)...this is especially important when we're dealing with firmware that has a limited amount of code space (256KB).
Any thoughts on this? I've tried all the switches in the compiler I can think of to "remove debug information", etc., but those paths are still in there.
"The command-line option --no_path_in_file_macros has been added. It removes the path leaving only the filename for the symbols FILE and BASE_FILE."
It is defined in the release notes if IAR.
http://supp.iar.com/FilesPublic/UPDINFO/005832/arm/doc/infocenter/iccarm_history.ENU.html
Or you can look for FILE and BASE_FILE macros and remove it you do not want to use the flag.

How to convert MP4 (h264/aac) file to F4F fragments for HDS (Adobe)

I am looking for some input on how to programmatically convert mp4 files to fragmented f4f files with accompanying manifests.
I currently have an implementation for creating segmented MPEG2-TS files with accompanying manifest for Apples HLS, and want to create a similar piece of software for Adobes HDS.
My code is based on Libav (alternatively, ffmpeg), so I was hoping they had native support for muxing f4f files, but I have not been able to find any resources for it.
What I am specifically looking for:
How (if) the format is used in libav?
If there is any special requirements (such as the h264_mp4toannexb filter required for converting MP4 to MPEG2 TS)?
Any sample code (even if it's not using libav/ffmpeg)
An easy-to-read manifest specification.
I'm afraid you have to read mp4/f4f specification, and implementation it your self.
MP4 file format: ISO/IEC 14496-14
f4f file format: It is included in the f4v specification.(http://www.adobe.com/cn/devnet/f4v.html)
The code of mod_h264_streaming (http://h264.code-shop.com/trac) may be helpful.

Resources