i m using java zlib package for extracting txt from a pdf file. But when i input first compressed stream found in this file to inflate(), it returns z_need_dict error. On getting this error i tried giving an arbitrary dictionary array to set_inflate_dictionary() followed by another inflate() call but same error appears "dictionary needed".I hv found in zlib manual that decompression application should provide the same dictionary that was used for compressing data. How can one know exactly what dictionary bytes were used while formulating this pdf file by the author? Or can one extract the same dictionary from pdf file?
You did not correctly locate or extract the zlib stream. PDF does not use zlib streams that require dictionaries.
Related
I have an array of English words that is about 275,000 elements long that I need to use for my iOS app written in Swift. However, Xcode doesn't seem to be able to handle such a large (3+ MB) file. The file will not open in Xcode, and when I attempt to compile the app, it seems to compile indefinitely and never build.
How should I handle this large amount of data?
Don't put a huge literal array in your swift source code.
Instead, create a text file, drag that into your project as a resource, then open that and convert it into an array at runtime using components(separatedBy:).
For speed and storage efficiency you could instead write a conversion utility that reads your text file and uses components(separatedBy:) to convert it to an Array of Strings. Then you could write the array of Strings to a binary plist.
You could then drag the plist file into your project as a resource, and write code that reads the plist file into an Array at launch.
How about put it in a file and read it at runtime? For example, put the elements in the a JSON array and store the array in a text file. Drag the file into your Xcode project, then it will be copied into the app bundle during compilation. Read the JSON array from the file and parse it at runtime.
There are many Tutorial on the internet about reading files in bundle and parse JSON data.
I am reading .docx file in a buffer and writing it to a new file successfully. (Using fread and fwrite in C) However now I want to enhance the scope of this project for the purpose of encryption. For which I want to be able to manipulate the buffer, then write it in new file.
Now one question might be, what manipulation do I need?
It could be anything really, like I'd write character 's' in buffer's location 15. Like below, and then write this new buffer (having character 's' at location 15, but the rest of the buffer remains unchanged) in a new .docx file.
buffer[15] = 's';
When I did this, the file that was created was corrupt. Since I am not fully aware of the structure of .docx file, this byte number 15 could be some potential identifier, or header, or any important information of .docx file needed for creating a non-corrupt file.
However, the things I know about .docx internal structure are:
It consists of XML files, zipped together.
The content that is written in .docx file, (for e.g. I have a file named test.docx, and it contains "Hello, how are you?") then the contents "Hello, how are you?" are stored in XML files.
There is a .rels (not confirm) extension file, among those files that are zipped together, that tells MS word about where the content is stored in file, i.e. where to look for content.
Apart from these 3 points I don't know much about structure of .docx file. Now considering all this, I want to be able to extract the contents of .docx file, from the XML files zipped together, read it (in C) in a buffer, change the buffer as I need it, and create a new file, with the new content that is present in the buffer.
Can someone guide me through this?
Also kindly mention, if I need to provide code, or any other essential details. Thanks in advance.
EDIT
PURPOSE OF ALL THIS:
I want to do all this for encryption. As by encrypting a file (using AES) the whole file will become unreadable, corrupt and everything inside will be changed from its place. When I decrypt that file, the file is unable to open. My guess is, as AES decryption algo does not know how to parse the contents recovered from decrypting the encrypted file, in to a new .docx file, thus it is unable to place the contents/structure properly in its place.
I have tried it. Original docx file was of 14 KB, encrypted docx file was of 14 KB as well as the decrypted docx file. But when I try to open the decrypted file, it says file is corrupt. Also I tried to check it in HEX editor. Decrypted file has just 00 bytes after exactly 30 Bytes.
DOCX files are based on OPC and OOXML. OPC is based on Zip. OOXML is based on XML. Therefore, you can use Zip and XML tools to operate on DOCX files. Beyond this, you'll have to be more specific about what you wish to do in order to receive better guidance.
Poking characters into random index locations in an XML file is operating at the wrong level of abstraction.
I'm working on a project that needs to open .mp4 file format, read it's frames 1 by 1, decode them and encode them with better type of lossless compression and save them into a file.
Please correct me if i'm wrong with order of doing things, because i'm not 100% sure how this particular thing should be done. From my understanding it should go like this:
1. Open input .mp4 file
2. Find stream info -> find video stream index
3. Copy codec pointer of found video stream index into AVCodecContext type pointer
4. Find decoder -> allocate codec context -> open codec
5. Read frame by frame -> decode the frame -> encode the frame -> save it into a file
So far i encountered couple of problems. For example, if i want to save a frame using av_interleaved_write_frame() function, i can't open input .mp4 file using avformat_open_input() since it's gonna populate filename part of the AVFormatContext structure with input file name and therefore i can't "write" into that file. I've tried different solution using av_guess_format() but when i dump format using dump_format() i get nothing so i can't find stream information about which codec is it using.
So if anyone have any suggestions, i would really appreciate them. Thank you in advance.
See the "detailed description" in the muxing docs. You:
set ctx->oformat using av_guess_format
set ctx->pb using avio_open2
call avformat_new_stream for each stream in the output file. If you're re-encoding, this is by adding each stream of the input file into the output file.
call avformat_write_header
call av_interleaved_write_frame in a loop
call av_write_trailer
close the file (avio_close) and clear up all allocated memory
You can convert a video to a sequence of losses images with:
ffmpeg -i video.mp4 image-%05d.png
and then from a series of images back to a video with:
ffmpeg -i image-%05d.png video.mp4
The functionality is also available via wrappers.
You can see a similar question at: Extracting frames from MP4/FLV?
How Application will detect file extension?
I knew that every file has header that contains all the information related to that file.
My question is how application will use that header to detect that file?
Every file in file system associated some metadata with it for example, if i changed audio file's extension from .mp3 to .txt and then I opened that file with VLC but still VLC is able to play that file.
I found out that every file has header section which contains all the information related to that file.
I want to know how can I access that header?
Just to give you some more details:
A file extension is basically a way to indicate the format of the data (for example, TIFF image files have a format specification).
This way an application can check if the file it handles is of the right format.
Some applications don't check (or accept wrong) file formats and just tries to use them as the format it needs. So for your .mp3 file, the data in this file is not changed when you simply change the extension to .txt.
When VLC reads the .txt byte by byte and interprets it as a .mp3 it can just extract the correct music data from that file.
Now some files include a header for extra validation of what kind of format the data inside the file is. For example a unicode text file (should) include a BOM to indicate how the data in the file needs to be handled. This way an application can check whether the header tag matches the expected header and so it knows for sure that your '.txt` file actually contains data in the 'mp3' format.
Now there are quite some applications to read those header tags, but they are often specific for each format. This TIFF Tag Viewer for example (I used it in the past to check the header tags from my TIFF files).
So or you could just open your file with some kind of hex viewer and then look at the format specifications what every bytes means, or you search Google for a header viewer for the format you want to see them.
I have a binary file (.bin) and a (.txt) file.
Using Python3, is there any way to combine these two files into one file (WITHOUT using any compressor tool if possible)?
And if I have to use a compressor, I want to do this with python.
As an example, I have 'file.txt' and 'file.bin', I want a library that gets these two and gives me one file, and also be able to un-merge the file.
Thank you
Just create a tar archive, a module that let's you accomplish this task is already bundled with Cpython, and it's called tarfile.
more examples here.
there are a lot of solutions for compressing!
gzip or zlib would allows compression and decompression and could be a solution for your problem.
Example of how to GZIP compress an existing file from [http://docs.python.org]:
import gzip
f_in = open('file.txt', 'rb')
f_out = gzip.open('file.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
but also tarfile is a good solution!
Tar's the best solution to get binary file.
If you want the output to be a text, you can use base64 to transform binary file into a text data, then concatenate them into one file (using some unique string (or other technique) to mark the point they were merged).