Writing matroska to an append only stream - file

I need to write a matroška video file to a stream that only supports an append operation (not network streaming, output is a single MKV file for offline playback). Right now I'm using ffmpeg's libavformat to do the muxing, but the resulting video file is not seekable (in a player) at all.
Going through the matroška specs, I figured out a way to create a seekable (in a player) file work with only one (file) seek operation:
SeekHead 1 (without clusters)
...
Clusters
Cues
SeekHead 2 (only clusters)
After the file is written I need to go back to SeekHead 1 and update it with positions of SeekHead 2 and Cues.
My output files can easily get to tens of gigabytes, so buffering the whole thing in memory is not an option.
Is there really no way to create the MKV without seeking in the output file?

Related

Fastest way to read 100's of files accessing them only once?

In my case I path to a folder and my code is to read all the files in the directory and subdirectory.
I need to read ALL the files, I only need to read it once (I don't need to cache anything) and I don't need the file contents after I process it.
I was thinking of opening all the files (100+ of them) at once with open followed with a posix_fadvise(f, 0, 0, POSIX_FADV_WILLNEED); then use threads to read each file one at a time (6 cores * 2 threads per core = 12threads/files I'll be reading at a time). Is this the fastest way to do it? Would there be problems if I call posix_fadvise on too many files? Should I read each file or use mmap? I'm not sure how I should go about this or limitations

Multithreaded compression, random access and on-the-fly reading

I have a program running on linux which generates thousand of text files. I want these files to be packed into a single (compressed) file.
The compressed file will later be opened by a C program, which needs to access specific files inside that container, in a random fashion.
The whole thing is working as follows:
Linux program generates thousands of small files
zip -9 out.zip *
C program with libzip accesing specific files inside .zip, depending on what the user requests. These reads are done on memory (no writing decompressed files to disk).
Works great. However, it takes about ~20 minutes for the compression to finish. Because such compression runs on a 40-core server, I have been experimenting with lbzip2 with excellent results in terms of both compression ratio and speed. I have also used zip -0 to pack all the .bz files into a single .zip container, which I assume is a better option than tar because of random access.
So my question is, how can I read .bz files compressed inside a .zip file? As far as I can tell, gzopen takes a file path as first argument.
You could just stick with your current zip format for random access. Run separate zip commands individually on each text file to turn them into many single entry zip files. Launch all those at once, and your 40 cores will be kept busy until done. Once done, use zipmerge to combine them all into a single zip file.

HTTP byte range yields a corrupt audio file?

when i open a .m4a or .mp3 file using chrome (or in VLC "open new stream), and skip to a certain time mark, it starts streaming the audio right at that time mark, which means that it downloads a chunk from the audio file starting at that position.
when i check network under chrome dev tools after i click somewhere in the audio player, and copy the cURL request, it contains a range http header, for example:
-H "Range: bytes=26500000-30000000"
when i run that cURL command in the terminal, the output is a corrupt audio file. if i try to convert it using ffmpeg, it logs:
[mov,mp4,m4a,3gp,3g2,mj2 # 00000164d5a3ab40] moov atom not found
the only exception is when i request a byte range starting from 0 (and ending at 300000+) in which case the output file can play.
from what i understand, the headers of the audio file are not downloaded when the byte range is somewhere in the middle or ending of the file.
i tried specifying two ranges, as follows
-H "Range: bytes=0-300000,2000000-4000000"
but the output file is corrupt nevertheless.
how does chrome/vlc deal with this situation, and how can i replicate it? what is the right way to download an audio chunk from a large audio file? i am guessing i would have to make consecutive http requests to build a proper, non-corrupt file, it seems that it is what chrome does.
UPDATE
a "temporary fix" i found to the problem that you might find useful is to use launch VLC through command with a hidden interface, load the audio stream (in this case the audio file url) and transcode the audio to a local file while assigning a start and stop time. an example, with a random podcast:
vlc -Idummy --play-and-exit --sout "#transcode{acodec=s16l,channels=2,samplerate=44100}:std{access=file,mux=wav,dst=C:\audio\output.wav}" --start-time=600 --stop-time=630 "http://traffic.libsyn.com/preview/worldofhardwarestartups/World_of_Hardware_Startups_Podcast_EP01_-_AIR_Ready.mp3"
however, the original issue is still unsolved.

FFmpeg decoding .mp4 video file

I'm working on a project that needs to open .mp4 file format, read it's frames 1 by 1, decode them and encode them with better type of lossless compression and save them into a file.
Please correct me if i'm wrong with order of doing things, because i'm not 100% sure how this particular thing should be done. From my understanding it should go like this:
1. Open input .mp4 file
2. Find stream info -> find video stream index
3. Copy codec pointer of found video stream index into AVCodecContext type pointer
4. Find decoder -> allocate codec context -> open codec
5. Read frame by frame -> decode the frame -> encode the frame -> save it into a file
So far i encountered couple of problems. For example, if i want to save a frame using av_interleaved_write_frame() function, i can't open input .mp4 file using avformat_open_input() since it's gonna populate filename part of the AVFormatContext structure with input file name and therefore i can't "write" into that file. I've tried different solution using av_guess_format() but when i dump format using dump_format() i get nothing so i can't find stream information about which codec is it using.
So if anyone have any suggestions, i would really appreciate them. Thank you in advance.
See the "detailed description" in the muxing docs. You:
set ctx->oformat using av_guess_format
set ctx->pb using avio_open2
call avformat_new_stream for each stream in the output file. If you're re-encoding, this is by adding each stream of the input file into the output file.
call avformat_write_header
call av_interleaved_write_frame in a loop
call av_write_trailer
close the file (avio_close) and clear up all allocated memory
You can convert a video to a sequence of losses images with:
ffmpeg -i video.mp4 image-%05d.png
and then from a series of images back to a video with:
ffmpeg -i image-%05d.png video.mp4
The functionality is also available via wrappers.
You can see a similar question at: Extracting frames from MP4/FLV?

Safely writing to and reading from the same file with multiple processes on Linux and Mac OS X

I have three processes designed to run constantly in both Linux and Mac OS X environments. One process (the Downloader) downloads and stores a local copy of a large XML file every 30 seconds. Two other processes (the Workers) use the stored XML file for input. Each Worker starts and runs at random times. Since the XML file is big, it takes a long time to download. The Workers also take a long time to read and parse it.
What is the safest way to setup the processes so the Downloader doesn't clobber the stored file while the Workers are trying to read it?
For Linux and Mac OS X machines that use inode based file systems, use temporary files to store the data while its being downloaded (and is an incomplete state). Once the download is complete, move the temporary file into its final location with an atomic action.
For a little more detail, there are two main things to watch out for when one process (e.g. Downloader) writes a file that's actively read by other processes (e.g. Workers):
Make sure the Workers don't try to read the file before the Downloader has finished writing it.
Make sure the Downloader doesn't alter the file while the Workers are reading it.
Using temporary files accommodates both of these points.
For a more specific example, when the Downloader is actively pulling the XML file, have it write to a temporary location (e.g. 'data-storage.tmp') on the same device/disk* where the final file will be stored. Once the file is completely downloaded and written, have the Downloader move it to its final location (e.g. 'data-storage.xml') via an atomic (aka linearizable) rename command like bash's mv.
* Note that the reason the temporary file needs to be on the same device as the final file location is to ensure the inode number stays the same and the rename can be done atomically.
This methodology ensures that while the file is being downloaded/written the Workers won't see it since it's in the .tmp location. Because of the way renaming works with inodes, it also make sure that any Worker that opened the file continues to see the old content even if a new version of the data-storage file is put in place.
Downloader will point 'data-storage.xml' to a new inode number when it does the rename, but the Worker will continue to access 'data-storage.xml' from the previous inode number thereby continuing to work with the file in that state. At the same time, any Worker that opens a new copy 'data-storage.xml' after Downloader has done the rename will see contents from the new inode number since it's now what is referenced directly in the file system. So, two Workers can be reading from the same filename (data-storage.xml) but each will see a different (and complete) version of the contents of the file based on which inode the filename was pointed to when the file was first opened.
To see this in action, I created a simple set of example scripts that demonstrate this functionality on github. They can also be used to test/verify that using a temporary file solution works in your environment.
An important note is that it's the file system on the particular device that matters. If you are using a Linux or Mac machine but working with a FAT file system (for example, a usb thumb drive), this method won't work.

Resources