I use Azure Speech SDK to convert a set of text files to voice. Successfully all of the texts are converted and an instance of ArrayBuffer is returned for each text. Then I convert each of them to Buffer and concatenate all of the buffers into one with Buffer.concat(). Then I pass the concatenated buffer data to fs.writeFile() to create an mp3 file. However only the first buffer is included in the audio file instead of the concatenated buffer.
What should I do?
To provide a little background, audio files generally consist of some header data that contains information about the audio (e.g. the sampling rate, how many channels of audio, etc...), followed by the actual audio data. Generally speaking, you should only have one header per audio file.
If you simply concatenate the audio data together, the header from the first file will be read by your media player. Since this doesn't contain any information about the other audio files you've concatenated, you'll get some non deterministic behaviour. It may play the audio from the first file only, it may give you an error, or it may try to play the header section for the remaining audio files as audio data (which is not correct).
For things to work correctly, you'd need to update the header to reflect all of the audio data, and strip out the remaining headers. You'd also need to make sure that the byte alignment, formatting, etc... of all your audio data is consistent.
Your best bet here is to use some software that understands how to read and parse audio files to stitch together your files. Using your search engine of choice for to search for e.g. combine mp3 files should help.
Related
So far, I understood:
Files have some information in their 'headers' which programs use to detect the file type.
Is it always true? If so how can I see the header?
Is it the only way of defining a file's type?
Is there a way to manually create an empty file (at least in Linux) with no headers at all?
If so, can I manually write its header and create a simple 'jpg' file
No, files simply have bytes and some metadata like a filename, permissions, last modified time. The format of those bytes is completely free and open without convention. Certainly some file types like jpegs,gifs,audio and video files have headers specified in their formats. Viewing a header depends completely on the format involved. They will normally be comprised of byte codes meaningless to the human eye so some software would normally be required to decode and view them.
yes.
touch emptyFile
Sounds painful. Use a library to write a jpeg. Headers are not necessarily easy to create. Someone else has done this hard work for you so I'd use it.
A file is nothing more than a sequence of bytes, and it has no default, internal structure. It is an abstraction made by the OS to make it more convenient to store and manipulate data.
Files may represent different types of things like images, videos, audio and plain text, so these need to be interpreted in a certain way in order to interact with their contents. For instance; an image is opened in an image viewer, a PDF document is opened in a PDF viewer; an audio file is opened in a media player. That doesn't mean you cannot open an image in a text editor - the file's contents will only be interpreted differently.
The closest thing to file metadata in UNIX and Linux is the inode - which stores information about files, but are not part of the files themselves - and the files's magic number. Use stat to inspect the inode and use file to determine its type (sometimes based on its magic number).
Also check out man file for more information about file types.
I want to read pcm samples from a file using fread and want to determine the signal strength of the samples.How do I go about it?
For reading, how many bytes constitute 1 pcm sample? Can I read more than 1 pcm sample at a time?
This is for WAV and AAC files.
You have to understand that WAV-files (and even more so AAC-files) are not all the same. I will only explain about WAV-files, you'll hopefully understand how it is with AAC-files then. As you pointed out, a WAV-file has PCM-encoded data. However that can be: 8-bit, 16-bit, 32-bit, ...Mono, Stereo, 5.1, ...,8kHz, 16kHz, 44.1kHz, etc. Depending on these values you have to interpret the data (e.g. when reading it with the fread()-function) differently. Therefore WAV-files have a header. You have to read that header first, in the standard way (I do not know the details). Then you know how to read the actual data. Since it is not that easy, I suggest you use on of the libraries out there, that read WAV-files for you, e.g. http://www.mega-nerd.com/libsndfile/ . Of course you can also google or use SO to find others. Or you do it the hard way and find out how WAV-file headers look like and decode that data first, then move on to the actual PCM-encoded data.
I have no experience tackling with WAV file, but once read data from mp3 file. As to the mp3 file, each 576 pcm samples are encoded into a frame. All the frames are stored directly into a file alone with some side information. When processing encoded data, I read binary data from the mp3 file and stored in a buffer, decoding buffered data and extract what is meaningful to me.
I think processing wav file(which stores pcm samples per my understand) is not quite different. You can read the binary data from file directly and perform some transformation according to wav encoding specification.
The file itself does not know what kind of data even what format of the data is in it. You can take everything in a file as bytes(even plain text), read byte from file interpreting the binary data yourself.
What I wish to do is to "split" an .mp3 into two separate files, taking duration as an argument to specify how long the first file should be.
However, I'm not entirely sure how to achieve this in C. Is there a particular library that could help with this? Thanks in advance.
I think you should use the gstreamer framework for this purpose. So you can write an application where you can use existing plugins to do this job. I am not sure of this but you can give it a try. Check out http://gstreamer.freedesktop.org/
For queries related to gstreamer: http://gstreamer-devel.966125.n4.nabble.com/
If you don't find any library in the end, then understand the header of a mp3 file and divide the original file into any number of parts and add individual headers to them. Its not gonna be easy but its possible.
It can be as simple as cutting the file at a position determined by the mp3 bitrate and duration requested. But:
your file should be CBR - that method won't work on ABR or VBR files, because they have different densityes
your player should be robust enough not to break if it gets partial mp3 frame at a start. Most of the playback libraries will handle mp3s split that way very gracefully.
If you need more info, just ask, I am on the mobile and can get you more info later.
If you want to be extra precise when cutting, you can use some library to parse mp3 frame headers, and then write frames that you need. That way, and the way mentioned before, you'll get only frame alignment as a minimum, and you have to live with thaty and that's 40ms.
If that isn't enough, you must decode mp3 to PCM, split at sample boundary, then recompress to mp3 again.
Good luck
P.s.
When you say split, I hope you don't expect them to play one after another with no audible 'artifacts'. Mp3 frames aren't self-sufficient, as they carry information from the frame before.
I try to grok that: Apple is talking about "packets" in audio files, and there is a fancy function called AudioFileReadPackets which takes a lot of arguments. One of them specifies the "start packet", and another one the number of packets which you want to read.
So I imagine an audio file to look like this, internally: It's made up of a lot of packets. If it's an audio file which has an variable bit rate format, then every packet may have a different size. If the file has an constant bit rate format, then every packet is the same size. So an audio file is like a truck full of boxes, and every box contains some interesting stuff.
Is that correct? Does it apply to any kind of file? Is this how files actually look like?
The question (even with the "especially audio files" qualification) is far too broad; different file formats are, well, different!
So to answer the question you will first have to specify a particular file type; then the answer to the question will invariably to look at its specification. Proprietary formats may not have a publicly available specification.
Specifications for many files (official and reverse engineered) can be found at the brilliant Wotsit's Format site.
AAC used by Apple iTunes and others is defined by ISO/IEC 13818-7:2006. The document will cost you 252 Swiss Francs (about US$233)! You'd have to be really interested (commercially) to pay that rather than use an existing AAC Codec.
"Packet" is a term commonly used in data transmission, so may be more applicable to audio streaming than audio files, where a "frame" may be more appropriate, or for data files in general a "record", but the terminology is flexible because it means whatever the person that wrote it thought it meant! If enough people misuse a term, it essentially becomes redefined (or multiply defined) to mean that, so I would not get too hung up on that. The author was do doubt using it to define a unit that has a defined format within a file that has multiple such units repeated sequentially.
"Packet" looks to me like Apple-specific terminology. I just did a lot of reading and coding to process WAV and MP3 files and I don't believe I saw the term "packet" once.
Files contain whatever the application that created them chose to place in them. Files are essentially a sequence of bytes. Any further organisation is a semantic distinction made by the program that created them. It is untrue to think of all files containing the same structure.
That said, certain data storage problems are similar enough to be solved in similar ways, and patterns start to emerge. Splitting data into records or packets is an example of that.
That's pretty much what audio files look like: a series of chunks of data, or frames. AudioFileReadPacketData and AudioFileReadPackets shield you from the details of, for instance, how big a frame might be in bytes (because you might be reading from a WAV file, which has a different structure to an MP3 file, or your MP3 file uses a variable bit rate).
The concept of frames doesn't apply in general to any file, but then you wouldn't be using the Audio File Services API to access just any old file.
For MP3 (and MP1, MP2) the file consists of frames. And yes, your understanding is correct - in VBR files packets have different size. In WAV files packets have the same length if memory serves (I wrote a decoder / player 11 years ago,).
I'm working on a C library that reads tag information from music files. I've already got ID3v2 taken care of, but I can't figure out how Ogg files are structured.
I opened a .ogg file in a hexeditor and I could find the tag data because that was all human readable. But everything from the beginning of the file to the tag data looked like garbage. How is this data encoded?
I don't need any help in the actual code, I just need help visualizing what a Ogg header looks like and what encoding it uses so I that I can read it. I'd like to use a non-hacky approach to reading Ogg files.
I've been looking at the Flac format, which has been helpful.
The Flac file I'm looking at has about 350 bytes between the "fLac" identifier and the human readable Comments section, and none of it is human readable in my hex editor, so I'm sure there has to be something important in there.
I'm using Linux, and I have no intention of porting to Windows or OS X. So if I need to use a glibc only function to convert the encoding, I'm fine with that.
The Ogg file format is documented here. There is a very nice graphical visualization as you requested with a detailed written description.
You may also want to look at libogg which is a open source BSD-licensed library for reading and writing Ogg files.
As is described in the link you provided, the following metadata blocks can occur between the "fLaC" marker and the VORBIS_COMMENT metadata block.
STREAMINFO: This block has information about the whole stream, like sample rate, number of channels, total number of samples, etc. It must be present as the first metadata block in the stream. Other metadata blocks may follow, and ones that the decoder doesn't understand, it will skip.
APPLICATION: This block is for use by third-party applications. The only mandatory field is a 32-bit identifier. This ID is granted upon request to an application by the FLAC maintainers. The remainder is of the block is defined by the registered application. Visit the registration page if you would like to register an ID for your application with FLAC.
PADDING: This block allows for an arbitrary amount of padding. The contents of a PADDING block have no meaning. This block is useful when it is known that metadata will be edited after encoding; the user can instruct the encoder to reserve a PADDING block of sufficient size so that when metadata is added, it will simply overwrite the padding (which is relatively quick) instead of having to insert it into the right place in the existing file (which would normally require rewriting the entire file).
SEEKTABLE: This is an optional block for storing seek points. It is possible to seek to any given sample in a FLAC stream without a seek table, but the delay can be unpredictable since the bitrate may vary widely within a stream. By adding seek points to a stream, this delay can be significantly reduced. Each seek point takes 18 bytes, so 1% resolution within a stream adds less than 2k. There can be only one SEEKTABLE in a stream, but the table can have any number of seek points. There is also a special 'placeholder' seekpoint which will be ignored by decoders but which can be used to reserve space for future seek point insertion.
Just after the above description, there's also the specification of the format of each of those blocks. The link also says
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
So, what are you missing? You say
I'd like a non-hacky approach to reading Ogg files.
Why re-write a library to do that when they already exist?