Forced alignment using Aeneas with multible aeneas text files - dataset

We have started a project to create a Turkish speech recognition dataset to use with DeepSpeech.
We finished preprocessing task of Ebook.
But we couldn't finish the forced alignment process with Aeneas.
According to its tutorials for forced alignment, you need a text file and its recorded audio file. While preprocessing of Ebook we have created 430 text files which are edited and cleaned for aeneas format (divided into paragraphs and sentences using nltk library).
But, while processing our created task object and creating its output file (Json file), we couldn't merge output files. For every Aeneas file, it starts from the beginning of the audio file.
It seems we need to split our audio file to 430 parts, but it is not a easy process.
I tried to merge Json files with:
import json
import glob
result = []
for f in glob.glob("*.json"):
with open(f, "rb") as infile:
result.append(json.load(infile))
with open("merged_file.json", "w") as outfile:
json.dump(result, outfile)
But it didn't work, because while forced alignment process, aeneas starting from the beginning of the audio file for each aeneas text files.
Is it possible to create a task object which includes all 430 aeneas text files and append them into one output file (Json file) with respect to their timings ( their seconds ) also using one audio file?
Our task object:
# create Task object
config_string = "task_language=tur|is_text_type=plain|os_task_file_format=json"
task = Task(config_string=config_string)
task.audio_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/Nutuk_sesli.mp3"
task.text_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/nutuk_aeneas_data_1.txt")
task.sync_map_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/syncmap.json")
Btw, we are working on Google Colab with python 3.

I figured out to solve my question, and found a solution.
Instead of combining JSON files, I could combine aeneas text files with this code:
with open("/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}/{2}_aeneas_data_all.txt".format(book_name,chapter,
book_name), "wb") as outfile:
for i in range(1,count-1):
file_name = "/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}/{2}_aeneas_data_{3}.txt".format(book_name, chapter, book_name, str(i))
#print(file_name)
with open(file_name, "rb") as infile:
outfile.write(infile.read())
So after combining aeneas files, I can create a json file which contains all paragraphs.

Related

Reading in the output of a program, saving it as a string, and using that string in the original program

I am using trying to get specific information from a group of MP3 files, currently I am in the main cygwin64 that holds MP3 files and a .C file which simply contains
FILE * fp;
It contains that single line of code because when that line of code is in place and I type and run "thing.c" in the cygwin command line it outputs what seems the be the information of the contents of the folder. For example it outputs,
home: sticky, directory
lib: directory
sbin: directory
setup-x86_64.exe: PE32+ executable (GUI) x86-64 (stripped to external PDB), for MS Windows
song.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
song1.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
thing.c: ASCII text, with CRLF line terminators
thing.txt: empty
What I want to do is be able to pull that output into a string that I can then use in my C file and alter and then re print out the new altered information. However I'm not sure where the output really is coming from or how I might be able to get it or save the output as a .txt file or back into a C file.
Any advice is appreciated Thanks!
This file is not really a C file at all. Because you're in Cygwin, you're likely operating on a case-insensitive filesystem (NTFS). As such, Cygwin's file command is running when you run the .c file. The way you've attempted to declare a variable (apparently) just so happens to be doing a 'file * fp' command. I'm sure you're getting fp: Cannot open "fp" or something similar after the rest of your output.
This is not anything C-related at all but is just being interpreted as a script by your shell.
It sounds like you have a lot to learn if you want to do this in C. More likely, you can probably write a shell script to accomplish what you want. While I've never used it, mp3info (https://github.com/jaalto/cygwin-package--mp3info) exists for pulling tag information from MP3 files. You could possibly get the exact information you want from that, or pipe the output into sed, awk, or a number of other tools.

File extension detection mechanism

How Application will detect file extension?
I knew that every file has header that contains all the information related to that file.
My question is how application will use that header to detect that file?
Every file in file system associated some metadata with it for example, if i changed audio file's extension from .mp3 to .txt and then I opened that file with VLC but still VLC is able to play that file.
I found out that every file has header section which contains all the information related to that file.
I want to know how can I access that header?
Just to give you some more details:
A file extension is basically a way to indicate the format of the data (for example, TIFF image files have a format specification).
This way an application can check if the file it handles is of the right format.
Some applications don't check (or accept wrong) file formats and just tries to use them as the format it needs. So for your .mp3 file, the data in this file is not changed when you simply change the extension to .txt.
When VLC reads the .txt byte by byte and interprets it as a .mp3 it can just extract the correct music data from that file.
Now some files include a header for extra validation of what kind of format the data inside the file is. For example a unicode text file (should) include a BOM to indicate how the data in the file needs to be handled. This way an application can check whether the header tag matches the expected header and so it knows for sure that your '.txt` file actually contains data in the 'mp3' format.
Now there are quite some applications to read those header tags, but they are often specific for each format. This TIFF Tag Viewer for example (I used it in the past to check the header tags from my TIFF files).
So or you could just open your file with some kind of hex viewer and then look at the format specifications what every bytes means, or you search Google for a header viewer for the format you want to see them.

How to extract data from multiple files with Python?

I am new to Python, which is also my first programming language. I have a set of txt files (academic papers), I need to extract the paper ID (e.g. ID: a1111111) and abstract (e.g. ABSTRACT: .....). I have no idea how to extract this data from multiple files from multiple folders? Thanks A LOT!
So your question is two part: reading files and accessing folders
Reading files
The methods/objects in python used for reading files is in Python's documentation on chapter 7:
http://docs.python.org/2/tutorial/inputoutput.html
The basic gist is that you use the open method to access files that are in the same directory
f = open('stuff.txt', 'r')
Where stuff.txt is the name of the file in the same directory that your python file is in.
Calling print f.read() will display the text (in String format) of the file. Feel free to assign f.read() to a variable to capture the data.
>>> x = f.read()
>>> print x
This is the entire file.\n
Best read the documentation for all these methods, cause there are subtleties. For example, calling f.read() once will return the entire file contents to you, but calling f.read() again will return an empty string, as the "end of the file has been reached."
Accessing Folders
Can you explain to me how exactly you'd like to access folders? In this case, it would be much easier to just put all your files in the same directory as where you are running your python file.
However, the basic way to move around in python is to use: os.chdir(path) which is basically cd'ing around. You must import os before you use this.
Leave a comment if you'd like some more information

combine a binary file and a .txt file to a single file in python

I have a binary file (.bin) and a (.txt) file.
Using Python3, is there any way to combine these two files into one file (WITHOUT using any compressor tool if possible)?
And if I have to use a compressor, I want to do this with python.
As an example, I have 'file.txt' and 'file.bin', I want a library that gets these two and gives me one file, and also be able to un-merge the file.
Thank you
Just create a tar archive, a module that let's you accomplish this task is already bundled with Cpython, and it's called tarfile.
more examples here.
there are a lot of solutions for compressing!
gzip or zlib would allows compression and decompression and could be a solution for your problem.
Example of how to GZIP compress an existing file from [http://docs.python.org]:
import gzip
f_in = open('file.txt', 'rb')
f_out = gzip.open('file.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
but also tarfile is a good solution!
Tar's the best solution to get binary file.
If you want the output to be a text, you can use base64 to transform binary file into a text data, then concatenate them into one file (using some unique string (or other technique) to mark the point they were merged).

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources