Opening a VCF.bgz file in python - edit - file

I have downloaded some data from gnomad - https://gnomad.broadinstitute.org/downloads.
It comes in the form of VCF.bgz file and I would like to read it as a vcf file.
I found some code here: Partially expand VCF bgz file in Linux
by #rnorris .
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried it on my data and got:
Not a gzipped file (b'TB')
Is there any way to fix it? I don't understand what the problem is.

"Not a gzipped file" means it's not a gzipped file. Either it was corrupted, downloaded incorrectly, or isn't a gzip file in the first place. A gzip file starts with b'\x1f\x8b'. Not b'TB'.
A .bgz file should be a gzip file, so you likely did not download what you think you downloaded. How did you download it?

The gnomAD datasets should definitely be valid block-gzipped VCF files. Regardless, I think you might have a better time using Hail (disclaimer: I'm a Hail maintainer).
Hail can read directly out of Google Cloud Storage if you install the GCS connector.
curl https://broad.io/install-gcs-connector | bash
Then in Python:
import hail as hl
mt = hl.read_matrix_table(
'gs://gcp-public-data--gnomad/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht'
)
mt = mt.head(100_000)
sites = mt.collect()

Related

checking that a file is not truncated

I have downloaded many gz files from an ftp address :
http://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/
How can I check that whether the files have been truncated during the download (i.e. wget did not download the entire file because of network connection) ? Thanks.
As you can see in each directory you have file md5sum.txt.
You can use command like:
md5sum -c md5sum.txt
This will calculate the hashes and compare them with the values in the file.
How can I check that whether the files have been truncated during the
download (i.e. wget did not download the entire file because of
network connection) ?
You might use spider mode to get just headers of response, for example
wget --spider http://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/Alasoo_2018/exon/Alasoo_2018_exon_macrophage_naive.permuted.tsv.gz
gives output
Spider mode enabled. Check if remote file exists.
--2022-05-30 09:38:55-- http://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/Alasoo_2018/exon/Alasoo_2018_exon_macrophage_naive.permuted.tsv.gz
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 645718 (631K) [application/octet-stream]
Remote file exists.
Length is size of file (in bytes) so after comparing it with your local file you will be able to tell if it is complete or not.
If you want to download missing parts if any, rather than merely check for completeness, then take look at -c option, from wget man page
-c
--continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of
Wget, or by another program.(...)

How to extract data (Download) from IBM Watson notebook

I have an instance of Pandas dataframe in a notebook runs in IBM Watson. I need to download the dataframe as a CSV file.
File read and File write python operations works in the notebook environment. If you want to run a system command all you have to do is run a command after ! to list the file you can just do
!ls -la
So my approach was to create a file in the local storage and encode the file in base64 and create a download link
from IPython.display import HTML
import base64
def create_download_link( dataframe, title = "Download CSV file", filename = "myout222.csv"):
csv = dataframe.to_csv() # create the csv
b64 = base64.b64encode(csv.encode()) # encode the file
payload = b64.decode() # set the payload
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html) # returning the link
Now call the function create_download_link(your_dataframe)
instead of a dataframe you can download any file by just reading the file and encoding the file
Since the system commands are working you can also upload the files into a separate server using curl, And also you can download files into the local storage by wget

Gzip Files: Extracting Does Not Work as Expected

I'm facing this very strange problem when working with gzip files. I'm trying to download this file https://www.sec.gov/Archives/edgar/daily-index/2014/QTR2/master.20140402.idx.gz
When I view the contents of the file inside the archive, it is perfect.
However when I unzip the contents and try to see them, it is all gibberish.
Is something wrong with the file or am I missing to see anything here?
If I remember correctly, an idx file is a Java file. It can also a plain text archive format, which it is in this case.
On Linux, try running
gunzip master.20140402.idx.gz
This will extract it into an idx file, which you should be able to open with any text reader, such as vi, since vi can open pretty much anything.
On Windows, you can, from the command line, use WinZip, with:
wzunzip -d master.20140402.idx.gz
You can then use something like IE, Edge, or Wordpad to try to examine the file, that should automagically load a readable environment.
EDIT:
So, I downloaded the file, and was able to extract, and view it in vi, IE, and Wordpad, using my above commands, so if you are seeing gibberish, try redownloading it. It should be 104kb in .gz format, and 533 kb extracted.

Can not open downloaded PDF files

I'm trying to download needed PDFs related to a researcher.
But the downloaded PDFs can't be opened, saying that the files may be damaged or in wrong format. While another URL used in test resulted in normal PDF files. Do you have any suggestion?
import requests
from bs4 import BeautifulSoup
def download_file(url, index):
local_filename = index+"-"+url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
return local_filename
# For Test: http://ww0.java4.datastructures.net/handouts/
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
root_link="http://ecal.berkeley.edu/publications.html#journals"
r=requests.get(root_link)
if r.status_code==200:
soup=BeautifulSoup(r.text)
# print soup.prettify()
index=1
for link in soup.find_all('a'):
new_link=root_link+link.get('href')
if new_link.endswith(".pdf"):
file_path=download_file(new_link,str(index))
print "downloading:"+new_link+" -> "+file_path
index+=1
print "all download finished"
else:
print "errors occur."
Your code has a comment saying:
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
Looks like what you can't open is a HTML file. So no wonder that a PDF reader will complain about it...
For any real PDF link that I had a problem with I would proceed as follows:
Download file with a different method (wget, curl, a browser, ...).
Can you download it even? Or is there some password hoop to be jumped through?
Is the download fast + complete?
Does it then open in a PDF viewer?
If so, compare to the file your script downloaded.
What are the differences?
Could they be caused by your script?
Are there no differences within the first few hundred lines, but differences later? End of the file being a bunch of nul-bytes? Then your download didn't complete...
If not so, still compare the differences. If there are none, your script is not at fault. The PDF may really be corrupted...
What does it look like, when opened in a text editor?

Ruby Zlib: What is the purpose of orig_name=?

My code is like this:
gz = Zlib::GzipWriter.open('test.zip')
gz.orig_name = "test.csv"
gz.write("testing writing to zipped file")
gz.close
What I am trying to do:
When using zip extractor application, test.zip will be unzipped to test.csv
I used orig_name method thinking that when I will try to extract the zip with other zip extractor like archive utility, the resulting file would be 'test.csv'. But the file is still 'test'.
If by "other zip extractor" you mean the gzip utility, you'd need to use the -N option of gzip to use the name stored in the gzip header. Otherwise it will just use the compressed file name with the .gz removed.

Resources