Can not open downloaded PDF files - file

I'm trying to download needed PDFs related to a researcher.
But the downloaded PDFs can't be opened, saying that the files may be damaged or in wrong format. While another URL used in test resulted in normal PDF files. Do you have any suggestion?
import requests
from bs4 import BeautifulSoup
def download_file(url, index):
local_filename = index+"-"+url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
return local_filename
# For Test: http://ww0.java4.datastructures.net/handouts/
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
root_link="http://ecal.berkeley.edu/publications.html#journals"
r=requests.get(root_link)
if r.status_code==200:
soup=BeautifulSoup(r.text)
# print soup.prettify()
index=1
for link in soup.find_all('a'):
new_link=root_link+link.get('href')
if new_link.endswith(".pdf"):
file_path=download_file(new_link,str(index))
print "downloading:"+new_link+" -> "+file_path
index+=1
print "all download finished"
else:
print "errors occur."

Your code has a comment saying:
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
Looks like what you can't open is a HTML file. So no wonder that a PDF reader will complain about it...
For any real PDF link that I had a problem with I would proceed as follows:
Download file with a different method (wget, curl, a browser, ...).
Can you download it even? Or is there some password hoop to be jumped through?
Is the download fast + complete?
Does it then open in a PDF viewer?
If so, compare to the file your script downloaded.
What are the differences?
Could they be caused by your script?
Are there no differences within the first few hundred lines, but differences later? End of the file being a bunch of nul-bytes? Then your download didn't complete...
If not so, still compare the differences. If there are none, your script is not at fault. The PDF may really be corrupted...
What does it look like, when opened in a text editor?

Related

Opening a VCF.bgz file in python - edit

I have downloaded some data from gnomad - https://gnomad.broadinstitute.org/downloads.
It comes in the form of VCF.bgz file and I would like to read it as a vcf file.
I found some code here: Partially expand VCF bgz file in Linux
by #rnorris .
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried it on my data and got:
Not a gzipped file (b'TB')
Is there any way to fix it? I don't understand what the problem is.
"Not a gzipped file" means it's not a gzipped file. Either it was corrupted, downloaded incorrectly, or isn't a gzip file in the first place. A gzip file starts with b'\x1f\x8b'. Not b'TB'.
A .bgz file should be a gzip file, so you likely did not download what you think you downloaded. How did you download it?
The gnomAD datasets should definitely be valid block-gzipped VCF files. Regardless, I think you might have a better time using Hail (disclaimer: I'm a Hail maintainer).
Hail can read directly out of Google Cloud Storage if you install the GCS connector.
curl https://broad.io/install-gcs-connector | bash
Then in Python:
import hail as hl
mt = hl.read_matrix_table(
'gs://gcp-public-data--gnomad/release/2.1.1/ht/exomes/gnomad.exomes.r2.1.1.sites.ht'
)
mt = mt.head(100_000)
sites = mt.collect()

Batch command for a GDAL script

I have the following GDAL script using the OSGeo console that i have tested for one image, that I want to change now to run over every image in one folder and output to another folder?
Edit: the code I posted was a test I ran on one image to check visual quality. I was happy with the visual quality after compressing my test image. I now want to apply the script to approx. 1000 images located in one folder and output to another folder.
Edit: i am not sure why I have received a downvote for asking a question in a straightforward manner? I have checked numerous other posts on SO and reddit and have not been able to get any process to work within the QGIS OsGeo framework and would value some advice.
gdal_translate -co COMPRESS=JPEG -co PHOTOMETRIC=YCBCR -co TILED=YES "D:\split outputs\raster_compression test\EQ_GLNG_Photo_2018_MGA550.tif" "D:\split outputs\raster_compression test\0001_JPEG.tif" -scale -ot Byte
In the QGIS python console something like the following should work
This expects your input and output folders to exist, and it expects that the input folder contains all tif files as a child (not in sub folders)
import os
from osgeo import gdal
input_folder = 'path_to_input'
output_folder = 'path_to_output'
options = gdal.TranslateOptions(
outputType=gdal.GDT_Byte,
scaleParams=[''],
creationOptions=['COMPRESS=JPEG', 'PHOTOMETRIC=YCBCR', 'TILED=YES']
)
for entry in os.listdir(input_folder):
if entry.endswith('.tif'):
gdal.Translate(
os.path.join(output_folder, entry),
os.path.join(input_folder, entry),
options=options
)

Gzip Files: Extracting Does Not Work as Expected

I'm facing this very strange problem when working with gzip files. I'm trying to download this file https://www.sec.gov/Archives/edgar/daily-index/2014/QTR2/master.20140402.idx.gz
When I view the contents of the file inside the archive, it is perfect.
However when I unzip the contents and try to see them, it is all gibberish.
Is something wrong with the file or am I missing to see anything here?
If I remember correctly, an idx file is a Java file. It can also a plain text archive format, which it is in this case.
On Linux, try running
gunzip master.20140402.idx.gz
This will extract it into an idx file, which you should be able to open with any text reader, such as vi, since vi can open pretty much anything.
On Windows, you can, from the command line, use WinZip, with:
wzunzip -d master.20140402.idx.gz
You can then use something like IE, Edge, or Wordpad to try to examine the file, that should automagically load a readable environment.
EDIT:
So, I downloaded the file, and was able to extract, and view it in vi, IE, and Wordpad, using my above commands, so if you are seeing gibberish, try redownloading it. It should be 104kb in .gz format, and 533 kb extracted.

Easy way to make a test PDF of a certain file size

want to test the upload file size limit in an application, and it's a pain finding / making various pdf's of certain sizes to test / debug this. Anybody have a better way?
You can write a simple shell script that converts set of images to pdf: How can I convert a series of images to a PDF from the command line on linux? and do it for 1,2,3, ..., all image files in certain directory.
Creating directory full of copies of single image, should be simple too, start with one image file with desired size e.g. 64KB.
# pseudocode - don't test it
END=5
for i in {1..$END}; do cp ./image ./image_$i; done
for i in {1..$END}; do convert ./image_{1..$i} mydoc_$i.pdf; done
I've found an online tool, however, it seems to not be working correctly since it can only generate 10MB files even though you tell it to make a 50MB file.
https://www.fakefilegenerator.com/generate-file.php

Importing text files or compressed files in QlikView using a batch file

I have multiple text files in a folder. I need those files to be imported into QlikView on a daily basis. Is there any way to import those files using batch/command file?
Moreover, can I import compressed files into QlikView?
I am not sure how your load script is set up, but if you wish to refresh your QlikView document, and you don't have QlikView Server, then you can use a batch file as follows:
"<Path To QlikView>\QV.exe" /r "ReportToReload.qvw"
The /r command parameter tells QlikView to open the document, reload it and then save and close the document. However, you must make sure that the QlikView User Preference option "Keep Progress Open after Reload" is not enabled, otherwise the progress dialogue will wait for you to close it after the document has been reloaded.
You can then schedule this batch file to run via Windows' Task Scheduler, or your favourite scheduling tool.
QlikView cannot import compressed files (e.g. Zip/RAR etc.), so you would need to extract these first using a batch script.
You can loop over your directory structure and read the existing files in your load script.
LET vCustCount = NoOfRows('Kunde');
TRACE Anzahl Kunden: $(vCustCount);
FOR i=1 TO $(vCustCount)
LET vNameKunde = FieldValue('name_kunde',$(i));
FOR each vFile in filelist ('$(vNameKunde)/umsatz.qvd')
TRACE $(vFile) hat eine umsatz.qvd;
LOAD ....
FROM [$(vFile)] (qvd);
NEXT vFile
NEXT
In this case I load pre-calculated qvd files but you could do the same with txt, csv ...
And as i_saw_drones mentioned QlikView cannot import compressed files. If you need to read compressed files you can batch operate them with a unzip tool.
You should have a look at
21.1 Loading Data from Files
in the Reference Manual.
HTH
Following script checks whether qvd exists or not. If, yes then it update it otherwise create a new qvd
IF NOT isNull(qvdCreateTime('G:\TestQvd\Data.qvd')) THEN
data2:
load * from G:\TestQvd\Data.qvd(qvd);
FOR each vFille in filelist ('G:\Test\*')
LOAD * FROM
[$(vFille)]
(txt, codepage is 1252, explicit labels, delimiter is spaces, msq);
NEXT vFille
ELSE
FOR each vFille in filelist ('G:\Test\*')
data2:
LOAD * FROM
[$(vFille)]
(txt, codepage is 1252, explicit labels, delimiter is spaces, msq);
NEXT vFille
ENDIF
STORE data2 into G:\TestQvd\Data.qvd;
exit Script;

Resources