How to upload .gz files into Google Big Query? - google-app-engine

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?

See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

Related

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
take the latest crawl (eg. April 2019)
navigate to the WARC file list and download it (same for WAT or WET)
unzip the file and randomly select one line (file path)
prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

How to process multiple text files at a time to analysis using mapreduce in hadoop

I have lots of small files , say more than 50000. i need to process these files at a time using map reduce concept to generate some analysis based on the input files.
Please suggest me a way to do this and also please let me know how to merge this small files into a big file using hdfs
See this blog post from cloudera explaining the problem with small files.
There is a project in github named FileCrush which does merge large number of small files. From project's homepage:
Turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass.

How many blobs may be submitted to GAE blobstore in one call?

I am trying to upload 1744 small files to the blobstore (total size of all the files is 4 MB) and get HTTP/1.1 503 Service Unavailable error.
This is 100% reproducible.
Is it a bug, I do I violate any constraints? I don't see any constraints in the documentation about number of blobs submitted in one call.
The answer that claims that create_upload_url can only accept one file per upload above is wrong. You can upload multiple files in a single upload and this is the way you should be approaching your problem.
That being said, there was a reliability problem when doing a batch upload that was worked on and fixed around a year or so ago. If possible I would suggest keeping the batch sizes a little smaller (say 100 or so files in a batch). Each file in the batch results in a write to the datastore to record the blob key so 1744 files == 1744 writes and if one of them fails then your entire upload will fail.
If you give me the app_id I can take a look at what might be going wrong with your uploads.
So, the answer. Currently only < 500 files may be submitted in one request.
This is going to be fixed in the scope of the ticket http://code.google.com/p/googleappengine/issues/detail?id=8032 so that unlimited number of files may be submitted. But it may take a GAE release or 2 before the fix is deployed.

grails file upload

Hey. I need to upload some files (images/pdf/pp) to my SQLS Database and thereafter, download it again. I'm not sure what is the best solution - store it as bytes, or store it as file (not sure if possible). I need later to databind multiple domain classes together with that file upload.
Any help would be very much apreciated,
JM
saving files in the file system or in the DB is a general question which is asked here several times.
check this: Store images(jpg,gif,png) in filesystem or DB?
I recommend to save the files in the file system and just save the path in the DB.
(if you want to work with google app-engine though you have to save the file as byte array in the DB as saving files in the file system is not possible with google app-engine)
To upload file with grails check this: http://www.grails.org/Controllers+-+File+Uploads

Plone 4 data is stored on the file system rather than in the database?

According to this post:
http://ifpeople.wordpress.com/2010/10/20/plone-4-best-yet-of-the-best-cms/
There is words about data storage:
Plone 4′s capacity to handle very
large files has improved drastically
since all file data is now stored on
the file system rather than in the
database. This enhances the ability of
Plone to scale to handle huge content
repositories out of the box!
I'm not plone user. What the meaning of that words? Is it flat file database?
Instead of storing uploaded pdfs and so in the database, these are now stored in a regular file system folder.
So they're stored as regular files on the regular filesystem. Plone's database itself handles those files transparently, so the application code doesn't need to know whether the files are on the filesystem or inside the database. (The technical term is "BLOB storage": binary large objects).
And, yes, it helps a lot with performance :-)
For another explanation, see point 4 on http://jstahl.org/archives/2010/09/01/5-things-that-rock-about-plone-4/ .
By default, files and images uploaded to a Plone 4 site are no longer stored in the traditional 'filestorage' file (eg. Data.fs), but instead in a specially organised 'blob' storage area on the file system. This is a tremendous help in preventing huge Data.fs files. Everything else is in stored the filestorage as before. The only thing you need to worry about is how to do backups properly, as repozo doesn't support this :-)
No, this quote refers to the inclusion of ZODB "blob support" (http://en.wikipedia.org/wiki/Binary_large_object) in Plone 4. Prior to this release, objects like files and images were stored in the (flat file) Data.fs file (which is part of the ZODB).
Now, they are stored on the filesystem in files (still managed by the ZODB) that look like this:
var/blobstorage
var/blobstorage/.layout
var/blobstorage/0x00
var/blobstorage/0x00/0x00
var/blobstorage/0x00/0x00/0x00
var/blobstorage/0x00/0x00/0x00/0x00
var/blobstorage/0x00/0x00/0x00/0x00/0x00
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00/0x3b
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00/0x3b/0xa5
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00/0x3b/0xa5/0x038ba9d72acbdcdd.blob
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00/0x3b/0xa9
var/blobstorage/0x00/0x00/0x00/0x00/0x00/0x00/0x3b/0xa9/0x038ba9d836b5cdaa.blob

Resources