Does CKAN have a limit size of data to upload? - solr

I have set CKAN and it is running fine, but have two questions.
Both problems below happen only if uploading file. If I add a new resource by a URL, everything runs fine.
1) I can upload small files (around 4kb) to a given dataset, but when trying with bigger files (65 kb) I get Error 500 An Internal Server Error Occurred. So is there a size limit for uploading files? What can I do to be able to upload bigger files?
2) I get another error, for the small uploaded files, and that is: when clicking in Go to Resource to download the data, it gives me Connection to localhost refused, and I cant visualize the data either. What am I doing wrong?
I appreciate any help. If you need me to provide more info on anything, I'll happily do.
Many thanks.

CKAN has an upload size limit of 10MB for resources by default. You can raise that in your ini with ckan.max_resource_size = XX, for example ckan.max_resource_size = 100 (which means = 100MB).
As for question 2): have you set ckan.site_url correctly in your ini?

As far as I'm aware, CKAN can easily cope with terabytes of data(Used for millions of medical records in hospitals etc) so there shouldn't be an issue with your file size. It could be an issue on their end while receiving your data.

Related

AWS Sagemaker failure after successful training "ClientError: Artifact upload failed:Insufficient disk space"

I'm training a network using custom docker image. First training with 50.000 steps everythig was ok, when I tried to increase to 80.000, I got error: "ClientError: Artifact upload failed:Insufficient disk space", I just increased the steps number.. this is weird to me. There are no errors in the cloudwatch log, my last entry is:
Successfully generated graphs: ['pipeline.config', 'tflite_graph.pb',
'frozen_inference_graph.pb', 'tflite_graph.pbtxt',
'tflite_quant_graph.tflite', 'saved_model', 'hyperparameters.json',
'label_map.pbtxt', 'model.ckpt.data-00000-of-00001',
'model.ckpt.meta', 'model.ckpt.index', 'checkpoint']
Which basically means that those files have been created because is a simple:
graph_files = os.listdir(model_path + '/graph')
Which disk space is talking about? Also looking at the training job I see from the disk utilization chart that the rising curve peaks at 80%...
I expect that after the successful creation of the aforementioned files, everything is uploaded to my s3 bucket, where no disk space issues are present. Why 50.000 steps is working and 80.000 is not working?
It is my understanding that the number of training steps don't influence the size of the model files..
Adding volume size to the training job selecting "additional storage volume per instance (gb)" to 5GB on the creation seems to solve the problem. I still don't understand why, but problem seems solved.
When the Sagemaker training completes, the model from /opt/ml/model directory in container will be uploaded to S3. If the model to be uploaded is too large then that error ClientError: Artifact upload failed:... will be thrown.
And, increasing the volume size will fix the problem superficially. But the model in most cases does not have to be that large, right?
Note that the odds are your model itself is not too large but you're saving your checkpoints to /opt/ml/model as well (bug).
And, in the end, sagemaker tries to pack everything (model and all checkpoints) in order to upload to S3. Thereby, not having sufficient volume. Hence, the error. You can confirm if this is the reason by checking the size of your uploaded model.tar.gz file on S3.
Why 50.000 steps is working and 80.000 is not working?
With 80,000 steps, the number of checkpoints have also increased, and the final model.tar.gz file which is to uploaded on S3 has become too big that it can't even fit in current volume.

Storing Serialized Video files to SQL Server

I currently am faced with a need to host 20 small video files for my website. I know I could just host them with my project in a folder but I came a crossed this article.
http://www.kindblad.com/2008/04/how-to-store-files-in-ms-sql-server.html
The thought of storing the file in the db had not occurred to me. My question is would there be a performance increase or decrease by storing the files as bit data in the db versus just streaming the data. I like the idea of having the data in the db for portability and having control and who gets access to the videos. Thanks in advance.
Unless you have a pressing need to store them in a database, I wouldn't, personally. You can still control who gets access to which files by using a handler to validate access to the file. One big problem that the method in that article has is that it doesn't support reading a byte range - so if someone wants to seek to the middle of a video, for example, they would have to wait for the whole thing to download. You'd want it to be able to support the range header, as described in this question.

How many blobs may be submitted to GAE blobstore in one call?

I am trying to upload 1744 small files to the blobstore (total size of all the files is 4 MB) and get HTTP/1.1 503 Service Unavailable error.
This is 100% reproducible.
Is it a bug, I do I violate any constraints? I don't see any constraints in the documentation about number of blobs submitted in one call.
The answer that claims that create_upload_url can only accept one file per upload above is wrong. You can upload multiple files in a single upload and this is the way you should be approaching your problem.
That being said, there was a reliability problem when doing a batch upload that was worked on and fixed around a year or so ago. If possible I would suggest keeping the batch sizes a little smaller (say 100 or so files in a batch). Each file in the batch results in a write to the datastore to record the blob key so 1744 files == 1744 writes and if one of them fails then your entire upload will fail.
If you give me the app_id I can take a look at what might be going wrong with your uploads.
So, the answer. Currently only < 500 files may be submitted in one request.
This is going to be fixed in the scope of the ticket http://code.google.com/p/googleappengine/issues/detail?id=8032 so that unlimited number of files may be submitted. But it may take a GAE release or 2 before the fix is deployed.

Google App Engine Large File Upload

I am trying to upload data to Google App Engine (using GWT). I am using the FileUploader widget and the servlet uses an InputStream to read the data and insert directly to the datastore. Running it locally, I can upload large files successfully, but when I deploy it to GAE, I am limited by the 30 second request time. Is there any way around this? Or is there any way that I can split the file into smaller chunks and send the smaller chunks?
By using the BlobStore you have a 1 GB size limit and a special handler, called unsurprisingly BlobstoreUpload Handler that shouldn't give you timeout problems on upload.
Also check out http://demofileuploadgae.appspot.com/ (sourcecode, source answer) which does exactly what you are asking.
Also, check out the rest of GWT-Examples.
Currently, GAE imposes a limit of 10 MB on file upload (and response size) as well as 1 MB limits on many other things; so even if you had a network connection fast enough to pump up more than 10 MB within a 30 secs window, that would be to no avail. Google has said (I heard Guido van Rossum mention that yesterday here at Pycon Italia Tre) that it has plans to overcome these limitations in the future (at least for users of GAE which pay per-use to exceed quotas -- not sure whether the plans extend to users of GAE who are not paying, and generally need to accept smaller quotas to get their free use of GAE).
you would need to do the upload to another server - i believe that the 30 second timeout cannot be worked around. If there is a way, please correct me! I'd love to know how!
If your request is running out of request time, there is little you can do. Maybe your files are too big and you will need to chunk them on the client (with something like Flash or Java or an upload framework like pupload).
Once you get the file to the application there is another issue - the datastore limitations. Here you have two options:
you can use the BlobStore service which has quite nice API for handling up 50megabytes large uploads
you can use something like bigblobae which can store virtually unlimited size blobs in the regular appengine datastore.
The 30 second response time limit only applies to code execution. So the uploading of the actual file as part of the request body is excluded from that. The timer will only start once the request is fully sent to the server by the client, and your code starts handling the submitted request. Hence it doesn't matter how slow your client's connection is.
Uploading file on Google App Engine using Datastore and 30 sec response time limitation
The closest you could get would be to split it into chunks as you store it in GAE and then when you download it, piece it together by issuing separate AJAX requests.
I would agree with chunking data to smaller Blobs and have two tables, one contains th metadata (filename, size, num of downloads, ...etc) and other contains chunks, these chunks are associated with the metadata table by a foreign key, I think it is doable...
Or when you upload all the chunks you can simply put them together in one blob having one table.
But the problem is, you will need a thick client to serve chunking-data, like a Java Applet, which needs to be signed and trusted by your clients so it can access the local file-system

Elegant way to determine total size of website?

is there an elegant way to determine the size of data downloaded from a website -- bearing in mind that not all requests will go to the same domain that you originally visited and that other browsers may in the background be polling at the same time. Ideally i'd like to look at the size of each individual page -- or for a Flash site the total downloaded over time.
I'm looking for some kind of browser plug-in or Fiddler script. I'm not sure Fiddler would work due to the issues pointed out above.
I want to compare sites similar to mine for total filesize - and keep track of my own site also.
Firebug and HttpFox are two Firefox plugin that can be used to determine the size of data downloaded from a website for one single page. While Firebug is a great tool for any web developer, HttpFox is a more specialized plugin to analyze HTTP requests / responses (with relative size).
You can install both and try them out, just be sure to disable the one while enabling the other.
If you need a website wide measurement:
If the website is made of plain HTML and assets (like CSS, images, flash, ...) you can check how big the folder containing the website is on the server (this assumes you can login on the server)
You can mirror the website locally using wget, curl or some GUI based application like Site Sucker and check how big the folder containing the mirror is
If you know the website is huge but you don't know how much, you can estimate its size. i.e. www.mygallery.com has 1000 galleries; each gallery has an average of 20 images loaded; every image is stored in 2 different sizes (thumbnail and full size) an average of for _n_kb / image; ...
Keep in mind that if you download / estimating a dynamic websites, you are dealing with what the website produces, not with the real size of the website on the server. A small PHP script can produce tons of HTML.
Have you tried Firebug for Firefox?
The "Net" panel in Firebug will tell you the size and fetch time of each fetched file, along with the totals.
You can download the entire site and then you will know for sure!
https://www.httrack.com/

Resources