Using huggingface load_dataset in Google Colab notebook - dataset

I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading.
Here is the code snippet which returns the error:
dataset_id ="nielsr/funsd-layoutlmv3"
from datasets import load_dataset
dataset = load_dataset(dataset_id)
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
And, in Colab, this returns:
Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...
---------------------------------------------------------------------------
UnidentifiedImageError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1569 _time = time.time()
-> 1570 for key, record in generator:
1571 if max_shard_size is not None and writer._num_bytes > max_shard_size:
9 frames
UnidentifiedImageError: cannot identify image file '/root/.cache/huggingface/datasets/downloads/extracted/e5bbbc543f8cc95554da124f3e80a57ed24d67d06ae1467da5810703f851e3f9/dataset/training_data/images/0000971160.png'
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1604 if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
1605 e = e.__context__
-> 1606 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1607
1608 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
I was expecting to receive the following (which I succesfully got in VS):
Train dataset size: 149
Test dataset size: 50
Thank you in advance!

Related

Should I reuse the same error number in different stored procedures when throwing errors?

I'm using Giraffe framework to implement a web server in F#. The web server executes some stored procedures based on client requests. Errors are thrown inside stored procedures when the request cannot be processed and I'm catching the errors in Giraffe like this:
try
// parse request ...
// data access and execute stored procdeure ...
with
| :? System.Data.SqlClient.SqlException as e ->
match e.Number with
| 60000 ->
return! (setStatusCode 400 >=> json e.Message) next ctx
Let's say in stored procedure A, I'm throwing an error with error number 60000, error message Error in stored procedure A. In another stored procedure B, can I reuse the same error number with a different error message or should I use a different one?
If I don't reuse the same error number the match expression will continue to grow like the following:
with
| :? System.Data.SqlClient.SqlException as e ->
match e.Number with
| 60000 ->
return! (setStatusCode 400 >=> json e.Message) next ctx
| 60001 ->
return! (setStatusCode 400 >=> json e.Message) next ctx
// more error numbers in the future ...
I've never actually used error codes in SQL, so I don't know what is best practice, but I would wager it is using one code for one type of error and one message. What you're suggesting is bending that suggestion, for the sole purpose of making an entirely other part of your code shorter.
Even if you do create a lot of error codes, you can avoid duplication in your code by doing
try ...
with
| :? SqlException as e ->
match e.Number with
| 60000 | 60001 ->
return! (setStatusCode 400 >=> json e.Message) next ctx
If this is something you need to do in several places, consider pulling this out into its own, separate function. If the list of error codes becomes large, consider putting this in a list object instead, e.g.
open System.Linq
let sqlErrorCodes = [ 60000; 60001 ]
try ...
with :? SqlException e when sqlErrorCodes.Contains(e.Number) ->
return! (setStatusCode 400 >=> json e.Message) next ctx

volttron scheduling actuator agent with CRON

For my volttron agent that I used the agent creation wizard to develop, can I get a tip on an error related to this , 'Timezone offset does not match system offset: -18000 != 0. Please, check your config files.'
When testing my script with the from volttron.platform.scheduling import cron feature I noticed the timezone/computer time was way off on my edge device so I reset the time zone with this tutorial which I am thinking definitely screwed things up.
ERROR: volttron.platform.jsonrpc.RemoteError: builtins.ValueError('Timezone offset does not match system offset: -18000 != 0. Please, check your config files.')
Whether or not this makes a difference this edge device does use the fowarding agent to push the data to a central VOLTTRON instance.
2021-05-14 12:45:00,007 (actuatoragent-1.0 313466) volttron.platform.vip.agent.subsystems.rpc ERROR: unhandled exception in JSON-RPC method 'request_new_schedule':
Traceback (most recent call last):
File "/var/lib/volttron/volttron/platform/vip/agent/subsystems/rpc.py", line 158, in method
return method(*args, **kwargs)
File "/home/volttron/.volttron/agents/8f4ee1c0-74cb-4070-8a8c-57bf9bea8a71/actuatoragent-1.0/actuator/agent.py", line 1343, in request_new_schedule
return self._request_new_schedule(rpc_peer, task_id, priority, requests, publish_result=False)
File "/home/volttron/.volttron/agents/8f4ee1c0-74cb-4070-8a8c-57bf9bea8a71/actuatoragent-1.0/actuator/agent.py", line 1351, in _request_new_schedule
local_tz = get_localzone()
File "/var/lib/volttron/env/lib/python3.8/site-packages/tzlocal/unix.py", line 165, in get_localzone
_cache_tz = _get_localzone()
File "/var/lib/volttron/env/lib/python3.8/site-packages/tzlocal/unix.py", line 90, in _get_localzone
utils.assert_tz_offset(tz)
File "/var/lib/volttron/env/lib/python3.8/site-packages/tzlocal/utils.py", line 46, in assert_tz_offset
raise ValueError(msg)
ValueError: Timezone offset does not match system offset: -18000 != 0. Please, check your config files.
This is my raise_setpoints_up function below that is alot like the CSV driver agent code.
def raise_setpoints_up(self):
_log.info(f'*** [Setter Agent INFO] *** - STARTING raise_setpoints_up function!')
schedule_request = []
# create start and end timestamps
_now = get_aware_utc_now()
str_start = format_timestamp(_now)
_end = _now + td(seconds=10)
str_end = format_timestamp(_end)
# wrap the topic and timestamps up in a list and add it to the schedules list
for device in self.jci_device_map.values():
topic = '/'.join([self.building_topic, device])
schedule_request.append([topic, str_start, str_end])
# send the request to the actuator
result = self.vip.rpc.call('platform.actuator', 'request_new_schedule', self.core.identity, 'my_schedule', 'HIGH', schedule_request).get(timeout=4)
_log.info(f'*** [Setter Agent INFO] *** - actuator agent scheduled sucess!')
Thanks for any tips
I suspect the time configured by tzdata is different than the timezone configured by the system since you changed this manually. Give this a try:
sudo dpkg-reconfigure tzdata
use sudo dpkg-reconfigure tzdata to change back to UTC time

Migration from v6.4 to v7.2 fails with "Column 'execution_id' cannot be null"

Running migration in a docker kiwitcms/kiwi:latest (v7.2 digest d757b56bc10c) image based docker container.
During migration testcases.0010_remove_bug fails with some database constraint problem.
Is this a bug in the migration script or an issue with data not consistent?
Operations to perform:
Apply all migrations: admin, attachments, auth, bugs, contenttypes, core, django_comments, kiwi_auth, linkreference, management, sessions, sites, testcases, testplans, testruns
Running migrations:
Applying testcases.0010_remove_bug...Traceback (most recent call last):
File "/venv/lib/python3.6/site-packages/django/db/backends/mysql/base.py", line 71, in execute
return self.cursor.execute(query, args)
File "/venv/lib/python3.6/site-packages/MySQLdb/cursors.py", line 209, in execute
res = self._query(query)
File "/venv/lib/python3.6/site-packages/MySQLdb/cursors.py", line 315, in _query
db.query(q)
File "/venv/lib/python3.6/site-packages/MySQLdb/connections.py", line 239, in query
_mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (1048, "Column 'execution_id' cannot be null")
This is a bug in the migration script. The problem lies in the fact that prior to these changes you could attach bugs directly to test cases, without an actual test execution. This resulted in the case_run_id field being None.
With the new changes bugs can only be attached to test executions and their execution_id field should always have a value.
Can you try this patch and let me know if it works for you:
https://github.com/kiwitcms/Kiwi/pull/1266

Items not groupped correctly - CoGroupByKey

CoGroupByKey problem
Data description.
I have two datasets.
Records - the first, containes around 0.5-1M of records per (key,day). For testing I use 2-3 keys and 5-10 days of data. What I shoot for is 1000+ keys. Each record contains key, timestamp in μ-seconds and some other data.
Configs - the second, is rather small. It describes the key in time, e.g. you can think about it as a list of tuples: (key, start date, end date, description).
For the exploration I've encoded the data as files of length-prefixed Protocol Buffer binary encoded messages. Additionally the files are packed with gzip. Data is sharded by date. Each file is around 10MB.
Pipeline
I use Apache Beam to express a pipeline.
First I add keys to both datasets. For Records dataset it's (key, day rounded timestamp). For Configs a key is (key, day), where day is each timestamp value between start date and end date (pointing midnight).
The datasets are merged using CoGroupByKey.
As a key type I use org.apache.flink.api.java.tuple.Tuple2 with a Tuple2Coder from repo github.com/orian/tuple-coder.
The problem
If the Records dataset is tiny like 5 days, everything seems fine (check normal_run.log).
INFO [main] (FlinkPipelineRunner.java:124) - Final aggregator values:
INFO [main] (FlinkPipelineRunner.java:127) - item count : 4322332
INFO [main] (FlinkPipelineRunner.java:127) - missing val1 : 0
INFO [main] (FlinkPipelineRunner.java:127) - multiple val1 : 0
When I run the pipeline against 10+ days I encounter an error pointing that for some Records there's no Config (wrong_run.log).
INFO [main] (FlinkPipelineRunner.java:124) - Final aggregator values:
INFO [main] (FlinkPipelineRunner.java:127) - item count : 8577197
INFO [main] (FlinkPipelineRunner.java:127) - missing val1 : 6
INFO [main] (FlinkPipelineRunner.java:127) - multiple val1 : 0
Then I've added some extra logging messages:
(a.java:144) - 68643 items for KeyValue3 on: 1462665600000000
(a.java:140) - no items for KeyValue3 on: 1463184000000000
(a.java:123) - missing for KeyValue3 on: 1462924800000000
(a.java:142) - 753707 items for KeyValue3 on: 1462924800000000 marked as no-loc
(a.java:123) - missing for KeyValue3 on: 1462752000000000
(a.java:142) - 749901 items for KeyValue3 on: 1462752000000000 marked as no-loc
(a.java:144) - 754578 items for KeyValue3 on: 1462406400000000
(a.java:144) - 751574 items for KeyValue3 on: 1463011200000000
(a.java:123) - missing for KeyValue3 on: 1462665600000000
(a.java:142) - 754758 items for KeyValue3 on: 1462665600000000 marked as no-loc
(a.java:123) - missing for KeyValue3 on: 1463184000000000
(a.java:142) - 694372 items for KeyValue3 on: 1463184000000000 marked as no-loc
You can spot that in first line 68643 items were processed for KeyValue3 and time 1462665600000000.
Later on in line 9 it seems the operation processes the same key again, but it reports that no Config was available for these Records.
The line 10 informs they've been marked as no-loc.
The line 2 is saying that there were no items for KeyValue3 and time 1463184000000000, but in line 11 you can read that the items for this (key,day) pair were processed later and they've lacked a Config.
Some clues
During one of the exploration runs I've got an exception (exception_thrown.log).
05/26/2016 03:49:49 GroupReduce (GroupReduce at GroupByKey)(1/5) switched to FAILED
java.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce at GroupByKey)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:455)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
at org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
... 3 more
Caused by: java.io.IOException: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
at org.apache.flink.runtime.operators.sort.LargeRecordHandler.finishWriteAndSortKeys(LargeRecordHandler.java:263)
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1409)
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799)
Caused by: java.lang.IllegalAccessError: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
at org.apache.flink.api.java.typeutils.runtime.NoFetchingInput.readBytes(NoFetchingInput.java:122)
at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:297)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:706)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:228)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:242)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:144)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:144)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
at org.apache.flink.runtime.io.disk.InputViewIterator.next(InputViewIterator.java:43)
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ReadingThread.go(UnilateralSortMerger.java:973)
at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
Work-around (after more testing, doesn't work, staying with Tuple2)
I've switched from using Tuple2 to a Protocol Buffer message:
message KeyDay {
optional ByteString key = 1;
optional int64 timestamp_usec = 2;
}
But using Tuple2.of() was just easier than: KeyDay.newBuilder().setKey(...).setTimestampUsec(...).build().
When switched to a key been a class derived from protobuf.Message the problem disappeared for 10-15 days (so data size which was problem for Tuple2), but increasing data size to 20 days revealed it's there.

In Google App Engine, how to I reduce memory consumption as I write a file out to the blobstore rather than exceed the soft memory limit?

I'm using the blobstore to backup and recovery entities in csv format. The process is working well for all of my smaller models. However, once I start to work on models with more than 2K entities, I am exceeded the soft memory limit. I'm only fetching 50 entities at a time and then writing the results out to the blobstore, so I'm not clear why my memory usage would be building up. I can reliably make the method fail just by increasing the "limit" value passed in below which results in the method running just a little longer to export a few more entities.
Any recommendations on how to optimize this process to reduce memory consumption?
Also, the files produced will only <500KB in size. Why would the process use 140 MB of memory?
Simplified example:
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
writer = csv.DictWriter(f, fieldnames=properties)
for entity in models.Player.all():
row = backup.get_dict_for_entity(entity)
writer.writerow(row)
Produces the error:
Exceeded soft private memory limit with 150.957 MB after servicing 7 requests total
Simplified example 2:
The problem seems to be with using files and the with statement in python 2.5. Factoring out the csv stuff, I can reproduce almost the same error by simply trying to write a 4000 line text file to the blobstore.
from __future__ import with_statement
from google.appengine.api import files
from google.appengine.ext.blobstore import blobstore
file_name = files.blobstore.create(mime_type='application/octet-stream')
myBuffer = StringIO.StringIO()
#Put 4000 lines of text in myBuffer
with files.open(file_name, 'a') as f:
for line in myBuffer.getvalue().splitlies():
f.write(line)
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
Produces the error:
Exceeded soft private memory limit with 154.977 MB after servicing 24 requests total
Original:
def backup_model_to_blobstore(model, limit=None, batch_size=None):
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
#Get the fieldnames for the csv file.
query = model.all().fetch(1)
entity = query[0]
properties = entity.__class__.properties()
#Add ID as a property
properties['ID'] = entity.key().id()
#For debugging rather than try and catch
if True:
writer = csv.DictWriter(f, fieldnames=properties)
#Write out a header row
headers = dict( (n,n) for n in properties )
writer.writerow(headers)
numBatches = int(limit/batch_size)
if numBatches == 0:
numBatches = 1
for x in range(numBatches):
logging.info("************** querying with offset %s and limit %s", x*batch_size, batch_size)
query = model.all().fetch(limit=batch_size, offset=x*batch_size)
for entity in query:
#This just returns a small dictionary with the key-value pairs
row = get_dict_for_entity(entity)
#write out a row for each entity.
writer.writerow(row)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
return blob_key
The error looks like this in the logs
......
2012-02-02 21:59:19.063
************** querying with offset 2050 and limit 50
I 2012-02-02 21:59:20.076
************** querying with offset 2100 and limit 50
I 2012-02-02 21:59:20.781
************** querying with offset 2150 and limit 50
I 2012-02-02 21:59:21.508
Exception for: Chris (202.161.57.167)
err:
Traceback (most recent call last):
.....
blob_key = backup_model_to_blobstore(model, limit=limit, batch_size=batch_size)
File "/base/data/home/apps/singpath/163.356548765202135434/singpath/backup.py", line 125, in backup_model_to_blobstore
writer.writerow(row)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 281, in __exit__
self.close()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 275, in close
self._make_rpc_call_with_retry('Close', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 388, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 236, in _make_call
_raise_app_error(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 179, in _raise_app_error
raise FileNotOpenedError()
FileNotOpenedError
C 2012-02-02 21:59:23.009
Exceeded soft private memory limit with 149.426 MB after servicing 14 requests total
You'd be better off not doing the batching yourself, but just iterating over the query. The iterator will pick a batch size (probably 20) that should be adequate:
q = model.all()
for entity in q:
row = get_dict_for_entity(entity)
writer.writerow(row)
This avoids re-running the query with ever-increasing offset, which is slow and causes quadratic behavior in the datastore.
An oft-overlooked fact about memory usage is that the in-memory representation of an entity can use 30-50 times the RAM compared to the serialized form of the entity; e.g. an entity that is 3KB on disk might use 100KB in RAM. (The exact blow-up factor depends on many factors; it's worse if you have lots of properties with long names and small values, even worse for repeated properties with long names.)
In What is the proper way to write to the Google App Engine blobstore as a file in Python 2.5 a similar problem was reported. In an answer there it is suggested that you should try inserting gc.collect() calls occasionally. Given what I know of the files API's implementation I think that is spot on. Give it a try!
I can't speak for the memory use in Python, but considering your error message, the error most likely stems from the fact that a blobstore backed file in GAE can't be open for more than around 30 seconds so you have to close and reopen it periodically if your processing takes longer.
It can possibly be a Time Exceed error, due to the limitation of the request to 30 secs. In my implementation in order to bypass it instead of having a webapp handler for the operation I am firing an event in the default queue. The cool thing about the queue is that it takes one line of code to invoke it, it has a 10 min time limit and if a task fails it retries before the time limit. I am not really sure if it will solve your problem but it worths giving a try.
from google.appengine.api import taskqueue
...
taskqueue.add("the url that invokes your method")
you can find more info about the queues here.
Or consider using a backend for serious computations and file operations.

Resources