ijson.common.IncompleteJSONError: lexical error: invalid char in json text - ijson

hi all ijson newbie I have a very large .json file 168 (GB) I want to get all possible keys, but in the file some values are written as NaN. ijson creates a generator and outputs dictionaries, in My code value. When a specific item is returned, it throws an error. How can you get a string instead of a dictionary instead of value? Tried **parser = ijson.items (input_file, '', multiple_values = True, map_type = str) **, didn't help.
def parse_json(json_filename):
with open('max_data_error.txt', 'w') as outfile:
with open(json_filename, 'r') as input_file:'''
# outfile.write('[ '
parser = ijson.items(input_file, '', multiple_values=True)
cont = 0
max_keys_list = list()
for value in parser:
for i in json.loads(json.dumps(value, ensure_ascii=False, default=str)) :
if i not in max_keys_list:
max_keys_list.append(i)
print(value)
print(max_keys_list)
for keys_item in max_keys_list:
outfile.write(keys_item + '\n')
if __name__ == '__main__':
parse_json('./email/emailrecords.bson.json')
Traceback (most recent call last):
File "panda read.py", line 29, in <module>
parse_json('./email/emailrecords.bson.json')
File "panda read.py", line 17, in parse_json
for value in parser:
ijson.common.IncompleteJSONError: lexical error: invalid char in json text.
litecashwire.com","lastname":NaN,"firstname":"Mia","zip":"87
(right here) ------^

Your file I not valid JSON (NaN is not a valid JSON value); therefore any JSON parsing library will complain about this, one way or another, unless they have an extension to handle this non-standard content.
The ijson FAQ found in the project description has a question about invalid UTF-8 characters and how to deal with them. Those same answers apply here, so I would suggest you go and try one of those.

Related

How can I create and shuffle a dataset for triplet mining in TensorFlow 2?

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
##tf.function
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
else:
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
shape=(batch_size//samples_per_class,),
minval=0,
maxval=class_nbr,
dtype=tf.dtypes.int64,
) # Which classes will be in batch
choice = tf.data.Dataset.from_tensor_slices(tf.concat([choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch = tf.data.experimental.choose_from_datasets(all_datasets, choice)
if i==0:
all_batches = batch
else:
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "training.py", line 137, in <module>
main()
File "training.py", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\tfr_to_dataset.py", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
custom_shuffle
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

How to send only 1 variable from a set of 3 in TinyDB

[DISCORD.PY and TINYDB]
I have set up a warning system for Discord. If someone gets warned, the data is saved like so:
{'userid': 264452325945376768, 'warnid': 37996302, 'reason': "some reason"}
Problem of this: I want the command "!warnings" to display these warnings, but I don't want it to have this ugly formatting of a JSON, instead I want it to display like so:
{member} has {amount} warnings.
WARN-ID: {warning id here}
REASON: {reason here}
To do this, I need to somehow call only one variable at a time instead of having all 3 (as of above JSON) instantly.
My code is as follows:
#Ralf.command()
async def warnings(ctx, *, member: discord.Member=None):
if member is None:
member = Ralf.get_user(ctx.author.id)
member_id = ctx.author.id
else:
member = Ralf.get_user(member.id)
member_id = member.id
WarnList = Query()
Result = warndb.search(WarnList.userid == member_id)
warnAmt = len(Result)
if warnAmt == 1:
await ctx.send(f"**{member}** has `{warnAmt}` warning.")
else:
await ctx.send(f"**{member}** has `{warnAmt}` warnings.")
for item in Result:
await ctx.send(item)
This code is working, but it shows the ugly {'userid': 264452325945376768, 'warnid': 37996302, 'reason': "some reason"} as output.
MAIN QUESTION: How do I call only userid without having warnid and reason displayed?
EDIT 1:
Trying to use dicts results in following:
For that I get the following:
Ignoring exception in command warnings:
Traceback (most recent call last):
File "C:\Users\entity2k3\AppData\Local\Programs\Python\Python39\lib\site-packages\discord\ext\commands\core.py", line 85, in wrapped
ret = await coro(*args, **kwargs)
File "c:\Users\entity2k3\Desktop\Discord Bots All\Entity2k3's Moderation\main.py", line 201, in warnings
await ctx.send(f"WARN-ID: `{Result['warnid']}` REASON: {Result['reason']}")
TypeError: list indices must be integers or slices, not str
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\entity2k3\AppData\Local\Programs\Python\Python39\lib\site-packages\discord\ext\commands\bot.py", line 903, in invoke
await ctx.command.invoke(ctx)
File "C:\Users\entity2k3\AppData\Local\Programs\Python\Python39\lib\site-packages\discord\ext\commands\core.py", line 859, in invoke
await injected(*ctx.args, **ctx.kwargs)
File "C:\Users\entity2k3\AppData\Local\Programs\Python\Python39\lib\site-packages\discord\ext\commands\core.py", line 94, in wrapped
raise CommandInvokeError(exc) from exc
discord.ext.commands.errors.CommandInvokeError: Command raised an exception: TypeError: list indices must be integers or slices, not str
You are getting the TypeError because your Result is a list of dictionaries.
Make sure to iterate through Result and process each dictionary separately.
Your Result object is like this [{}, {}, {} ...]
Also you shouldn't capitalize the first letter of your variable. You should name it results, because it may contain more than 1 result.
for item in results:
user_id = item.get("userid")
warn_id = item.get("warnid")
reason = item.get("reason")
# do stuff with those

Datastore error: BadValueError: Expected integer, got [0, 1, 2, 3]

Others have reported a similar error, but the solutions given do not solve my problem.
For example there is a good answer here. The answer in the link mentions how ndb changes from a first use to a later use and suggests there is a problem because a first run produces a None in the Datastore. I cannot reproduce or see that happening in the Datastore for my sdk, but that may be because I am running it here from the interactive console.
I am pretty sure I got an initial good run with the GAE interactive console, but every run since then has failed with the error in my Title to this question.
I have left the print statements in the following code because they show good results and assure me that the error is occuring in the put() at the very end.
from google.appengine.ext import ndb
class Account(ndb.Model):
week = ndb.IntegerProperty(repeated=True)
weeksNS = ndb.IntegerProperty(repeated=True)
weeksEW = ndb.IntegerProperty(repeated=True)
terry=Account(week=[],weeksNS=[],weeksEW=[])
terry_key=terry.put()
terry = terry_key.get()
print terry
for t in list(range(4)): #just dummy input, but like real input
terry.week.append(t)
print terry.week
region = 1 #same error message for region = 0
if region :
terry.weeksEW.append(terry.week)
else:
terry.weeksNS.append(terry.week)
print 'EW'+str(terry.weeksEW)
print 'NS'+str(terry.weeksNS)
terry.week = []
print 'week'+str(terry.week)
terry.put()
The idea of my code is to first build up the terry.week list values incrementally and then later store the whole list to the appropriate region, either NS or EW. So I'm looking for a workaround for this scheme.
The error message is likely of no value but I am reproducing it here.
Traceback (most recent call last):
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/python/runtime/request_handler.py", line 237, in handle_interactive_request
exec(compiled_code, self._command_globals)
File "<string>", line 55, in <module>
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 3458, in _put
return self._put_async(**ctx_options).get_result()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 383, in get_result
self.check_success()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/context.py", line 824, in put
key = yield self._put_batcher.add(entity, options)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 430, in _help_tasklet_along
value = gen.send(val)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/context.py", line 358, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/datastore/datastore_rpc.py", line 1858, in async_put
pbs = [entity_to_pb(entity) for entity in entities]
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 697, in entity_to_pb
pb = ent._to_pb()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 3167, in _to_pb
prop._serialize(self, pb, projection=self._projection)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1422, in _serialize
values = self._get_base_value_unwrapped_as_list(entity)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1192, in _get_base_value_unwrapped_as_list
wrapped = self._get_base_value(entity)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1180, in _get_base_value
return self._apply_to_values(entity, self._opt_call_to_base_type)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1352, in _apply_to_values
value[:] = map(function, value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1234, in _opt_call_to_base_type
value = _BaseValue(self._call_to_base_type(value))
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1255, in _call_to_base_type
return call(value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1331, in call
newvalue = method(self, value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1781, in _validate
(value,))
BadValueError: Expected integer, got [0, 1, 2, 3]
I believe the error comes from these lines:
terry.weeksEW.append(terry.week)
terry.weeksNS.append(terry.week)
You are not appending another integer; You are appending a list, when an integer is expected.
>>> aaa = [1,2,3]
>>> bbb = [4,5,6]
>>> aaa.append(bbb)
>>> aaa
[1, 2, 3, [4, 5, 6]]
>>>
This fails the ndb.IntegerProperty test.
Try:
terry.weeksEW += terry.week
terry.weeksNS += terry.week
EDIT: To save a list of lists, do not use the IntegerProperty(), but instead the JsonProperty(). Better still, the ndb datastore is deprecated, so... I recommend Firestore, which uses JSON objects by default. At least use Cloud Datastore, or Cloud NDB.

Python3 - IndexError when trying to save a text file

i'm trying to follow this tutorial with my own local data files:
CNTK tutorial
i have the following function to save my data array into a txt file feedable to CNTK:
# Save the data files into a format compatible with CNTK text reader
def savetxt(filename, ndarray):
dir = os.path.dirname(filename)
if not os.path.exists(dir):
os.makedirs(dir)
if not os.path.isfile(filename):
print("Saving", filename )
with open(filename, 'w') as f:
labels = list(map(' '.join, np.eye(11, dtype=np.uint).astype(str)))
for row in ndarray:
row_str = row.astype(str)
label_str = labels[row[-1]]
feature_str = ' '.join(row_str[:-1])
f.write('|labels {} |features {}\n'.format(label_str, feature_str))
else:
print("File already exists", filename)
i have 2 ndarrays of the following shape that i want to feed the model:
train.shape
(1976L, 15104L)
test.shape
(1976L, 15104L)
Then i try to implement the fucntion like this:
# Save the train and test files (prefer our default path for the data)
data_dir = os.path.join("C:/Users", 'myself', "OneDrive", "IA Project", 'data', 'train')
if not os.path.exists(data_dir):
data_dir = os.path.join("data", "IA Project")
print ('Writing train text file...')
savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
print ('Writing test text file...')
savetxt(os.path.join(data_dir, "Test-128x118_cntk_text.txt"), test)
print('Done')
and then i get the following error:
Writing train text file...
Saving C:/Users\A702628\OneDrive - Atos\Microsoft Capstone IA\Capstone data\train\Train-128x118_cntk_text.txt
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-24-b53d3c69b8d2> in <module>()
6
7 print ('Writing train text file...')
----> 8 savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
9
10 print ('Writing test text file...')
<ipython-input-23-610c077db694> in savetxt(filename, ndarray)
12 for row in ndarray:
13 row_str = row.astype(str)
---> 14 label_str = labels[row[-1]]
15 feature_str = ' '.join(row_str[:-1])
16 f.write('|labels {} |features {}\n'.format(label_str, feature_str))
IndexError: list index out of range
Can somebody please tell me what's going wrong with this part of the code? And how could i fix it? Thank you very much in advance.
Since you're using your own input data -- are they labelled in the range 0 to 9? The labels array only has 10 entries in it, so that could cause an out-of-range problem.

In Google App Engine, how to check input validity of Key created by urlsafe?

Suppose I create a key from user input websafe url
key = ndb.Key(urlsafe=some_user_input)
How can I check if the some_user_input is valid?
My current experiment shows that statement above will throw ProtocolBufferDecodeError (Unable to merge from string.) exception if the some_user_input is invalid, but could not find anything about this from the API. Could someone kindly confirm this, and point me some better way for user input validity checking instead of catching the exception?
Thanks a lot!
If you try to construct a Key with an invalid urlsafe parameter
key = ndb.Key(urlsafe='bogus123')
you will get an error like
Traceback (most recent call last):
File "/opt/google/google_appengine/google/appengine/runtime/wsgi.py", line 240, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "/opt/google/google_appengine/google/appengine/runtime/wsgi.py", line 299, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "/opt/google/google_appengine/google/appengine/runtime/wsgi.py", line 85, in LoadObject
obj = __import__(path[0])
File "/home/tim/git/project/main.py", line 10, in <module>
from src.tim import handlers as handlers_
File "/home/tim/git/project/src/tim/handlers.py", line 42, in <module>
class ResetHandler(BaseHandler):
File "/home/tim/git/project/src/tim/handlers.py", line 47, in ResetHandler
key = ndb.Key(urlsafe='bogus123')
File "/opt/google/google_appengine/google/appengine/ext/ndb/key.py", line 212, in __new__
self.__reference = _ConstructReference(cls, **kwargs)
File "/opt/google/google_appengine/google/appengine/ext/ndb/utils.py", line 142, in positional_wrapper
return wrapped(*args, **kwds)
File "/opt/google/google_appengine/google/appengine/ext/ndb/key.py", line 642, in _ConstructReference
reference = _ReferenceFromSerialized(serialized)
File "/opt/google/google_appengine/google/appengine/ext/ndb/key.py", line 773, in _ReferenceFromSerialized
return entity_pb.Reference(serialized)
File "/opt/google/google_appengine/google/appengine/datastore/entity_pb.py", line 1710, in __init__
if contents is not None: self.MergeFromString(contents)
File "/opt/google/google_appengine/google/net/proto/ProtocolBuffer.py", line 152, in MergeFromString
self.MergePartialFromString(s)
File "/opt/google/google_appengine/google/net/proto/ProtocolBuffer.py", line 168, in MergePartialFromString
self.TryMerge(d)
File "/opt/google/google_appengine/google/appengine/datastore/entity_pb.py", line 1839, in TryMerge
d.skipData(tt)
File "/opt/google/google_appengine/google/net/proto/ProtocolBuffer.py", line 677, in skipData
raise ProtocolBufferDecodeError, "corrupted"
ProtocolBufferDecodeError: corrupted
Interesting here are is
File "/opt/google/google_appengine/google/appengine/ext/ndb/key.py", line 773, in _ReferenceFromSerialized
return entity_pb.Reference(serialized)
which is the last code executed in the key.py module:
def _ReferenceFromSerialized(serialized):
"""Construct a Reference from a serialized Reference."""
if not isinstance(serialized, basestring):
raise TypeError('serialized must be a string; received %r' % serialized)
elif isinstance(serialized, unicode):
serialized = serialized.encode('utf8')
return entity_pb.Reference(serialized)
serialized here being the decoded urlsafe string, you can read more about it in the link to the source code.
another interesting one is the last one:
File "/opt/google/google_appengine/google/appengine/datastore/entity_pb.py", line 1839, in TryMerge
in the entity_pb.py module which looks like this
def TryMerge(self, d):
while d.avail() > 0:
tt = d.getVarInt32()
if tt == 106:
self.set_app(d.getPrefixedString())
continue
if tt == 114:
length = d.getVarInt32()
tmp = ProtocolBuffer.Decoder(d.buffer(), d.pos(), d.pos() + length)
d.skip(length)
self.mutable_path().TryMerge(tmp)
continue
if tt == 162:
self.set_name_space(d.getPrefixedString())
continue
if (tt == 0): raise ProtocolBuffer.ProtocolBufferDecodeError
d.skipData(tt)
which is where the actual attempt to 'merge the input to into a Key' is made.
You can see in the source code that during the process of constructing a Key from an urlsafe parameter not a whole lot can go wrong. First it checks if the input is a string and if it's not, a TypeError is raised, if it is but it's not 'valid', indeed a ProtocolBufferDecodeError is raised.
My current experiment shows that statement above will throw ProtocolBufferDecodeError (Unable to merge from string.) exception if the some_user_input is invalid, but could not find anything about this from the API. Could someone kindly confirm this
Sort of confirmed - we now know that also TypeError can be raised.
and point me some better way for user input validity checking instead of catching the exception?
This is an excellent way to check validity! Why do the checks yourself if the they are already done by appengine? A code snippet could look like this (not working code, just an example)
def get(self):
# first, fetch the user_input from somewhere
try:
key = ndb.Key(urlsafe=user_input)
except TypeError:
return 'Sorry, only string is allowed as urlsafe input'
except ProtocolBufferDecodeError:
return 'Sorry, the urlsafe string seems to be invalid'

Resources