Extracting text from big files in Prolog - file

I would like to extract text between a start and end string with SWI-Prolog, e.g., all the titles from Wikipedia dumps. I don't want to use an XML-parser, as I want to deal with different file types in the same way. I got it working for small files, but run into problems for large files.
For big files (e.g., Romanian Wikipedia) prolog runs out of memory (prolog -G1G -L1G -T1G -s main.pl -t main, see content of main.pl below):
Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.
For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).
found: 'Rocarta'
found: 'Muzică'
found: 'Iris (formație românească)'
found: 'Pagina principală'
...[removed hundreds of lines]
found: 'Zadar'
found: 'Australia'
found: 'Slovenia'
found: 'Croația'
ERROR: Out of global stack
Exception: (5,861) between([60, 116, 105, 116, 108, 101, 62], [60, 47, 116, 105, 116, 108, 101, 62], _264890370, [10, 32, 32, 32, 32, 60, 110, 115|...], []) ?
How to accomplish this task with big input files?
MWE (main.pl):
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
last_call_optimisation(true).
main :-
phrase_from_file(between(`<title>`, `</title>`, _), `wiki.xml`).
between(Start, End, Found) -->
string(_), string(Start), string(Found), string(End),
{ format("found: '~s' \n", [Found]) },
between(Start, End, _).
between(_, _, []) -->
remainder(_),
{ format("finished parsing") }.
example input (wiki.xml):
<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>
example output (expected):
found: 'Albert Einstein'
found: 'Elvis Presley'
finished parsing
Edit:
If we remove the recursive call from between/3, the output changes, and doesn't correspond to what I expect:
found: 'Albert Einstein'
found: 'Albert Einstein</title></page>
<page><title>Elvis Presley'
found: 'Elvis Presley'
finished parsing

this construct
..., string(_), string(Start), ...
it's very inefficient. It turns a linear parse into an exponential one.
But we have a really simple solution, since a string literal performs an exact match in a DCG:
:- use_module(library(dcg/basics)).
main(Titles) :-
%phrase_from_file(between(`<title>`, `</title>`, Titles),`wiki.xml`).
phrase(between(`<title>`, `</title>`, Titles), `
<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>
`).
between(_Start, _End, []) --> [].
between(Start, End, [Found|Rest]) -->
Start, string(String), End,
{ atom_codes(Found, String) },
!, between(Start, End, Rest).
between(Start, End, List) --> [_], between(Start, End, List).
I would simplify the code, though:
...
phrase(tag(`title`, Titles), `
...
tag(_Tag, []) --> [].
tag(Tag, [Found|Rest]) -->
"<", Tag, ">", string(String), "</", Tag, ">",
{ atom_codes(Found, String) },
!, tag(Tag, Rest).
tag(Tag, List) --> [_], tag(Tag, List).
My bet is that on large files this is slightly more efficient.
It's also easy to generalize:
...
phrase(tags([title, footnote], Contents), `
...
tags(_Tags, []) --> [].
tags(Tags, [Key-Found|Rest]) -->
"<", {member(Tag, Tags)}, Tag, ">", string(String), "</", Tag, ">",
{ maplist(atom_codes, [Found,Key], [String,Tag]) },
!, tags(Tags, Rest).
tags(Tags, List) --> [_], tags(Tags, List).
but not very efficient. Better (but should profile to prove it)
...
"<", string(Tag), ">", {memberchk(Tag, Tags)}, string(String), "</", Tag, ">",
...
Edit: at least on a small set of Tags, "<", {member(Tag, Tags)}, Tag, ">" seems to require a lot less inferences than "<", string(Tag), ">", {memberchk(Tag, Tags)},.

Related

How can I create and shuffle a dataset for triplet mining in TensorFlow 2?

I'm working on a network using triplet mining for training. In order to make it work properly, I need my batches to contain several images of the same class. The problem I'm currently facing is that I have 751 classes, for a total of 12,937 pictures, and a batch size of 48 pictures. When shuffling the dataset using the command below, the odds to get pictures from the same class are really low, making the triplet mining inefficient.
dataset = dataset.shuffle(12937)
What I would need instead is a way of generating batches that contain a specific number of pictures for every class represented in this batch. As an example, let's say here that I want 12 classes per batch, there would be 4 pictures for each of them.
Another problem I'm facing is how would I shuffle this dataset at the end of every epoch so that I can have different batches that still follow the condition fixed above, that is 12 classes, 4 pictures for each one of them?
Is there any proper way to do it? I can't really find one. Please let me know if I'm unclear, and if you need further details.
================ EDIT ================
I've been trying a few things, and came up with something that would do what I want. The function would be the following:
counter = 0.
# Assuming a format such as (data, label)
def predicate(data, label):
global counter
allowed_labels = tf.constant([counter])
isallowed = tf.equal(allowed_labels, tf.cast(label, tf.float32))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
counter += 1
return tf.greater(reduced, tf.constant(0.))
##tf.function
def custom_shuffle(train_dataset, batch_size, samples_per_class = 4, iterations_in_epoch = 100, database='market'):
assert batch_size%samples_per_class==0, F'batch size must be a {samples_per_class} multiple.'
if database == 'market':
class_nbr = 751
else:
raise Exception('Unsuported database yet')
all_datasets = [train_dataset.filter(predicate) for _ in range(class_nbr)] # Every element of this array is a dataset of one class
for i in range(iterations_in_epoch):
choice = tf.random.uniform(
shape=(batch_size//samples_per_class,),
minval=0,
maxval=class_nbr,
dtype=tf.dtypes.int64,
) # Which classes will be in batch
choice = tf.data.Dataset.from_tensor_slices(tf.concat([choice for _ in range(4)], axis=0)) # Exactly 4 picture from each class in the batch
batch = tf.data.experimental.choose_from_datasets(all_datasets, choice)
if i==0:
all_batches = batch
else:
all_batches = all_batches.concatenate(batch)
all_batches = all_batches.batch(batch_size)
return all_batches
It does what I want, however the returned dataset is extremely slow to iterate, making modele learning impossible. As per this thread, I understood that I needed to decorate custom_shuffle with #tf.function, as the one commented out. However, when doing so, it raises the following error:
Traceback (most recent call last):
File "training.py", line 137, in <module>
main()
File "training.py", line 80, in main
train_dataset = get_dataset(TRAINING_FILENAMES, IMG_SIZE, BATCH_SIZE, database=database, func_type='train')
File "E:\Morgan\TransReID_TF\tfr_to_dataset.py", line 260, in get_dataset
dataset = custom_shuffle(dataset, batch_size)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "D:\Programs\Anaconda3\envs\AlignedReID_TF\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: No unary variant device copy function found for direction: 1 and Variant type_index: class tensorflow::data::`anonymous namespace'::DatasetVariantWrapper
[[{{node BatchDatasetV2/_206}}]] [Op:__inference_custom_shuffle_11485]
Function call stack:
custom_shuffle
Which I don't understand, and don't see how to fix.
Is there something I'm doing wrong?
PS: I'm aware the lack of minimal code to reproduce this behavior makes it hard to debug, I'll try to provide some as soon as possible.

ijson.common.IncompleteJSONError: lexical error: invalid char in json text

hi all ijson newbie I have a very large .json file 168 (GB) I want to get all possible keys, but in the file some values are written as NaN. ijson creates a generator and outputs dictionaries, in My code value. When a specific item is returned, it throws an error. How can you get a string instead of a dictionary instead of value? Tried **parser = ijson.items (input_file, '', multiple_values = True, map_type = str) **, didn't help.
def parse_json(json_filename):
with open('max_data_error.txt', 'w') as outfile:
with open(json_filename, 'r') as input_file:'''
# outfile.write('[ '
parser = ijson.items(input_file, '', multiple_values=True)
cont = 0
max_keys_list = list()
for value in parser:
for i in json.loads(json.dumps(value, ensure_ascii=False, default=str)) :
if i not in max_keys_list:
max_keys_list.append(i)
print(value)
print(max_keys_list)
for keys_item in max_keys_list:
outfile.write(keys_item + '\n')
if __name__ == '__main__':
parse_json('./email/emailrecords.bson.json')
Traceback (most recent call last):
File "panda read.py", line 29, in <module>
parse_json('./email/emailrecords.bson.json')
File "panda read.py", line 17, in parse_json
for value in parser:
ijson.common.IncompleteJSONError: lexical error: invalid char in json text.
litecashwire.com","lastname":NaN,"firstname":"Mia","zip":"87
(right here) ------^
Your file I not valid JSON (NaN is not a valid JSON value); therefore any JSON parsing library will complain about this, one way or another, unless they have an extension to handle this non-standard content.
The ijson FAQ found in the project description has a question about invalid UTF-8 characters and how to deal with them. Those same answers apply here, so I would suggest you go and try one of those.

Datastore error: BadValueError: Expected integer, got [0, 1, 2, 3]

Others have reported a similar error, but the solutions given do not solve my problem.
For example there is a good answer here. The answer in the link mentions how ndb changes from a first use to a later use and suggests there is a problem because a first run produces a None in the Datastore. I cannot reproduce or see that happening in the Datastore for my sdk, but that may be because I am running it here from the interactive console.
I am pretty sure I got an initial good run with the GAE interactive console, but every run since then has failed with the error in my Title to this question.
I have left the print statements in the following code because they show good results and assure me that the error is occuring in the put() at the very end.
from google.appengine.ext import ndb
class Account(ndb.Model):
week = ndb.IntegerProperty(repeated=True)
weeksNS = ndb.IntegerProperty(repeated=True)
weeksEW = ndb.IntegerProperty(repeated=True)
terry=Account(week=[],weeksNS=[],weeksEW=[])
terry_key=terry.put()
terry = terry_key.get()
print terry
for t in list(range(4)): #just dummy input, but like real input
terry.week.append(t)
print terry.week
region = 1 #same error message for region = 0
if region :
terry.weeksEW.append(terry.week)
else:
terry.weeksNS.append(terry.week)
print 'EW'+str(terry.weeksEW)
print 'NS'+str(terry.weeksNS)
terry.week = []
print 'week'+str(terry.week)
terry.put()
The idea of my code is to first build up the terry.week list values incrementally and then later store the whole list to the appropriate region, either NS or EW. So I'm looking for a workaround for this scheme.
The error message is likely of no value but I am reproducing it here.
Traceback (most recent call last):
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/python/runtime/request_handler.py", line 237, in handle_interactive_request
exec(compiled_code, self._command_globals)
File "<string>", line 55, in <module>
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 3458, in _put
return self._put_async(**ctx_options).get_result()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 383, in get_result
self.check_success()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/context.py", line 824, in put
key = yield self._put_batcher.add(entity, options)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 430, in _help_tasklet_along
value = gen.send(val)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/context.py", line 358, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/datastore/datastore_rpc.py", line 1858, in async_put
pbs = [entity_to_pb(entity) for entity in entities]
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 697, in entity_to_pb
pb = ent._to_pb()
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 3167, in _to_pb
prop._serialize(self, pb, projection=self._projection)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1422, in _serialize
values = self._get_base_value_unwrapped_as_list(entity)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1192, in _get_base_value_unwrapped_as_list
wrapped = self._get_base_value(entity)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1180, in _get_base_value
return self._apply_to_values(entity, self._opt_call_to_base_type)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1352, in _apply_to_values
value[:] = map(function, value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1234, in _opt_call_to_base_type
value = _BaseValue(self._call_to_base_type(value))
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1255, in _call_to_base_type
return call(value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1331, in call
newvalue = method(self, value)
File "/Users/brian/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/model.py", line 1781, in _validate
(value,))
BadValueError: Expected integer, got [0, 1, 2, 3]
I believe the error comes from these lines:
terry.weeksEW.append(terry.week)
terry.weeksNS.append(terry.week)
You are not appending another integer; You are appending a list, when an integer is expected.
>>> aaa = [1,2,3]
>>> bbb = [4,5,6]
>>> aaa.append(bbb)
>>> aaa
[1, 2, 3, [4, 5, 6]]
>>>
This fails the ndb.IntegerProperty test.
Try:
terry.weeksEW += terry.week
terry.weeksNS += terry.week
EDIT: To save a list of lists, do not use the IntegerProperty(), but instead the JsonProperty(). Better still, the ndb datastore is deprecated, so... I recommend Firestore, which uses JSON objects by default. At least use Cloud Datastore, or Cloud NDB.

How do I create an abstract function that has a given derivative in Sage?

I want to have abstract $f$ function that has a given derivative. But, when I try to substitute it to D[0](f)(t), Sage says:
NameError: name 'D' is not defined
R.<t,u1,u2> = PolynomialRing(RR,3,'t' 'u1' 'u2')
tmp1 = r1*k1*u1-(r1/k1)*k1^2*u1^2-r1*b12/k1*k1*u1*k2*u2
f=function('f',t)
a=diff(f)
a.substitute_expression((D[0](f)(t))==tmp1)
tmp1.integral() won't do the job. I also can't substitute the integral, although it gives no warning.
%var u10, u20,r1,r2,k1,k2,b12,b21,t
u1=function('u1',t)
u2=function('u2',t)
tmp1 = r1*k1*u1-(r1/k1)*k1^2*u1^2-r1*b12/k1*k1*u1*k2*u2
tmp2 = r2*u2*k2-r2/k2*k2^2*u2^2-((r2*b21)/k2)*u1*u2*k1*k2
v1=integral(tmp1,t)
v2=integral(tmp2,t)
sep1=tmp1.substitute_expression(u1==v1,u2==v2)
sep2=tmp2.substitute_expression(u1==v1,u2==v2)
trial=diff(sep1,t)
trial.substitute_expression((integrate(-b12*k2*r1*u1(t)*u2(t) - k1*r1*u1(t)^2 + k1*r1*u1(t), t))==v1, (integrate(-b12*k2*r1*u1(t)*u2(t) - k1*r1*u1(t)^2 + k1*r1*u1(t), t))==v2)
Now let's go back to original version:
d1=diff(tmp1,t)
d1.substitute_function((D[0](u1)(t)),tmp1)
Error in lines 13-13
Traceback (most recent call last):
File "/projects/b501d31c-1f5d-48aa-bee3-73a2dcb30a39/.sagemathcloud/sage_server.py", line 733, in execute
exec compile(block+'\n', '', 'single') in namespace, locals
File "", line 1, in <module>
NameError: name 'D' is not defined
I don't know if this is really what you are looking for. But it offers at least some semblance of it.
sage: def myfunc(self, *args, **kwds): return e^(args[0])^2
sage: foo = function('foo', nargs=1, tderivative_func=myfunc)
sage: foo(x)
foo(x)
sage: foo(x).diff(x)
e^(x^2)
sage: foo(x).diff(x,3)
4*x^2*e^(x^2) + 2*e^(x^2)
You'll need to read the documentation of function (gotten by typing function?) very carefully to use this well, especially the comment
Note that custom methods must be instance methods, i.e., expect the
instance of the symbolic function as the first argument.
The doc is quite subtle and could use some improvement.

How can I use sign in array of initial facts with Perl module AI::ExpertSystem::Advanced

I am trying to use the Perl AI::ExpertSystem::Advanced module, and I try to use sign in the array of initial facts. The documentation of this module shows an example:
my $ai = AI::ExpertSystem::Advanced->new(
viewer_class => 'terminal',
knowledge_db => $yaml_kdb,
initial_facts => ['I', ['F', '-'], ['G', '+']);
but there is something wrong (syntax error). I thing that one ] missing at the end of code.
First question: What is the correct form? When I run the example my terminal shows me a lot of errors.
Second question: Can I use a file to stored initial facts?
Thanks for your answers.
Error log:
when I use example from documentation:
syntax error at mix.pl line 24, near "])"
Global symbol "$ai" requires explicit package name at mix.pl line 26.
Missing right curly or square bracket at mix.pl line 27, at end of line
Execution of mix.pl aborted due to compilation errors.
When I put ] in its correct place at the end of expression: initial_facts => ['I', ['F', '-'], ['G', '+']]);
Attribute (initial_facts) does not pass the type constraint because: Validation failed for 'ArrayRef[Str]' with value ARRAY(0x3268038) at C:/Perl64/lib/Moose/Meta/Attribute.pm line 1274.
Moose::Meta::Attribute::verify_against_type_constraint('Moose::Meta::Attribute=HASH(0x3111108)', 'ARRAY(0x3268038)', 'instance', 'AI::ExpertSystem::Advanced=HASH(0x30ef068)') called at C:/Perl64/lib/Moose/Meta/Attribute.pm line 1261
Moose::Meta::Attribute::_coerce_and_verify('Moose::Meta::Attribute=HASH(0x3111108)', 'ARRAY(0x3268038)', 'AI::ExpertSystem::Advanced=HASH(0x30ef068)') called at C:/Perl64/lib/Moose/Meta/Attribute.pm line 531
Moose::Meta::Attribute::initialize_instance_slot('Moose::Meta::Attribute=HASH(0x3111108)', 'Moose::Meta::Instance=HASH(0x32673d8)', 'AI::ExpertSystem::Advanced=HASH(0x30ef068)', 'HASH(0x3118298)') called at C:/Perl64/lib/Class/MOP/Class.pm line 525
Class::MOP::Class::_construct_instance('Moose::Meta::Class=HASH(0x2eb2418)', 'HASH(0x3118298)') called at C:/Perl64/lib/Class/MOP/Class.pm line 498
Class::MOP::Class::new_object('Moose::Meta::Class=HASH(0x2eb2418)', 'HASH(0x3118298)') called at C:/Perl64/lib/Moose/Meta/Class.pm line 274
Moose::Meta::Class::new_object('Moose::Meta::Class=HASH(0x2eb2418)', 'HASH(0x3118298)') called at C:/Perl64/lib/Moose/Object.pm line 28
Moose::Object::new('AI::ExpertSystem::Advanced', 'viewer_class', 'terminal', 'knowledge_db', 'AI::ExpertSystem::Advanced::KnowledgeDB::YAML=HASH(0x3118478)', 'verbose', 1, 'initial_facts', 'ARRAY(0x3268038)') called at mix.pl line 20
This is a bug in the documentation (and possibly in the module itself).
To set the object up with negative initial facts you need to create the dictionary object first.
my $initial_facts_dict = AI::ExpertSystem::Advanced::Dictionary->new(
stack => [ 'I', ['F', '-'], ['G', '+'] ]);
my $ai = AI::ExpertSystem::Advanced->new(
viewer_class => 'terminal',
knowledge_db => $yaml_kdb,
initial_facts_dict => $initial_facts_dict,
);

Resources