Theory of formal languages - Automaton - language-theory

I'm wondering about formal languages. I have a kind of parser :
It reads à xml-like serialized tree structure and turn it into a multidimmensionnal array.
My point is on the similarities between the algorithm being used and the differents kinds of automatons ( state machines turing machines stack ... ).
So the question is : which is the automaton I implictly use here, and to which formal languages family does it fit ?
And what's about recursion ?
What i mean by " automaton i use implicitly " is "which is the minimal automaton to do the same job".
Here is the complete source :
$words; // an array of XML tag '<tag>', '</tag>' and simple text content
$tree = array(
'type' => 'root',
'sub' => array()
);
$pTree = array(&$tree);
$deep = 0;
foreach ( $words as $elem )
if ( preg_match($openTag, $elem) ) { // $elem is an open tag
$pTree[$deep++]['sub'][] = array( // we add an element to the multidim array
'type' => 'block',
'content' => $elem,
'sub' => array()
);
$size = sizeof($pTree[$deep - 1]['sub']);
$pTree[$deep] = &$pTree[$deep - 1]['sub'][$size - 1]; // down one level in the tree
} elseif ( preg_match($closeTag, $elem) ) { // it is a close tag
$deep--; // up in the tree
} else { // simple element
$pTree[$deep]['sub'][] = array(
'type' => 'simple',
'content' => $elem
);
}

Please take a look at your question again. You're referring to a $words variable, which is not in your example. Also, there is no code, without knowing what is being done it's hard to answer you.
Judging from the name of the variable $deep, it is probably not the state. The state in an automaton is an element of a set that is specific to the automaton; $deep looks like it could contain a depth, any positive integer. Again, hard to tell without the code.
Anyway, you are probably not "implicitly using" any automaton at all, if you didn't design your code as an implementation of one.
Your simple xml-like files could probably be recognized by a deterministic stack machine, or generated by a deterministic context-free grammar, making them Type-2 in the Chomsky hierarchy. Once again this is just a guess, "a xml-like serialized tree structure" is too vague for any kind of formalism.
In short, if you are looking to use any formal theory, do word your questions more formally.
Edit (after seeing the code):
You're building a tree. That's out of reach for an automaton (at least the “standard” ones). Finite automatons only work with an input and a state, stack machines add a stack to that, and Turing machines have a read-write tape they can move in both directions.
The “output” of an automaton is a simple “Yes” (accepted) or “No” (not accepted, or an infinite loop). (Turing machines can be defined to provide more output on their tape.)
The best I can answer to “which is the minimal automaton to do the same job” is that your language can be accepted by a stack machine; but it would work very differently and not give you trees.
However, you might look into grammars – another formal language construct that introduces the concept of parse trees.
What you are doing here is creating such a parse tree with a top-down parser.

Related

Perl: Parentheses vs Brackets for array definition, why is one considered a scalar?

I was following this tutorial on the HTML::Template module for Perl. Here's the template:
<!--template2.tmpl-->
<html>
<body>
<table>
<tr>
<th>Language</th>
<th>Description</th>
</tr>
<tmpl_loop name="language">
<tr>
<td><tmpl_var name="language_name"></td>
<td><tmpl_var name="description"></td>
</tr>
</tmpl_loop>
</table>
</body>
</html>
And here's the CGI test program:
#!"C:\Strawberry\perl\bin\perl.exe" -wT
use CGI qw(:all);
use CGI::Carp qw(fatalsToBrowser);
use HTML::Template;
my #rows = (
{
language_name => 'C#',
description => 'Created by Microsoft'
},
{
language_name => 'PHP',
description => 'Hypertext Preprocessor'
},
{
language_name => 'Haskell',
description => 'Functional language'
},
);
print header;
my $template=HTML::Template->new(filename=>'template2.tmpl');
$template->param(language => #rows);
print $template->output();
This fails with the following error: HTML::Template::param() : attempt to set parameter 'language' with a scalar - parameter is not a TMPL_VAR!
However, when I change the definition of #rows from using parenthesis to using square brackets(from my #rows=(...) to my #rows = [...]) the code works fine; it displays a table with the data.
As I understood from reading this article, the first form is an array defined from a list and the second one is a reference to an anonymous array. It's still not clear to me why the first form doesn't work. I'd appreciate you clarifying this for me.
The tutorial you're following contains an error. The line
$template->param( language => #languages );
should be
$template->param( language => \#languages );
Why? Short answer: the right-hand side of the loop name you pass to param must be a reference to an array, not an array.
Long answer: When you pass arguments to a function or method, all of the arguments get expanded into one long list. This is a common source of mistakes for beginners. So in your code (and in the tutorial's code), you're not passing two parameters to the param method, you're passing four (one for the string 'language', and three for the elements of #languages.
Here's an example of this argument-list unraveling. If you have three variables as follows:
my $scalar = 'bear';
my #array = ('rat', 'moose', 'owl');
my %hash = (mass => 500, units => 'kg');
and you pass them to a function like so:
some_function($scalar, #array, %hash);
then the function will see eight arguments: 'bear', 'rat', 'moose', 'owl', 'mass', 500, 'units', and 'kg'! Perhaps even more surprising, the two sets of values from the hash might be passed in a different order, because hashes are not stored or retrieved in a determinate order.
Your solution of changing the parentheses to square brackets works, but not for a very good reason. Parentheses delimit lists (which can be stored in arrays or hashes); square brackets delimit references to arrays. So your square-bracket code creates a reference to an anonymous array, which is then stored as the first (and only) element of your named array #rows. Instead, you should either store the array reference (delimited with square brackets) in a scalar variable (say, $rows), or you should use parentheses, store the list in the array #rows, and pass a reference to that array to the param method (by using a backslash, as I did above with \#languages).
language => #rows
means
'language', $rows[0], $rows[1], $rows[2], ...
or
language => $rows[0],
$rows[1] => $rows[2],
...
You want
language => \#rows
The param() method in HTML::Template takes pairs of arguments. The first value in the pair is the name of a template variable that you want to set and the second is the value that you want to set that variable to.
So you can make a call that sets a single variable:
$template->param(foo => 1);
Or you can set multiple variables in one call:
$template->param(foo => 1, bar => 2, baz => 3);
For reasons that should be obvious, the variable names given in your call to param() should all be variables that are defined in your template (either as standard tmpl_var variables or as looping tmpl_loop variables).
If you're setting a tmpl_loop variable (as you are in this case) then the associated value needs to be a reference to an array containing your values. There is some attempt to explain this in the documentation for param(), but I can see how it might be unclear as it just does it by showing examples in square (array reference constructor) brackets rather than actually explaining the requirement.
The reason for this is that the list of parameters passed to a subroutine in Perl is "flattened" - so an array is broken up into its individual elements. This means that when you pass:
$template->param(languages => #rows);
Perl sees it as:
$template->param(languages => $row[0], $row[1] => $row[2]);
The elements of your array are hash references. This means that $row[1] will be interpreted as a stringified hash reference (something like "HASH(0x12345678)") which definitely isn't the name of one of the variables in your template.
So how do we fix this? Well, there are a few alternatives. You have stumbled over a bad one. You have used code like this:
#rows = [ ... ];
This creates #rows an array with a single element which is a reference to your real array. This means that:
$template->param(language => #rows);
is interpreted as:
$template->param(language => $rows[0]);
And as $rows[0] is a reference to your array, it all works.
Far better would be to explicitly pass a reference to #rows.
#rows = ( ... ); # your original version
$template->param(language => \#rows);
Or to create an array reference, stored in a scalar.
$rows = [ ... ];
$template->param(language => $rows);
There's really nothing to choose between these two options.
However, I would ask you to consider why you are spending time teaching yourself HTML::Template. It has been many years since I have seen it being used. The Template Toolkit seems to have become the de-facto standard Perl templating engine.

using ruby to extract the values in a hash in a DRY way

My app passes to different methods a json_element for which the keys are different, and sometimes empty.
To handle it, I have been hard-coding the extraction with the following sample code:
def act_on_ruby_tag(json_element)
begin
# logger.progname = __method__
logger.debug json_element
code = json_element['CODE']['$'] unless json_element['CODE'].nil?
predicate = json_element['PREDICATE']['$'] unless json_element['PREDICATE'].nil?
replace = json_element['REPLACE-KEY']['$'] unless json_element['REPLACE-KEY'].nil?
hash = json_element['HASH']['$'] unless json_element['HASH'].nil?
I would like to eliminate hardcoding the values, and not quite sure how.
I started to think through it as follows:
keys = json_element.keys
keys.each do |k|
set_key = k.downcase
instance_variable_set("#" + set_key, json_element[k]['$']) unless json_element[k].nil?
end
And then use #code for example in the rest of the method.
I was going to try to turn into a method and then replace all this hardcoded code.
But I wasn't entirely sure if this is a good path.
It's almost always better to return a hash structure from a method where you have things like { code: ... } rather than setting arbitrary instance variables. If you return them in a consistent container, it's easier for callers to deal with delivering that to the right location, storing it for later, or picking out what they want and discarding the rest.
It's also a good idea to try and break up one big, clunky step with a series of smaller, lighter operations. This makes the code a lot easier to follow:
def extract(json)
json.reject do |k, v|
v.nil?
end.map do |k, v|
[ k.downcase, v['$'] ]
end.to_h
end
Then you get this:
extract(
'TEST' => { '$' => 'value' },
'CODE' => { '$' => 'code' },
'NULL' => nil
)
# => {"test"=>"value", "code"=>"code"}
If you want to persist this whole thing as an instance variable, that's a fairly typical pattern, but it will have a predictable name that's not at the mercy of whatever arbitrary JSON document you're consuming.
An alternative is to hard-code the keys in a constant like:
KEYS = %w[ CODE PREDICATE ... ]
Then use that instead, or one step further, define that in a YAML or JSON file you can read-in for configuration purposes. It really depends on how often these will change, and what sort of expectations you have about the irregularity of the input.
This is a slightly more terse way to do what your original code does.
code, predicate, replace, hash = json_element.values_at *%w{
CODE PREDICATE REPLACE-KEY HASH
}.map { |x| x.fetch("$", nil) if x }

generate C structures from high level definition of objects

I have to define high level definition of objects such as :
obj1 => [
name => "object1",
type => "uint64",
dependents => [
{type => "unit32", name => "dep1"},
{type => "uint32", name => "dep2"}
],
default_value = "100"
]
From this I want to generate the C structures and some helper routines such as:
struct_dependents {
int type;
char name[MAX];
}
struct struct_obj1 {
char name[MAX];
int type;
struct dependents deps[MAX_DEP];
unit64 default_value;
}
// Some initializations..
Earlier I thought I could define the high level objects in .pm (perl module) files and then use perl to generate C code, but writing code to generate C code this way might be error prone and tough to maintain if object definitions change in future.
What I want to know is that - are there any such ready made tools which allow us to write high level object definition and auto generate their C structures?
There are plenty of code generators for C - you're more likely to find something that uses an intermediary syntax such as xml; A quick google turned up xml2c. You can use XML::Simple for saving your hashes to xml.
More examples can be found on google.
If you wish to roll out your own, code generation using template toolkit provides a flexible approach.

parsing strings in c

I have a string like this:
{ "\\"name\\" => \\"{ 'a', 'b', 'c' }\\"**,** \\"age\\" => \\"{6, 7, 8 }\\" " }
It's a hstore, and for example 'a' can be a hstore to. I want to parse this string by comma in C.
when parsed the output must be like that
array(
array('name' => {'a','b','c'}, 'age' => {6, 7, 8 }) ,
array( ),
array( )...
)
It seems to be some nested JSON format. Did you consider using a JSON parsing library, like .e.g. Jansson
If you want to parse by commas, strtok() is one possible option. See http://www.daniweb.com/software-development/c/threads/184836. Honestly, I can't see how parsing by comma will make this data any more intelligible, but it can be done regardless.
Use Ragel to generate a state machine and implementation in C: http://ragel.org/
Bit of a learning curve but well worth it. Being able to visualize the output in state machines helps with debugging.
I'm currently using it to produce yet another json library. Which as mentioned in the above answer, seems to be pretty similar. Feel free to learn from and copy bits of my code.
I have the ragel code split over 3 files:
https://github.com/matiu2/yajp/blob/master/json.rl
https://github.com/matiu2/yajp/blob/master/string.rl
https://github.com/matiu2/yajp/blob/master/number.rl
Which produces these three c++ files (You can produce C just as easily if needed):
https://github.com/matiu2/yajp/blob/master/json.hpp
https://github.com/matiu2/yajp/blob/master/string.hpp
https://github.com/matiu2/yajp/blob/master/number.hpp
Also, in case it is interesting here are diagrams of the state machines the ragel produces: https://github.com/matiu2/yajp/tree/master/images

Should I choose a hash, an object or an array to represent a data instance in Perl?

I was always wondering about this, but never really looked thoroughly into it.
The situation is like this: I have a relatively large set of data instances. Each instance has the same set or properties, e.g:
# a child instance
name
age
height
weight
hair_color
favorite_color
list_of_hobbies
Usually I would represent a child as a hash and keep all children together in a hash of hashes (or an array of hashes).
What always bothered me with this approach is that I don't really use the fact that all children (inner hashes) have the same structure. It seems like it might be wasteful memory-wise if the data is really large, so if every inner hash is stored from scratch it seems that the names of the key names can take far more sapce than the data itself...
Also note that when I build such data structures I often nstore them to disk.
I wonder if creating a child object makes more sense in that perspective, even though I don't really need OO. Will it be more compact? Will it be faster to query?
Or perhaps representing each child as an array makes sense? e.g.:
my ($name, $age, $height, $weight, $hair_color, $favorite_color, $list_of_hobbies) = 0..7;
my $children_h = {
James => ["James", 12, 1.62, 73, "dark brown", "blue", ["playing football", "eating ice-cream"]],
Norah => [...],
Billy => [...]
};
print "James height is $children_h->{James}[$height]\n";
Recall my main concerns are space efficiency (RAM or disk when stored), time efficiency (i.e. loading a stored data-set then getting the value of property x from instance y) and ... convenience (code readability etc.).
Thanks!
Perl is smart enough to share keys among hashes. If you have 100,000 hashes with the same five keys, perl stores those five strings once, and references to them a hundred thousand times. Worrying about the space efficiency is not worth your time.
Hash-based objects are the most common kind and the easiest to work with, so you should use them unless you have a damn good reason not to.
You should save yourself a lot of trouble, start using Moose, and stop worrying about the internals of your objects (although, just between you and me, Moose objects are hash-based unless you use special extensions to make them otherwise -- and once again, you shouldn't do that without a really good reason.)
I guess it is mainly personal taste (except of course when other people have to work on your code too)
Anyway, I think you should look into moose It is definitely not the most time nor space efficient, but it is the most pleasant and most secure way of working.
(By secure, I mean that other people that use your object can't misuse it as easily)
I personally prefer an object when I'm really representing something.
And when I work with objects in perl, I prefer moose
Gr,
ldx
Unless absolute speed tuning is a requirement, I would make an object using Moose. For pure speed, use constant indexes and an array.
I like objects because they reduce the mental effort needed to work with big deep structures. For example, if you build a data structure to represent the various classrooms in a school. You'll have something like a list of kids, a teacher and a room number. If you have everything in a big structure you have to know the structure internals access the hobbies of the children in the classroom. With objects, you can do somthing like:
my #all_hobbies = uniq map $_->all_hobbies,
map $_->all_students, $school->all_classrooms;
I don't care about the internals. And I can concisely generate a unique list of all the kids hobbies. All the complicated accesses are still happening, but I don't need to worry about what is happening. I can simply use the interface.
Here's a Moose version of your child class. I set up the hobbies attribute to use the array trait, so we get a bunch of methods simply for the asking.
package Child;
use Moose;
has [ 'name', 'hair_color', 'fav_color' ] => (
is => 'ro',
isa => 'Str',
required => 1,
);
has [ 'age', 'height', 'weight' ] => (
is => 'ro',
isa => 'Num',
required => 1,
);
has hobbies => (
is => 'ro',
isa => 'Int',
default => sub {[]},
traits => ['Array'],
handles => {
has_no_hobbies => 'is_empty',
num_hobbies => 'count',
has_hobbies => 'count',
add_hobby => 'push',
clear_hobbies => 'clear',
all_hobbies => 'elements',
},
);
# Good to do these, see moose best practices manual.
__PACKAGE__->meta->make_immutable;
no Moose;
Now to use the Child class:
use List::MoreUtils qw( zip );
# Bit of messing about to make array based child data into objects;
#attributes = qw( name age height weight hair_color fav_color hobbies );
my #children = map Child->new( %$_ ),
map { zip #attributes, #$_ },
["James", 12, 1.62, 73, "dark brown", "blue", ["playing football", "eating ice-cream"]],
["Norah", 13, 1.75, 81, "black", "red", ["computer programming"]],
["Billy", 11, 1.31, 63, "red", "green", ["reading", "drawing"]],
;
# Now index by name:
my %children_by_name = map { $_->name, $_ } #children;
# Here we get kids with hobbies and print them.
for my $c ( grep $_->has_hobbies, #children ) {
my $n = $c->name;
my $h = join ", ", $c->all_hobbies;
print "$n likes $h\n";
}
I usually start with a hash and manipulate that, until I find instances where the data I really want is derived from the data that I have. And/or that I want some sort of peculiar--or even polymorphic--behavior.
At that point, I start creating a packages to store class behavior, implementing methods as needed.
Another case is where I think this data would be useful in more than one instance. In that case, it's either rewrite all the selection cases everywhere where you think you'll need it or package the behavior in a class, so that you don't have to do too much copying or studying of the cases the next time you want to use that data.
Generally, if you don't need utter efficiency, hashes will be your best bet. In Perl an object is just a $something with a class name attached. The object can be a hash, an array, a scalar, a code reference, or even a glob reference inside. So objects can only possibly be a win in convenience, not efficiency.
If you want to give an array a shot, the typical way of making that somewhat maintainable is using constants for the field names:
use strict;
use warnings;
use constant {
NAME => 0,
AGE => 1,
HEIGHT => 2,
WEIGHT => 3,
HAIR_COLOR => 4,
FAVORITE_COLOR => 5,
LIST_OF_HOBBIES => 6,
};
my $struct = ["James", 12, 1.62, 73, "dark brown", "blue", ["playing football", "eating ice-cream"]];
# And then access it with the constants as index:
print $struct->[NAME], "\n";
$struct->[AGE]++; # happy birthday!
Alternatively, you could try whether using an array (object) as follows makes more sense:
package MyStruct;
use strict;
use warnings;
use Class::XSAccessor::Array
accessors => {
name => 0,
age => 1,
height => 2,
weight => 3,
hair_color => 4,
favorite_color => 5,
list_of_hobbies => 6,
};
sub new {
my $class = shift;
return bless([#_] => $class);
}
package main;
my $s = MyStruct->new;
$s->name("James");
$s->age(12);
$s->height(1.62);
$s->weight(73);
# ... you get the drill, but take care: The following is fine:
$s->list_of_hobbies(["foo", "bar"]);
# This can produce action-at-a-distance:
my $hobbies = ["foo", "bar"];
$s->list_of_hobbies($hobbies);
$hobbies->[1] = "baz"; # $s changed, too (due to reference)
Coming back to my original point: Usually, you want hashes or hash-based objects.
Whenever I try to decide between using a hash or an array to store data, I almost always use a hash. I can almost always find a useful way to index the values in the list for quick lookup. However, your question is more about hashes of array refs vs hashes of hash refs vs hashes of object refs.
In your example above, I would have used a hash of hash refs rather than a hash of array refs. The only time I would use an array is when there is an inherent order in the data that should be maintained, that way I can look things up in order. In this case, there isn't really any inherent order in the arrays you're storing (e.g., you arbitrarily chose height before weight), so it would be more appropriate (in my humble opinion) to store the data as a hash where the keys are descriptions of the data you're storing (name, height, weight, etc).
As to whether you should use a hash of hash refs or a hash of object refs, that can often be a matter of preference. There is some overhead associated with object-oriented Perl, so I try only to use it when I can get a large benefit in, say, usability. I usually only use objects/classes when there are actions inherently associated with the data (so I can write $my_obj->fix(); rather than fix($my_obj);). If you're just storing data, I would say stick with a hash.
There should not be a significant difference in RAM usage or in time to read from/write to disk. In terms of readability, I think you will get a huge benefit using hashes over arrays, since with the hashes the keys actually make sense, but the arrays are just indexed by numbers that have no real relationship with the data. This may require more disk space for storage if you're storing in plain text, but if that's a huge concern you can always compress the data!

Resources