splitting JSON string using regex - arrays

I want to split a JSON document and which has a pattern like [[[1,2],[3,4][5,6]]] using regex. The pairs represent x ad y. What I want to do it to take this string and produce a list with {"1,2", "3,4","5,6"}. Eventually I want to split the pairs. I was thinking I can make a list of {"1,2", “3,4","5,6"} and use the for loop to split the pairs. Is this approach correct to get the x and y separately?

JSON is not a regular language, but a Context free language, and as such, cannot be matched by a regular expresion. You need a full JSON parser like the ones referenced in the comments to your question.
... but, if you are going to have a fixed structure, like only three levels of square brakets only, and with the structure you posted in your question, then there's a regexp that can parse it (It would be a subset of the JSON grammar, not general enough to parse other JSON contents):
You'll have numbers: ([+-]?[0-9]+)
Then you'll have brackets and separators: \[\[\[, ,, \],\[ and \]\]\]
and finally, put all this together:
\[\[\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\]\]\]
and if you want to permit spaces between symbols, then you need:
\s*\[\s*\[\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*\]\s*\]\s*
This regexp will have six matching groups that will match the corresponding integers in the matching string as the folloging demo
Clarification
Regular languages, and regular grammars, and regular expressions form a class of languages with many practical properties, for example:
You can parse them efficiently in one pass with what is called a finite automaton
You can define the automaton to accept language sentences simply with a regular expression.
You can simply operate with regexps (or with automata) to make more complex acceptors (for the union of language sets, intersection, symmetric difference, concatenation, etc) to make acceptors for them.
You can simply say if one regular expression (the language it defines) is a subset, superset or none of the language of the original.
By contrast, it limits the power of languages that can be defined with it:
you cannot define languages that allow nesting of subexpressions (like the bracketing you allow in JSON expressions or the tag nesting allowed in XML documents)
you cannot define languages which collect context and use it in another place of the sentence (for example, sentences that identify a number and have to match that same number in another place of the sentence)
But, the meaning of my answer is that, if you bind the upper limit of nesting (let's say, for example, to three levels of parenthesis, like the example you posted) you can make your language regular and then parse it with the regular expression. It is not easy to do that, because this often leads to complex expressions (as you have seen in my answer) but not impossible, and you'll gain the possibility of being able to identify parts of the sentence as submatches of the regular subexpressions embedded in the global one.
If you want to allow nesting, you need to switch to context free languages, which are defined with context free grammars and are accepted with a more complex stack based automaton. Then, you loose the complete set of operations you had:
You'll never be able again to say if some language overlaps another (is included)
You'll never be abla again to construct a language from the union, intersection or difference of other context free languages.
But you will be able to match unbounded nested sentences. Normally, programming languages are defined with a context free grammar and a little more work for context checking (for example, to check if some identifier being used is actually defined in the declaration section or to match the starting and ending tag identifiers at matching levels in an XML document)
For context free languages, see this.
For regular languages, see this.
Second clarification
As in your question you didn't expressed you wanted to match real, decimal numbers, I have modified the demo to make it to allow fixed point numbers (not general floating point with exponential notation, you'll need to work it yourself, as an exercise). Just make some tests and modify the regexp to adapt it to your needs.
(well, if you want to see the solution, look at it)

Yeah i tried using the regex in my code but it is not working so I am trying a different approach now. I have an idea of how to approach it but it is not really working. First of let me be more clear on the question. What I am trying to so parse a JSON document. Like the image below. the file has a strings have [[[1,2],[3,4][5,6]]] pattern. What I am trying to get out of this is to have each pair as a list. So the list has an x-y pairs.
the string structure
My approach: first replace the “[[“ and “]]” at the begging and at the end, so I have a string with the same pattern through out. which gives [enter image description here][2]me a string “[1,2],[3,4][5,6]” This is my code but it is not working. How do I fix it? The other thing I though it could be an issue is, the strings are not the same length so. So how do I replace just the beginning and the ending?
my code
Then I can use a regex split method to get a list that has a form {“1,2” , “3,4”, “5,6”}. I am not really sure how to do this though.
Then I take the x, and the y, and add them and add those to the list. So I get of a list pair x-y pair. I will appreciate if you show me how to do this.
This is the approach I am working on but if there is a better way of doing it I will be glad to see it. [enter image description here][4]

Related

String formatting vs String Interpolation

Please help me understand the difference between the two concepts of String Formatting and String Interpolation.
From Stackoverflow tag info for string-interpolation:
String interpolation is the replacement of defined character sequences in a string by given values. This representation may be considered more intuitive for formatting and defining content than the composition of multiple strings and values using concatenation operators. String interpolation is typically implemented as a language feature in many programming languages including PHP, Haxe, Perl, Ruby, Python, C# (as of 6.0) and others.
From Stackoverflow tag info for string-formatting:
Commonly refers to a number of methods to display an arbitrary number of varied data types into a string.
To me, they appear to be similar, but I hope there's some difference.
Also, kindly clarify whether these are some technology-specific concepts, or, technology-agnostic concepts. (I was reading about these concepts in the context of Python. But quick google and bing searches brought up related articles in other programming languages such as Java, C#, etc.)
String formatting is a rather general term of generating string content from data using some parameters. For example, creating date strings from date objects for a specific date format, number strings from numbers with a particular number of decimal digits or a number of leading spaces and zeroes etc. It can also involve templates, like in sprintf function present in C or many other languages, or e.g. str.format in Python. For example, in Ruby:
sprintf("%06.2f", 1.2) # float, length 6, 2 decimals, leading zeroes if needed
# => "001.20"
String interpolation is a much more restricted concept: evaluating expressions embedded inside string literals and replacing them with the result of such evaluation. For example, in Ruby:
"Two plus two is #{2+2}"
# => "Two plus two is 4"
Some languages can perform formatting inside interpolation. For example, in Python:
f"Six divided by five is {6/5:06.2f}"
# => "Six divided by five is 001.20"
The concepts are language-agnostic. However, it is not guaranteed that all programming languages will have a built-in capability for one or both of them. For example, C has no string interpolation, but it has string formatting, using the printf family of functions; and until somewhat recently, JavaScript did not have either, and any formatting was done in a low-tech way, using concatenation and substringing.
String interpolation is one way of doing string formatting. Another way of doing string formatting is called string concatenation. These are technology-agnostic concepts.
In other words, "string formatting" is a goal, and "string interpolation" is a strategy for reaching that goal.

Perl structure flow to C

I've started working on a program which is in Perl and has to be transformed into C.
There are a lot of subroutines which have structure member accessing which is unfamiliar to me, because I have little to no knowledge about Perl syntax and structure flow.
Example:
$ref->{$struct2[$value]->{field1}}->{struct_insideStruct2}->{$ref2->{field}}
$ref is a third structure
$ref2 is a local copy of a parameter which is of type struct1
My question is: How do you create a line like this in C?
Do I need to create nested multiple structures?
I need to understand how multiple access operators in Perl works and if I can create something similiar in C, thanks in advance.
I recommend to not try to directly translate between languages, as this likely results in a clumsy and unnatural code. That would certainly be the case here, as commented further down. The best I can do for this quest is to explain what the expression does
$ref -> { $struct2[$value]->{field1} }
-> { struct_insideStruct2 }
-> { $ref2->{field} }
The $ref is a reference to a hash (associative array); it's OK to think of it as a pointer to a hash. One can tell because the -> ("arrow") operator dereferences, and the {...} on its right means that on its left there must be a hash reference; this returns a value that it points to.
In this case, the key with which it is dereferenced (the index into the associative array) involves an element of the array #struct2 at index $value; that element is another hash reference, being dereferenced (indexed into) with a key field1 (string literal†).
What this returns is another hash reference, which is then indexed into (dereferenced) with the key struct_insideStruct2 (string), and this again returns a hash reference.
That last one is indexed with a key which itself is produced by dereferencing another hash reference, $ref2, with a key field (string).
This is an example of a Perl complex data structure. How do you like it? I don't, not very much. Even in Perl, ideally I'd like to see this rewritten as a class, as it goes too deep and wide and so it packs too much complexity without any supporting structure which a class can provide.
If you still wish to indeed and really do that kinda thing in C, you can. May want to find a good hash implementation (or use structs and nest them carefully), and probably to dust off your function pointer syntax and such. But I would recommend to not get into all that.
Instead, once you understand the deep-nested data structure explained above, and the data it represents, find a way to implement what it means and does in your code in a native C way. We always want to use logic, techniques, and idioms native to the language at hand.
Along with linked documentation also see the short and sweet perlintro. The full reference for Perl's references is perlref.
† Normally such "barewords" need be under quotes, 'string' (or using "", or q() or qq() operators ...). But if that is a sole thing between {} then the quoting may be omitted.

Array parameters in FMI 2.0

In FMI 2.0, array parameters are serialized to scalar variables.
Importing tools can display them as arrays, but their size is fixed and their handling is inefficient.
Better array support is currently in development by a working group of the FMI project, but I would like to know about workarounds how to handle array parameters in the meantime.
Ideas are to
hard code them (disadvantage: the are no paramters any more ...)
put them in a CSV file in the resources folder and read them at the start of the simulation (disadvantage: no parameter mask support, complicated)
put them in a string parameter and parse it at simulation start (disadvantage: limited length of strings, complicated)
Are there other ideas / workarounds? Thanks in advance.
Combinations of the ideas outlined in your question are also possible.
Hard code with selector parameter
Here the idea is to hard code several variants of your array and allow the user to select one with a parameter.
I did this in a recent project where a user needed to choose between different spatially resolved initial conditions (e.g. temperature profiles). We used a model to generate more than 100 different sets of spatially resolved initial conditions (each representing a different "history" of the modeled object), hard coded them as FORTRAN arrays (the inner core of the FMU was in FORTRAN), and used a single integer parameter to select which profile he wanted to use.
It worked very well and the user has no way of breaking it.
Shorten the array and interpolate
If the data in your array is smooth, you might be able to dramatically reduce the number of values you actually need to pass to your simulation - which would make serialization into scalar parameters less painful.
Within the FMU, interpolate to get the resolution you need.
String parameter to select csv file
You can use a string parameter to provide the path to a user-provided csv-file. I would not recommend this, because the user will most likely break it.

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

What is the actual definition of an array? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Arrays, What’s the point?
I tried to ask this question before in What is the difference between an array and a list? but my question was closed before reaching a conclusive answer (more about that).
I'm trying to understand what is really meant by the word "array" in computer science. I am trying to reach an answer not have a discussion as per the spirit of this website. What I'm asking is language agnostic but you may draw on your knowledge of what arrays are/do in various languages that you've used.
Ways of thinking about this question:
Imagine you're designing a new programming language and you decide to implement arrays in it; what does that mean they do? What will the properties and capabilities of those things be. If it depends on the type of language, how so?
What makes an array an array?
When is an array not an array? When it is, for example, a list, vector, table, map, or collection?
It's possible there isn't one precise definition of what an array is, if that is the case then are there any standard or near-standard assumptions or what an array is? Are there any common areas at least? Maybe there are several definitions, if that is the case I'm looking for the most precision in each of them.
Language examples:
(Correct me if I'm wrong on any of these).
C arrays are contiguous blocks of memory of a single type that can be traversed using pointer arithmetic or accessed at a specific offset point. They have a fixed size.
Arrays in JavaScript, Ruby, and PHP, have a variable size and can store an object/scalar of any type they can also grow or have elements removed from them.
PHP arrays come in two types: numeric and associative. Associative arrays have elements that are stored and retrieved with string keys. Numeric arrays have elements that are stored and retrieved with integers. Interestingly if you have: $eg = array('a', 'b', 'c') and you unset($eg[1]) you still retrieve 'c' with $eg[2], only now $eg[1] is undefined. (You can call array_values() to re-index the array). You can also mix string and integer keys.
At this stage of sort of suspecting that C arrays are the only true array here and that strictly-speaking for an array to be an array it has to have all the characteristics I mention in that first bullet point. If that's the case then — again these are suspicions that I'm looking to have confirmed or rejected — arrays in JS and Ruby are actually vectors, and PHP arrays are probably tables of some kind.
Final note: I've made this community wiki so if answers need to be edited a few times in lieu of comments, go ahead and do that. Consensus is in order here.
It is, or should be, all about abstraction
There is actually a good question hidden in there, a really good one, and it brings up a language pet peeve I have had for a long time.
And it's getting worse, not better.
OK: there is something lowly and widely disrespected Fortran got right that my favorite languages like Ruby still get wrong: they use different syntax for function calls, arrays, and attributes. Exactly how abstract is that? In fortran function(1) has the same syntax as array(1), so you can change one to the other without altering the program. (I know, not for assignments, and in the case of Fortran it was probably an accident of goofy punch card character sets and not anything deliberate.)
The point is, I'm really not sure that x.y, x[y], and x(y) should have different syntax. What is the benefit of attaching a particular abstraction to a specific syntax? To make more jobs for IDE programmers working on refactoring transformations?
Having said all that, it's easy to define array. In its first normal form, it's a contiguous sequence of elements in memory accessed via a numeric offset and using a language-specific syntax. In higher normal forms it is an attribute of an object that responds to a typically-numeric message.
array |əˈrā|
noun
1 an impressive display or range of a particular type of thing : there is a vast array of literature on the topic | a bewildering array of choices.
2 an ordered arrangement, in particular
an arrangement of troops.
Mathematics: an arrangement of quantities or symbols in rows and columns; a matrix.
Computing: an ordered set of related elements.
Law: a list of jurors empaneled.
3 poetic/literary elaborate or beautiful clothing : he was clothed in fine array.
verb
[ trans. ] (usu. be arrayed) display or arrange (things) in a particular way : arrayed across the table was a buffet | the forces arrayed against him.
[ trans. ] (usu. be arrayed in) dress someone in (the clothes specified) : they were arrayed in Hungarian national dress.
[ trans. ] Law empanel (a jury).
ORIGIN Middle English (in the senses [preparedness] and [place in readiness] ): from Old French arei (noun), areer (verb), based on Latin ad- ‘toward’ + a Germanic base meaning ‘prepare.’
From FOLDOC:
array
1. <programming> A collection of identically typed data items
distinguished by their indices (or "subscripts"). The number
of dimensions an array can have depends on the language but is
usually unlimited.
An array is a kind of aggregate data type. A single
ordinary variable (a "scalar") could be considered as a
zero-dimensional array. A one-dimensional array is also known
as a "vector".
A reference to an array element is written something like
A[i,j,k] where A is the array name and i, j and k are the
indices. The C language is peculiar in that each index is
written in separate brackets, e.g. A[i][j][k]. This expresses
the fact that, in C, an N-dimensional array is actually a
vector, each of whose elements is an N-1 dimensional array.
Elements of an array are usually stored contiguously.
Languages differ as to whether the leftmost or rightmost index
varies most rapidly, i.e. whether each row is stored
contiguously or each column (for a 2D array).
Arrays are appropriate for storing data which must be accessed
in an unpredictable order, in contrast to lists which are
best when accessed sequentially. Array indices are
integers, usually natural numbers, whereas the elements of
an associative array are identified by strings.
2. <architecture> A processor array, not to be confused with
an array processor.
Also note that in some languages, when they say "array" they actually mean "associative array":
associative array
<programming> (Or "hash", "map", "dictionary") An array
where the indices are not just integers but may be
arbitrary strings.
awk and its descendants (e.g. Perl) have associative
arrays which are implemented using hash coding for faster
look-up.
If you ignore how programming languages model arrays and lists, and ignore the implementation details (and consequent performance characteristics) of the abstractions, then the concepts of array and list are indistinguishable.
If you introduce implementation details (still independent of programming language) you can compare data structures like linked lists, array lists, regular arrays, sparse arrays and so on. But then you are not longer comparing arrays and lists per se.
The way I see it, you can only talk about a distinction between arrays and lists in the context of a programming language. And of course you are then talking about arrays and lists as supported by that language. You cannot generalize to any other language.
In short, I think this question is based on a false premise, and has no useful answer.
EDIT: in response to Ollie's comments:
I'm not saying that it is not useful to use the words "array" and "list". What I'm saying is the words do not and cannot have precise and distinct definitions ... except in the context of a specific programming language. While you would like the two words to have distinct meaning, it is a fact that they don't. Just take a look at the way the words are actually used. Furthermore, trying to impose a new set of definitions on the world is doomed to fail.
My point about implementation is that when we compare and contrast the different implementations of arrays and lists, we are doing just that. I'm not saying that it is not a useful thing to do. What I am saying is that when we compare and contrast the various implementations we should not get all hung up about whether we call them arrays or lists or whatever. Rather we should use terms that we can agree on ... or not use terms at all.
To me, "array" means "ordered collection of things that is probably efficiently indexable" and "list" means "ordered collection of things that may be efficiently indexable". But there are examples of both arrays and lists that go against the trend; e.g. PHP arrays on the one hand, and Java ArrayLists on the other hand. So if I want to be precise ... in a language-agnostic context, I have to talk about "C-like arrays" or "linked lists" or some other terminology that makes it clear what data structure I really mean. The terms "array" and "list" are of no use if I want to be clear.
An array is an ordered collection of data items indexed by integer. It is not possible to be certain of anything more. Vote for this answer you believe this is the only reasonable outcome of this question.
An array:
is a finite collection of elements
the elements are ordered, and this is their only structure
elements of the same type
supported efficient random access
has no expectation of efficient insertions
may or may not support append
(1) differentiates arrays from things like iterators or generators. (2) differentiates arrays from sets. (3) differentiates arrays from things like tuples where you get an int and a string. (4) differentiates arrays from other types of lists. Maybe it's not always true, but a programmer's expectation is that random access is constant time. (5) and (6) are just there to deny additional requirements.
I would argue that a real array stores values in contiguous memory. Anything else is only called an array because it can be used like array, but they aren't really ("arrays" in PHP are definately not actual arrays (non-associative)). Vectors and such are extensions of arrays, adding additional functionality.
an array is a container, and the objects it holds have no any relationships except the order; the objects are stored in a continuous space abstractly (high level, of course low level may continuous too), so you could access them by slot[x,y,z...].
for example, per array[2,3,5,7,1], you could get 5 using slot[2] (slot[3] in some languages).
for a list, a container too, each object (well, each object-holder exactly such as slot or node) it holds has indicators which "point" to other object(s) and this is the main relationship; in general both high or low level the space is not continuous, but may be continuous; so accessing by slot[x,y,z...] is not recommended.
for example, per |-2-3-5-7-1-|, you need to do a travel from first object to 3rd one to get 5.

Resources