Arrays as separate type - arrays

Some scripting languages, such as Python and Javascript, have arrays (aka lists) as a separate datatype from hash tables (aka dictionaries, maps, objects). In other scripting languages, such as PHP and Lua, an array is merely a hash table whose keys happen to be integers. (The implementation may be optimized for that special case, as is done in the current version of Lua, but that's transparent to the language semantics.)
Which is the better approach?
The unified approach is more elegant in the sense of having one thing rather than two, though the gain isn't quite as large as it might seem at first glance, since you still need to have the notion of iterating over the numeric keys specifically.
The unified approach is arguably more flexible. You can start off with nested arrays, find you need to annotate them with other stuff, and just add the annotations, without having to rework the data structures to interleave the arrays with hash tables.
In terms of efficiency, it seems to be pretty much a wash (provided the implementation optimizes for the special case, as Lua does).
What am I missing? Does the separate approach have any advantages?

Having separate types means that you can make guarantees about performance, and you know that you will have "normal" semantics for things like array slicing. If you have a unified system, you need to work out what all operations, such as slicing, mean on sparse arrays.

An array is more than a table intentionally restricted to consecutive integer keys. It's a sequence, a collection of n items (not key-value pairs, just the values) with a well-defined order. This is, in my opinion, a data structure that has no place for additional data in the form of non-integer keys. It's conceptually simpler.
Also, implementing the two seperately may be simpler, especially when considering the addition of an optimization (which is apparently obscure enough that a performance-oriented language like Lua didn't implement it for many many years) which makes arrays perform well.
Also, the flexibility point is arguable. If the need for more complex annotation arises, it's quite possible that you'll soon also need polymorphism, in which case you should just switch to objects with an array among other attributes.

As mentioned, there are speed and complexity issues involved in having two separate types. However, one of the things that I find important about having two types is that it expresses the intent of the datastore.
A list is a an ordered list of items. The items and their order ARE the data, the keys only exist in a conceptual manner to describe the order of the items.
A map is a mapping of keys to values. The keys and the values they represent ARE the data.
The point to note that the keys are part of the data for a map, they're not for a list... conceptually. When you choose one data type over the other, you're specifying your intent.
I'll add as an aside that every language that shares a data type for lists and maps has certain... annoyances that come along with it. There are always certain concessions that need to be made to allow the combination, and they can bite you sometimes. It's generally not a big deal, but it can be annoying.

Related

Is the list append feature a feature of the array data structure?

The array data structure has the following features:
Here is the list of most important array features you must know (i.e.
be able to program)
copying and cloning
insertion and deletion
searching and sorting
I am wondering, for the list data type, which can be used for the array data structure, is the append method considered a feature of the array data structure, per the insertion and deletion bullet point?
I would argue that it isn't. I would argue that it is entirely the feature of a list to be able to programmatically append, remove, insertAt, etc. Arrays do not require any functionality other than being a collection of similar types, and in some cases merely a collection of things.
For instance, as referenced in this C article we can see that an array is a collection of similar types. These arrays have no given functionality, and in fact there is no standard, given, way to add or remove to/from them.
Functionally speaking, appending an element to a list is the same as inserting it at the end.
That being said: You seem to have got the concepts of arrays and lists backwards:
A list is typically defined as any kind of data structure which can store an ordered group of things.
An array is something more specific. It's typically defined as a data structure which is made up of a fixed number of objects in memory, stored one after another. Java's array type (e.g. int[]) works this way, for instance.
The web page you are referring to is not helping matters. It's very confusingly written; I'd recommend that you look for another, better reference.

Is it bad design to use arrays within a database?

So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.

How to automatically translate pure code into code that uses mutable arrays for efficiency?

This is a Haskell question, but I'd also be interested in answers about other languages. Is there a way to automatically translate purely functional code, written to process either lists or immutable arrays without doing any destructive updates, into code that uses mutable arrays for efficiency?
In Haskell the generated code would either run in the ST monad (in which case it would all be wrapped in runST or runSTArray) or in the IO monad, I assume.
I'm most interested in general solutions which work for any element type.
I thought I've seen this before, but I can't remember where. If it doesn't already exist, I'd be interested in creating it.
Implementing a functional language using destructive updates is a memory management optimization. If an old value will no longer be used, it is safe to reuse the old memory to hold a new values. Detecting that a value will not be used anymore is a difficult problem, which is why reuse is still managed manually.
Linear type inference and uniqueness type inference discover some useful information. These analyses discover variables that hold the only reference to some object. After the last use of that variable, either the object is transferred somewhere else, or the object can be reused to hold a new value.
Several languages, including Sisal and SAC, attempt to reuse old array memory to hold new arrays. In SAC, programs are first converted to use explicit memory management (specifically, reference counting) and then the memory management code is optimized.
You say "either lists or immutable arrays", but those are actually two very different things, and in many cases algorithms naturally suited to lists would be no faster (and possibly slower) when used with mutable arrays.
For instance, consider an algorithm consisting of three parts: Constructing a list from some input, transforming the list by combining adjacent elements, then filtering the list by some criterion. A naive approach of fully generating a new list at each step would indeed be inefficient; a mutable array updated in place at each step would be an improvement. But better still is to observe that only a limited number of elements are needed simultaneously and that the linear nature of the algorithm matches the linear structure of a list, which means that all three steps can be merged together and the intermediate lists eliminated entirely. If the initial input used to construct the list and the filtered result are significantly smaller than the intermediate list, you'll save a lot of overhead by avoiding extra allocation, instead of filling a mutable array with elements that are just going to be filtered out later anyway.
Mutable arrays are most likely to be useful when making a lot of piecemeal, random-access updates to an array, with no obvious linear structure. When using Haskell's immutable arrays, in many cases this can be expressed using the accum function in Data.Array, which I believe is already implemented using ST.
In short, a lot of the simple cases either have better optimizations available or are already handled.
Edit: I notice this answer was downvoted without comment and I'm curious why. Feedback is appreciated, I'd like to know if I said something dumb.

Array/list vs Dictionary (why we have them at first place)

To me they are both same and that is why i am wondering why we have dictionary data structure when we can do everything with arrays/list? What is so fancy in dictionaries?
Arraylists just store a set of objects (that can be accessed randomly). Dictionaries store pairs of objects. This makes array/lists more suitable when you have a group of objects in a set (prime numbers, colors, students, etc.). Dictionaries are better suited for showing relationships between a pair of objects.
Why do we need dictionaries? lets say you have some data you need to convert from one form to another, like roman numeral characters to their values. Without dictionaries, you'd have to hack this association together with two arrays, where you first find the position the key is in the first list and access that position in the second. This is terribly error prone and inefficient, and dictionaries provide a more direct approach.
Arrays provide random access of a sequential set of data. Dictionaries (or associative arrays) provide a map from a set of keys to a set of values.
I believe you are comparing apples and oranges - they serve two completely different purposes and are both useful data structures.
Most of the time a dictionary-like type is built as a hash table - this type is very useful as it provides very fast lookups on average (depending on the quality of the hashing algorithm).
The confusion lies in the different naming conventions in different languages. In my understanding, what is called a "Dictionary" in Python is the same as "Associative Array" in PHP.
To build on what Andrew said, in some languages such as PHP and Javascript, the array can also function as a dictionary (known as associative arrays). It also comes down to loose v strict typing in the language.
You could in theory do everything with dictionaries.
But do not forget that at some point the program runs on a real machine which has limitations due to the hardware: processor, memory, nature of the storage (disc/SSD) ...
Behind the scenes the dictionaries are often using a Hash table
In some languages you can choose between many different types of list/array and hash tables as there are many different implementations of those structures, each with advantages and disadvantaged.
Use an array when you work with a sequence of elements or need to randomly access an element at a given index (0, 1, 2, ...)
Use a dictionary when you have key/value format and need fast retrieval via key
If you want to understand more about these I recommend you learn more about data structures as they are fundamental
NOTE: depending on the language the name of those structures may vary and is a source of confusion.

When should I use Scala's Array instead of one of the other collections?

This is more a question of style and preference but here goes: when should I use scala.Array? I use List all the time and occasionally run into Seq, Map and the like, but I've never used nor seen Array in the wild. Is it just there for Java compatibility? Am I missing a common use-case?
First of all, let's make a disclaimer here. Scala 2.7's Array tries to be a Java Array and a Scala Collection at the same time. It mostly succeeds, but fail at both for some corner cases. Unfortunately, these corner cases can happen to good people with normal code, so Scala 2.8 is departing from that.
On Scala 2.8, there's Array, which is Java Array. That means it is a contiguous memory space, which stores either references or primitives (and, therefore, may have different element sizes), and can be randomly accessed pretty fast. It also has lousy methods, an horrible toString implementation, and performs badly when using generics and primitives at the same time (eg: def f[T](a: Array[T]) = ...; f(Array(1,2,3))).
And, then, there is GenericArray, which is a Scala Collection backed by an Array. It always stores boxed primitives, so it doesn't have the performance problems when mixing primitives and generics but, on the other hand, it doesn't have the performance gains of a purely primitive (non-generic) primitive array.
So, when to use what? An Array has the following characteristics:
O(1) random read and write
O(n) append/prepend/insert/delete
mutable
If you don't need generics, or your generics can be stated as [T <: AnyRef], thus excluding primitives, which are AnyVal, and those characteristics are optimal for your code, then go for it.
If you do need generics, including primitives, and those characteristics are optimal for your code, use GenericArray on Scala 2.8. Also, if you want a true Collection, with all of its methods, you may want to use it as well, instead of depending on implicit conversions.
If you want immutability or if you need good performance for append, prepend, insert or delete, look for some other collection.
An array is appropriate when you have a number of items of the same (or compatible) class, and you know in advance the exact count of those items, or a reasonable upper bound, and you're interested in fast random access and perhaps in-place alteration of items, but after setting it up, you will never ever insert or remove items from somewhere in the list.
Or stated in another way, it's an aggregate data structure with less bells and whistles than the Collection types, with slightly less overhead and slightly better performance depending on how it's used.
A very contrived example: You're in the business of producing functions, and quality testing for these functions involves checking their performance or results for a set of 1000 fixed input values. Moreover, you decide not to keep these values in a file, but rather you hard code them into your program. An array would be appropriate.
Interfacing with Java APIs is one case. Also unlike Java arrays scala arrays are invariant and hence doesn't have any advantage over lists because of that.

Resources