When to use an array vs database - arrays

I'm a student and starting to relearn again the basics of programming.
The problem I stated above starts when I have read some Facebook posts that most of the programmers use arrays in their application and arrays are useful. And I started to realize that I never use arrays in my program.
I read some books but they only show the syntax of array and didn't discuss on when to apply them in creating real world applications. I tried to research this on the Internet but I cannot find any. Do you guys have circumstance when you use arrays. Can you please share it to me so I can have an idea.
Also, to clear my doubts can you please explain to me why arrays are good to store information because database can also store information. When is the right time for me to use database and arrays?
I hope to get a clear answer because I have one remaining semester before the internship and I want to clear my head on this. I do not include any specific programming language because I know most of the programming language have arrays.
I hope to get an answer that can I can easily understand.

When is the right time for me to use database and arrays?
I can see how databases and arrays may seem like competing solutions to the same problem, but you're comparing apples and oranges. Arrays are a way to represent structured data in memory. Databases are a tool to store data on disk until you need to retrieve it.
The question you pose is kind of like asking: "When is the right time to use an integer to hold a value, vs a piece of paper?" One of them is a structural representation in memory; the other is a storage tool.
Do you guys have circumstance when you use arrays
In most applications, databases and arrays work together. Applications often retrieve data from a database, and hold it in an array for easy processing. Here is a simple example:
Google allows you to receive an alert when something of interest is mentioned on the news. Let's call it the event. Many people can be interested in the event, so Google needs to keep a list of people to alert. How? Probably in a database.
When the event occurs, what does Google do? Well it needs to:
Retrieve the list of interested users from the DB and place it in an array
Loop through the array and send a notification to each user.
In this example, arrays work really well because users form a collection of similarly shaped data structures that needs to be put through a similar process. That's exactly what arrays are for!
Some other common uses of arrays
A bank wants to send invoice and payment due reminders at the end of the day. So it retrieves the users with past due payments from the DB, and loops through the users' array sending notifications.
An IT admin panel wants to check whether all critical websites in a list are still online. So it loops through the array of domains, pings each one and records the results in a log
An educational program wants to perform statistical functions on student test results. So it puts the results in an array to easily perform operations such as average, sum, standardDev...
Arrays are also awesome at keeping things in a predictable order. You can be certain that as you loop forward through an array, you encounter values in the order you put them in. If you're trying to simulate a checkout line at the store, the customers in a queue are a perfect candidate to represent in an array because:
They are similarly shaped data: each customer has a name, cart contents, wait time, and position in line
They will be put through a similar process: each customer needs methods for enter queue, request checkout, approve payment, reject payment, exit queue
Their order should be consistent: When your program executes next(), you should expect that the next customer in line will be the one at the register, not some customer from the back of the line.
Trying to store the checkout queue in a database doesn't make sense because we want to actively work with the queue while we run our simulation, so we need data in memory. The database can hold a historical record of all customers and their checkout outcomes, perhaps for another program to retrieve and use in another way (maybe build customized statistical reports)

There are two different points. Let's me try to explain the simple way:
Array: container objects to keep a fixed number of values. The array is stored in your memory. So it depends on your requirements but when you need a fixed and fast one, just use array.
Database: when you have a relational data or you would like to store it in somewhere and not really worry about the size of the objects. You can store 10, 100, 1000 records to you DB. It's also flexible and you can select/query/update the data flexible. Simple way to use is: have a relational data, large amount and would like to flexible it, use database.
Hope this help.

There are a number of ways to store data when you have multiple instances of the same type of data. (For example, say you want to keep information on all the people in your city. There would be some sort of object to hold the information on each person, and you might want to have a data structure that holds the information on every person.)
Java has two main ways to store multiple instances of data in memory: arrays and Collections.
Databases are something different. The difference between a database and an array or collection, as I see it, are:
databases are persistent, i.e. the data will stay around after your program has finished running;
databases can be shared between programs, often programs running in all different parts of the world;
databases can be extremely large, much, much larger than could fit in your computer's memory.
Arrays and collections, however, are intended only for use by one program as it runs. Your program may want to keep track of some information in order to do its calculations. But the data will be in your computer's memory, and therefore other programs on other computers won't be able to access it. And when your program is done running, the data is gone. However, since the data is in memory, it's much faster to use it than data in a database, which is stored on some sort of external device. (This is really an overgeneralization, and doesn't consider things like virtual memory and caching. But it's good enough for someone learning the basics.)
The Java run time gives you three basic kinds of collections: sets, lists, and maps. A set is an unordered collection of unique elements; you use that when the data doesn't belong in any particular order, and the main operations you want are to see if something is in the set, or return all the data in the set without caring about the order. A list is ordered, though; the data has a particular order, and provides operations like "get the Nth element" for some number N, and adding to the ends of the list or inserting in a particular place in the list. A map is unordered like a set, but it also attaches keys to the data, so that you can look for data by giving the key. (Again, this is an overgeneralization. Some sets do have order, like SortedSet. And I haven't even gotten into queues, trees, multisets, etc., some of which are in third-party libraries.)
Java provides a List type for ordered lists, but there are several ways to implement it. One is ArrayList. Like all lists, it provides the capability to get the Nth item in the list. But an ArrayList provides this capability faster; under the hood, it's able to go directly to the Nth item. Some other list implementations don't do that--they have to go through the first, second, etc., items, until they get to the Nth.
An array is similar to an ArrayList, but it has a different syntax. For an array x, you can get the Nth element by referring to x[n], while for an ArrayList you'd say x.get(n). As far as functionality goes, the biggest difference is that for an array, you have to know how big it is before you create it, while an ArrayList can grow. So you'd want to use an ArrayList if you don't know beforehand how big your list will be. If you do know, then an array is a little more efficient, I think. Although you can probably get by mostly with just ArrayList, arrays are still fundamental structures in every computer language. The implementation of ArrayList depends on arrays under the hood, for instance.

Think of an array as a book, and database as library. You can't share the book with others at the same time, but you can share a library. You can't put the entire library in one book, but you can checkout 1 book at a time.

Related

Perform operations directly on database (esp. Firestore)

Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?
Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?

Iterating and storing thousands/millions of objects efficiently

I am working on a simulation where I need to be able to handle thousands potentialy millions of objects updating every loop.
All of the objects need to have their logic function called (AI).
But depending on the location of the object determines how detailed the logic will be. For example:
[working with 100 objects to keep it simple]
All objects have a location (x,y)
20 objects are 500 points away
from a 'point of interest' location.
50 objects are 500 points
from the 20 objects (1000 points away).
30 objects are within 100
points from the point of interest.
Now say this was a detailed city simulation with the objects being virtual citizens.
At 6pm it's time for everyone to go home from their jobs and sleep.
So we iterate through all citizens, but I'm wanting them to do different things.
The furtherest away objects (50) Go home from their job and sleep
until morning.
The closer objects (20) Go home from their job, have a
bite to eat then sleep until morning.
The closest objects (30) Go
home from their job, have a bite to eat, brush teeth then sleep until
morning.
As you can see the closer they are to the point of interest the more detailed the logic becomes.
I am trying to work out what the best and most performance efficient way to iterate through all objects would be.
This would be relativly easy with a hand full of objects but as this needs to handle at lest 500,000 objects efficiently, I need some advice.
Also I'm not sure if I should iterate through all objects every loop or maybe it would be better to iterate through the closest objects every loop but only itereate through further away objects every 10 loops?
With the additional requirement of needing the objects to interact between other objects close to them, I have been thinking the best way to do this might be to organise them in a quadtree but I'm not sure. It seems as though quad trees are more for static content, but the objects i'm dealing with, as mentioned have a location and are required to move to other locations.
Am I going down the right track of thinking? or is there a 'better' way?
I am also working in c++ if anyone thinks its relevant.
Any advise would be greatly appreciated.
NOTE:
The point of interest changes regularly, think of it as a camera
view.
Objects are created and destroyed dynamically
If you want to quickly select objects in certain radius from particular point, then quad-tree or just simple square grid will help.
If your problem is how to store millions of objects to make iteration through them efficient, then you probably could use column based technique, when instead of having 1 million objects each having 5 fields, you have 5 arrays of 1 million elements each. In this case each object is just an index in range 0 .. 999999. So, for example, you want to store 1 million object of the following structure:
struct resident
{
int x;
int y;
int flags;
int age;
int health; // This is computer game, right?
}
Then, instead of declaring resident residents [1000000] you declare 5 arrays:
int resident_x [1000000];
int resident_y [1000000];
int resident_flags [1000000];
int resident_age [1000000];
int resident_health [1000000];
And then, instead of, say, residents [n].x you use resident_x [n]. Such way to store objects may be faster when you need to iterate through all objects of the same type and do something with couple of fields in each object (with the same set of fields in each object).
You need to break the problem down into "classes", just like in the real world. Each person's class is computed from the distance. So lower class people are far away and upper class are close. Or more correctly "far class", nearish class and "here class" or whatever you want to name them.
1) Make an array with one slot for each class. This slot will hold a "linked list" of each person in that class. When a persons class changers(social climbers), then it is very rapid to move the object to another list.
2) So put everybody into the proper classes and iterate only the classes close to you. In a proper scenario there are objects which are to far away to care about so you can put those back to disk and only reload when you get nearer.
There's a few questions embedded in there:
-How to deal with large quantities of objects? If there is a constant number of fixed objects, you may be able to simply create an array of them, as long as you have sufficient memory. If you need to dynamically create and destroy them, you put yourself at risk for memory leaks without careful handling of destroyed objects. At a certain point, you may ask yourself whether it is better to use another application, such as a database, to store your objects, and perform just the logic in your C++ code. Databases will provide additional functionality that I will highlight.
-How to find objects in a given distance from others. This is a classic problem for geographic information systems (GIS); it sounds like you are trying to operate a simple GIS to store your objects and your attributes, so it is applicable. It takes computation power to test SQRT((X-x)^2+(Y-y)^2), the distance formula, on every point. Instead, it is common to use a 'windowing function' to extract a square containing all the points you want, and then search within this to find points that lie specifically in a given radius. Some databases are optimized to perform a variety of GIS functions, including returning points within a given radius, or returning points within some other geometry like a polygon. Otherwise you'll have to program this functionality yourself.
-Tree storage of objects. This can improve speed, but you will hit a tradeoff if the objects are constantly moving around, wherein the tree has to be restructured often. It all depends on how often things move versus how often you want to do calculations on them.
-AI code. If you're trying to do AI on millions of objects, this may be your biggest use of performance, instead of the methodology used to store and search the objects. You're right in that simpler code for points farther away will increase performance, as will executing the logic less often for far away points. This is sometimes handled using Monte Carlo analysis, where the logic will be performed on a random subset of points during any given iteration, and you could have the probability of execution decrease as distance from the point of interest increases.
I would consider using a Linear Quadtree with Morton Encoding / Z-Order indexing. You can further optimize this structure by using a Bit Array to represent nodes that contain data and very quickly perform calculations.
I've done this extremely efficiently in the browser using Javascript and I can traverse through 67 million nodes in sub-seconds. Once I've narrowed it down to the region of interest, I look up the data in a different structure. All of it still in milliseconds. I'm using this for spatial vector animation.

When is it a good idea to use a database

I am doing an information retrieval project in c++. What are the advantages of using a database to store terms, as opposed to storing it in a data structure such as a vector? More generally when is it a good idea to use a database rather than a data structure?
(Shawn): Whenever you want to keep the data beyond the length of the instance of the program. (persistence across time)
(Michael Kjörling): Whenever you want many instances of your program, either in the same computer or in many computers, like in a network or the Net, access and manipulate (share) the same data. (persistence across network space)
Whenever you have very big amount of data that do not fit into memory.
Whenever you have very complex data structures and you prefer to not rewrite code to manipulate them, e.g search, update them, when the db programmers have already written such code and probably much faster than the code you (or I)'ll write.
Whenever you want to keep the data beyond the length of the instance of the program?
In addition to Shawn pointing out persistence: whenever you want multiple instances of the program to be able to easily share data?
In-memory data structures are great, but they are not a replacement for persistence.
It really depends on the scope. For example if you're going to have multiple applications accessing the data then a database is better because you won't have to worry about file locks, etc. Also, you'd use a database when you need to do things like joining other data, sorting, etc... unless you like to implement Quicksort.

Practical to save thousands of data structures in a file and do specific lookups?

There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?
It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.
From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).
That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.
Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?
I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...

Explaining benefits of an array to a lay person?

I develop code in our proprietry system using a scripting language that is unique to that system.
Our director has allowed us to request enhancements to this language, which currently lacks user definable arrays.
I need to write a concept brief on why we need arrays and how they can benefit us, however I need to explain it in a fashion that someone who has no understanding of code will understand.
I'm a programmer, therefore I suck at documentation and explaining things in a non-technical manner. I tried banging my head on the desk to see if anything useful would come out but it hasn't. Can anyone help?
I love analogies.
Much easier to have a 100 DVD holder that sits neatly on your floor and holds 100 dvds in order than 100 individual DVDs scattered around your house where you last used them
Especially relevant when you need to move the collection from one place to another or share it with a friend.
What's your application area? To speak the users' language you need to know that. Suppose it's stocks trading: then what to you is an array, to the users may be a portfolio -- get the quotes for several stock at once rather than having to do it repeatedly for one at a time. If your application area is CRM, then the array will let the users check on a group of customers at once, rather than do it one at a time. And so on, and so forth.
In every application area there will be cases in which users may want to deal with a bunch of things at once, it being easier than dealing with one thing at a time. Phrase it in the appropriate vocabulary, and you have the case for arrays!
You might want to see if you can move the business away from your custom scripting environment and into a standard scripting environment like LUA or Python. You might be surprised at how much easier it is to get LUA up and running than it is to :
Support an in house system
Create tools for it (do you have an IDE?)
Train new programmers in it
Live without modern features that you lack the time/skills to impliment.
Key to getting that to happen would be to make LUA interoperable with your standard scripting system or writing a translation from your old scripts to LUA scripts.
The benefit is that it makes the code shorter, and thus less money is spent coding and debugging. You can then present some example code that you could make it shorter had the language supported arrays.
Sounds like you've been asked to create code in the past (or anticipate having to create code in the future), where your job would have been faster/easier/cheaper if the system that you used had arrays.
That's the issue: you want to do more for your director and you need arrays to help you.
Your director will understand the business benefits of you having a better toolkit--you'll be able to do more for him or her. And that's how you increase business efficiency.
Tell your director: I want to improved my productivity for you and our team. To do so, arrays would be very helpful.
I like Alex's answer - it has to be put in terms of the user's problems. What problem (that they care about) can they do with it that they cannot do without it?
I used to teach introductory programming in college, and arrays are simply not something that comes easily to non-programmers. They need to understand some other basics first, like the sequential nature of programs, the lego-block way programs are constructed, the idea of run-time (as opposed to write-time) and really importantly the concept of a variable as a container of a value, and how that is different from its name, and how its contents changes with time while its name does not.
I found a useful way to get into this area is to let them program a very simple, decimal, simulated computer, in "machine language". They get the notion of memory address vs. memory contents, and that address is just a number. That makes it a lot easier to introduce arrays in a more "real" language.
Another approach is to have them work on a kind of problem where they really start wishing they could invent variables on-the-fly. Like they don't want to just have a variable A, but they feel a need for A1, A2, etc. and then they would really like to say Ai where i is a another variable. Once they feel the need for that, then they will grasp arrays. (For example, they could take a simple program that asks for their name and has a simple conversation with them, and then extend it to talk to two people at once, then three, and so on.)
Then, a useful next step is "parallel arrays" which can serve as rudimentary arrays of structures. i.e. N$(i) can be name of student i, while A(i) can be age of student i. This makes the idea useful.
Only then would I dare to start to introduce algorithms like sorting, merging, table lookup, and so on.
I think to fully realize the potential of arrays, you must somehow mention two things:
1) Array Algorithms
Sort, Find, etc. All the basics. Equate this in your business brief as structured data that can organize itself. No extra query language. No variable naming conventions. All you need is good standards.
2) Multi-Dimensional Arrays
The power of arrays seems fully realized to me with matrices. With these you can practically hold limitless data.
Plus, depending on the power of the propriety language you are using, arrays can store objects.
The power of an array is that it allows you to put a group of things together so that you can perform the same operation on all of them with less code.
Sorting is one example of an array operation, and is like having a box of index cards that you are putting in order.
Or if you had a collection of letters that need to go out, being able to write a loop that stamps each letter and then sends it is better than writing out
Take first letter, stamp it.
Mail it.
Takes second letter, stamp it,
Mail it.
Basically anything you'ld use to refer to the first, second, third, fifth, etc, is basically like an array.
And then indexed/hashed arrays are like having an index in the book - you know the author describes the Defenistration of Prague somewhere in the volume, but looking in the index shows that it's on page 255.
Here's the easiest-to-understand benefit: It lets you refer to things by a number. Try to emphasize the importance of this.

Resources