How would i push a stream of messages into one array?

How would i push a stream of messages into one array? - arrays

I have a stream of objects that come in one at a time to a debugger node. I want to be able to grab the objects and store them into one array once the objects stop streaming from an rss feed. The issue is I won't know how many of these objects will be coming through.
I've tried pushing the objects into an array that i store into a flow context, but I have to believe there's a much better and less messy way of doing this in node-red.

Look at the join node, it has a number of modes for collecting messages and assembling an array of payloads.
Since you do not know how many, the time out node may be the best bet

Related

Perform operations directly on database (esp. Firestore)

Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?

Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?

When to use an array vs database

I'm a student and starting to relearn again the basics of programming.
The problem I stated above starts when I have read some Facebook posts that most of the programmers use arrays in their application and arrays are useful. And I started to realize that I never use arrays in my program.
I read some books but they only show the syntax of array and didn't discuss on when to apply them in creating real world applications. I tried to research this on the Internet but I cannot find any. Do you guys have circumstance when you use arrays. Can you please share it to me so I can have an idea.
Also, to clear my doubts can you please explain to me why arrays are good to store information because database can also store information. When is the right time for me to use database and arrays?
I hope to get a clear answer because I have one remaining semester before the internship and I want to clear my head on this. I do not include any specific programming language because I know most of the programming language have arrays.
I hope to get an answer that can I can easily understand.

When is the right time for me to use database and arrays?
I can see how databases and arrays may seem like competing solutions to the same problem, but you're comparing apples and oranges. Arrays are a way to represent structured data in memory. Databases are a tool to store data on disk until you need to retrieve it.
The question you pose is kind of like asking: "When is the right time to use an integer to hold a value, vs a piece of paper?" One of them is a structural representation in memory; the other is a storage tool.
Do you guys have circumstance when you use arrays
In most applications, databases and arrays work together. Applications often retrieve data from a database, and hold it in an array for easy processing. Here is a simple example:
Google allows you to receive an alert when something of interest is mentioned on the news. Let's call it the event. Many people can be interested in the event, so Google needs to keep a list of people to alert. How? Probably in a database.
When the event occurs, what does Google do? Well it needs to:
Retrieve the list of interested users from the DB and place it in an array
Loop through the array and send a notification to each user.
In this example, arrays work really well because users form a collection of similarly shaped data structures that needs to be put through a similar process. That's exactly what arrays are for!
Some other common uses of arrays
A bank wants to send invoice and payment due reminders at the end of the day. So it retrieves the users with past due payments from the DB, and loops through the users' array sending notifications.
An IT admin panel wants to check whether all critical websites in a list are still online. So it loops through the array of domains, pings each one and records the results in a log
An educational program wants to perform statistical functions on student test results. So it puts the results in an array to easily perform operations such as average, sum, standardDev...
Arrays are also awesome at keeping things in a predictable order. You can be certain that as you loop forward through an array, you encounter values in the order you put them in. If you're trying to simulate a checkout line at the store, the customers in a queue are a perfect candidate to represent in an array because:
They are similarly shaped data: each customer has a name, cart contents, wait time, and position in line
They will be put through a similar process: each customer needs methods for enter queue, request checkout, approve payment, reject payment, exit queue
Their order should be consistent: When your program executes next(), you should expect that the next customer in line will be the one at the register, not some customer from the back of the line.
Trying to store the checkout queue in a database doesn't make sense because we want to actively work with the queue while we run our simulation, so we need data in memory. The database can hold a historical record of all customers and their checkout outcomes, perhaps for another program to retrieve and use in another way (maybe build customized statistical reports)

There are two different points. Let's me try to explain the simple way:
Array: container objects to keep a fixed number of values. The array is stored in your memory. So it depends on your requirements but when you need a fixed and fast one, just use array.
Database: when you have a relational data or you would like to store it in somewhere and not really worry about the size of the objects. You can store 10, 100, 1000 records to you DB. It's also flexible and you can select/query/update the data flexible. Simple way to use is: have a relational data, large amount and would like to flexible it, use database.
Hope this help.

There are a number of ways to store data when you have multiple instances of the same type of data. (For example, say you want to keep information on all the people in your city. There would be some sort of object to hold the information on each person, and you might want to have a data structure that holds the information on every person.)
Java has two main ways to store multiple instances of data in memory: arrays and Collections.
Databases are something different. The difference between a database and an array or collection, as I see it, are:
databases are persistent, i.e. the data will stay around after your program has finished running;
databases can be shared between programs, often programs running in all different parts of the world;
databases can be extremely large, much, much larger than could fit in your computer's memory.
Arrays and collections, however, are intended only for use by one program as it runs. Your program may want to keep track of some information in order to do its calculations. But the data will be in your computer's memory, and therefore other programs on other computers won't be able to access it. And when your program is done running, the data is gone. However, since the data is in memory, it's much faster to use it than data in a database, which is stored on some sort of external device. (This is really an overgeneralization, and doesn't consider things like virtual memory and caching. But it's good enough for someone learning the basics.)
The Java run time gives you three basic kinds of collections: sets, lists, and maps. A set is an unordered collection of unique elements; you use that when the data doesn't belong in any particular order, and the main operations you want are to see if something is in the set, or return all the data in the set without caring about the order. A list is ordered, though; the data has a particular order, and provides operations like "get the Nth element" for some number N, and adding to the ends of the list or inserting in a particular place in the list. A map is unordered like a set, but it also attaches keys to the data, so that you can look for data by giving the key. (Again, this is an overgeneralization. Some sets do have order, like SortedSet. And I haven't even gotten into queues, trees, multisets, etc., some of which are in third-party libraries.)
Java provides a List type for ordered lists, but there are several ways to implement it. One is ArrayList. Like all lists, it provides the capability to get the Nth item in the list. But an ArrayList provides this capability faster; under the hood, it's able to go directly to the Nth item. Some other list implementations don't do that--they have to go through the first, second, etc., items, until they get to the Nth.
An array is similar to an ArrayList, but it has a different syntax. For an array x, you can get the Nth element by referring to x[n], while for an ArrayList you'd say x.get(n). As far as functionality goes, the biggest difference is that for an array, you have to know how big it is before you create it, while an ArrayList can grow. So you'd want to use an ArrayList if you don't know beforehand how big your list will be. If you do know, then an array is a little more efficient, I think. Although you can probably get by mostly with just ArrayList, arrays are still fundamental structures in every computer language. The implementation of ArrayList depends on arrays under the hood, for instance.

Think of an array as a book, and database as library. You can't share the book with others at the same time, but you can share a library. You can't put the entire library in one book, but you can checkout 1 book at a time.

Which is better: sending many small messages or fewer large ones?

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.
P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?

I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.
Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.

In order to answer your question, you need understand the MessageStore interface and its implementations.
In a nutshell, under the hood, it took the following steps:
The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.
So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.

Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.
The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.
This depends on the type of messages that you are sending, mainly.

Data structure for large set of stepwise/incremental data and method to store it

I don't know if I don't know the correct terms or if what I'm looking for simply isn't a common structure, so please bear with me as I try to describe what I am looking for.
Right now I have a sorted set. It changes over time with simple modifications. A (k,v) pair is inserted, deleted, or the value of a specific key may change.
No actions are or ever will be executed on more than a single key.
What I need is a way to store each incremental version of the data set and have it be mapped to a point in time. I will need to access any portion of it quickly and be able to generate the exact sorted set that existed at that time, and how it changed over the time period.
It is not feasible to store the actual sorted sets after each mutation themselves as it is about 10kb of data and will have approximately 2-3 mutations per second on average. This is a personal project so writing 2.5 gigabytes of data per set (times 10-20 sets) per day is cost prohibitive.
Now I have come up with a solution - and here lies my question, does the solution I've come up with have a term? Is there a better way to do it?
If I have an initial dataset Orders, the next iteration of data could be written as Orders + (K,V) then instead of storing the entire set twice, I simply store the actual set once, and then the second time it is stored as a reference + the mutation.
Then if I wanted to access Orders[n] I would iterative Orders[0] -> Order[n] applying the mutation and I would generate the set as it existed in time.
There is a big problem with this however. I need to be able to quickly access any range of data - roughly 250,000 iterations per day * months or years - so it is not practical to calculate the set from 0 -> n when n is large. The obvious solution here is to at some interval cache the resulting set and instead of a given data point recursively being calculated all the way back to Orders[0] it would only need to calculate back to Orders[1,500,000] to find the set which existed at Orders[1,500,100].
If I decided this was a good way to structure the data, how often should I cache results?
Does something like this exist? In my research a lot of sources said to use linked lists or binary trees. I don't need a tree as my data is 100% continuous, not branching. So if I used a linked list my confusion lies in actually storing the data. And this is where I am completely stuck. What would be the best database & database schema to store this? (could use any system at this point, though having a node.js wrapper would be ideal as that is what is serving the data to the front-end) Or would writing binary data work better?
Even something as simple as an actual term for what I'm looking for or an alternative data structure to research would be helpful. Thanks!

This sounds like an excellent use case for a persistent binary search tree. A persistent data structure is one where after performing an operation on the structure, you get back two structures - the one before the change and the one after the change. Crucially, the internal representations of the two structures share memory, so if you have a 10KB collection, it takes much less than 20KB to store the before and after snapshots.
Since you want a key/value store, a persistent binary search tree might be your best bet. Like a normal BST, all operations run in O(log n) time. You can then store an array of all the snapshots, giving you O(1) access to any time slice you want.
Hope this helps!

The data structures you are talking about are typically known as "persistent data structures" or sometimes "immutable data structures."

Design scalable highly updated ressource on a DHT in key/value pair

I would like to design a resource that users would continuously update and read, this resource need not to be always update but must scale well, I mean that the nodes responsible for the resource and his replicas should not be overloaded.
The main problem is that I cannot see how it is possible! I could offload these nodes by adding read cache and update them at a slow pace, but for the writing I have no idea how to scale because values must be at known keys to be recovered by the users and so I can't share the load on the DHT...
Thx a lot for your ideas!

After some though,
one possible idea would be to make something approximative, each time someone want to post a feed it also post the feed with a key equal to a timestamp representing the second (like 25/03/2011 at 5 min and 1 sec) at which he posted the feed, so that the load is automatically shared between the node and that the responsible node quickly change.
When someone want to know the last feeds posted he just check the timestamp of the actual second and the 2-3 past second and can thus recover the last feeds distributing the load between the nodes responsible for the different seconds...
That should make the job I think :).
Hope my own answer will help someone later ;)