would reference counting in the appengine datastore be a good idea? - google-app-engine

hen I was thinking about implementing a system, enabling users to send messages to eachother, I thought about the ammount of data you could save if, instead of saving a copy of the message for each of the receivers, I would save one message with a list of receivers.
There would actually be 3 lists, one list of receivers, one list of booleans, where if b[i], then receiver[i] has read the message, and a third list containinging all the users that have not deleted the message. Every day, I would run a cron job, looking for messages with an empty list of such users, and remove them.
Could there be any problems with this model?

The first schema, it's like trying to replicate the email architecture, which is outdated and does not work quite well.
Definitely, the second approach it's better.
Problems? No one while your code does not have bugs. But consider replies, if you have to support them. Maybe a fourth list could be enough if the instance does not exceeds the 1M size limit.
But actually, a separated model for answers is more consistent and intuitive. This new model will also have lists like: readed_by, deleted_by, etc.
The cron job may be unnecessary. You could just delete the message after a user mark it as "deteleted" if message.deleted_by == message.receivers + message.from.

Related

Get the "Place in line" of records in the realtime db?

Basically I'm creating a system to manage requests from users in a Firebase Realtime Database, via an Express app loaded into Firebase Functions. The request table will be in a queue format, so FIFO. Records would be ordered chronologically by their keys, a la timecodes that are created from Post requests. I'd like to be able to tell a user their place in line, and I'd like to be able to list this place in line in my app. I expect this queue to have requests numbering in the thousands, so iterating up to the entire length of the queue every time a client requests its place in line seems unattractive.
I've thought of doing a query for the user's UID (which would be saved in each request, naturally), but I can't figure out how to structure that query while maintaining the chronological order of the requests. Something like requestsReference.orderByKey().endAt({UID}, key: "requestorUid") doesn't seem like it'd work from what I'm seeing in the docs; but if I pulled off a query like that then I'd be able to get the place in line just from the length of the returned object. It's worth saying now that I have no idea how efficient this would be compared to just iterating the entire queue in its original chronological order.
I've also thought of taking an arithmetic approach, basically adding the "place in line when requested" and "total fulfillments when requested" as data in the request records. So then I'd be able to retrieve a record by its UID, and determine the place in line via placeInLineAtRequestTime - (totalCurrentFulfillments - totalFulfillmentsAtRequestTime). It'd be a rough approach, and I'd need to fetch the entire fulfillments table in order to get the current count. So again, I'm not sure how this compares.
Anyway, any thoughts? Am I missing some real easy way I could do this? Or would iterating it be cheaper than I think it'd be?

When to use an array vs database

I'm a student and starting to relearn again the basics of programming.
The problem I stated above starts when I have read some Facebook posts that most of the programmers use arrays in their application and arrays are useful. And I started to realize that I never use arrays in my program.
I read some books but they only show the syntax of array and didn't discuss on when to apply them in creating real world applications. I tried to research this on the Internet but I cannot find any. Do you guys have circumstance when you use arrays. Can you please share it to me so I can have an idea.
Also, to clear my doubts can you please explain to me why arrays are good to store information because database can also store information. When is the right time for me to use database and arrays?
I hope to get a clear answer because I have one remaining semester before the internship and I want to clear my head on this. I do not include any specific programming language because I know most of the programming language have arrays.
I hope to get an answer that can I can easily understand.
When is the right time for me to use database and arrays?
I can see how databases and arrays may seem like competing solutions to the same problem, but you're comparing apples and oranges. Arrays are a way to represent structured data in memory. Databases are a tool to store data on disk until you need to retrieve it.
The question you pose is kind of like asking: "When is the right time to use an integer to hold a value, vs a piece of paper?" One of them is a structural representation in memory; the other is a storage tool.
Do you guys have circumstance when you use arrays
In most applications, databases and arrays work together. Applications often retrieve data from a database, and hold it in an array for easy processing. Here is a simple example:
Google allows you to receive an alert when something of interest is mentioned on the news. Let's call it the event. Many people can be interested in the event, so Google needs to keep a list of people to alert. How? Probably in a database.
When the event occurs, what does Google do? Well it needs to:
Retrieve the list of interested users from the DB and place it in an array
Loop through the array and send a notification to each user.
In this example, arrays work really well because users form a collection of similarly shaped data structures that needs to be put through a similar process. That's exactly what arrays are for!
Some other common uses of arrays
A bank wants to send invoice and payment due reminders at the end of the day. So it retrieves the users with past due payments from the DB, and loops through the users' array sending notifications.
An IT admin panel wants to check whether all critical websites in a list are still online. So it loops through the array of domains, pings each one and records the results in a log
An educational program wants to perform statistical functions on student test results. So it puts the results in an array to easily perform operations such as average, sum, standardDev...
Arrays are also awesome at keeping things in a predictable order. You can be certain that as you loop forward through an array, you encounter values in the order you put them in. If you're trying to simulate a checkout line at the store, the customers in a queue are a perfect candidate to represent in an array because:
They are similarly shaped data: each customer has a name, cart contents, wait time, and position in line
They will be put through a similar process: each customer needs methods for enter queue, request checkout, approve payment, reject payment, exit queue
Their order should be consistent: When your program executes next(), you should expect that the next customer in line will be the one at the register, not some customer from the back of the line.
Trying to store the checkout queue in a database doesn't make sense because we want to actively work with the queue while we run our simulation, so we need data in memory. The database can hold a historical record of all customers and their checkout outcomes, perhaps for another program to retrieve and use in another way (maybe build customized statistical reports)
There are two different points. Let's me try to explain the simple way:
Array: container objects to keep a fixed number of values. The array is stored in your memory. So it depends on your requirements but when you need a fixed and fast one, just use array.
Database: when you have a relational data or you would like to store it in somewhere and not really worry about the size of the objects. You can store 10, 100, 1000 records to you DB. It's also flexible and you can select/query/update the data flexible. Simple way to use is: have a relational data, large amount and would like to flexible it, use database.
Hope this help.
There are a number of ways to store data when you have multiple instances of the same type of data. (For example, say you want to keep information on all the people in your city. There would be some sort of object to hold the information on each person, and you might want to have a data structure that holds the information on every person.)
Java has two main ways to store multiple instances of data in memory: arrays and Collections.
Databases are something different. The difference between a database and an array or collection, as I see it, are:
databases are persistent, i.e. the data will stay around after your program has finished running;
databases can be shared between programs, often programs running in all different parts of the world;
databases can be extremely large, much, much larger than could fit in your computer's memory.
Arrays and collections, however, are intended only for use by one program as it runs. Your program may want to keep track of some information in order to do its calculations. But the data will be in your computer's memory, and therefore other programs on other computers won't be able to access it. And when your program is done running, the data is gone. However, since the data is in memory, it's much faster to use it than data in a database, which is stored on some sort of external device. (This is really an overgeneralization, and doesn't consider things like virtual memory and caching. But it's good enough for someone learning the basics.)
The Java run time gives you three basic kinds of collections: sets, lists, and maps. A set is an unordered collection of unique elements; you use that when the data doesn't belong in any particular order, and the main operations you want are to see if something is in the set, or return all the data in the set without caring about the order. A list is ordered, though; the data has a particular order, and provides operations like "get the Nth element" for some number N, and adding to the ends of the list or inserting in a particular place in the list. A map is unordered like a set, but it also attaches keys to the data, so that you can look for data by giving the key. (Again, this is an overgeneralization. Some sets do have order, like SortedSet. And I haven't even gotten into queues, trees, multisets, etc., some of which are in third-party libraries.)
Java provides a List type for ordered lists, but there are several ways to implement it. One is ArrayList. Like all lists, it provides the capability to get the Nth item in the list. But an ArrayList provides this capability faster; under the hood, it's able to go directly to the Nth item. Some other list implementations don't do that--they have to go through the first, second, etc., items, until they get to the Nth.
An array is similar to an ArrayList, but it has a different syntax. For an array x, you can get the Nth element by referring to x[n], while for an ArrayList you'd say x.get(n). As far as functionality goes, the biggest difference is that for an array, you have to know how big it is before you create it, while an ArrayList can grow. So you'd want to use an ArrayList if you don't know beforehand how big your list will be. If you do know, then an array is a little more efficient, I think. Although you can probably get by mostly with just ArrayList, arrays are still fundamental structures in every computer language. The implementation of ArrayList depends on arrays under the hood, for instance.
Think of an array as a book, and database as library. You can't share the book with others at the same time, but you can share a library. You can't put the entire library in one book, but you can checkout 1 book at a time.

Which is better: sending many small messages or fewer large ones?

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.
P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?
I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.
Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.
In order to answer your question, you need understand the MessageStore interface and its implementations.
In a nutshell, under the hood, it took the following steps:
The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.
So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.
Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.
The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.
This depends on the type of messages that you are sending, mainly.

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

Building a queue in CouchDB

This might sound like an obvious question, but I'm new to CouchDB, so I thought it was worthwhile asking in case there is something about CouchDB's structure that changes the situation that I didn't know about. For reasons out of my control, I have to build a queue-like structure out of CouchDB. For simplicity's sake, let's say I'm queueing IDs for jobs to be executed later. Note that there will be no duplicates.
I'm trying to figure out what the best way to structure this is. As I currently see it, I have a few options:
Store the queue items as entries in a queue database with the IDs as _id, and store the dequeued items in a similar dequeued database with the IDs as the _id. Each record in each database wouldn't have any other information other than the (mandatory) _id and _rev.
Have a single queueing database, and that database will contain one record with _id = 'queue' and one record with _id = 'dequeued'. Within each of the two records, there will be an arbitrary number of keys, each of which will be an ID for the jobs to be executed (or that were already executed). The values associated in the database with the keys will be irrelevant, possibly just a Boolean.
Have a single queueing database, and within that database, have a single record called queue. Within that record, have two keys: queue and dequeued. Each of those keys will have as its associated value an arbitrary-length list of job execution IDs.
1 is slightly less desirable because it requires two databases, and 2 strikes me as a poor choice because it requires loading the entire list of queued or dequeued items in order to read a list item or make any changes. However, 3 is nice in that it allows for the whole list of IDs to be an ordered list rather than key/value pairs, which makes it easier to pick a random item from the list to be the next job to be executed, since I don't actually need to know any key names (since there are none).
I'm looking for whichever provides the best performance. Any thoughts on this?
Update
For people reading this question in the future, I've built my CouchDB queuing module, CouchQueue, a work in progress.
You can get it npm install couchqueue.
Take a look (and please comment, pull request, etc.) here at Github.
Use one document per element in the queue, and keep one queue database.
I recommend a field to order the elements, for example .created_at with a timestamp in ISO 8601 format.
You can toggle an element's visibility with a .visible flag.
I recommend a map/reduce view, something like this
function(doc) {
if(doc.visible)
emit(doc.created_at, doc)
}
Now you can query this view, either oldest-first, or newest-first (?descending=true). You can mark an element complete by updating it, setting visible = false.
I wrote a CouchDB queue, CQS which is identical to the Amazon SQS API. It is similar to what I describe, except there is a checked-out state messages can be, not visible in the queue for a timeout period. I have used CQS in production for about two years, with hundreds of millions of updates.
I suggest using separate documents for each queue entry, it will allow you to avoid conflicts.
If you just need an queue with the interface push(), pop(), top() for adding inserting an element and taking one then the solution may be very simple (if you want the list with next(), or accessing n-th element, it gets more complicated). For scheduling algorithm with linear order (like FIFO, FILO) you can implement push() as insertion of the new document:
{ type: "queue", inserted: CURRENT_TIME, ... }
top() as the map:
function (doc) {
if (doc.type == "queue" && doc.inserted) {
emit(doc.inserted, doc);
}
}
and reduce as an aggregation (eg. max for FILO, min for FIFO).
For pop() you can ask view for the top() and then delete the document.
Map/reduce has to be deterministic, so if you want to choose the random element you can make the reduce dependent on the pseudo-random (chosen by the server) _id.
I expect two problems:
Mind the concurrency: two processes can ask for the same document with top(), first will delete the document as part of the pop(), and second will try to fetch the deleted document.
CouchDB never really deletes the document, only marks as deleted. Adding and deleting for each push()/pop() will grow the database. You will have to reuse the documents somehow. Perhaps you have some poll of the tasks, which are inserted and removed, or reordered in the queue. Then you can add queued: true to the task document, instead of separate documents with type: "queue".

Resources