Cloud Firestore better structure for the case - database

I am developing an APP where the users (firebase auth) will register their expenses and will be notified (OneSignal) on each Sunday about the expenses that will expire in the week.
My firestore structure is:
-users (collection)
---xxXXxxX (user document)
-----email
-----OneSignal ID
-----expenses (collection)
-------yyYYYyY (expense document)
---------dueDate
---------value
---------userId
-------aaAAaaA (expense document)
---------dueDate
---------value
---------userId
---bBBbbBB (another user document)
-----email
-----OneSignal ID
-----expenses (collection)
-------wwWWwwW (expense document)
---------dueDate
(...)
Based on this structure, every Sunday Google Cloud will run a schedule function that will query all the expenses that expire in the week (Collection group query - returning a list of expenses that can have more than one expense per user.)
With this list, still in function, I will separate manually the userId from expenses, creating a second list with one register per user. With the second list the function will get the OneSignal ID of each user (another queries on firebase, one per user in the list) and register a notification in OneSignal service for every user.
P.S: OneSignal ID can change, because this situation i can't save the OneSignalID on expense.
I guess that this structure will work, but appears that this is not the best solution because many queries running on the "background" and this can be costly in the future.
Does anyone have a better suggestion for this case? Maybe another structure on firestore...
I hope that I explained the "problem" well. English is not my first language.
Thank you!

From what I've read in the documentation, you're doing perfect, and here's why (anyone, correct me if I'm wrong)
Firebase charges you when the users download more than Xgb of data. As far as I know you won't be charged for doing queries and filtering. So you're good in this aspect.
The Firebase firestore querying time depends solely on the amount of results you get. Doesn't matter the structure. So if you're fine with this structure, stick with it. I see no problem at all.
EDIT : I just re-read the docs and found this:
When you use Cloud Firestore, you are charged for the following:
The number of reads, writes, and deletes that you perform.
The amount of storage that your database uses, including overhead for metadata and indexes.
The amount of network bandwidth that you use.
So you will be charged when querying apparently. In this case a better way to structure the db would be to flattening the expenses tree maybe. You could have something like this
Users - - - -
|------
|------
Expenses - - - - -
user ID - - -
Etc---------
Etc----------
This way you could filter for a user's query.
https://cloud.google.com/firestore/docs/query-data/queries#:~:text=Cloud%20Firestore%20provides%20powerful%20query,Data%20and%20Get%20Realtime%20Updates.

Related

How do I fill in the 'gaps' of missing data in Firestore or to make sure that it is complete?

Please pardon me for asking a noob question, or if my explanation is not clear. It is quite complicated for me and I have tried to make it as clear as possible. I'm open for any ideas to improve this question.
I'm working on Flutter App, with Cloud Firestore for the cloud database and Sembast DB for local database.
The scenario is like this:
Supposedly on June 14th I added 5 documents in a collection like this:
Bob,
Sally,
Hannah,
Jane,
Chris,
Then in the cloud there is also a document that keeps track of the latest update, which was June 14th.
Now, the App fetch the data with pagination of 3 documents each time.
Thus for the first fetch, the APP gets documents: Bob, Sally, Hannah with latest update was June 14th
Then the App saved these 3 documents and the latest updated date into its local DB.
Suppose the APP was offline for awhile after that.
While it was offline, there were 5 new data added on June 16th:
Bobby,
Andy,
John,
Jack,
Danny,
Once the APP gets online again, it compare the latest data in cloud and local DB it will found that the latest updated date has changed to June 16th. Thus it will fetch 3 new documents from June 16th:
Bobby, Andy and John.
So now the local Database looks like this:
Bobby,
Andy,
John,
Bob,
Sally,
Hannah
You can see that there are some lost data in between which are Jack and Danny. Because they are at the middle of the list.
However, for Jane and Chris, the app would get it when it arrives at the end of the list, because then it can use .startAfter command in Firestore.
So the problem is for the data in the middle of the list such as Jack and Danny. Hence, how does the app knows that there are some missing data there?
What we hope to see is that when the app scrolls down, we hope the app is able to fill in the missing data and make it looks like this:
Bobby,
Andy,
John,
Jack,
Danny,
Bob,
Sally,
Hannah
This example only consist of 10 documents but in my real case I'm dealing with thousands of documents with tens and thousands of users.
Checking documents on each 'gap' is not performance friendly.
Moreover, I must make sure that I have the least number of document read.
And for those who wants to see codes. Here is the code I use to fetch documents in Cloud Firestore:
Query query = collectionReference
.orderBy('lastUpdated', descending: true)
.startAfter([lastDocDateRetrieved])
.limit(100);
// Getting the document based on the query.
querySnapshot = await query.getDocuments();
Explanation: This query will fetch the data in order of the lastUpdated Date in Cloud Firestore, which starts after the latest document's updated date recorded in local DB.
Then I save the data in local DB (Sembast DB):
// after converting the querySnapshot into two lists of keys and values:
await db.transaction((txn) async {
// Save all documents in the local database
await _store.records(keyList).put(txn, valueList);
});
The key problem is not in the code, but in the logic of how do we know that there are missing data in between documents?
When using pagination to cycle through documents from a query (using startAfter()), the API can not tell you if new documents were added in a page that you've already gone through. If you want to know if new documents were added, you would have to start the query over from the beginning.
The only alternative to this (which I have never seen anyone do) is to set up individual listeners on each page query, and keep them all active so that the listener can tell you if something changed within that page. Doing this correctly would require pretty large amount of code to handle all the boundary cases, and might be expensive if your documents change frequently.

What storage mechanism can I use to store the data related to user interaction of my website for a day

I store information about which items were accessed. That's it initially. I will store the id and type of item that were accessed. For example in a relational table it would be.
id type view
1 dairy product 100
2 meat 88
Later on, in the end of the day, I will transfer this data to the actual table of the product.
products
id name view
1 Cheesy paradise 100
This is a web site, I don't want to update the table everytime the user visits a product. Because the products are in relational database and it would be very unprofessional. I want to make a service in Nodejs that when the user visits a product and stay for 5 secs and roll the page to the bottom I increment a high speed storage and in the end of the day I updated the related products in "one go".
I will handle only 300 visits in diferent products a day. But, of course, I want to my system to grow and it will handle keeping track of 1 thousand of products per minute, for example. In my mind when I though about this feature I thought about using Mongo. But I don't know it seems so much for this simple task. What tecnology can fit this situation better?
I would recommend MongoDB, since you are mostly "dumping" data into a database. That also allows you in the future to dump more information then you will now, no matter what kind of documents you dump now. Mongo is totally fine for a "dump" database structure.

Firestore: Running Complex Update Queries With Multiple Retrievals (ReactJS)

I have a grid of data whose endpoints are displayed from data stored in my firestore database. So for instance an outline could be as follows:
| Spent total: $150 |
| Item 1: $80 |
| Item 2: $70 |
So the value for all of these costs (70,80 and 150) is stored in my firestore database with the sub items being a separate collection from my total spent. Now, I wannt to be able to update the price of item 2 to say $90 which will then update Item 2's value in firestore, but I want this to then run a check against the table so that the "spent total" is also updated to say "$170". What would be the best way to accomplish something like this?
Especially if I were to add multiple rows and columns that all are dependent on one another, what is the best way to update one part of my grid so that afterwords all of the data endpoints on the grid are updated correctly? Should I be using cloud functions somehow?
Additionally, I am creating a ReactJS app and previously in the app I just had my grid endpoints stored in my Redux store state so that I could run complex methods that checked each row and column and did some math to update each endpoint correctly, but what is the best way to do this now that I have migrated my data to firestore?
Edit:here are some pictures of how I am trying to set up my firestore layout currently:
You might want to back up a little and get a better understanding of the type of database that Firestore is. It's NoSQL, so things like rows and columns and tables don't exist.
Try this video: https://youtu.be/v_hR4K4auoQ
and this one: https://youtu.be/haMOUb3KVSo
But yes, you could use a cloud function to update a value for you, or you could make the new Spent total calculation within your app logic and when you write the new value for Item 2, also write the new value for Spent total.
But mostly, you need to understand how firestore stores your data and how it charges you to retrieve it. You are mostly charged for each read/write request, with much less concern for the actual amount of data you have stored overall. So it will probably be better to NOT keep these values in separate collections if you are always going to be utilizing them at the same time.
For example:
Collection(transactions) => Document(transaction133453) {item1: $80, item2: $70, spentTotal: $150}
and then if you needed to update that transaction, you would just update the values for that document all at once and it would only count as 1 write operation. You could store the transactions collection as a subcollection of a customer document, or simply as its own collection. But the bottom line is most of the best practices you would rely on for a SQL database with tables, columns, and rows are 100% irrelevant for a Firestore (NoSQL) database, so you must have a full understanding of what that means before you start to plan the structure of your database.
I hope this helps!! Happy YouTubing...
Edit in response to comment:
The way I like to think about it is how am I going to use the data as opposed to what is the most logical way to organize the data. I'm not sure I understand the context of your example data, but if I were maybe tracking budgets for projects or something, I might use something like the screenshots I pasted below.
Since I am likely going to have a pretty limited number of team members for each budget, that can be stored in an array within the document, along with ALL of the fields specific to that budget - basically anything that I might like to show in a screen that displays budget details, for instance. Because when you make a query to populate the data for that screen, if everything you need is all in one document, then you only have to make one request! But if you kept your "headers" in one doc and then your "data" in another doc, now you have to make 2 requests just to populate 1 screen.
Then maybe on that screen, I have a link to "View Related Transactions", if the user clicks on that, you would then call a query to your collection of transactions. Something like transactions is best stored in a collection, because you probably don't know if you are going to have 5 transactions or 500. If you wanted to show how many total transactions you had on your budget details page, you might consider adding a field in your budget doc for "totalTransactions: (number)". Then each time a user added a transaction, you would write the transaction details to the appropriate transactions collection, and also increase the totalTransactions field by 1 - this would be 2 writes to your db. Firestore is built around the concept that users are likely reading data way more frequently than writing data. So make two writes when you update your transactions, but only have to read one doc every time you look at your budget and want to know how many transactions have taken place.
Same for something like chats. But you would only make chats a subcollection of the budget document if you wanted to only ever show chats for one budget at a time. If you wanted all your chats to be taking place in one screen to talk about all budgets, you would likely want to make your chats collection at the root level.
As for getting your data from the document, it's basically a JSON object so (may vary slightly depending on what kind of app you are working in),
a nested array is referred to by:
documentName.arrayName[index]
budget12345.teamMembers[1]
a nested object:
documentName.objectName.fieldName
budget12345.projectManager.firstName
And then a subcollection is
collection(budgets).document(budget12345).subcollection(transactions)
FirebaseExample budget doc
FirebaseExample remainder of budget doc
FirebaseExample team chats collection
FirebaseExample transactions collection

Tricky 1-to-1 unowned relations query

I am currently developing a location based service on GAE/Java. I am quite new to this and I need your help with the JDO query part.
I have two persistent classes, Client and ClientGeolocation. The first one is for storing the client attributes (Key clientId, String name, String settings, etc.) and the second is for storing its geolocation updates (Key clientGeolocationId, Key clientId, Long timestamp, Double latitude, Double longitude). Since one client has thousands of geolocation records (one for each location update) over time, I decided to use 1-to-1 unowned relationship between ClientGeolocation and Client classes.
The service lets the user to see if another user is within range (e.g. they are in 5 minutes walking distance). Making this happen with JDO queries for each request would be far too resource consuming / slow so I put the last geolocation of the users in memcache and do the checking from there. So far so good.
The problem is when the App cold starts and memcache is empty, I want to fill up the memcache with data from the storage (using JDO query) and I simply do not know how to query "the last geolocation record for each user who has at least one record which is not older than 180 minutes".
The best possible solution I can come up with at the moment is to do this in two parts. First, to query the clientId keys of users who has records within the last 180 minutes (this will query distinct clientIds I hope) then execute a query for all clientId in which I query the last (top 1 order by timestamp desc) geolocation record. This means if the first query returns 10.000 users then I will do 10.000 queries for the last geolocation records. I have a feeling that there is a better solution for this in GAE :) .
Can you please help me how to write this query in a proper way?
Thank you very much for your help!
Should this be helpful ?
http://www.datanucleus.org/products/accessplatform_3_0/jdo/jdoql_subquery.html

What Database / Technology should be used to calculate unique visitors in time scope

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.
For example:
I have these columns:
hashkey, name, surname, visit_datetime, site, gender, etc...
I need to get distinct in time span of 1 year, less than in 5 sec:
SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.
Is there a better technology to calculate something like this than relational databases?
I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.
For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).
You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.
Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.
All GA analytics parameters are packed into the Request URL, e.g.,
utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:
__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1
In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor
So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).
There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.
Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

Resources