I wanted to understand how to optimize the use of accesses to the Firebase database.
I have a database containing: mountain refuges and routes.
Practically every opening of the application all the necessary data are downloaded from the database.
My idea would be to save the relevant data in a json file, and only when the user wants more information on this data does the read access to the database take place.
How can I reduce read access to the database? Is it possible to save the most "relevant" data in a json file on the device so as not to download the "relevant" data every time the user opens the app?
If you're asking about pricing, Firestore is billed by interactions with the database reads, writes, deletes, not quantity of data (ignoring Stored Data for this use case)
Additionally, Queries are shallow: they only return documents in a particular collection or collection group and do not return subcollection data.
So as long as your structure supports it, showing the higher level List of Places should be the only reads you're doing until the user actually selects a place to get more details on.
If you have a million places, leverage pagination to only load enough to support the UI - say 100 at a time. That will limit the number of reads needed as well.
Related
I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.
We are currently developing a tool to count wildlife passing through defined areas. The gadget that automatically counts the animals will be sending data (weather, # of animals passing etc.) in a 5 minute interval via HTTP to our API. There will be hundreds of these measurement stations and it should be scalable.
Now the question arised whether to use a filesystem or a RDBMS to save this data.
Pro DB
save exact time and date when the entry was created
directly related to area# via indexed key
Pro Filesystem
Collecting data is not as resource intensive since for every API call only 1 line will be appended to the file
Properties of the data:
only related to 1 DB entry (the area #)
the measurement stations are in remote areas we have to account for outages
What will be done with the data
Give a overview over timeperiods per area#
act as a early warning system if the # of animals is surprisingly low/high
Probably by using a cronjob and comparing to simliar data
We are thinking to chose a RDBMS to save the data but I am worried that after millions of entries the DB will slow down and eventually stop working. This question was asked here where 360M entries is not really considered "big data" so I'm not quite sure about my task either.
Should we chose these recommended techniques (MongoDB ...) or can this task be handled by PostgreSQL or MySQL?
I have created such a system for marine boyes. The devices sends data over GPRS / iridum using HTTP or raw tcp sockets (to minimize bandwidth).
The recieving server stores the data in a db-table, with data provided and timestamp.
The data is then parsed and records are created in another table.
The devices can also request UTC-time from the server, thus not needing a RTC.
Before any storage is made to the "raw" table, a row is appended to a text-file. This is puerely for logging or being able to recover from database downtime.
As for database type, I'd recommend regular RDBMS. Define markers for your data. We use 4-digit codes that gives headroom for 10000 types of measure values.
The users on my website do operations like login, logout, update profile, change passwords etc. I am trying to come up with something that can store these user activities for my users and also return the matching records in case some system asks me for them based on userIds.
The problem from what I can think of looks more write intensive(since users keep logging into the website very often and I have to record that). Once in a while(say when users clicks on history or some reporting team needs it), the data records are read and returned.
So my application is write intensive
There are various approaches I can think of.
Approach 1. The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into database(which has been sharded(assume based on hash of userId)).
The problem with this approach is if my activity manager runs on multiple nodes, it has to send those records to various shards which means a lot of data movement over network.
Approach 2:The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into read though write through cache which would take care of writing into the database.
Problem with this approach is I do not know If I can control as to where those records would be written(I mean to which shard). Basically I do not know if the write through cache works(does it map to a local DB or it can manage to send data to shards).
Approach 3: The login operation is the most common user activity in my system. I can have a separate queue for login which must be periodically flushed to disk.
Approach 4: Use some cloud based storage which acts as a in memory queue where data coming from all nodes in stored. This would be reliable cache that guarantees no data loss. Periodically read from this cache and store that into the database shards.
There are many problems to solve:
1. Ensuring I do not loose the data(What kind of data replication to use i.e. any queue that ensures reliability)
2. Ensuring my frequent writes do not result in performance
3. Avoid single point of failure.
4. Achieving infinite scale
I need suggestion based on above from the existing solution available.
I'm implementing an app that keeps track of how many times a post is viewed. But I'd like to keep a 'smart' way of keeping track. This means, I don't want to increase the view counter just because a user refreshes his browser.
So I decided to only increase the view counter if IP and user agent (browser) are unique. Which is working so far.
But then I thought. If Youtube, is doing it this way, and they have several videos with thousands or even millions of views. This would mean that their views table in the database would be overly populated with IP's and user agents....
Which brings me to the assumption that their video table has a counter cache for views (i.e. views_count). This means, when a user clicks on a video, the IP and user agent is stored. Plus, the counter cache column in the video table is increased.
Every time a video is clicked. Youtube would need to query the views table and count the number of entries. Won't this affect performance drastically?
Is this how they do it? Or is there a better way?
I would leverage client side browser fingerprinting to uniquely identify view counts. This library seems to be getting significant traction:
https://github.com/Valve/fingerprintJS
I would also recommend using Redis for anything to do with counts. It's atomic increment commands are easy to use and guarantee your counts never get messed up via race conditions.
This would be the command you would want to use for incrementing your counters:
http://redis.io/commands/incr
The key in this case would be the browser fingerprint hash sent to you from the client. You could then have a Redis "set" that would contain a list of all browser fingerprints known to be associated with a given user_id (the key for the set would be the user_id).
Finally, if you really need to, you run a cron job or other async process that dumps the view counts for each user into your counter cache field for your relational database.
You could also take the approach where you store user_id, browser fingerprint, and timestamps in a relational database (mysql?) and counter cache them into your user table periodically (probably via cron).
First of all, afaik, youtube uses BigTable, so do not worry about querying the count, we don't know the exact structure of the database anyway.
Assuming that you are on a relational model, create a column view_count, but do not update it on every refresh. Record the visists and periodically update the cache.
Also, you can generate hash from IP, browser, date and any other information you are using to detect if this is an unique view, and do not store the whole data.
Also, you can use session/cookie to record the view being viewed. Since it will expire, it won't be such memory problem - I don't believe anyone is viewing thousand of videos in one session
If you want to store all the IP's and browsers, then make sure you have enough DB storage space, add an index and that's it.
If not, then you can use the rails session to store the list of videos that a user has visited, and only increment the view_count attribute of a video when he's visiting a new video.
I am using Google App Engine to create a web application. The app has an entity, records for which will be inserted through an upload facility by the user. User may select up to 5K rows(objects) of data. I am using DataNucleus project as JDO implementation. Here is the approach I am taking for inserting the data to Data Store.
Data is read from the CSV and converted to entity objects and stored in a list.
The list is divided into smaller groups of objects say around 300/group.
Each group is serialized and stored in cache using memcache using a unique id as the key.
For each group, a task is created and inserted into the Queue along with the key. Each task calls a servlet which takes this key as the input parameter, reads the data from memory and inserts this to the data store and deletes the data from memory.
The Queue has a maximum rate of 2/min and the bucket size is 1. The problem i am facing is the task is not able to insert all 300 records in to data store. Out of 300, maximum that gets inserted is around 50. I have validated the data once it is read from memcache and am able to get all the stored data back from the memory. I am using the makepersistent method of the PersistenceManager to save data to ds. Can someone please tell me what the issue could be?
Also, I want to know, is there a better way of handling bulk insert/update of records. I have used BulkInsert tool. But in cases like these, it will not satisfy the requirement.
This is a perfect use-case for App Engine mapreduce. Mapreduce can read lines of text from a blob as input, and it will shard your input for you and execute it on the taskqueue.
When you say that the bulkloader "will not satisfy the requirement", it would help if you say what requirement you have that it doesn't satisfy, though - I presume in this case, the issue is that you need non-admin users to upload data.