How to persist large amounts of data through app closure? - database

I have noticed that apps like instagram keep some data persistent through app closures. Even if all internet connection is removed (perhaps via airplane mode) and the app is closed, reopening it still shows the last loaded data despite the fact that the app cannot call any loading functions from the database. I am curious as to how this is achieved? I would like to implement a similar process into my app (Xcode and swift 4), but I do not know which method is best. I know that NSUserDefaults can persist app data, but I have seen that this is for small and uncomplicated data, of which mine would not be. I know that I can store some of the data in an internal SQL db, via FMDB, but some of the data I would like to persist is image data, which I am not sure exactly how to save into SQL. I also know of Core Data but after reading through some of the documentation I have become a bit confused as to whether or not it fits my purpose. Which of these (or others?) would be best?
As an additional question, regardless of which persistence method I choose, I feel as though every time the data is actually loaded from the DB (when internet connection is available), which is in the viewDidLoad, I would need to be updating the data in the persistent storage in case the internet connection drops. I am concerned that this doubling of my writing procedures will slow the app down? Is there any validity to this concern? Or is it unavoidable anyway?

Related

Large data set for web application - use an API for each query or store locally on database?

I'm building an application which uses a publicly available set of data that is rather large. I have two options to query it:
Via an API. For each query, my application would send a request using this dataset's API.
Alternatively, I could download (downloading the CSV files take over 4.0GB) and store the entire dataset locally.
The type of operations and analysis that I'd like to perform on the data for my web application is easily done with either method. However I'm wondering which way is best and why?
The only thing I can think of is that querying a local database would be faster, however using the API would ensure the data is up-to-date ("valid" data in this dataset is said to expire after 10 years according to the organisation's website).
As you said both options are valid and it depends on your use case which option is better.
Consider the following questions:
How often is the data updated? Is it maybe completely historical data and will never be updated, or only new values will added but existing never change? How much effort would it be to update your locally stored data automatically.
How time critical is the response time and availability? Locally stored data makes you independent against network delay to the API, an outage of the API, a rate limit that the Service provider could implement to throttle the rate of requests, or taking the data offline. How much data is requested on average, what is the response time for the API?

Tracking unused redux data in React components

I'm looking for a good way to track which props received by a component are not being used and can be safely removed.
In a system I maintain, our client single-page app fetches a large amount of data from some private endpoints in our backend services via redux saga. For most endpoints called, all data received is passed directly to our React components, no filtering applied. We are working to improve the overall system performance, and part of that process involves reducing the amount of data returned by our backend-for-frontend services, given those themselves call a large number of services to compose the returned JSON data, which adds to the overall response time.
Ideally, we want to make sure we only fetch the data we absolutely need and save the server from doing unnecessary calls and data normalization. So far, we've been trimming the backend services data by doing a code inspection; we inspect the returned data for each endpoint, then inspect the front-end code and finally remove the data we identified (as a best guess) as unused. That's proven to be risky and inefficient, frequently we assume some data is unused, then months later find a corner case in which it was actually needed, and have to reverse the work. I'm looking for a smart, automated way to identify unused props in my app. Has anyone else had to work on something like that before? Ideas?
There's an existing library called https://github.com/aholachek/redux-usage-report , which wraps the Redux state in a proxy to identify which pieces of state are actually being used.
That may be sufficiently similar to what you're doing to be helpful, or at least give you some ideas that you can take inspiration from.

Making Large IndexedDB Persistent in Browser

I am looking at making a LOB html5 web application. The main part we are trying to accomplish is to make the application offline capable. This will mean taking a large chunk of SQL data from the server and storing it in the browser. This will need to be living in the browser for quite a while, dont want to have to continuously refresh it everytime the browser is closed and reopened.
We are looking at storing the data inside the client in indexedDB but I read that indexedDB is stored in temporary storage so the lifetime of it cannot be relied on. Does anyone know of any strategies on prolonging its lifetime? Also we will be pulling down massive chunks of data so 1-5mb storage might not suffice what we require.
My current thought is to somewhat store it down to the browser storage using html5 storage API's and hydrate it into the indexedDb as it's required. Just need to make sure we can grow the storage limit to whatever we need.
Any advice on how we approach this?
We are looking at storing the data inside the client in indexedDB but I read that indexedDB is stored in temporary storage so the lifetime of it cannot be relied on.
That is technically true but in practice I've never seen the browser actually delete data. More common if you're storing a lot of data, you will hit quota limits which are annoying and sometimes inconsistent/buggy.
Regardless, you shouldn't rely on data in IndexedDB always being there forever, because users can always delete data, have their computers break without backups, etc.
If you create a Chrome extension, you can enable unlimited storage. I've successfully stored several thousand large text documents persistently in indexedDB using this approach.
This might be a silly question to add, but can you access the storage area outside of the browser? For instance, if I did not want to have a huge lag when my app starts up and loads a bunch of data, could I make an external app to "refresh" the local data so that when the browser starts, it is ready to rock and roll?
I assume the answer here will be no, but I had to ask.
Has anyone worked around this for large data sets? For instance loading in one tab and working in another? Chrome extension to load, but access via the app?

When to use a certain type of persistence in Google App Engine?

First of all I'll explain the question. By persistence, I mean storing data beyond the execution of a single request. It might not be the best question title, so feel free to edit it.
The way I see it, there are three types of persistence in GAE, each one "closer" to the request itself:
The datastore
This is where all data is most likely to be based. It may go into the higher layers of persistence temporarily, but in the end, this is where the data really is. Unfortunately, querying the datastore repeatedly is slow and uses a lot of resources.
Use when...
storing data that should be stored for an indefinite amount of time.
Avoid using when...
getting data that is queried often but rarely updated.
memcache
This is a highly complex caching engine that stores the data in memory and makes sure all users read from/write to the same cache. It's a much faster way to get/set data on a key→value basis than using the datastore. Unfortunately, data can only stay in the memory for so long, and there is no guarantee that it will stay for as long as you tell it to; the data may disappear at any time if memory is needed elsewhere.
Use when...
you need to get data more often than you need to update it. Even when data needs to be updated often, it can have its uses (if a few missed updates are considered okay), by setting up a task queue to persist data from the memcache to the datastore.
Avoid using when...
data needs to be updated often and has to be up-to-date when fetched.
Global variables
This isn't an official method of persisting data, but it works. However, it's the least reliable method, and since it has no data synchronization across servers, persisted data may show up differently for different users (but from what I've found, the server rarely changes for the same user.) Theoretically, this should be the method that has the least overhead in getting/setting values, however, and could have its uses.
Use when...
hell freezes over? I don't know... I haven't enough knowledge about what goes on behind the scenes to actually rely on this method. Discuss!
Avoid using when...
you rely on the data being the same across servers.
Cookies
If the data is user-specific, it can be efficient to store it as a cookie in the user's browser. There are some pitfalls to watch out for though:
Security – the user can meddle with cookies, and malicious people could potentially do the same. To make sure that the contents are unreadable and unchangeable to all, the cookie can be encrypted using the PyCrypto library which is available on GAE.
Performance – since cookies are sent with every request (even images), it can add to the bandwidth being used, and slow down requests. One solution is to use another domain for static content, so the browser won't send the cookie for that content.
When should the different types of persistence be used? How can they be combined to reduce/even out the amount of resources being spent?
Datastore
Use the datastore to hold any long living information. The datastore should be used like you would use a normal database to hold data that will be used in your site/application.
MemCache
Use this to access data a lot quicker than trying to access the datastore. MemCache can return data really quickly and can be used for any data that needs to span multiple calls from users. It is normally data that was originally in the datastore and then moved to the memcache.
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data() #get data from the datastore
memcache.add("key", data, 60)
return data
The memcache will flush itself when the item is out of date. You set this in the last param of the add shown above.
Global Variables
I wouldn't use these at all since they can't span instances. In GAE a request creates a new instance, well in python it does. If you want to use Global variables I would store the data needed in the memcache.
Your post is a good summary of the 3 major options. You mostly have answered the question already. However, if you are currently building an app and stressing over whether or not you should memcache something, try this:
Write your app using the datastore for everything that needs to outlive more than one request.
Once your app (or some usable subset) is working, run some functional tests or simulations to see where the slow spots (or high quota usage) are.
Find the most slow or inefficient request path, and figure out how to make that faster (either by using memcache, or altering your datastructures so you can do gets instead of queries, or possibly storing something in a global instance variable*)
goto 2 until you're satisfied.
*Things that might be good for a "global" variable would be something that is relatively expensive to create/fetch, that a substantial portion of your requests will use, and that does not need to be consistent across requests/users.
I use global variable to speed up json conversion. Before I convert my data structure to json, I hash it and check if the json if already available. For my app this gives quite a speedup as the pure python implementation is quite slow.
Global variables
To complement AutomatedTester's answer, and also reply his further question about how to share information between GETs without memcache or datastore, below a quick illustration of how to use global variables:
if 'i' not in globals():
i = 0
def main():
global i
i += 1
print 'Status: 200'
print 'Content-type: text/plain\n'
print i
if __name__ == '__main__':
main()
Calling this script multiple times will give you 1, 2, 3... Of course as mentioned earlier by Blixt you should not count on this trick too much ('i' can sometimes switch back to zero) but it can be useful to store user-specific information in a dictionary, session data for instance.

.NET CF mobile device application - best methodology to handle potential offline-ness?

I'm building a mobile application in VB.NET (compact framework), and I'm wondering what the best way to approach the potential offline interactions on the device. Basically, the devices have cellular and 802.11, but may still be offline (where there's poor reception, etc). A driver will scan boxes as they leave his truck, and I want to update the new location - immediately if there's network signal, or queued if it's offline and handled later. It made me think, though, about how to handle offline-ness in general.
Do I cache as much data to the device as I can so that I use it if it's offline - Essentially, each device would have a copy of the (relevant) production data on it? Or is it better to disable certain functionality when it's offline, so as to avoid the headache of synchronization later? I know this is a pretty specific question that depends on my app, but I'm curious to see if others have taken this route.
Do I build the application itself to act as though it's always offline, submitting everything to a local queue of sorts that's owned by a local class (essentially abstracting away the online/offline thing), and then have the class submit things to the server as it can? What about data lookups - how can those be handled in a "Semi-live" fashion?
Or should I have the application attempt to submit requests to the server directly, in real-time, and handle it if it itself request fails? I can see a potential problem of making the user wait for the timeout, but is this the most reliable way to do it?
I'm not looking for a specific solution, but really just stories of how developers accomplish this with the smoothest user experience possible, with a link to a how-to or heres-what-to-consider or something like that. Thanks for your pointers on this!
We can't give you a definitive answer because there is no "right" answer that fits all usage scenarios. For example if you're using SQL Server on the back end and SQL CE locally, you could always set up merge replication and have the data engine handle all of this for you. That's pretty clean. Using the offline application block might solve it. Using store and forward might be an option.
You could store locally and then roll your own synchronization with a direct connection, web service of WCF service used when a network is detected. You could use MSMQ for delivery.
What you have to think about is not what the "right" way is, but how your implementation will affect application usability. If you disable features due to lack of connectivity, is the app still usable? If you have stale data, is that a problem? Maybe some critical data needs to be transferred when you have GSM/GPRS (which typically isn't free) and more would be done when you have 802.11. Maybe you can run all day with lookup tables pulled down in the morning and upload only transactions, with the device tracking what changes it's made.
Basically it really depends on how it's used, the nature of the data, the importance of data transactions between fielded devices, the effect of data latency, and probably other factors I can't think of offhand.
So the first step is to determine how the app needs to be used, then determine the infrastructure and architecture to provide the connectivity and data access required.
I haven't used it myself, but have you looked into the "store and forward" capabilities of the CF? It may suit your needs. I believe it uses an Exchange mailbox as a message queue to send SOAP packets to and from the device.
The best way to approach this is to always work offline, then use message queues to handle sending changes to and from the device. When the driver marks something as delivered, for example, update the item as delivered in your local store and also place a message in an outgoing queue to tell the server it's been delivered. When the connection is up, send any queued items back to the server and get any messages that have been queued up from the server.

Resources