Is it safe to cache Gmail API message results? - gmail-api

I am iterating over a Gmail account with many thousands of messages. To save resources I am caching the results of users.messages.get. Is it safe to cache this data indefinitely? Will the data it returns ever change? I assume that it will not but so far I am unable to find anything definitive in the docs or otherwise to confirm this.

It can change - though it might not necessarily change in a way that you care about.
The API itself lets you change the labels on a Message and delete messages.

Related

What's the best way to handle caching on updates?

I have a website, but every time I upload a new update or feature I'm afraid it won't show up to the user.
It has happened a few times, we uploaded something new, but for some users, it didn't appear. The old information was left and it only appeared after a while.
As I know that no users will clear their browser cache to prevent this, I would like to know if there is anything I can do on the development side to prevent this, and every time I upload something new, neither user will experience any problems or will not receive the news.
I currently use AWS services like ec2, es3, bucket. cloud front and route 53
What to do
Actions to perform summarized with screenshots really elegantly here: https://stackoverflow.com/a/60049652/14077491
Why to do it
When someone makes a request to your website, AWS automatically caches the result in edge locations to speed up the response time for subsequent requests. The default is 24 hours, but this can be modified.
You can bypass this by either (1) setting the cache expiration to a very short time span, or (2) using cache invalidation. The first is not recommended since then your users will have to wait longer for a response more often, which isn't good. You can perform cache invalidation in a couple of ways, depending on whichever is better for your project. You can read the AWS docs about cache invalidation to choose for your use case.
I have previously added an extra cache invalidation task to my CD pipeline, which automates the process and ensures it is never forgotten. Unless you are posting many, many updates per month, it is also free.

Persistent job queue?

Internet says using database for queues is an anti-pattern, and you should use (RabbitMQ or Beanstalked or etc)
But I want all requests stored. So I can later lookup how long they took, any failed attempts or errors or notes logged, who requested it and with what metadata, what was the end result, etc.
It looks like all the queue libraries don't have this option. You can't persist the data to allow you to query it later.
I want what those queues do, but with a "persist to database" option. Does this not exist? How do people deal with this? Do you use a queue library and copy over all request information into your database when the request finishes?
(the language/database I'm using is anything, whatever works best for this)
If you want to log requests, and meta-data about how long they took etc, then do so - log it to the database when you know the relevant results, and run your analytic queries as you would expect to.
The reason to not be using the database as a temporary store is that under high traffic, the searching for, and locking of unprocessed jobs, and then updating or deleting them when they are complete, can take a great deal of effort. That is especially true if don't remove jobs from the active table, and so have to search ever more completed jobs to find those that have yet to be done.
One can implement the task queue by themselves using a persistent backend (like database) to persist the tasks in queues. But the problem is, it may not scale well and also, it is always better to use a proven implementation instead of reinventing the wheel. These are tougher problems to solve and it is better to use the existent frameworks.
For instance, if you are implementing in Python, the typical choice is to use Celary with Redis/RabbitMQ backend.

Gmail API all messages

I need to get all messages in Inbox with gmail api. But I see only one way to do it.
Get list of messages(id, threadID)
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages?labelIds=INBOX&key={YOUR_API_KEY}
With id`s get all messages in loop
While
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages/147199d21bbaf5a5?key={YOUR_API_KEY}
End of While
But for this way needed 100500 request.
Have anybody idea how to get with one request all messages(or just payload field)?
Use batch and request 100 messages at a time. You will need to make 1000 requests but the good news is that's quite fine and it'll be easier for everyone (no downloading 1GB response in a single request!).
Documented at:
https://developers.google.com/gmail/api/guides/batch
There's a few other people that have asked about batching Gmail Api here on Stack Overflow, so just do a quick search to find answers and examples.
The approach you are doing is correct, as there is no 'GetAll' API to download them
Reasons include:
Unbounded Result Sets
Pulling out an unlimited amount of emails (aka unbounded result set) is a resource hog on Google servers. Did you want the attachments AND images? These could be gigabytes of data.
Network Problems
Google has to read gigabytes form disk, store it in memory and send them over the internet. Google's server would handle it, but the bandwidth of internet connectivity would not work. Worst of all, if you issue this request again and again, you could perform a DDoS attack on Google.
Security Risk
If someone gains an API key of another user, they could download their entire mailbox. Hence Google provide paging to ensure that they can provide a securer service and reduce resource contention.
Therefore, it is there to protect you and other users, and themselves.

Is there a way to make more than 10K requests on Google search from the same IP?

I am currently working to an app that requires to scrape data from Google's search results. For example google.com/search?q=domain.com and so on. But Google blocks my IP address after making some requests. I know there are Google APIs, but there are many sites around that just scrape the data directly.
Scraping Google search results is a breech of the terms-of-service. Google actively discourages such and blocks those who do. They share their information with you free of charge but they don't appreciate you trying to get a copy of all of it.
Better to do your own crawling of the domain.
Too bad I did not see your question earlier, if it's not too late:
Scraping Google does indeed violate their terms of service, on the other hand you may choose not to accept them. You would accept their TOS when you create a Google account for example but as far as I know you can also reject the acceptance again (at least when they change them).
For a smaller amount of data you can use their API or also their commercial API but if you need the results and ranks exactly as a user will see them (SEO purposes) I know no official way to get their permission.
I am not a lawyer, so you might want to consult one if you want to make sure about legal consequences.
However, scraping Google usually does not lead to any legal problems. I remember that even Bing (Microsofts engine) got caught scraping Google for unknown keywords. That happened a few years ago. My personal guess is that the majority of their original results were copied from Google in secret.
There is an open source project http://google-rank-checker.squabbel.com which does work to scrape large amounts of Google results. As far as I remember, without modification it is limited to about 50-70k resultpages per day.
I suggest to take a look at the code, it's PHP with libcURL.
You will need proper IP addresses (not shared, not previously abused) as well. Scraping with a single IP will result in getting blocked by Google within an hour.
Usually the first thing that happens is a captcha, by solving the captcha you generate a cookie which allows you to keep making requests.
If you continue you will get a complete ban.
And if you "hammer" Google with a huge amount of requests you will alert their staff and they can put a manual ban on the whole ISP or network block.
A proper amount is around 10 requests per hour with an IP, that's what I have been sticking to on my related projects.
So if someone scrapes Google, make sure you have functions which validate the results and watch for unexpected returns. In such a case your code should immediately stop accessing Google to prevent further accessing a page which is just showing a captcha.

What's the best way for the client app to immediately react to an update in the database?

What is the best way to program an immediate reaction to an update to data in a database?
The simplest method I could think of offhand is a thread that checks the database for a particular change to some data and continually waits to check it again for some predefined length of time. This solution seems to be wasteful and suboptimal to me, so I was wondering if there is a better way.
I figure there must be some way, after all, a web application like gmail seems to be able to update my inbox almost immediately after a new email was sent to me. Surely my client isn't continually checking for updates all the time. I think the way they do this is with AJAX, but how AJAX can behave like a remote function call I don't know. I'd be curious to know how gmail does this, but what I'd most like to know is how to do this in the general case with a database.
Edit:
Please note I want to immediately react to the update in the client code, not in the database itself, so as far as I know triggers can't do this. Basically I want the USER to get a notification or have his screen updated once the change in the database has been made.
You basically have two issues here:
You want a browser to be able to receive asynchronous events from the web application server without polling in a tight loop.
You want the web application to be able to receive asynchronous events from the database without polling in a tight loop.
For Problem #1
See these wikipedia links for the type of techniques I think you are looking for:
Comet
Reverse AJAX
HTTP Server Push
EDIT: 19 Mar 2009 - Just came across ReverseHTTP which might be of interest for Problem #1.
For Problem #2
The solution is going to be specific to which database you are using and probably the database driver your server uses too. For instance, with PostgreSQL you would use LISTEN and NOTIFY. (And at the risk of being down-voted, you'd probably use database triggers to call the NOTIFY command upon changes to the table's data.)
Another possible way to do this is if the database has an interface to create stored procedures or triggers that link to a dynamic library (i.e., a DLL or .so file). Then you could write the server signalling code in C or whatever.
On the same theme, some databases allow you to write stored procedures in languages such as Java, Ruby, Python and others. You might be able to use one of these (instead of something that compiles to a machine code DLL like C does) for the signalling mechanism.
Hope that gives you enough ideas to get started.
I figure there must be some way, after
all, web application like gmail seem
to update my inbox almost immediately
after a new email was sent to me.
Surely my client isn't continually
checking for updates all the time. I
think the way they do this is with
AJAX, but how AJAX can behave like a
remote function call I don't know. I'd
be curious to know how gmail does
this, but what I'd most like to know
is how to do this in the general case
with a database.
Take a peek with wireshark sometime... there's some google traffic going on there quite regularly, it appears.
Depending on your DB, triggers might help. An app I wrote relies on triggers but I use a polling mechanism to actually 'know' that something has changed. Unless you can communicate the change out of the DB, some polling mechanism is necessary, I would say.
Just my two cents.
Well, the best way is a database trigger. Depends on the ability of your DBMS, which you haven't specified, to support them.
Re your edit: The way applications like Gmail do it is, in fact, with AJAX polling. Install the Tamper Data Firefox extension to see it in action. The trick there is to keep your polling query blindingly fast in the "no news" case.
Unfortunately there's no way to push data to a web browser - you can only ever send data as a response to a request - that's just the way HTTP works.
AJAX is what you want to use though: calling a web service once a second isn't excessive, provided you design the web service to ensure it receives a small amount of data, sends a small amount back, and can run very quickly to generate that response.

Resources