I have a project hosted on Google Appengine, standard runtime.
I have defined a servlet which responds to a GET request with up-to 200kb data. Problem is that the servlet attempts to buffer a very large amount of the data before actually writing out.
I tried to put an upper limit on the buffering by doing,
resp.setBufferSize(1024);
But this makes no difference. An immediate log of the buffer size,
LOGGER.info("Using buffer of size " + resp.getBufferSize());
tells me that the buffer is of size 1024, but once the data is written, I again log the buffer size and it has grown to a very large amount depending upon the data written out.
Now, if I increase the output to sufficiently large amount I get an Exception, while writing out that data. The data itself is not stored anywhere, it is generated and written to the "ServletOutputStream" directly.
Error for /sizeTestServlet
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.OutputStream.write(OutputStream.java:75)
at com.google.apphosting.runtime.jetty.RpcResponseGenerator.addContent(RpcResponseGenerator.java:65)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:644)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:589)
Is there some way to disable output buffering ?
App Engine instances do not support data streaming - they always return an entire response. If you need to stream data, you can use flexible environment.
From
How Requests are Handled:
Streaming Responses
App Engine does not support streaming responses where data is sent in
incremental chunks to the client while a request is being processed.
All data from your code is collected as described above and sent as a
single HTTP response.
Note, however, that for such small responses, as in your case (up to 200KB), streaming may be significantly less efficient and slower than returning all the data at once.
Related
I am building an IoT device that will be producing 200Kb of data per second, and I need to save this data to storage. I currently have about 500 devices, I am trying to figure out what is the best way to store the data? And the best database for this purpose? In the past I have stored data to GCP's BigQuery and done processing by using compute engine instance groups, but the size of the data was much smaller.
This is my best answer based upon the limited information in your question.
The first step is to document / describe what type of data that you are processing. Is it structured data (SQL) or unstructured (NoSQL)? What type of queries do you need to make? How long do you need to store the data and what is the expected total data size. This will determine the choice of the backend performing the query processing and analytics.
Next you need to look at the rate of data being transmitted. At 200 Kbits (or is it 200 KBytes) times 500 devices this is 100 Mbits (or 800 MBits) per second. How valuable is the data and how tolerant is your design for data loss? What is the data transfer rate for each device (cellular, wireless, etc.) and connection reliability?.
To push the data into the cloud I would use Pub/Sub. Then process the data to merge, combine, compress, purge, etc and push to Google Cloud Storage or to BigQuery (but other options may be better such as Cloud SQL or Cloud Datastore / BigTable). The answer for the intermediate processor depends on the previous questions but you will need some horsepower to process that rate of data stream. Options might be Google Cloud Dataproc running Spark or Google Cloud Dataflow.
There is a lot to consider for this type of design. My answer has created a bunch of questions, hopefully this will help you architect a suitable solution.
You could also look at IoT Core as a possible way to handle the load balancing piece (it auto-scales). There would be some up front overhead registering all your devices, but it also then handles secure connection as well (TLS stack + JWT encryption for security on devices using IoT Core).
With 500 devices and 200KB/s, that sound well within the capabilities of the system to handle. Pub/Sub is the limiter, and it handles between 1-2M messages per second so it should be fine.
My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.
I need to get all messages in Inbox with gmail api. But I see only one way to do it.
Get list of messages(id, threadID)
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages?labelIds=INBOX&key={YOUR_API_KEY}
With id`s get all messages in loop
While
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages/147199d21bbaf5a5?key={YOUR_API_KEY}
End of While
But for this way needed 100500 request.
Have anybody idea how to get with one request all messages(or just payload field)?
Use batch and request 100 messages at a time. You will need to make 1000 requests but the good news is that's quite fine and it'll be easier for everyone (no downloading 1GB response in a single request!).
Documented at:
https://developers.google.com/gmail/api/guides/batch
There's a few other people that have asked about batching Gmail Api here on Stack Overflow, so just do a quick search to find answers and examples.
The approach you are doing is correct, as there is no 'GetAll' API to download them
Reasons include:
Unbounded Result Sets
Pulling out an unlimited amount of emails (aka unbounded result set) is a resource hog on Google servers. Did you want the attachments AND images? These could be gigabytes of data.
Network Problems
Google has to read gigabytes form disk, store it in memory and send them over the internet. Google's server would handle it, but the bandwidth of internet connectivity would not work. Worst of all, if you issue this request again and again, you could perform a DDoS attack on Google.
Security Risk
If someone gains an API key of another user, they could download their entire mailbox. Hence Google provide paging to ensure that they can provide a securer service and reduce resource contention.
Therefore, it is there to protect you and other users, and themselves.
I'm considering building an app on App Engine, and I'm trying to decide if I should store data in the datastore or on google cloud storage.
Each object is going to be typically no more than around a kilobyte, perhaps a few kilobytes at most (and often less). It won't change too often.
I could have the client directly access the data, but though I might live without there would be some benefit to the app engine app accessing the data and using it as part of serving a response.
What are the performance characteristics of google cloud storage? How quickly do requests come back? I was able to find a status dashboard for the datastore which indicates that they are usually reasonably quick at handling requests but I've had trouble getting guidance on how fast GCS is.
Under the most recent price reductions, it seems like the datastore might actually be cheaper for my use case of relatively small chunks of data ($0.06/100,000 requests vs $0.01/10,000 class b operations). Am I interpreting that correctly?
This thread might give you some insights. Back in september 2013 I had about 200-250ms on average for "blank" sequential inserts. You can get a great speedup if you combine your requests. You can insert up to 500 entities in a single request. Which takes roughly 500ms-900ms if I remember correctly.
I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!
Is this on frontend or a backend instance? Looks like a job for a backend instance to me.
Have you considered using different instance types?
frontend types
backend types