I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!
Is this on frontend or a backend instance? Looks like a job for a backend instance to me.
Have you considered using different instance types?
frontend types
backend types
Related
we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.
My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.
I am looking at using a spell checker for my GAE app and we have an algorithm already for spell checking, but I'm trying to figure out how to best store and load dictionary files for best performance.
I am considering the following strategies:
Place the dictionary data in a text file(s) in local app engine storage and load/read them using standard IO methods (open(),read(),etc)
Place the dictionary data in GCS and load/read using GCS IO methods
Place the dictionary data in an ndb.model() and load/cache information
One cache I don't quite understand is the context cache -- is this cache that is attached to a given instance? I.e. if I have a resident instance that is spun up, can I go ahead and load the dictionary data into the instance's RAM and thus accessing data should be extremely fast (microsecond vs millisecond seek/get times)? The dictionary data will probably be a sharded list of some sort that we'll optimize for performance. Are there other data storage methods/structures I'm not considering here that may be more appropriate? Thanks.
Cache (or its full name memcache) isn't exactly RAM but similar. When used with NDB it acts like a buffer. When you do writes it writes to the Memcache first then to the DB. Though this may sound slower its not, as writes to the DB take a while before they are accessible. When it reads it checks memcache, if it exists then it uses that info otherwise it pulls from the DB, stores it in Memcache then gives you the data. Just like RAM though its volatile, thus you cannot guaranty information is always acceptable, its limited (depending on what type of instance you have) and can be flush with no warning or reason. You can read more here:
https://developers.google.com/appengine/docs/python/memcache/
https://developers.google.com/appengine/articles/scaling/memcache
Ultimately Memcahe will be the fastest and most accessible as it it shared amongst all your instances, so if one instance pulls some data from the datastore then all of them can access it quickly. Even if its not in memcache it is still the fastest of all the options, as the others ones will fill up your memory and may cause errors and performance issues.
A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)
I have a task endpoint that needs to process data (say >1MB file) uploaded from a frontend request. However, I do not think I can pass the data from the frontend request via TaskOptions.Builder as I will get the "Task size too large" error.
I need some kind of "temporary" data store for the uploaded data, that can be deleted once the task has successfully processed it.
Option A: Store uploaded data in memcache, pass the key to the task. This is likely going to work most of the time, except when the data is evicted BEFORE the task is processed. If this can be resolved, sounds like a great solution.
Option B: Store the data in datastore (an Entity created just for this purpose). Pass the id to the task. The task is responsible for deleting the entity when it is done.
Option C: Use the Blobstore service. This, IMHO, is similar in concept to Option B.
At the moment, i'm thinking option B is the most feasible way.
Appreciate any advise on the best way to handle these situations.
If you are storing data larger than 1mb, you must use the blobstore. (Yes, you can segment the data in the datastore, but it's not worth the work.) There are two things to look out for, however. Make sure that you write the data to the blobstore in chunks less than 1mb. Also, since the task queue is idempotent, your tasks should not fail if the requested blobstore key does not exist, since a previous task may have deleted it already.
Option B doesn't work too, because maximum entity size is 1mb too. Same limit as for task.
Option A can't give any guarantee, values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value.
Imho, Option C is best for your needs