I have a vendor that is feeding me real-time data over SQS with CSV data in the body of the message. It's roughly one message/minute. The size of the body can vary greatly, but let's assume that it's under 512MB.
I first thought about writing a lambda function that's triggered by their SQS queue to upload to S3 and then use Snowpipe to load externally, but that seems like overkill to me. Wouldn't it just be easier to write the body locally to /tmp and then load internally?
I'm leaning towards loading internally, so I'm looking for a convincing argument to use Snowpipe / load externally instead. What would I be missing out on by not using Snowpipe?
An internal load, as referenced in that document link, takes the file, moves it to S3, and then runs a copy into statement. There really isn't much difference between that and asking SQS/Lamdba to store the message on S3 and use Snowpipe to load it for you. I think the easier and more efficient solution would be to ask Lambda to store the data on S3 (using something like boto3) and have Snowpipe load it. With your method, you are downloading the message, storing it in /tmp, only to then push it back up to S3. That's more movement of the data.
If these messages were smaller in size, I would have recommended that you use a direct connection within your Lambda function and use an insert statement, but this is much slower and 300k records would take too long.
Related
I am making a bus prediction web application for a college project. The application will use GTFS-R data, which is essentially a transit delay API that is updated regularly. In my application, I plan to use a cron job and python script to make regular get requests and write the response to a JSON file, essentially creating a feed of transit updates. I have set up a get request, where the user inputs trip data that will be searched against the feed to determine if there are transit delays associated with their specific trip.
My question is - if the user sends a request at the same time as the JSON file is being updated, could this lead to issues?
One solution I was thinking of is having an intermediary JSON file, which when fully loaded will replace the file used in the search function.
I am not sure if this is a good solution or if it is even needed. I am also not sure of the semantics needed to search for solutions to similar problems so pointers in the right direction would be useful.
To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.
A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)
I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!
Is this on frontend or a backend instance? Looks like a job for a backend instance to me.
Have you considered using different instance types?
frontend types
backend types
I have a task endpoint that needs to process data (say >1MB file) uploaded from a frontend request. However, I do not think I can pass the data from the frontend request via TaskOptions.Builder as I will get the "Task size too large" error.
I need some kind of "temporary" data store for the uploaded data, that can be deleted once the task has successfully processed it.
Option A: Store uploaded data in memcache, pass the key to the task. This is likely going to work most of the time, except when the data is evicted BEFORE the task is processed. If this can be resolved, sounds like a great solution.
Option B: Store the data in datastore (an Entity created just for this purpose). Pass the id to the task. The task is responsible for deleting the entity when it is done.
Option C: Use the Blobstore service. This, IMHO, is similar in concept to Option B.
At the moment, i'm thinking option B is the most feasible way.
Appreciate any advise on the best way to handle these situations.
If you are storing data larger than 1mb, you must use the blobstore. (Yes, you can segment the data in the datastore, but it's not worth the work.) There are two things to look out for, however. Make sure that you write the data to the blobstore in chunks less than 1mb. Also, since the task queue is idempotent, your tasks should not fail if the requested blobstore key does not exist, since a previous task may have deleted it already.
Option B doesn't work too, because maximum entity size is 1mb too. Same limit as for task.
Option A can't give any guarantee, values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value.
Imho, Option C is best for your needs