How to handle Sagemaker Batch Transform discarding a file with a failed model request - amazon-sagemaker

I have a large number of JSON requests for a model split across multiple files in an S3 bucket. I would like to use Sagemaker's Batch Transform feature to process all of these requests (I have done a couple of test runs using small amounts of data and the transform job succeeds). My main issue is here (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html#batch-transform-errors), specifically:
If a batch transform job fails to process an input file because of a problem with the dataset, SageMaker marks the job as failed. If an input file contains a bad record, the transform job doesn't create an output file for that input file because doing so prevents it from maintaining the same order in the transformed data as in the input file. When your dataset has multiple input files, a transform job continues to process input files even if it fails to process one. The processed files still generate useable results.
This is not preferable mainly because if 1 request fails (whether its a transient error, a malformmated request, or something wrong with the model container) in a file with a large number of requests, all of those requests will get discarded (even if all of them succeeded and the last one failed). I would ideally prefer Sagemaker to just write the output of the failed response to the file and keep going, rather than discarding the entire file.
My question is, are there any suggestions to mitigating this issue? I was thinking about storing 1 request per file in S3, but this seems somewhat ridiculous? Even if I did this, is there a good way of seeing which requests specifically failed after the transform job finishes?

You've got the right idea: the fewer datapoints are in each file, the less likely a given file is to fail. The issue is that while you can pass a prefix with many files to CreateTransformJob, partitioning one datapoint per file at least requires an S3 read per datapoint, plus a model invocation per datapoint, which is probably not great. Be aware also that apparently there are hidden rate limits.
Here are a couple options:
Partition into small-ish files, and plan on failures being rare. Hopefully, not many of your datapoints would actually fail. If you partition your dataset into e.g. 100 files, then a single failure only requires reprocessing 1% of your data. Note that Sagemaker has built-in retries, too, so most of the time failures should be caused by your data/logic, not randomness on Sagemaker's side.
Deal with failures directly in your model. The same doc you quoted in your question also says:
If you are using your own algorithms, you can use placeholder text, such as ERROR, when the algorithm finds a bad record in an input file. For example, if the last record in a dataset is bad, the algorithm places the placeholder text for that record in the output file.
Note that the reason Batch Transform does this whole-file failure is to maintain a 1-1 mapping between rows in the input and the output. If you can substitute the output for failed datapoints with an error message from inside your model, without actually causing the model itself to fail processing, Batch Transform will be happy.

Related

Process past data in multiple window operators in Apache Flink?

Context: the project I'm working on processes timestamped files that are produced periodically (1 min) and they are ingested in real time into a series of cascading window operators. The timestamp of the file indicates the event time, so I don't need to rely on the file creation time. The result of the processing of each window is sent to a sink which stores the data in several tables.
input -> 1 min -> 5 min -> 15 min -> ...
\-> SQL \-> SQL \-> SQL
I am trying to come up with a solution to deal with possible downtime of the real time process. The input files are generated independently, so in case of severe downtime of the Flink solution, I want to ingest and process the missed files as if they were ingested by the same process.
My first thought is to configure a mode of operation of the same flow which reads only the missed files and has an allowed lateness which covers the earliest file to be processed. However, once a file has been processed, it is guaranteed that no more late files will be ingested, so I don't necessarily need to maintain the earliest window open for the duration of the whole process, especially since there may be many files to process in this manner. Is it possible to do something about closing windows, even with the allowed lateness set, or maybe I should look into reading the whole thing as a batch operation and partition by timestamp instead?
Since you are ingesting the input files in order, using event time processing, I don't see why there's an issue. When the Flink job recovers, it seems like it should be able to resume from where it left off.
If I've misunderstood the situation, and you sometimes need to go back and process (or reprocess) a file from some point in the past, one way to do this would be to deploy another instance of the same job, configured to only ingest the file(s) needing to be (re)ingested. There shouldn't be any need to rewrite this as a batch job -- most streaming jobs can be run on bounded inputs. And with event time processing, this backfill job will produce the same results as if it had been run in (near) real-time.

how to make 1 million inserts in cassandra

I am parsing thousands of csv files from my application and for each parsed row I am making an insert into Cassandra. It seems that after letting it run it stops at 2048 inserts and throws the BusyConnection error.
Whats the best way for me to make about 1 million inserts?
Should i export the inserts as strings into a file, then run that file directly from CQL to make these massive inserts so I dont actually do it over the network?
We solve such issues using script(s).
The script go through input data and...
At each time it takes a specific amount of data from input.
Wait for specific amount of time.
Continues in reading and inserting of data.
ad 1. For our configuration and data (max 10 columns with mostly numbers and short texts) we found from 500 to 1000 rows are optimal.
ad 2. We define wait time as n * t. Where n is number of rows processed in single run of script. And t is time constant in millisecond. Value of t strongly depends on your configuration; however, for us t = 70 ms is enough to make the process smooth.
1 million requests - it's not so big number really, you can load it from cqlsh using the COPY FROM command. But you can load this data via your Java code as well.
From the error message it looks like that you're using asynchronous API. You can use it for high-performance inserts, but you need to control how many requests are processed at the same time (so-called, in-flight requests).
There are several aspects here:
Starting with version 3 of the protocol, you may have up to 32k in-flight requests per connection instead of 1024 that is used by default. You can configure it when creating Cluster object.
You need to control how many requests are in-flight, by wrapping session.executeAsync with some counter, for example, like in this example (not the best because it limits on the total requests per session, not on the connections to individual hosts - this will require much more logic, especially around token-aware requests).

How to best process large query results written to intermediate table in App Engine

We are running large query jobs where we hit the 128M response size and BigQuery raises the "Response too large to return. Consider setting allowLargeResults to true in your job configuration" error.
We are opting for the allowLargeResults approach to keep the already complex SQL unchanged (instead of chunking things at this level). The question is what is the best way to process the results written to the intermediate table:
Export the table to GCS, then queue tasks that process chunks of the response file using offsets into the GCS file. This introduces latency from GCS, GCS file maintenance (e.g. cleaning up files), and another point of failure (http errors/timeouts etc).
Query chunks from the intermediate tables also using queued tasks. The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option.
Any experience in this area and or recommendations?
Note that we are running in the Google App Engine (Python)
Thanks!
I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).
This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).
Note that BigQuery is able to export data in chunks - and you can request as many chunks as workers you have.
From https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple:
If you ask to export to:
['gs://my-bucket/file-name.json']
you will get an export in one GCS file, as long as it's less than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-*.json']
you will get several files with each having a chunk of the total export. Useful when exporting more than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
you will get exports optimized for 3 workers. Each of these patterns will receive a series of exported chunks, so each worker can focus on its own chunks.
It is not clear what is exactly the reason for "chunk processing". If you have some complex SQL logic that needs to be run against your data and result happend to be bigger than current 128MB limit - just still do it with allowLargeResults and than consume result the way you need to consume it.
Of course you most likely have reason for chunking but it is not understood and thus making answering problematic
Another suggestion is to not to ask many questions in one question - this make answering very problematic, so you have great chance to not to get it answered
Finally, my answer for the only question that is relatively clear (for me at least)
The question here is what is the best way to chunk the rows (is there
an efficient way to do this, e.g. is there an internal row number we
can refer to?). We probably end up scanning the entire table for each
chunk so this seems more costly than the export to GCS option
It depends on how and when your table was created!
If your table was loaded as a one big load - i dont see way of avoiding table scan again and again.
If table was loaded in increments and recently - you have chance to enjoy so called Table Decoration (specifically you should look to Range Decorators)
In the very early era of BigQuery there was an expectation of having Partitioned Decorators - this would address a lot of users needs - but it is still not available and i dont know what are the plans on them

detecting when data has changed

Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

Storage of many log files

I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.

Resources