Create Hugging Face Transformers Tokenizer using Amazon SageMaker in a distributed way - amazon-sagemaker

I am using the SageMaker HuggingFace Processor to create a custom tokenizer on a large volume of text data.
Is there a way to make this job data distributed - meaning read partitions of data across nodes and train the tokenizer leveraging multiple CPUs/GPUs.
At the moment, providing more nodes to the processing cluster merely replicates the tokenization process (basically duplicates the process of creation), which is redundant. You can primarily only scale vertically.
Any insights into this?

Considering the following example code for HuggingFaceProcessor:
If you have 100 large files in S3 and use a ProcessingInput with s3_data_distribution_type="ShardedByS3Key" (instead of FullyReplicated), the objects in your S3 prefix will be sharded and distributed to your instances.
For example, if you have 100 large files and want to filter records from them using HuggingFace on 5 instances, the s3_data_distribution_type="ShardedByS3Key" will put 20 objects on each instance, and each instance can read the files from its own path, filter out records, and write (uniquely named) files to the output paths, and SageMaker Processing will put the filtered files in S3.
However, if your filtering criteria is stateful or depends on doing a full pass over the dataset first (such as: filtering outliers based on mean and standard deviation on a feature - in case of using SKLean Processor for example): you'll need to pass that information in to the job so each instance can know how to filter. To send information to the instances launched, you have to use the /opt/ml/config/resourceconfig.json file:
{ "current_host": "algo-1", "hosts": ["algo-1","algo-2","algo-3"] }

Related

How to efficiently store and retrieve aggregated data in DynamoDB? [duplicate]

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.

Elastic Search - MultiLog design

we are trying to implement a new log system for our IoT device, different applications in the cloud (api, spa, etc). We are trying to design the "Schema" to be the most efficient as possible and we feel there are many good solutions, but it's hard to select one.
Here is a general structure : under the devices node we have our 3 different kinds of IoT devices and similar for infra : different applications and more.
So we were thinking of creating one index for each blue circle and create a hierarchical naming with our indexes so we can take advantage of the wildcard when execute search.
For example :
logs-devices-modules
logs-devices-edges
logs-devices-modules
logs-infra-api
logs-infra-portal
And for mapping, we have different log type in each index and should we map only the common field or everything ? Should we map common field and let the dynamic mapping for the logs type specifics?
Please share your opinion and tips if you have !
Ty.
I would generally map everything to ECS, since Kibana knows the meaning of many fields and it aligns with other inputs.
How much data do you have and how different are your fields? If you don't have too much data (every shard should have >10GB — manage with rollover / ILM ideally) and less than 100 fields in total, I would go for a single index and add a field with with the different names, so you can easily filter on that. Though different retention lengths of the data would favor multiple indices, so you will have to pick the right tradeoffs for your system.

Is it possible to append data to existing SOLR document based on a field value?

Currently, I have two databases that share only one field. I need to append the data from one database into the document generated by the other, but the mapping is one to many, such that multiple documents will have the new data appended to it. Is this possible in SOLR? I've read about nested documents, however, in this case the "child" documents would be shared by many "parent" documents.
Thank you.
I see two main options:
you can write some client code using SolrJ that reads all data needed for a given doc from all datasources (doing a SQL join, looking up separate db, whatever), and then write the doc to Solr. Of course, you can (should) do this in batches if you can.
you can index the first DB into Solr (using DIH if it's doable so it's quick to develop). It is imporntant you store all fields (or use docvalues) so you can have all your data back later. Then you write some client code that:
a) retrieves all data about a doc
b)gets all data that must be added from the other DB
c) build a new representation of the doc (with client docs if needed)
d) you update the doc, overwriting it

Avro hadoop random access to files

I am wondering if Avro supports random access or queries. For example if I create a Avro file call it B.avro which contains 2 binary files X.png and Y.png, is it possible to directly access Y.png? without the need of iterating through the entire file, it would be great if there was a way to access the file contents directly using the the file key.
If not is there any other data structure that would allow me to do this in the hadoop environment sequenceFiles, HAR ? I am basically using Avro as a way to deal with a large number of small files in hadoop, but I would also like to query these files which makes it tough when storing them in larger collections.
Thanks.
I don't know if there is any OOTB feature that allows us to access value through its key. But random access to AVro data files is supported by public void seek(long position), provided by DataFileReader.
You might find MapFile useful. The class MapFile.Reader allows us to get the value for a named key.
Don't mind please if this is exactly not what you need.

How to save R list object to a database?

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Resources