Most effective way to transfer almost 400k images to S3

Most effective way to transfer almost 400k images to S3 - file

I am currently in charge of transferring a site from its current server to EC2, that part of the project is done and fine, the other part is the part I am struggling with, the site currently has almost 400K images, all sorted within different folders within a main userimg folder, the client wants all these images to be stored on S3 - the main problem I have is how do I transfer almost 400,000 images from the server to S3 - I have been using http://s3tools.org/s3cmd which is brilliant but if I was to transfer the userimg folder with s3cmd it is going to take almost 3 days solid, and if the connection breaks or similar problem I am going to have some images on s3 and some not, with no way to continue the process...
Can anyone suggest a solution, has anyone come up against a problem like this before?

I would suggest you to write (or to get someone to write) a simple Java utility that:
Reads the structure of your client directories (if needed)
For every image creates a corresponding key (according to the file structure read in 1)on s3 and starts Multi-part upload in paralel using AWS SDK or jets3t API.
I did it for our client. It is less than 200 lines of java code and it is very reliable.
below is the part that does a multi-part upload.The part that reads the file structure is trivial.
/**
* Uploads file to Amazon S3. Creates the specified bucket if it does not exist.
* The upload is done in chunks of CHUNK_SIZE size (multi-part upload).
* Attempts to handle upload exceptions gracefully up to MAX_RETRY times per single chunk.
*
* #param accessKey - Amazon account access key
* #param secretKey - Amazon account secret key
* #param directoryName - directory path where the file resides
* #param keyName - the name of the file to upload
* #param bucketName - the name of the bucket to upload to
* #throws Exception - in case that something goes wrong
*/
public void uploadFileToS3(String accessKey
,String secretKey
,String directoryName
,String keyName // that is the file name that will be created after upload completed
,String bucketName ) throws Exception {
// Create a credentials object and service to access S3 account
AWSCredentials myCredentials =
new BasicAWSCredentials(accessKey, secretKey);
String filePath = directoryName
+ System.getProperty("file.separator")
+ keyName;
log.info("uploadFileToS3 is about to upload file [" + filePath + "]");
AmazonS3 s3Client = new AmazonS3Client(myCredentials);
// Create a list of UploadPartResponse objects. You get one of these
// for each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// make sure that the bucket exists
createBucketIfNotExists(bucketName, accessKey, secretKey);
// delete the file from bucket if it already exists there
s3Client.deleteObject(bucketName, keyName);
// Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucketName, keyName);
InitiateMultipartUploadResult initResponse = s3Client.initiateMultipartUpload(initRequest);
File file = new File(filePath);
long contentLength = file.length();
long partSize = CHUNK_SIZE; // Set part size to 5 MB.
int numOfParts = 1;
if (contentLength > CHUNK_SIZE) {
if (contentLength % CHUNK_SIZE != 0) {
numOfParts = (int)((contentLength/partSize)+1.0);
}
else {
numOfParts = (int)((contentLength/partSize));
}
}
try {
// Step 2: Upload parts.
long filePosition = 0;
for (int i = 1; filePosition < contentLength; i++) {
// Last part can be less than 5 MB. Adjust part size.
partSize = Math.min(partSize, (contentLength - filePosition));
log.info("Start uploading part[" + i + "] of [" + numOfParts + "]");
// Create request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucketName).withKey(keyName)
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(filePosition)
.withFile(file)
.withPartSize(partSize);
// repeat the upload until it succeeds or reaches the retry limit
boolean anotherPass;
int retryCount = 0;
do {
anotherPass = false; // assume everything is ok
try {
log.info("Uploading part[" + i + "]");
// Upload part and add response to our list.
partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
log.info("Finished uploading part[" + i + "] of [" + numOfParts + "]");
} catch (Exception e) {
log.error("Failed uploading part[" + i + "] due to exception. Will retry... Exception: ", e);
anotherPass = true; // repeat
retryCount++;
}
}
while (anotherPass && retryCount < CloudUtilsService.MAX_RETRY);
filePosition += partSize;
log.info("filePosition=[" + filePosition + "]");
}
log.info("Finished uploading file");
// Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(
bucketName,
keyName,
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
log.info("multipart upload completed.upload id=[" + initResponse.getUploadId() + "]");
} catch (Exception e) {
s3Client.abortMultipartUpload(new AbortMultipartUploadRequest(
bucketName, keyName, initResponse.getUploadId()));
log.error("Failed to upload due to Exception:", e);
throw e;
}
}
/**
* Creates new bucket with the names specified if it does not exist.
*
* #param bucketName - the name of the bucket to retrieve or create
* #param accessKey - Amazon account access key
* #param secretKey - Amazon account secret key
* #throws S3ServiceException - if something goes wrong
*/
public void createBucketIfNotExists(String bucketName, String accessKey, String secretKey) throws S3ServiceException {
try {
// Create a credentials object and service to access S3 account
org.jets3t.service.security.AWSCredentials myCredentials =
new org.jets3t.service.security.AWSCredentials(accessKey, secretKey);
S3Service service = new RestS3Service(myCredentials);
// Create a new bucket named after a normalized directory path,
// and include my Access Key ID to ensure the bucket name is unique
S3Bucket zeBucket = service.getOrCreateBucket(bucketName);
log.info("the bucket [" + zeBucket.getName() + "] was created (if it was not existing yet...)");
} catch (S3ServiceException e) {
log.error("Failed to get or create bucket[" + bucketName + "] due to exception:", e);
throw e;
}
}

Sounds like a job for Rsync. I've never used it in combination with S3, but S3Sync seems like what you need.

If you don't want to actually upload all of the files (or indeed, manage it), you could use AWS Import/Export which basically entails just shipping Amazon a hard-disk.

You could use superflexiblefilesychronizer. It is a commercial product but the Linux version is free.
It can compare and sync the folders and multiple files can be transferred in parallel. Its fast. The interface is perhaps not the simplest, but thats mainly because it has a million configuration options.
Note: I am not affiliated in any way with this product but I have used it.

Consider Amazon S3 Bucket Explorer.
It allows you to upload files in parallel, so that should speed up the process.
The program has a job queue, so that if one of the uploads fails it will retry the upload automatically.

Related

Cucumber Extent Report - How to add screenshots to each cucumber step

I am unable to add a screen shot to each step I execute on to my cucumber Extent report. Please see code snippet below.
public String captureScreenshotMobileWeb(WebDriver driver, String screenShotName) {
TestLogger tLog = new TestLogger();
String logFilename = (new SimpleDateFormat("yyyyMMddHHmmsss")).format(new Date());
try {
Object ts;
if (driver.equals(this.wdriver)) {
ts = this.wdriver;
} else {
ts = (TakesScreenshot)driver;
}
File source = (File) ((TakesScreenshot)ts).getScreenshotAs(OutputType.FILE);
String dest = "C:\\Users\\nb313260\\Documents\\Projects\\Servicing\\staff-servicing-automation-test\\ScreenShots\\" + screenShotName + "_" + logFilename + ".png";
File destination = new File(dest);
FileUtils.copyFile(source, destination);
tLog.logInfo("Screenshot taken: " + screenShotName);
return dest;
} catch (Exception var9) {
tLog.logError(var9.getMessage());
return var9.getMessage();
}
}
Feature:
Feature: Account overview display for all product types
Scenario Outline: As a user I must be able to view account details Given Client is authenticated
When User launches staff servicing with Account number "<Account_Number>" for "<Product_Type>"
Then User must be able to view account overview page and details under At a glance, Spotlight & Account information for selected "<Account_Number>" account number on screen Examples: |Account_Number|Product_Type|

Code/Script to recursively search S3 bucket to unzip (GZIP) and move specific file from ZIP to new location

Hi I have an S3 bucket containing gzip files. Within each zip there is a single TSV file I want to move to a new folder or bucket (dont really mind which). The S3 bucket will be added to with new zip file each hour so this script needs to be something I can schedule or trigger. Happy to use CLI, Lambda or any other method! Pointers, links, help very much appreciated.

Ok so the fudge way to do this is with local processing:
Connect to S3
AmazonS3Config config = new AmazonS3Config();
config.ServiceURL = "https://s3-eu-west-1.amazonaws.com";
AmazonS3Client s3Client = new AmazonS3Client(
S3AccessKey,
S3SecretKey,
config
);
Copy Down the Files you want to process
S3DirectoryInfo dir = new S3DirectoryInfo(s3Client, bucketname, "jpbodenukproduction");
dir.CopyToLocal(#"C:\S3Local");
Decompress the Gzip (containing the tar, containing the multiple files):
string directorypath = #"C:\S3Local";
DirectoryInfo directoryselected = new DirectoryInfo(directorypath);
foreach (FileInfo FileToDecompress in directoryselected.GetFiles("*.gz"))
{
Decompress(FileToDecompress);
}
public static void Decompress(FileInfo fileToDecompress)
{
using (FileStream originalFileStream = fileToDecompress.OpenRead())
{
string currentFileName = fileToDecompress.FullName;
string newFileName = currentFileName.Remove(currentFileName.Length - fileToDecompress.Extension.Length);
using (FileStream decompressedFileStream = File.Create(newFileName))
{
using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
{
decompressionStream.CopyTo(decompressedFileStream);
Console.WriteLine("Decompressed: {0}", fileToDecompress.Name);
}
}
}
}
Now deal with the tar file (using ICSharpCode.SharpZipLib):
foreach (FileInfo TarFile in directoryselected.GetFiles("*.tar"))
{
var stream = File.OpenRead(TarFile.FullName);
var tarArchive = ICSharpCode.SharpZipLib.Tar.TarArchive.CreateInputTarArchive(stream);
tb1.Text = "Processing:" + TarFile.Name;
try
{
tarArchive.ExtractContents(#"C:\S3Local\Trash\");
}
catch (Exception ziperror)
{
tb1.Text = "Delay Error in TarUnzip:" + ziperror;
Thread.Sleep(10000);
}
finally
{
tarArchive.Close();
stream.Close();
}
Finally do what you want with the unzipped files, I simply extracted the single file I need, recompressed and moved back up to S3.
My plan is to next convert into Lambda and get this running on a schedule.

How to reload list resource bundle in ADF 12c

I fail to reload my resource bundle class to reflect the changed translations (made my end-user) on page. Although getContent method executes and all translations as key/value fetched from database and object[][] returned from getContent method successfully. this happens after each time I clear the cache and refresh the jsf page through actionListener.
ResourceBundle.clearCache();
Also I tried to use the below and got the same result.
ResourceBundle.clearCache(Thread.currentThread().GetContextClassLoader());
Why WLS always see the old one? Am I miss something?
versions: 12.2.1.1.0 and 12.2.1.3.0

The end user - after making the translations and contributing to the internationalization of the project, the translations are saved to the database,
The process to inforce these operations are done through the following steps:
Create a HashMap and load all the resource key/vale pairs in the map
from the database:
while (rs.next()) {
bundle.put(rs.getString(1), rs.getString(2));
}
Refresh the Bundle of your application
SoftCache cache =
(SoftCache)getFieldFromClass(ResourceBundle.class,
"cacheList");
synchronized (cache) {
ArrayList myBundles = new ArrayList();
Iterator keyIter = cache.keySet().iterator();
while (keyIter.hasNext()) {
Object key = keyIter.next();
String name =
(String)getFieldFromObject(key, "searchName");
if (name.startsWith(bundleName)) {
myBundles.add(key);
sLog.info("Resourcebundle " + name +
" will be refreshed.");
}
}
cache.keySet().removeAll(myBundles);
Getthe a String from ResourceBoundle of your application:
for (String resourcebundle : bundleNames) {
String bundleName =
resourcebundle + (bundlePostfix == null ? "" : bundlePostfix);
try {
bundle = ResourceBundle.getBundle(bundleName, locale, getCurrentLoader(bundleName));
} catch (MissingResourceException e) {
// bundle with this name not found;
}
if (bundle == null)
continue;
try {
message = bundle.getString(key);
if (message != null)
break;
} catch (Exception e) {
}
}

Load milion objects from DB

i am using Drools engine but only from the Implementation side. the frameWork is setup for me, and i can use only RULES side ( i hope i am able to explain myself).
that said - my problem is that i am trying to load about 1 milion row from Oracle DB into the WM, and i find that this task takes too long, here is the rule that i use to load the objects:
(BTW - the request to load the milion records into the WM is mandatory since i need to use these DB objects as part of my rules along with other objects that are injected at runtime into the engine)
rule "Load_CMMObject"
salience 10
no-loop
when
not CMMObjectShouldBeRefreshed() over window:time( 1500m )
then
log("Fetching Load_CMMObject");
ExecutionContext ctx = getExecutionContext();
String getObjectTypeSQL = "select AID , BC_OBJECT_ID , ALR_EQP_NAME , ALR_FROM_SITE from CMM_DB.COR_EQP_VW";
PreparedStatement pStmt = null;
try {
pStmt = MTServiceAccessor.getDBAccessService().prepareNativeStatement(ctx, getObjectTypeSQL, false);
ResultSet rs = pStmt.executeQuery();
while (rs.next()) {
String aid = rs.getString(1);
int objectID = rs.getInt(2);
String eqpName = rs.getString(3);
String fromSite = rs.getString(4);
CMMObject cmmObject = new CMMObject();
cmmObject.setIp(aid);
cmmObject.setObjectID(objectID);
cmmObject.setEqpName(eqpName);
cmmObject.setFromSite(fromSite);
insert(cmmObject);
//log("insert Object ---> " + cmmObject.getIp());
}
log("Finish Loading All cmm_db.BCMED_EQP_VW");
} catch (Exception e) {
log("Failed to Load ServiceName_TBL" + e);
} finally {
if (pStmt != null) {
try {
pStmt.close();
} catch (Exception e) {
log("Failed to close pstmt");
}
}
}
//log(" finished loading trails into the WM1");
CMMObjectShouldBeRefreshed cMMObjectShouldBeRefreshed = new CMMObjectShouldBeRefreshed();
//log(" finished loading trails into the WM2");
insert (cMMObjectShouldBeRefreshed);
//log("finished loading trails into the WM3");
end
i am using server that allocates about 20Gb RAM for the Drools engine, and it has 8 1.5GHZ Quad Core proccessors.
the problem is that it take me to load 5000 raws about 1 minute --> so if i want to load the 1 milion records from the DB it will take me 200 minutes to complete the task and this is too much.
i will appreciate any help here,
thanks alot!

Does flyway migrations support PostgreSQL's COPY?

Having performed a pg_dump of an existing posgresql schema, I have an sql file containing a number of table population statements using the copy.
COPY test_table (id, itm, factor, created_timestamp, updated_timestamp, updated_by_user, version) FROM stdin;
1 600 0.000 2012-07-17 18:12:42.360828 2012-07-17 18:12:42.360828 system 0
2 700 0.000 2012-07-17 18:12:42.360828 2012-07-17 18:12:42.360828 system 0
\.
Though not standard this is part of PostgreSQL's PLSQL implementation.
Performing a flyway migration (via the maven plugin) I get:
[ERROR] Caused by org.postgresql.util.PSQLException: ERROR: unexpected message type 0x50 during COPY from stein
Am I doing something wrong, or is this just not supported?
Thanks.

The short answer is no.
The one definite problem is that the parser is currently not able to deal with this special construct.
The other question is jdbc driver support. Could you try and see if this syntax generally supported by the jdbc driver with a single createStatement call?
If it is, please file an issue in the issue tracker and I'll extend the parser.
Update: This is now supported

I have accomplished this for Postgres using
public abstract class SeedData implements JdbcMigration {
protected static final String CSV_COPY_STRING = "COPY %s(%s) FROM STDIN HEADER DELIMITER ',' CSV ENCODING 'UTF-8'";
protected CopyManager copyManager;
#Override
public void migrate(Connection connection) throws Exception {
log.info(String.format("[%s] Populating database with seed data", getClass().getName()));
copyManager = new CopyManager((BaseConnection) connection);
Resource[] resources = scanForResources();
List<Resource> res = Arrays.asList(resources);
for (Resource resource : res) {
load(resource);
}
}
private void load(Resource resource) throws SQLException, IOException {
String location = resource.getLocation();
InputStream inputStream = getClass().getClassLoader().getResourceAsStream(location);
if (inputStream == null) {
throw new FlywayException("Failure to load seed data. Unable to load from location: " + location);
}
if (!inputStream.markSupported()) {
// Sanity check. We have to be able to mark the stream.
throw new FlywayException(
"Failure to load seed data as mark is not supported. Unable to load from location: " + location);
}
// set our mark to something big
inputStream.mark(1 << 32);
String filename = resource.getFilename();
// Strip the prefix (e.g. 01_) and the file extension (e.g. .csv)
String table = filename.substring(3, filename.length() - 4);
String columns = loadCsvHeader(location, inputStream);
// reset to the mark
inputStream.reset();
// Use Postgres COPY command to bring it in
long result = copyManager.copyIn(String.format(CSV_COPY_STRING, table, columns), inputStream);
log.info(format(" %s - Inserted %d rows", location, result));
}
private String loadCsvHeader(String location, InputStream inputStream) {
try {
return new BufferedReader(new InputStreamReader(inputStream)).readLine();
} catch (IOException e) {
throw new FlywayException("Failure to load seed data. Unable to load from location: " + location, e);
}
}
private Resource[] scanForResources() throws IOException {
return new ClassPathScanner(getClass().getClassLoader()).scanForResources(getSeedDataLocation(), "", ".csv");
}
protected String getSeedDataLocation() {
return getClass().getPackage().getName().replace('.', '/');
}
}
To use implement the class with the appropriate classpath
package db.devSeedData.dev;
public class v0_90__seed extends db.devSeedData.v0_90__seed {
}
All that is needed then is to have CSV files in your classpath under db/devSeedData that follow the format 01_tablename.csv. Columns are extracted from the header line of the CSV.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Most effective way to transfer almost 400k images to S3 - file

Sounds like a job for Rsync. I've never used it in combination with S3, but S3Sync seems like what you need.

If you don't want to actually upload all of the files (or indeed, manage it), you could use AWS Import/Export which basically entails just shipping Amazon a hard-disk.

Consider Amazon S3 Bucket Explorer. It allows you to upload files in parallel, so that should speed up the process. The program has a job queue, so that if one of the uploads fails it will retry the upload automatically.

Related

Cucumber Extent Report - How to add screenshots to each cucumber step

Code/Script to recursively search S3 bucket to unzip (GZIP) and move specific file from ZIP to new location

How to reload list resource bundle in ADF 12c

Load milion objects from DB

Does flyway migrations support PostgreSQL's COPY?

Categories

Resources