How do I load a big CSV file into WSO2 ML - wso2-ml

I was trying to upload a 10GB CSV file into WSO2 ML, but I could not do it, it gave me errors, I followed this link to change the size limit of my dataset in WSO2 ML(https://docs.wso2.com/display/ML100/FAQ#FAQ-Isthereafilesizelimittomydataset?Isthereafilesizelimittomydataset?)
I am running wso2 ML in a PC with the following characteristics:
- 50GB RAM
- 8 Cores
Thanks

When it comes to uploading datasets into WSO2 Machine Learner, we have given three options.
Uploading files from your local file system. As you have mentioned, maximum uploading limit is kept to 100MB and you can increase the limit by setting -Dog.apache.cxf.io.CachedOutputStream.Threshold option your wso2server.dat file. We have tested this feature with a 1GB file. However, for large files, we don't recommend this option. The main use case of this functionality is to allow users to quickly try out some machine learning algorithm with small datasets.
Since you are working with a large dataset we would like to recommend following two approaches for uploading your dataset into WSO2 ML server.
Upload data using Hadoop file system (HDFS). We have given a detailed description on how to use HDFS files in WSO2 ML in our documentation [1].
If you have up and running WSO2 DAS instance, by integrating WSO2 ML with WSO2 DAS you can easily point out a DAS table as your source type in the WSO2 ML's "Create Dataset" wizard. For more details on integrating WSO2 ML with WSO2 DAS please refer [2].
If you need more help regarding this issue please let me know.
[1]. https://docs.wso2.com/display/ML100/HDFS+Support
[2]. https://docs.wso2.com/display/ML110/Integration+with+WSO2+Data+Analytics+Server

For those who want to use HDP (Hortonworks) as part of your HDFS solution to load a large sized dataset for WSO2 ML using the NameNode port of 8020 via IPC, i.e. hdfs://hostname:8020/samples/data/wdbcSample.csv, you may also need to ingest such a data file onto HDFS in the first place using the following Java client:
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://hostname:8020"), configuration);
Path dstPath = new Path("hdfs://hostname:8020/samples/data/wdbcSample.csv");
if (hdfs.exists(dstPath)) {
hdfs.delete(dstPath, true);
} else {
System.out.println("No such destination ...");
}
Path srcPath = new Path("wdbcSample.csv"); // a local file path on the client side
try {
hdfs.copyFromLocalFile(srcPath, dstPath);
System.out.println("Done successfully ...");
} catch (Exception ex) {
ex.printStackTrace();
} finally {
hdfs.close();
}
}

Related

How to upload media (image, video) to Google Cloud Storage with Java

I am now facing with a problem once uploading media files to Google Cloud Storage by using Google Cloud Storage API Java. To be more specific, the GCS API Java example just help us upload text files to Google Bucket but is not useful for media files. I also see some discussion from ticket: Written file not showing up in Google Cloud Storage that the team suggest to use a gsutil tool written by Python. I am not using blobstore as well
My question is how can I do with the following requirements:
-Creating and deleting buckets.
-Uploading, downloading, and deleting objects (such as media files).
-Listing buckets and objects.
-Moving, copying, and renaming objects.
by implementing with Java?
I thank you very much for your time and look forward to hearing from you.
Upload/Download files: you can use the Blobstore API which can be configured to store blobs in Google Cloud Storage by specifying your bucket in the BlobstoreService createUploadUrl. Similarly, to download you can create a createGsBlobKey with the Bucket name + Object name which can then be served by the Blobstore service.
Create/Delete buckets: The Google Cloud Storage Java Client Library does not offers a way to create/delete buckets. You will need to use Google Cloud Storage REST API to programatically create and delete. Thought, you might want to consider organizing your data within one bucket.
Moving, copying, and renaming objects: make use of the Google Cloud Storage Java Client Library
I faced a similar requirement when I had to deal with all sorts of documents including media objects such as images and videos. This is the implementation I followed based on the official documentation of Google Cloud Examples project on GitHub:
Source Link
To Upload a file
public boolean uploadFile(String filePath, byte[] file) {
try {
setDefaultStorageCredentials();
storage.create(BlobInfo.newBuilder(bucketName, filePath).build(),
new ByteArrayInputStream(file));
return true;
} catch (Exception e) {
return false;
}
}
To download a file
public byte[] downloadFile(String filePath) throws FileNotFoundException, IOException {
setDefaultStorageCredentials();
return storage.get(bucketName).get(filePath).getContent();
}
To delete a file
public boolean deleteFile(String filePath){
setDefaultStorageCredentials();
return storage.delete(storage.get(bucketName).get(filePath).getBlobId());
}
To provide temporary access to a file using a signed URL
public String getTemporaryFileLink(String filePath) throws Exception{
setDefaultStorageCredentials();
Blob blob = storage.get(bucketName).get(filePath);
String blobName = blob.getName();
URL signedUrl = storage.signUrl(BlobInfo.newBuilder(bucketName, blobName).build(), 5,TimeUnit.MINUTES);
return signedUrl.toExternalForm();
}
Most of these methods are mentioned in this Google Github project. I just removed the clutter in my implementation. Hope this helps.

Reading data from XLS stored in Google Cloud Storage

I have uploaded Excel file into GCS . Using Apache POI library from local excel file
i am able to read data.
I am not getting avaliable file readers and methods to Read data from GCS.
please suggest me excel file reading methods from GCS.
thanks in advance.
Since this is still unanswered I'll expand upon the previous comment : The GCS Client Library[1] will give you an InputStream which you can use to read the data from GCS:
GcsFilename fileName = new GcsFilename("bucket", "test.xlsx");
GcsInputChannel readChannel = gcsService.openPrefetchingReadChannel(fileName, 0, BUFFER_SIZE);
InputStream inputStream = Channels.newInputStream(readChannel);
XSSFWorkbook workbook = new XSSFWorkbook(inputStream);
XSSFSheet sheet = workbook.getSheetAt(0);
Iterator<Row> rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
// Do stuff...
}
Note that if you want to use POI on the App Engine runtime itself you will need to use a nightly build of POI or build from source yourself, otherwise you will run into the issue 'com.sun.org.apache.xerces.internal.util.SecurityManager is a restricted class'[2].
[1] https://cloud.google.com/appengine/docs/java/googlecloudstorageclient/
[2] Google App Engine and Apache Poi loading templates

SilverLight multi file uploader + Azure Blob storage : occasional corrupt uploads in IE

I am using Silverlight multi file uploader and Uploading the document in Azure Blob as Byte Array.
//Append the memory stream into ByteArray
using (MemoryStream ms = new MemoryStream())
{
stream.CopyTo(ms);
return ms.ToArray();
}
// Upload the file
blob.UploadByteArray(bytes);
Upload document appears to be corrupt Intermittently.
Any Suggestions?
The Windows Azure Storage Client Library protects the integrity of the blob being uploaded by verifying an MD5 hash of the data when it is sent to the Windows Azure Storage Service (in most cases). If you are using an HTTPS connection to the service, this would also verify the data was sent without errors.
The details of how MD5 hashes are used: http://blogs.msdn.com/b/windowsazurestorage/archive/2011/02/18/windows-azure-blob-md5-overview.aspx
I believe the corruption you are seeing occurred between the client's web browser and your application. You will need to have the user try their upload again.
By the way, your code creates two additional copies of the data unnecessarily (MemoryStream and byte array). Instead, try this:
blob.UploadFromStream(stream);

Location of GS File in Local/Dev AppEngine

I'm trying to trouble shoot some issues I'm having with an export task I have created. I'm attempting to export CSV data using Google Cloud Storage and I seem to be unable to export all my data. I'm assuming it has something to do with the (FAR TOO LOW) 30 second file limit when I attempt to restart the task.
I need to trouble shoot, but I can't seem to find where my local/development server writing the files out. I see numerous entries in the GsFileInfo table so I assume something is going on, but I can't seem to find the actual output file.
Can someone point me to the location of the Google Cloud Storage files in the local AppEngine development environment?
Thanks!
Looking at dev_appserver code, looks like you can specify a path or it will calculate a default based on the OS you are using.
blobstore_path = options.blobstore_path or os.path.join(storage_path,
'blobs')
Then it passed this path to blobstore_stub (GCS storage is backed by blobstore stub), which seems to shard files by their blobstore key.
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
return os.path.join(self._DirectoryForBlob(blob_key), str(blob_key)[1:])
For example, i'm using ubuntu and started with dev_appserver.py --storage_path=~/tmp, then i was able to find files under ~/tmp/blobs and datastore under ~/tmp/datastore.db. Alternatively, you can go to local admin_console, the blobstore viewer link will also display gcs files.
As tkaitchuck mentions above, you can use the included LocalRawGcsService to pull the data out of the local.db. This is the only way to get the file, as they are stored in the Local DB using the blobstore. Here's the original answer:
which are the files uri on GAE java emulating cloud storage with GCS client library?

Does Google App Engine allow creation of files and folders on the server?

I know Google App Engine offers free space, but I wonder if it's for storing data in its database only or does it also allow me to create files and directories on the server side to store my data ? For instance can I use the following method to save file ?
public static void saveFile(String File_Path,StringBuffer Str_Buf,boolean Append)
{
FileOutputStream fos=null;
BufferedOutputStream bos=null;
try
{
fos=new FileOutputStream(File_Path,Append);
bos=new BufferedOutputStream(fos);
for (int j=0;j<Str_Buf.length();j++) bos.write(Str_Buf.charAt(j));
}
catch (Exception e) { e.printStackTrace(); }
finally
{
try
{
if (bos!=null)
{
bos.close();
bos=null;
}
if (fos!=null)
{
fos.close();
fos=null;
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
}
You can read files from your own project - You cannot write to the file system
from the FAQ ...
Why can't I write to this file?
Writing to local files is not supported in App Engine due to the distributed nature of your application. Instead, data which must be persisted should be stored in the distributed datastore. For more information see the documentation on the runtime sandbox
An App Engine application cannot:
write to the filesystem. Applications must use the App Engine datastore for storing persistent data. Reading from the filesystem is allowed, and all application files uploaded with the application are available.
open a socket or access another host directly. An application can use the App Engine URL fetch service to make HTTP and HTTPS requests to other hosts on ports 80 and 443, respectively.
spawn a sub-process or thread. A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.
make other kinds of system calls.
New information.
Answer is Yes, but you have to use their cloud storage for write access. You can not use regular files for your purpose.
https://developers.google.com/appengine/docs/java/googlecloudstorageclient/
It also has Python API as well as RESTful API.
No, the file i/o is not allowed. you may use blobs to store images or text.
There is a way if you use the /tmp folder. However, it will store files in the RAM of the instance, so note that it will take up memory and that it is temporary (as the folder's name suggest).
More details, see the documentation here or here.
In most situations, it is preferred to use Google Storage instead.
Note also that there is no issue to write to the file system if you choose the flexible environment instead of the standard one (however, be careful of the pricing difference, e.g. this).
I think it should be mentioned that writing to the blobstore using the files API is now deprecated, and that Google is moving to cloud storage.
Google App Engine is a scalable and stateless service.
Scalable means multiple parallel instances get started/shutdown/recycled (elastic).
If one instance writes data to a local SQLite DB, the other parallel instances can’t see or know about the data. This will lead to inconsistency.
You simply cannot have "local write" pattern in a stateless service.
Hence, you have to write your data (state) outside the app engine over the network to a durable storage (such as firebase,cloud sql,cloud storage,redis).
If you want the convenience of “all at one place” compute and local storage, you will have to spin a VM with durable block storage.

Resources