Using Wildcards in file name i am trying to read files from GCS bucket.
in gsutil command line wildcards is working in specifying file names.
but in java client api
GcsFilename filename = new GcsFilename(BUCKETNAME, "big*");
it is searching for file named "big*" instead of file starting with big .
please help me how i can use Wildcards in GCSFilename.
Thanks in advance.
Wildcard characters are a feature of gsutil, but they're not an inherent part of the Google Cloud Storage API. You can, however, handle this the same way that gsutil does.
If you want to find the name of every object that begins with a certain prefix, Google Cloud Storage's APIs provide a list method with a "prefix" argument. Only objects matching the prefix will be returned. This doesn't work for arbitrary regular expressions, but it will work for your example.
The documentation for the list method goes into more detail.
As Brandon Yarbrough mentioned, GcsFilename represent a name of a single GCS Object, which could include any valid UTF-8 character [excluding a few such as \r \n but including '*' though
not recommended). see https://developers.google.com/storage/docs/bucketnaming#objectnames for more info.
GAE GCS client does not support listing yet (though that is planned to be added), so for now you can use the GCS XML or JSON API directly (using urlfetch) or use the Java GCS api client, https://developers.google.com/api-client-library/java/apis/storage/v1
See example for the latter option:
public class ListServlet extends HttpServlet {
public static final List<String> OAUTH_SCOPES =
ImmutableList.of("https://www.googleapis.com/auth/devstorage.read_write");
#Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
try {
String bucket = req.getParameter("bucket");
AppIdentityCredential cred = new AppIdentityCredential(OAUTH_SCOPES);
Storage storage = new Storage.Builder(new UrlFetchTransport(), new JacksonFactory(), cred)
.setApplicationName(SystemProperty.applicationId.get()).build();
Objects.List list = storage.objects().list(bucket);
for (StorageObject o : list.execute().getItems()) {
resp.getWriter().println(o.getName() + " -> " + o);
}
} catch (Exception ex) {
throw new ServletException(ex);
}
}
}
Related
Since version 1.15 of Apache Flink you can use the compaction feature to merge several files into one.
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#compaction
How can we use compaction with bulk Parquet format?
The existing implementations for the RecordWiseFileCompactor.Reader (DecoderBasedReader and ImputFormatBasedReader) do not seem suitable for Parquet.
Furthermore we can not find any example for compacting Parquet or other bulk formats.
There are two types of file compactor mentioned in flink's document.
OutputStreamBasedFileCompactor : The users can write the compacted results into an output stream. This is useful when the users don’t want to or can’t read records from the input files.
RecordWiseFileCompactor : The compactor can read records one-by-one from the input files and write into the result file similar to the FileWriter.
If I remember correctly, Parquet saves meta information at end of files. So obviously we need to use RecordWiseFileCompactor. Because we need to read the whole Parquet file so we can get the meta information at the end of the file. Then we can use the meta information (number of row groups, schema) to parse the file.
From the java api, to construct a RecordWiseFileCompactor, we need a instance of RecordWiseFileCompactor.Reader.Factory.
There are two implementations of interface RecordWiseFileCompactor.Reader.Factory, DecoderBasedReader.Factory and InputFormatBasedReader.Factory respectively.
DecoderBasedReader.Factory creates a DecoderBasedReader instance, which reads whole file content from InputStream. We can load the bytes into a buffer and parse the file from the byte buffer, which is obviously painful. So we don't use this implementation.
InputFormatBasedReader.Factory creates a InputFormatBasedReader, which reads whole file content using the FileInputFormat supplier we passed to InputFormatBasedReader.Factory constructor.
The InputFormatBasedReader instance uses the FileInputFormat to read record by record, and pass records to the writer which we passed to forBulkFormat call, till the end of the file.
The writer receives all the records and compact the records into one file.
So the question becomes what is FileInputFormat and how to implement it.
Though there are many methods and fields of class FileInputFormat, we know only four methods are called from InputFormatBasedReader from InputFormatBasedReader source code mentioned above.
open(FileInputSplit fileSplit), which opens the file
reachedEnd(), which checks if we hit end of file
nextRecord(), which reads next record from the opened file
close(), which cleans up the site
Luckily, there's a AvroParquetReader from package org.apache.parquet.avro we can utilize. It has already implemented open/read/close. So we can wrap the reader inside a FileInputFormat and use the AvroParquetReader to do all the dirty works.
Here's a example code snippet
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
public class ExampleFileInputFormat extends FileInputFormat<GenericRecord> {
private ParquetReader<GenericRecord> parquetReader;
private GenericRecord readRecord;
#Override
public void open(FileInputSplit split) throws IOException {
Configuration config = new Configuration();
// set hadoop config here
// for example, if you are using gcs, set fs.gs.impl here
// i haven't tried to use core-site.xml but i believe this is feasible
InputFile inputFile = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(split.getPath().toUri()), config);
parquetReader = AvroParquetReader.<GenericRecord>builder(inputFile).build();
readRecord = parquetReader.read();
}
#Override
public void close() throws IOException {
parquetReader.close();
}
#Override
public boolean reachedEnd() throws IOException {
return readRecord == null;
}
#Override
public GenericRecord nextRecord(GenericRecord genericRecord) throws IOException {
GenericRecord r = readRecord;
readRecord = parquetReader.read();
return r;
}
}
Then you can use the ExampleFileInputFormat like below
FileSink<GenericRecord> sink = FileSink.forBulkFormat(
new Path(path),
AvroParquetWriters.forGenericRecord(schema))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.enableCompactionOnCheckpoint(10)
.build(),
new RecordWiseFileCompactor<>(
new InputFormatBasedReader.Factory<>(new SerializableSupplierWithException<FileInputFormat<GenericRecord>, IOException>() {
#Override
public FileInputFormat<GenericRecord> get() throws IOException {
FileInputFormat<GenericRecord> format = new ExampleFileInputFormat();
return format;
}
})
))
.build();
I have successfully deployed this to a flink on k8s and compacted files on gcs. There're some notes for deploying.
You need to download flink shaded hadoop jar from https://flink.apache.org/downloads.html (search Pre-bundled Hadoop in webpage) and the jar into $FLINK_HOME/lib/
If you are writing files to some object storage, for example gcs, you need to follow the plugin instruction. Remember to put the plugin jar into the plugin folder but not the lib foler.
If you are writing files to some object storage, you need to download the connector jar from cloud service supplier. For example, I'm using gcs and download gcs-connector jar following GCP instruction. Put the jar into some foler other than $FLINK_HOME/lib or $FLINK_HOME/plugins. I put the connector jar into a newly made folder $FLINK_HOME/hadoop-lib
Set environment HADOOP_CLASSPATH=$FLINK_HOME/lib/YOUR_SHADED_HADOOP_JAR:$FLINK_HOME/hadoop-lib/YOUR_CONNECTOR_JAR
After all these steps, you can start your job and good to go.
Im trying to implement google's Cloud Connection Server with Google App Engine following this tutorial -
Implementing an XMPP-based App Server. I copied latest smack jars from http://www.igniterealtime.org/projects/smack/ (smack.jar and smackx.jar), put them in WEB-INF/lib and added them to the classpath (im using eclipse).
In the code sample in the first link i posted, the XMPPConnection is initiated in a 'main' method. Since this is not really suitable to GAE i created a ServletContextListener and added it to web.xml.
public class GCMContextListener implements ServletContextListener {
private static final String GCM_SENDER_ID = "*GCM_SENDER_ID*";
private static final String API_KEY = "*API_KEY*";
private SmackCcsClient ccsClient;
public GCMContextListener() {
}
#Override
public void contextInitialized(ServletContextEvent arg0) {
final String userName = GCM_SENDER_ID + "#gcm.googleapis.com";
final String password = API_KEY;
ccsClient = new SmackCcsClient();
try {
ccsClient.connect(userName, password);
} catch (XMPPException e) {
e.printStackTrace();
}
}
#Override
public void contextDestroyed(ServletContextEvent arg0) {
try {
ccsClient.disconnect();
} catch (XMPPException e) {
e.printStackTrace();
}
}
}
web.xml
<web-app>
<listener>
<listener-class>com.myserver.bootstrap.GCMContextListener</listener-class>
</listener>
</web-app>
Now, when i start the GAE server i get the following exception :
java.lang.NoClassDefFoundError: javax.naming.directory.InitialDirContext is a restricted class. Please see the Google App Engine developer's guide for more details.
i searched the "Google App Engine developer's guide for more details" but couldnt find anything about this. can you please help me ?
Google App Engine restricts access to certain JRE classes. In fact they published a whitelist that shows you which classes are useable. It seems to me that the Smack library might require some reference to a directory context (maybe to create the XMPP messages?) and that is why your servlet causes this exception. The javax.naming.directory is not in the whitelist.
I'm currently working on setting up a GCM Server as well. It seems to me that you need to read through the example and see what that main method is doing. What I see is a connection to the GCM server:
try {
ccsClient.connect(userName, password);
} catch (XMPPException e) {
e.printStackTrace();
}
Then a downstream message being sent to a device:
// Send a sample hello downstream message to a device.
String toRegId = "RegistrationIdOfTheTargetDevice";
String messageId = ccsClient.getRandomMessageId();
Map<String, String> payload = new HashMap<String, String>();
payload.put("Hello", "World");
payload.put("CCS", "Dummy Message");
payload.put("EmbeddedMessageId", messageId);
String collapseKey = "sample";
Long timeToLive = 10000L;
Boolean delayWhileIdle = true;
ccsClient.send(createJsonMessage(toRegId, messageId, payload, collapseKey,
timeToLive, delayWhileIdle));
}
These operations would be completed at some point during your application's lifecycle, so your servlet should support them by providing the methods the example is implementing, such as the connect method that appears in the first piece of code that I pasted here. It's implementation is in the example at line 235 if I'm not mistaken.
As the documentation says, the 3rd party application server, which is what you're trying to implement using GAE, should be:
Able to communicate with your client.
Able to fire off properly formatted requests to the GCM server.
Able to handle requests and resend them as needed, using exponential back-off.
Able to store the API key and client registration IDs. The API key is included in the header of POST requests that send messages.
Able to store the API key and client registration IDs.
Able to generate message IDs to uniquely identify each message it sends.
So I want to create a java.io.File so that I can use it to generate a multipart-form POST request. I have the file in the form of a com.google.api.services.drive.model.File so I'm wondering, is there a way I can convert this Google File to a Java File? This is a web-app that uses the Google App Engine SDK, which prohibits every approach I've tried to make this work
No, you it doesn't seem like you can convert from com.google.api.services.drive.model.File to java.io.File. But it should still be possible to generate a multipart-form POST request using your data in Drive.
So the com.google.api.services.drive.model.File class is used for storing metadata about the file. It's not storing the file contents.
If you want to read the contents of your file into memory, this code snippet from the Drive documentation shows how to do it. Once the file is in memory, you can do whatever you want with it.
/**
* Download the content of the given file.
*
* #param service Drive service to use for downloading.
* #param file File metadata object whose content to download.
* #return String representation of file content. String is returned here
* because this app is setup for text/plain files.
* #throws IOException Thrown if the request fails for whatever reason.
*/
private String downloadFileContent(Drive service, File file)
throws IOException {
GenericUrl url = new GenericUrl(file.getDownloadUrl());
HttpResponse response = service.getRequestFactory().buildGetRequest(url)
.execute();
try {
return new Scanner(response.getContent()).useDelimiter("\\A").next();
} catch (java.util.NoSuchElementException e) {
return "";
}
}
https://developers.google.com/drive/examples/java
This post might be helpful for making your multi-part POST request from Google AppEngine.
In GoogleDrive Api v3 you can download the file content into your OutputStream. You need for that the file id, which you can get from your com.google.api.services.drive.model.File:
String fileId = "yourFileId";
OutputStream outputStream = new ByteArrayOutputStream();
driveService.files().get(fileId).executeMediaAndDownloadTo(outputStream);
I'm developing a web application using Google app engine for Java.
I will use Google Cloud storage and according to the documentation, I'm using GCS client library to emulate cloud storage on local disk.
I have no problem saving the files, I can see them from eclipse under the war folder (under the path WEB-INF/appengine-generated) and I can see them from the web admin panel accessible from the url
localhost:8888/_ah/admin
as indicated in this question
My question is the following. Which are the files URI under localhost to access them with GCS emulation?
Example of one of uploaded files on localhost:
file key is aglub19hcHBfaWRyJwsSF19haF9GYWtlQ2xvdWRTdG9yYWdlX18xIgpxcmNvZGUuanBnDA
ID/name is encoded_gs_key:L2dzLzEvcXJjb2RlLmpwZw
filename is /gs/1/qrcode.jpg
Thanks in advance.
You can see how this is done here:
https://code.google.com/p/appengine-gcs-client/source/browse/trunk/java/src/main/java/com/google/appengine/tools/cloudstorage/dev/LocalRawGcsService.java
As of today this mapping is being maintained by the using the local datastore. This may change in the future, but you should be able to simply call into this class or one of the higher level classes provided with the GCS client to get at the data.
Using getServingUrl()
The local gcs file is saved into a blob format.
When saving it, I can use location like your filename "/gs/1/qrcode.jpg"
Yet, when accessing it, this fake location is not working.
I found a way. It may not be the best, but works for me.
BlobKey bk = BlobstoreServiceFactory.getBlobstoreService().createGsBlobKey(location);
String url = ImagesServiceFactory.getImagesService().getServingUrl(bk);
The url will be like:
http://127.0.0.1:8080/_ah/img/encoded_gs_key:yourkey
(I was hardly to find any direct solution by google search.
I hope this answer can help others in need.)
Resource: ImagesServiceFactory ImageService
FileServiceFactory
For those who wish to serve the local GCS files that have been created by the GAE GCS library, one solution is to expose a Java Servlet like this:
package my.applicaion.servlet;
import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.blobstore.BlobKey;
import com.google.appengine.api.blobstore.BlobstoreService;
import com.google.appengine.api.blobstore.BlobstoreServiceFactory;
public final class GoogleCloudStorageServlet
extends HttpServlet
{
#Override
protected void doGet(final HttpServletRequest request, final HttpServletResponse response)
throws ServletException, IOException
{
final BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
final String fileName = "/gs" + request.getPathInfo();
final BlobKey blobKey = blobstoreService.createGsBlobKey(fileName);
blobstoreService.serve(blobKey, response);
}
}
and in your web.xml:
<servlet>
<servlet-name>GoogleCloudStorage</servlet-name>
<servlet-class>my.applicaion.servlet.GoogleCloudStorageServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>GoogleCloudStorage</servlet-name>
<url-pattern>/gcs/*</url-pattern>
</servlet-mapping>
If you host this servlet in your GAE application, the URL for accessing a GCS file with bucket bucket-name and with name fileName is http://localhost:8181:/gcs/bucket-name/fileName, the local GAE development server port number being 8181.
This works at least from GAE v1.9.50.
And if you intend to have the local GCS server working in a unit test with Jetty, here is a work-around, hopefully with the right comments:
final int localGcsPortNumber = 8081;
final Server localGcsServer = new Server(localGcsPortNumber);
final ServletContextHandler context = new ServletContextHandler(ServletContextHandler.NO_SESSIONS);
final String allPathSpec = "/*";
context.addServlet(new ServletHolder(new HttpServlet()
{
#Override
protected void service(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException
{
final BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
final String fileName = "/gs" + request.getRequestURI();
final BlobKey blobKey = blobstoreService.createGsBlobKey(fileName);
if (blobKey != null)
{
// This is a work-around over the "ServeBlobFilter" which does not take the "Content-Type" from the "blobInfo", but attempts to retrieve it from the "blobKey"
final BlobInfo blobInfo = BlobStorageFactory.getBlobInfoStorage().loadGsFileInfo(blobKey);
if (blobInfo != null)
{
final String contentType = blobInfo.getContentType();
if (contentType != null)
{
response.addHeader(HttpHeaders.CONTENT_TYPE, contentType);
}
}
}
blobstoreService.serve(blobKey, response);
}
}), allPathSpec);
// The filter is responsible for taken the "blobKey" from the HTTP header and for fulfilling the response with the corresponding GCS content
context.addFilter(ServeBlobFilter.class, allPathSpec, EnumSet.of(DispatcherType.REQUEST));
// This attribute must be set, otherwise a "NullPointerException" is thrown
context.getServletContext().setAttribute("com.google.appengine.devappserver.ApiProxyLocal", LocalServiceTestHelper.getApiProxyLocal());
localGcsServer.setHandler(context);
localGcsServer.start();
I am trying to develop an API call using Apache CXF that takes in an attachment along with the request. I followed this tutorial and this is what I have got so far.
#POST
#Path("/upload")
#RequireAuthentication(false)
public Response uploadWadl(MultipartBody multipartBody){
List<Attachment> attachments = multipartBody.getAllAttachments();
DataHandler dataHandler = attachments.get(0).getDataHandler();
try {
InputStream is = dataHandler.getInputStream();
} catch (IOException e) {
e.printStackTrace();
}
return Response("OK");
}
I am getting an InputStream object to the attachment and everything is working fine. However I need to pass the attachment as a java.io.File object to another function. I know I can create a file here, read from the inputstream and write to it. But is there a better solution? Has the CXF already stored it as a File? If so I could just go ahead and use that. Any suggestions?
I'm also interested on this matter. While discussing with Sergey on the CXF mailing list, I learned that CXF is using a temporary file if the attachment is over a certain threshold.
In the process I discovered this blogpost that explains how to use CXF attachment safely.
You can be interested by the exemple on this page as well.
That's all I can say at the moment as I'm investigating right now, I hope that helps.
EDIT : At the moment here's how we handle attachment with CXF 2.6.x. About uploading a file using multipart content type.
In our REST resource we have defined the following method :
#POST
#Produces(MediaType.APPLICATION_JSON)
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("/")
public Response archive(
#Multipart(value = "title", required = false) String title,
#Multipart(value = "hash", required = false) #Hash(optional = true) String hash,
#Multipart(value = "file") #NotNull Attachment attachment) {
...
IncomingFile incomingFile = attachment.getObject(IncomingFile.class);
...
}
A few notes on that snippet :
#Multipart is not standard to JAXRS, it's not even in JAXRS 2, it's part of CXF.
In our code we have implemented bean validation (you have to do it yourself in JAXRS 1)
You don't have to use a MultipartBody, the key here is to use an argument of type Attachment
So yes as far as we know there is not yet a possibility to get directly the type we want in the method signature. So for example if you just want the InputStream of the attachment you cannot put it in the signature of the method. You have to use the org.apache.cxf.jaxrs.ext.multipart.Attachment type and write the following statement :
InputStream inputStream = attachment.getObject(InputStream.class);
Also we discovered with the help of Sergey Beryozkin that we could transform or wrap this InputStream, that's why in the above snippet we wrote :
IncomingFile incomingFile = attachment.getObject(IncomingFile.class);
IncomingFile is our custom wrapper around the InputStream, for that you have to register a MessageBodyReader, ParamHandler won't help as they don't work with streams but with String.
#Component
#Provider
#Consumes
public class IncomingFileAttachmentProvider implements MessageBodyReader<IncomingFile> {
#Override
public boolean isReadable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
return type != null && type.isAssignableFrom(IncomingFile.class);
}
#Override
public IncomingFile readFrom(Class<IncomingFile> type,
Type genericType,
Annotation[] annotations,
MediaType mediaType,
MultivaluedMap<String, String> httpHeaders,
InputStream entityStream
) throws IOException, WebApplicationException {
return createIncomingFile(entityStream, fixedContentHeaders(httpHeaders)); // the code that will return an IncomingFile
}
}
Note however that there have been a few trials to understand what was passed, how, and the way to hot-fix bugs (For example the first letter of the first header of the attachment part was eat so you had ontent-Type instead of Content-Type).
Of course the entityStream represents the actual InputStream of the attachment. This stream will read data either from memory or from disk, depending on where CXF put the data ; there is a size threshold property (attachment-memory-threshold) for that matter. You can also say where the temporary attachments will go (attachment-directory).
Just don't forget to close the stream when you are done (some tool do it for you).
Once everything was configured we tested it with Rest-Assured from Johan Haleby. (Some code are part of our test utils though) :
given().log().all()
.multiPart("title", "the.title")
.multiPart("file", file.getName(), file.getBytes(), file.getMimeType())
.expect().log().all()
.statusCode(200)
.body("store_event_id", equalTo("1111111111"))
.when()
.post(host().base().endWith("/store").toStringUrl());
Or if you need to upload the file via curl in such a way :
curl --trace -v -k -f
--header "Authorization: Bearer b46704ff-fd1d-4225-9dd4-e29065532b73"
--header "Content-Type: multipart/form-data"
--form "hash={SHA256}3e954efb149aeaa99e321ffe6fd581f84d5a497b6fab5c86e0d5ab20201f7eb5"
--form "title=fantastic-video.mp4"
--form "archive=#/the/path/to/the/file/fantastic-video.mp4;type=video/mp4"
-X POST http://localhost:8080/api/video/event/store
To finish this answer, I'd like to mention it is possible to have JSON payload in multipart, for that you can use an Attachment type in the signature and then write
Book book = attachment.getObject(Book.class)
Or you can write an argument like :
#Multipart(value="book", type="application/json") Book book
Just don't forget to add the Content-Type header to the relevant part when performing the request.
It might be worth to say that it is possible to have all the parts in a list, just write a method with a single argument of type List<Attachment>. However I prefer to have the actual arguments in the method signature as it's cleaner and less boilerplate.
#POST
void takeAllParts(List<Attachment> attachments)