Reading files from S3 bucket with camel - apache-camel

I am new to Camel and need some guidance. I need to read some files from an S3 bucket. The structure is like so.
S3 Bucket
```
Incoming
+xls
-file1.xls
-file2.xls
-file3.xls
+doc
-file1.doc
-file2.doc
-file3.doc
Processed
+xls
...
+doc
...
When a particular excel file is dropped into the incoming/xls folder (say file1.xls), I need to pick up all the files, do some processing and drop them into a processed folder with the same directory structure.
What components do I need to use for this? I tried reading the documentation but its a little difficult to figure out what components I need. I understand that I will use the camel-aws-s3 plugin but there are not many examples of it out there.

On the https://camel.apache.org/components/latest/aws-s3-component.html there some examples about writing and reading from a S3 Bucket.
Next to reading and writing to S3, you might need some custom processor that uses Apache POI to transform the xsl files

Related

What would be the best way to store JSON files for use within a React application?

I have a python script that fetches data twice a day from a server of mine. The script returns around 40 JSON files containing various data. The files aren't particularly big and the combined size of all the files is around 250KB.
Alongside my script I am developing a dashboard in React that renders the data from each file into a table, allowing me a visual representation of the data.
I have been looking at what would be the best way to store these files, something that allows me to upload and fetch them twice a day.
Someone mentioned to me about using MongoDB to store the files, but after some research I feel like Mongo is better at storing the contents of the file rather than the file itself. I tried to develop a solution but I couldn't figure out how it could be done when each object is stored as a document with no clear way (to me) which document came from which file.
Other options I have considered are:
Storing the files on the server that is hosting my React project and rendering them locally as I am doing now during development
Storing the files using a provider such as AWS/Firebase
Storing them in a different database (I see SQL now support the storing of JSON files)
Are there any other solutions that you think would work best for this scenario? If so, why?
Hello,
Check about use of FTP server.
We have clients that send us data every 10 min via FTP that is inside XML files, then I have NodeJS back-end which read these files.
You can use it for your scenario with JSON files.

Snowpipe: Importing historical data that may be modified without causing re-imports

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Google cloud storage create zip file with all files from a bucket folder with app engine

I want to create a zip file with all files present inside a bucket folder and write this zip file back to google cloud storage.
I want to do this with app engine standard environment but i didn't find a good example for doing this.
If the size of the writable temporary file normally needed during the zip file creation could fit in the available memory of your instance class you may be able to use the StringIO facility and avoid writing to the filesystem. See for an example How to zip or tar a static folder without writing anything to the filesystem in python?
It may also be possible to directly write the zip file to GCS, basically using the GAE app as a pipeline, which might circumvent the available instance memory limitation mentioned above, but you'd have to try it out, I don't have an actual example. The tricks to watch for would be the picking the right file handlers arguments and maybe buffering options. An example of directly accessing a GCS file (only you'd want to write to it instead of reading from it) would be How to open gzip file on gae cloud?

How to upload .gz files into Google Big Query?

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?
See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

grails file upload

Hey. I need to upload some files (images/pdf/pp) to my SQLS Database and thereafter, download it again. I'm not sure what is the best solution - store it as bytes, or store it as file (not sure if possible). I need later to databind multiple domain classes together with that file upload.
Any help would be very much apreciated,
JM
saving files in the file system or in the DB is a general question which is asked here several times.
check this: Store images(jpg,gif,png) in filesystem or DB?
I recommend to save the files in the file system and just save the path in the DB.
(if you want to work with google app-engine though you have to save the file as byte array in the DB as saving files in the file system is not possible with google app-engine)
To upload file with grails check this: http://www.grails.org/Controllers+-+File+Uploads

Resources