Unable to see any columns in table after running AWS Glue crawler - database

I am relatively new to AWS Glue, but after creating my crawler and running it successfully, I can see that a new table has been created but I can't see any columns in that table. It is absolutely blank.
I am using a .csv file from a S3 bucket as my data source.

Is your file UTF8 encoded... Glue has a problem if it’s not.
Does your file have at least 2 records
Does the file have more than one column.
There are various factors that impact the crawler from identifying a csv file
Please refer to this documentation that talks about the built in classifier and what it needs to crawl a csv file properly
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

Related

creating AWS glue metadata from the command line

I want to create glue metadata based on an already existing table but I want to read in values from a different S3 bucket. So basically the location parameter in my DDL script will be different and the name of the table.
Can someone help me figure out how to this via the command line? I have been going through aws documentation but haven't found anything helpful yet.

What would be the best way to store JSON files for use within a React application?

I have a python script that fetches data twice a day from a server of mine. The script returns around 40 JSON files containing various data. The files aren't particularly big and the combined size of all the files is around 250KB.
Alongside my script I am developing a dashboard in React that renders the data from each file into a table, allowing me a visual representation of the data.
I have been looking at what would be the best way to store these files, something that allows me to upload and fetch them twice a day.
Someone mentioned to me about using MongoDB to store the files, but after some research I feel like Mongo is better at storing the contents of the file rather than the file itself. I tried to develop a solution but I couldn't figure out how it could be done when each object is stored as a document with no clear way (to me) which document came from which file.
Other options I have considered are:
Storing the files on the server that is hosting my React project and rendering them locally as I am doing now during development
Storing the files using a provider such as AWS/Firebase
Storing them in a different database (I see SQL now support the storing of JSON files)
Are there any other solutions that you think would work best for this scenario? If so, why?
Hello,
Check about use of FTP server.
We have clients that send us data every 10 min via FTP that is inside XML files, then I have NodeJS back-end which read these files.
You can use it for your scenario with JSON files.

How to upload .gz files into Google Big Query?

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?
See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

lucene indexing security files

I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

creating a video database

I am interested in creating a video databse. My goal is to have a folder where my videos will be kept and each time I copy/delete a video the website that presents them should be updated to. the problem is I have no idea how to approach it.
Should I..
Use Sql and store a reference to each video location?
Have a script that checks all the time if new changes happen in that folder?
A package like joomla?
I am using ubuntu btw. I already have a simple html5 page, and I am presenting the videos using html5 video.
It depends on the size and the performance you want.
1.Way : use php to scan the folder and generate links on the fly
2.way : Use a database to store the file names and retrieve the names from the database and generate urls
pros and cons.
simple to implement , no changes in upload or download script. no database required.
You need have a database , little coding required for upload and also while genrating a page
You should make a db (format does not matter) and storing in it only file names of videos: the videos would be stored on hard drive.
Any operation on the web site will pass first on db for insert/update/delete videos records and then (maybe in a transaction context) on the file system.
This would be the standard approach to your question.

Resources