I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.
Any ideas how to make my spark saving to file with a specified file name?
Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do
rdd.saveAsTextFile("foo")
It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.
The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX files inside foo as well.
I'll suggest to do it in this way (Java example):
theRddToPrint.coalesce(1, true).saveAsTextFile(textFileName);
FileSystem fs = anyUtilClass.getHadoopFileSystem(rootFolder);
FileUtil.copyMerge(
fs, new Path(textFileName),
fs, new Path(textFileNameDestiny),
true, fs.getConf(), null);
Extending Tathagata Das answer to Spark 2.x and Scala 2.11
Using Spark SQL we can do this in one liner
//implicits for magic functions like .toDf
import spark.implicits._
val df = Seq(
("first", 2.0),
("choose", 7.0),
("test", 1.5)
).toDF("name", "vals")
//write DataFrame/DataSet to external storage
df.write
.format("csv")
.save("csv/file/location")
Then you can go head and proceed with adoalonso's answer.
I have an idea, but not ready code snippet. Internally (as name suggest) Spark uses Hadoop output format. (as well as InputFormat when reading from HDFS).
In the hadoop's FileOutputFormat there is protected member setOutputFormat, which you can call from the inherited class to set other base name.
It's not really a clean solution, but inside a foreachRDD() you can basically do whatever you like, also create a new file.
In my solution this is what I do: I save the output on HDFS (for fault tolerance reasons), and inside a foreachRDD I also create a TSV file with statistics in a local folder.
I think you could probably do the same if that's what you need.
http://spark.apache.org/docs/0.9.1/streaming-programming-guide.html#output-operations
Related
I have a file ontobible.owl. how to extract that file and then save data to mysql (because I want display data from ontobible.owl in website). can anyone help me?
edited:
here is my ontobible.owl file (https://teamtrainit.com/ontobible.owl)
i've try open ontobible.owl with sublime text 3 and contains like this
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#HOS5_2">
<verseID>HOS5_2</verseID>
<verse_text>And the revolters are profound to make slaughter, though I have been a rebuker of them all.</verse_text>
</Verse>
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#2CH2_1">
<hasPerson rdf:resource="http://semanticbible.org/ns/2006/NTNames#god_1324"/>
<hasPerson rdf:resource="http://www.co-ode.org/roberts/family-tree.owl#solomon_2762"/>
<verseID>2CH2_1</verseID>
<verse_text>And Solomon determined to build an house for the name of the LORD, and an house for his kingdom.</verse_text>
</Verse>
how to convert that xml tag to array or json so I cant save it to mysql database
you have several options for extracting data from owl
use owl-api and write java code (i think owl api is accessible in other languages) to extract data and pack it in the format you need. also you can use sparql queries for extracting data via jena api
install protege, open your file in protege and save it in format json-dl. this format is very similar to the regular json and you can easily transform it for your needs
install fuseki server, add your file and using sparql queries extract data from there
i think that the second option is the easiest for start if you don't want to write queries or code and it won't take long
Is there a way to download more than 100MB of data from Snowflake into excel or csv?
I'm able to download up to 100MB through the UI, clicking the 'download or view results button'
You'll need to consider using what we call "unload", a.k.a. COPY INTO LOCATION
which is documented here:
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html
Other options might be to use a different type of client (python script or similar).
I hope this helps...Rich
.....EDITS AS FOLLOWS....
Using the unload (COPY INTO LOCATION) isn't quite as overwhelming as it may appear to be, and if you can use the snowSQL client (instead of the webUI) you can "grab" the files from what we call an "INTERNAL STAGE" fairly easily, example as follows.
CREATE TEMPORARY STAGE my_temp_stage;
COPY INTO #my_temp_stage/output_filex
FROM (select * FROM databaseNameHere.SchemaNameHere.tableNameHere)
FILE_FORMAT = (
TYPE='CSV'
COMPRESSION=GZIP
FIELD_DELIMITER=','
ESCAPE=NONE
ESCAPE_UNENCLOSED_FIELD=NONE
date_format='AUTO'
time_format='AUTO'
timestamp_format='AUTO'
binary_format='UTF-8'
field_optionally_enclosed_by='"'
null_if=''
EMPTY_FIELD_AS_NULL = FALSE
)
overwrite=TRUE
single=FALSE
max_file_size=5368709120
header=TRUE;
ls #my_temp_stage;
GET #my_temp_stage file:///tmp/ ;
This example:
Creates a temporary stage object in Snowflake, which will be discarded when you close your session.
Takes the results of your query and loads them into one (or more) csv files in that internal temporary stage, depending on size of your output. Notice how I didn't create another database object called a "FILE FORMAT", it's considered a best practice to do so, but you can do these one off extracts without creating that separate object if you don't mind having the command be so long.
Lists the files in the stage, so you can see what was created.
Pulls the files down using the GET, in this case this was run on my mac and the file(s) were placed in /tmp, if you are using Windoz you will need to modify a little bit.
I am searching for an simple data storage solution, interfaced from local bash (write and read).
Background: I'm collecting sensor data and save values with timestamp (actually in a text file, every week a new file is created).
I like to visualize the data on request with the help of php.
Is there a database (like sqlite) which can be easily written from bash ?
You could try rrdtool. It's a round robin time series db, perfect for visualizations.
sqlite3 can take a query as argument (see "Using sqlite3 in a shell script").
DB='example'
VAR='sensordata'
QUERY="INSERT INTO table(column) VALUES ('${VAR}')"
sqlite3 "$DB" "$QUERY"
What you won't get is string escaping so you need to be sure that $VAR is safe from SQL injections.
I'm tring to create an SSIS package to import some dataset files, however given that I seem to be hitting a brick
wall everytime I achieve a small part of the task I need to take a step back and perform a sanity check on what I'm
trying to achieve, and if you good people can advise whether SSIS is the way to go about this then I would
appreciate it.
These are my questions from this morning :-
debugging SSIS packages - debug.writeline
Changing an SSIS dts variables
What I'm trying to do is have a For..Each container enumerate over the files in a share on the SQL Server. For each
file it finds a script task runs to check various attributes of the filename, such as looking for a three letter
code, a date in CCYYMM, the name of the data contained therein, and optionally some comments. For example:-
ABC_201007_SalesData_[optional comment goes here].csv
I'm looking to parse the name using a regular expression and put the values of 'ABC', '201007', and
'SalesData' in variables.
I then want to move the file to an error folder if it doesn't meet certain criteria :-
Three character code
Six character date
Dataset name (e.g. SalesData, in this example)
CSV extension
I then want to lookup the Character code, the date (or part thereof), and the Dataset name against a lookup table
to mark off a 'checklist' of received files from each client.
Then, based on the entry in the checklist, I want to kick off another SSIS package.
So, for example I may have a table called 'Checklist' with these columns :-
Client code Dataset SSIS_Package
ABC SalesData NorthSalesData.dtsx
DEF SalesData SouthSalesData.dtsx
If anyone has a better way of achieving this I am interested in hearing about it.
Thanks in advance
That's an interesting scenario, and should be relatively easy to handle.
First, your choice of the Foreach Loop is a good one. You'll be using the Foreach File Enumerator. You can restrict the files you iterate over to be just CSVs so that you don't have to "filter" for those later.
The Foreach File Enumerator puts the filename (full path or just file name) into a variable - let's call that "FileName". There's (at least) two ways you can parse that - expressions or a Script Task. Depends which one you're more comfortable with. Either way, you'll need to create three variables to hold the "parts" of the filename - I'll call them "FileCode", "FileDate", and "FileDataset".
To do this with expressions, you need to set the EvaluateAsExpression property on FileCode, FileDate, and FileDataset to true. Then in the expressions, you need to use FINDSTRING and SUBSTRING to carve up FileName as you see fit. Expressions don't have Regex capability.
To do this in a Script Task, pass the FileName variable in as a ReadOnly variable, and the other three as ReadWrite. You can use the Regex capabilities of .Net, or just manually use IndexOf and Substring to get what you need.
Unfortunately, you have just missed the SQLLunch livemeeting on the ForEach loop: http://www.bidn.com/blogs/BradSchacht/ssis/812/sql-lunch-tomorrow
They are recording the session, however.
I just want to know if there could be any way by which we can read a value from an .xls file using a .bat file.
For eg:If i have an .xls named test.xls which is having two columns
namely 'EID' and then 'mail ID'.Now when we give the input to the .xls the EID name.it should extract the mail id which corresponds to the EID and echo the result out.
**EID** **MailID**
E22222 MynameisA#company.com
E33333 MynameisB#company.com
...
...
So by the above table,when i give the input to the xls file using my .bat file as E22222,it should read the corresponding mail ID as MynameisA#company.com and it should echo the value.
So i hope i am able to present my doubt.Please get back to me for more clarifications.
Thanks and regards
Maddy
There is no facility to do this directly with traditional .bat files. However, you might investigate PowerShell, which is designed to be able to do this sort of thing. PowerShell integrates well with existing Windows applications (such as Excel) and may provide the tools you need to do this easily.
A quick search turned up this example of reading Excel files from PowerShell.
You can't do this directly from a batch file. Furthermore, to manipulate use Excel files in scripting you need Excel to be installed.
What you can do is wrap the Excel-specific stuff in a VBScript and call that from your batch.
You can do it with Alacon - command-line utility for Alasql database.
It works with Node.js, so you need to install Node.js and then Alasql package:
To take data from Excel file you can use the following command:
> node alacon "SELECT VALUE [mail ID] FROM XLS('mydata.xls', {headers:true})
WHERE EID = ?" "E2222"
Fist parameter is a SQL-expresion, which read data from XLSX file with header and search data
for second parameter value: "E22222". The command returns mail ID value.
This will be hard (very close to impossible) in BAT, especially when using the original XLS file, but even after an export to CSV it will be much easier to use a script/programming language (Perl, C, whatever) to do this.