Is it possible to pass a variable from Spark interpreter (pyspark or sql) to Markdown? The requirement is to display a nicely formatted text (i.e. Markdown) such as "20 events occurred between 2017-01-01 and 2017-01-08" where the 20, 2017-01-01 and 2017-01-08 are dynamically populated based on output from other paragraphs.
Posting this for benefit of other users, this is what I have been able to find:
Markdown paragraphs can only contain static text.
But it is possible to achieve a dynamic formatted text output with the Angular interpreter instead.
(First paragraph)
%spark
// create data frame
val eventLogDF = ...
// register temp table for SQL access
eventLogDF.registerTempTable( "eventlog" )
val query = sql( "select max(Date), min(Date), count(*) from eventlog" ).take(1)(0)
val maxDate = query(0).toString()
val minDate = query(1).toString()
val evCount = query(2).toString()
// bind variables which can be accessed from angular interpreter
z.angularBind( "maxDate", maxDate )
z.angularBind( "minDate", minDate )
z.angularBind( "evCount", evCount )
(Second paragaph)
%angular
<div>There were <b>{{evCount}} events</b> between <b>{{minDate}}</b> and <b>{{maxDate}}</b>.</div>
You could also print out markdown by translate it into HTML first, for those who may already have an markdown template for output, or your Zeppelin environment have no Angular interpreter(e.g. a K8s deployment).
First, install markdown2.
%sh
pip install markdown2
And use it.
%pyspark
import markdown2
# prepare your markdown string
markdown_string = template_mymarkdown.format(**locals())
# use Zeppelin %html for output
print("%html", markdown2.markdown(markdown_string, extras=["tables"]))
A screenshot for example:
Related
Classic Snowflake Web UI and the new Snowsight are great at importing sql from a file but neither allows you to export sql to a file. Is there a workaround?
You can use an IDE to connect to snowflake and write queries. Then the scripts can be downloaded using IDE features and can sync with git repo as well.
dbeaver is one such IDE which supports snowflake :
https://hevodata.com/learn/dbeaver-snowflake/
The query pane is interactive so the obvious workaround will be:
CTRL + A (select all)
CTRL + C (copy)
<open_favourite_text_editor>
CTRL + P (paste)
CTRL + S (save)
This tool can help you while the team develops a native feature to export worksheets:
"Snowflake Snowsight Extensions wrap Snowsight features that do not have API or SQL alternatives, such as manipulating Dashboards and Worksheets, and retrieving Query Profile and step timings."
https://github.com/Snowflake-Labs/sfsnowsightextensions
Further explained on this post:
https://medium.com/snowflake/importing-and-exporting-snowsight-dashboards-and-worksheets-3cd8e34d29c8
For example, to save to a file within PowerShell:
PS > $dashboards | foreach {$_.SaveToFolder(“path/to/folder”)}
PS > $dashboards[0].SaveToFile(“path/to/folder/mydashboard.json”)
ETA: I'm adding this edit to the front because this is what actually worked.
Again, BSON was a dead end & punycode is irrelevant. I don't know why punycode is referenced in the metadata file; but my best guess is that they might use punycode to encode the worksheet name itself (though I'm not sure why that would be needed since it shouldn't need to be part of a URL).
After doing terrible things and trying a number of complex ways of dealing with escape character hell, I found that the actual encoding is very simple. It just works as an 8 bit encoding with anything that might cause problems escaped away (null, control codes, double quotes, etc.). To load, treat the file as a text file using an 8-bit encoding; extract the data as a JSON field, then re-encode that extracted data as that same encoding. I just used latin_1 to read; but it may not even matter which encoding you use as long as you are consistent and use the same one to re-encode. The encoded field will then be valid zlib compressed data.
I decided that I wanted to start from scratch so I needed to back the worksheets first and I made a Python script based on my findings above. Be warned that this may return even worksheets that you previously closed for good. After running this and verifying that backups were created, I just ran rm #~/worksheet_data/;, closed the tab & reopened it.
Here's the code (fill in the appropriate base directory location):
import os
from collections import OrderedDict
import configparser
from sqlalchemy import create_engine, exc
from snowflake.sqlalchemy import URL
import pathlib
import json
import zlib
import string
def format_filename(s: str) -> str: # From https://gist.github.com/seanh/93666
"""Take a string and return a valid filename constructed from the string.
Uses a whitelist approach: any characters not present in valid_chars are
removed. Also spaces are replaced with underscores.
Note: this method may produce invalid filenames such as ``, `.` or `..`
When I use this method I prepend a date string like '2009_01_15_19_46_32_'
and append a file extension like '.txt', so I avoid the potential of using
an invalid filename.
"""
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
filename = ''.join(c for c in s if c in valid_chars)
# filename = filename.replace(' ','_') # I don't like spaces in filenames.
return filename
def trlng_dash(s: str) -> str:
"""Removes trailing character if present."""
return s[:-1] if s[-1] == '-' else s
sso_authenticate = True
# Assumes CLI config file exists.
config = configparser.ConfigParser()
home = pathlib.Path.home()
config_loc = home/'.snowsql/config' # Assumes it's set up from Snowflake CLI.
base_dir = home/r'{your Desired base directory goes here.}'
json_dir = base_dir/'json' # Location for your worksheet stage JSON files.
sql_dir = base_dir/'sql' # Location for your worksheets.
# Assumes CLI config file exists.
config.read(config_loc)
# Add connection parameters here (assumes CLI config exists).
# Using sso so only 2 are needed.
# If there's no config file, etc. enter by hand here (or however you want to do it).
connection_params = {
'account': config['connections']['accountname'],
'user': config['connections']['username'],
}
if sso_authenticate:
connection_params['authenticator'] = 'externalbrowser'
if config['connections'].get('password', None) is not None:
connection_params['password'] = config['connections']['password']
if config['connections'].get('rolename', None) is not None:
connection_params['role'] = config['connections']['rolename']
if locals().get('database', None) is not None:
connection_params['database'] = database
if locals().get('schema', None) is not None:
connection_params['schema'] = schema
sf_engine = create_engine(URL(**connection_params))
if not base_dir.exists():
base_dir.mkdir()
if not json_dir.exists():
json_dir.mkdir()
if not (sql_dir).exists():
sql_dir.mkdir()
with sf_engine.connect() as connection:
connection.execute(f'get #~/worksheet_data/ \'file://{str(json_dir.as_posix())}\';')
for file in [path for path in json_dir.glob('*') if path.is_file()]:
if file.suffix != '.json':
file.replace(file.with_suffix(file.suffix + '.json'))
with open(json_dir/'metadata.json', 'r') as metadata_file:
files_meta = json.load(metadata_file)
# List of files from metadata file will contain some empty worksheets.
files_description_orig = OrderedDict((file_key_value['name'], file_key_value) for file_key_value in sorted(files_meta['activeWorksheets'] + list(files_meta['inactiveWorksheets'].values()), key=lambda x: x['name']) if file_key_value['name'])
# files_description will only track non empty worksheets
files_description = files_description_orig.copy()
# Create updated files description filtering out empty worksheets.
for item in files_description_orig:
json_file = json_dir/f"{files_description_orig[item]['name']}.json"
# If a file didn't make it or was deleted by hand, we should
# remove from the filtered description & continue to the next item.
if not (json_file.exists() and json_file.is_file()):
del files_description[item]
continue
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
# If the file represents a worksheet with a body field, we want it.
if not json_dat['wsContents'].get('body'):
del files_description[item]
## Delete JSON files corresponsing to empty worksheets.
# f.close()
# try:
# (json_dir/f"{files_description_orig[item]['name']}.json").unlink()
# except:
# pass
# Produce a list of normalized filenames (no illegal or awkward characters).
file_names = set(
format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
for item in files_description)
# Add useful information to our files_description OrderedDict
for file_name in file_names:
repeats_cnt = 0
file_name_repeats = (
item
for item
in files_description
if file_name == format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
)
for file_uuid in file_name_repeats:
files_description[file_uuid]['normalizedName'] = file_name
files_description[file_uuid]['stemSuffix'] = '' if repeats_cnt == 0 else f'({repeats_cnt:0>2})'
repeats_cnt += 1
# Now we iterate on non-empty worksheets only.
for item in files_description:
json_file = json_dir/f"{files_description[item]['name']}.json"
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
body = json_dat['wsContents']['body']
body_bin = body.encode('latin_1')
body_txt = zlib.decompress(body_bin).decode('utf8')
sql_file = sql_dir/f"{files_description[item]['normalizedName']}{files_description[item]['stemSuffix']}.sql"
with open(sql_file, 'w') as sql_f:
sql_f.write(body_txt)
creation_stamp = files_description[item]['created']/1000
os.utime(sql_file, (creation_stamp,creation_stamp))
print('Done!')
As mentioned at Is there any option in snowflake to save or load worksheets? (and in Snowflake's own documentation), in the Classic UI, the worksheets are saved at the user stage under #~/worksheet_data/.
You can download it with a get command like:
get #~/worksheet_data/<name> file:///<your local location>; (though you might need quoting if running from Windows).
The problem is that I do not know how to access it programmatically. The downloaded files look like JSON but it is not valid JSON. The main key is "wsContents" and contains most of the worksheet information. Its value includes two subkeys, "encoding" and "body".
The "encoding" key denotes that gzip is being used. The "body" key seems to be the actual worksheet data which looks a lot like a straight binary representation of the compressed text data. As such, any JSON reader will choke on it.
If it is anything like that, I do not currently know how to access it programmatically using Python.
I do see that a JSON like format exists, BSON, that is bundled into PyMongo. Trying to use this on these files fails. I even tried bson.is_valid and it returns False so I am assuming that it means that these files in Snowflake are not actually BSON.
Edited to add: Again, BSON is a dead end.
Examining the "body" value as just binary data, the first two bytes of sample files do seem to correspond to default zlib compression (0x789c). However, attempting to run straight zlib.decompress on the slice created from that first byte to the last corresponding to the first & last characters of the "body" value results in the error:
Error - 3 while decompressing data: invalid code lengths set
This makes me think that the bytes there, as is, are at least partly garbage and still need some processing before they can be decompressed.
One clue that I failed to mention earlier is that the metadata file (called "metadata" and which serves as an inventory of the remaining files at the #~/worksheet_data/ location) declares that the files use the punycode encoding. However, I have not known how to use that information. The data in these files doesn't particularly look like what I feel punycode should look like nor does it particularly make sense to me that you would use punycode on binary data that is not meant to ever be used to directly generate text such as zlib compressed data.
I need to do the following:
Read a .csv file into a variable. Csv file is having one single row with a string like (110,111,112,113,114)
Using this String variable, split the content on the basis of a comma",".
What I have done:
I have added a Thread Group
2a. Added a user defined variable 'Config Element'.
2b. Added a variable named 'issueIds' having value ${__FileToString(D:\TestCasesId.csv,,issueIds)}
3a. Now I added a JSR223 Sampler with the following code:
String lineItems1 = ${issueIds};
log.info(lineItems1);
3b. Executing this give the following error:
Response code:500
Response message:javax.script.ScriptException: In file: inline evaluation of: ``String lineItems1 = 114660,114661,114662,114663; log.info(lineItems1); ;'' Encountered "114661" at line 1, column 28.
in inline evaluation of: ``String lineItems1 = 114660,114661,114662,114663; log.info(lineItems1); ;'' at line number 1
4a. Added a BeanShell Sampler with the following script:
String lineItems2 = ${issueIds};
String[] lineItems2Arr = lineItems2.split(",");
log.info(lineItems2);
log.info(lineItems2Arr[0]);
4b. Executing this give the following error:
Response code:500
Response message:org.apache.jorphan.util.JMeterException: Error invoking bsh method: eval In file: inline evaluation of: ``String lineItems2 = 114660,114661,114662,114663; String[] lineItems2Arr = lineIt . . . '' Encountered "114661" at line 1, column 28.
What am i doing wrong?
You are doing 2 things wrong:
Inlining JMeter Functions or Variables into scripting elements is not recommended, you should be using vars shorthand to JMeterVariables class instance instead like:
String lineItems1 = vars.get("issueIds");
Since JMeter 3.1 it's recommended to use JSR223 Test Elements and Groovy language for scripting therefore consider choosing groovy from the language drop-down
Groovy has much better performance comparing to Beanshell, it supports all modern Java SDK features and provides some syntax sugar on top of it, check out Apache Groovy - Why and How You Should Use It article for more details.
In case amount of comma separated fields is the same for all used csv files, you can consider using 'CSV Data Set Config' instead of manual splitting. In that case you will have a separate variable for each column in the csv, e.g.
id1,id2,id3,id4,id5
110,111,112,113,114
I am using the Table API on Flink 1.4.0. I have some Table objects to be convert to a DataSet of type Row. The project was built using Maven and imported on IntelliJ. I have the following code and the IDE cannot resolve the method tableenv.toDataSet() method. Please help me out. Thank you.
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
...
tableEnvironment.registerTableSource("table1",csvSource);
Table table1 = tableEnvironment.scan("table1");
DataSet<Row> result = tableEnvironment.toDataSet(table1, Row.class);
The last statement causes an error
"Cannot resolve toDataSet() method"
You might not import the right BatchTableEnvironment.
Please check that you import org.apache.flink.table.api.java.BatchTableEnvironment instead of org.apache.flink.table.api.BatchTableEnvironment. The former is the common base class for the Java and Scala variants.
If you want to read a DataSet from a csv file, do it like following:
DataSet<YourType> csvInput = env.readCsvFile("hdfs:///the/CSV/file") ...
More on this: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/#data-sources
I would like to issue one complex query and create multiple graphs from it. Any ideas on how to do that?
Take a look at the Zeppelin Tutorial and the related sample notebook installed by default. In the first paragraph, it shows how to build a data set for both API and SQL use. The same data set is then used in later paragraphs. From the tutorial notebook in version 0.6.2:
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
// So you don't need create them manually
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
You can then reference this using Spark SQL:
%sql
select * from bank limit 5
Or use the bank DataFrame directly:
%spark
bank.show(5)
Hi I wonder if it is possible to select certain items from the array property of a JSON input of Stream Analytics and return them as an array property of a JSON output.
My example to make it more clear - I send a list of OSGI bundles running on a device with name, version and state of the bundle. (I leave out rest of the content.) Sample message:
{"bundles":[{"name":"org.eclipse.osgi","version":"3.5.1.R35x_v20090827","state":32},{"name":"slf4j.log4j12","version":"1.6.1","state":4}]}
Via Stream Analytics I want to create one JSON output (event hub) for active bundle (state == 32) and put the rest in the different output. Content of those event hubs will be processed later. But in the processing I also need the original Device ID so I fetch it from the IoTHub message properties.
So my query looks like this:
WITH Step1 AS
(
SELECT
IoTHub.ConnectionDeviceId AS deviceId,
bundles as bundles
FROM
iotHubMessages
)
SELECT
messages.deviceId AS deviceId,
bundle.ArrayValue.name AS name,
bundle.ArrayValue.version AS version
INTO
active
FROM
Step1 as messages
CROSS APPLY GetArrayElements(messages.bundles) AS bundle
WHERE
bundle.ArrayValue.state = 32
SELECT
messages.deviceId AS deviceId,
bundle.ArrayValue.name AS name,
bundle.ArrayValue.version AS version
INTO
other
FROM
Step1 as messages
CROSS APPLY GetArrayElements(messages.bundles) AS bundle
WHERE
bundle.ArrayValue.state != 32
This way there is a row for every item of the original array containing deviceId, name and version properties in the active output. So the deviceId property is copied several times, which means additional data in a message. I'd prefer a JSON with one deviceId property and one array property bundles, similar to the original JSON input.
Like active:
{"deviceid":"javadevice","bundles":[{"name":"org.eclipse.osgi","version":"3.5.1.R35x_v20090827"}]}
And other:
{"deviceid":"javadevice","bundles":[{"name":"slf4j.log4j12","version":"1.6.1"}]}
Is there any way to achieve this? - To filter items of array and return it back as an array in the same format as is in the input. (In my code I change number of properties, but that is not necessary.)
Thanks for any ideas!
I think you can achieve this using the Collect() aggregate function.
The only issue I see is that the deviceId property will be outputted in the bundle array as well.
WITH Step1 AS
(
SELECT
IoTHub.ConnectionDeviceId AS deviceId,
bundles as bundles
FROM
iotHubMessages
),
Step2 AS
(
SELECT
messages.deviceId AS deviceId,
bundle.ArrayValue.name AS name,
bundle.ArrayValue.version AS version
bundle.ArrayValue.state AS state
FROM
Step1 as messages
CROSS APPLY GetArrayElements(messages.bundles) AS bundle
)
SELECT deviceId, Collect() AS bundles
FROM Step2
GROUP BY deviceId, state, System.Timestamp
WHERE state = 32