How to create a file resource on the fly in Clojure? - file

I have a database that contains images, and I want to serve these images through some urls, preferably like so:
foobar.com/items?id=whateverTheImageIdIsInTheDatabase.
So I wrote this code:
(defn create-item-image []
(let [item-id (:id (:params req))
item
(find-by-id
"items"
(ObjectId. item-id)
)
file-location (str "resources/" item-id ".jpg")
]
(with-open [o (io/output-stream file-location)]
(let [
;; take the first image. The "image" function simply returns the data-url from the id of the image stored in (first (:images item))
img-string (get (str/split (image (first (:images item))) #",") 1)
img-bytes
(.decode (java.util.Base64/getDecoder) img-string)
]
;; write to a file with the name whateverTheImageIdIsInTheDatabase.jpg
(.write o img-bytes)
(.close o)
)
)
)
)
(defn image-handler [req]
(do
(prn "coming to image handler")
(create-item-image req)
;; send the resourc whateverTheImageIdIsInTheDatabase.jpg created above.
(assoc (resource-response (str (:_id (:params req)) ".jpg") {:root ""})
:headers {"Content-Type" "image/jpeg; charset=UTF-8"})
)
)
But this doesn't work. The resources are broken. Is it because the resource is sent before the file is created? How do I do send the resource on the fly? Another issue with writing a file on disk is that it has to stay there. So if 1000 requests for different images are made, all 1000 files would be stored in the server, which shouldn't be necessary since they are already in the database. Ultimately, how can I send these images stored as data-urls as files in the response without having to first write them to the disk?

Resources are static files which do not change over running the application and they are packed into uberjar when the server is compiled for production.
If you want to server an image in response, just convert it into byte array and send it within :body of response.

Related

In Snowflake: How to access internally staged pre-trained model from UDF, syntax dilemma?

What is the syntax to reference a staged zip file from UDF? Specifically, I created UDF in Snowpark and it needs to load s-bert sentence_transformers pre-trained model (I downloaded the model, zipped it, and uploaded it to internal stage).
The "SentenceTransformer" method (see the code line below) takes a parameter that can either be the name of the model -- in which case, the pre-trained model will be downloaded form the web; or it can take a directory path to a folder that contains already downloaded pre-trained model files.
Downloading the files from the Web with UDF is not an option in Snowflake.
So, what is the directory path to the internally staged file that I can use as a parameter to the SentenceTransformer method so it can access already downloaded zip model? "#stagename/filename.zip" is not working.
#udf(....)
def create_embedding()..:
....
model = SentenceTransformer('all-MiniLM-L6-v2') #THIS IS THE LINE IN THE QUESTION
....
....
UDFs need to specify specific files when creating them (for now):
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-creating.html#loading-a-file-from-a-stage-into-a-python-udf
Check the example from the docs, which uses imports, snowflake_import_directory to open(import_dir + 'file.txt'):
create or replace function my_udf()
returns string
language python
runtime_version=3.8
imports=('#my_stage/file.txt')
handler='compute'
as
$$
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
def compute():
with open(import_dir + 'file.txt', 'r') as file:
return file.read()
$$;

How to export Snowflake Web UI Worksheet SQL to file

Classic Snowflake Web UI and the new Snowsight are great at importing sql from a file but neither allows you to export sql to a file. Is there a workaround?
You can use an IDE to connect to snowflake and write queries. Then the scripts can be downloaded using IDE features and can sync with git repo as well.
dbeaver is one such IDE which supports snowflake :
https://hevodata.com/learn/dbeaver-snowflake/
The query pane is interactive so the obvious workaround will be:
CTRL + A (select all)
CTRL + C (copy)
<open_favourite_text_editor>
CTRL + P (paste)
CTRL + S (save)
This tool can help you while the team develops a native feature to export worksheets:
"Snowflake Snowsight Extensions wrap Snowsight features that do not have API or SQL alternatives, such as manipulating Dashboards and Worksheets, and retrieving Query Profile and step timings."
https://github.com/Snowflake-Labs/sfsnowsightextensions
Further explained on this post:
https://medium.com/snowflake/importing-and-exporting-snowsight-dashboards-and-worksheets-3cd8e34d29c8
For example, to save to a file within PowerShell:
PS > $dashboards | foreach {$_.SaveToFolder(“path/to/folder”)}
PS > $dashboards[0].SaveToFile(“path/to/folder/mydashboard.json”)
ETA: I'm adding this edit to the front because this is what actually worked.
Again, BSON was a dead end & punycode is irrelevant. I don't know why punycode is referenced in the metadata file; but my best guess is that they might use punycode to encode the worksheet name itself (though I'm not sure why that would be needed since it shouldn't need to be part of a URL).
After doing terrible things and trying a number of complex ways of dealing with escape character hell, I found that the actual encoding is very simple. It just works as an 8 bit encoding with anything that might cause problems escaped away (null, control codes, double quotes, etc.). To load, treat the file as a text file using an 8-bit encoding; extract the data as a JSON field, then re-encode that extracted data as that same encoding. I just used latin_1 to read; but it may not even matter which encoding you use as long as you are consistent and use the same one to re-encode. The encoded field will then be valid zlib compressed data.
I decided that I wanted to start from scratch so I needed to back the worksheets first and I made a Python script based on my findings above. Be warned that this may return even worksheets that you previously closed for good. After running this and verifying that backups were created, I just ran rm #~/worksheet_data/;, closed the tab & reopened it.
Here's the code (fill in the appropriate base directory location):
import os
from collections import OrderedDict
import configparser
from sqlalchemy import create_engine, exc
from snowflake.sqlalchemy import URL
import pathlib
import json
import zlib
import string
def format_filename(s: str) -> str: # From https://gist.github.com/seanh/93666
"""Take a string and return a valid filename constructed from the string.
Uses a whitelist approach: any characters not present in valid_chars are
removed. Also spaces are replaced with underscores.
Note: this method may produce invalid filenames such as ``, `.` or `..`
When I use this method I prepend a date string like '2009_01_15_19_46_32_'
and append a file extension like '.txt', so I avoid the potential of using
an invalid filename.
"""
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
filename = ''.join(c for c in s if c in valid_chars)
# filename = filename.replace(' ','_') # I don't like spaces in filenames.
return filename
def trlng_dash(s: str) -> str:
"""Removes trailing character if present."""
return s[:-1] if s[-1] == '-' else s
sso_authenticate = True
# Assumes CLI config file exists.
config = configparser.ConfigParser()
home = pathlib.Path.home()
config_loc = home/'.snowsql/config' # Assumes it's set up from Snowflake CLI.
base_dir = home/r'{your Desired base directory goes here.}'
json_dir = base_dir/'json' # Location for your worksheet stage JSON files.
sql_dir = base_dir/'sql' # Location for your worksheets.
# Assumes CLI config file exists.
config.read(config_loc)
# Add connection parameters here (assumes CLI config exists).
# Using sso so only 2 are needed.
# If there's no config file, etc. enter by hand here (or however you want to do it).
connection_params = {
'account': config['connections']['accountname'],
'user': config['connections']['username'],
}
if sso_authenticate:
connection_params['authenticator'] = 'externalbrowser'
if config['connections'].get('password', None) is not None:
connection_params['password'] = config['connections']['password']
if config['connections'].get('rolename', None) is not None:
connection_params['role'] = config['connections']['rolename']
if locals().get('database', None) is not None:
connection_params['database'] = database
if locals().get('schema', None) is not None:
connection_params['schema'] = schema
sf_engine = create_engine(URL(**connection_params))
if not base_dir.exists():
base_dir.mkdir()
if not json_dir.exists():
json_dir.mkdir()
if not (sql_dir).exists():
sql_dir.mkdir()
with sf_engine.connect() as connection:
connection.execute(f'get #~/worksheet_data/ \'file://{str(json_dir.as_posix())}\';')
for file in [path for path in json_dir.glob('*') if path.is_file()]:
if file.suffix != '.json':
file.replace(file.with_suffix(file.suffix + '.json'))
with open(json_dir/'metadata.json', 'r') as metadata_file:
files_meta = json.load(metadata_file)
# List of files from metadata file will contain some empty worksheets.
files_description_orig = OrderedDict((file_key_value['name'], file_key_value) for file_key_value in sorted(files_meta['activeWorksheets'] + list(files_meta['inactiveWorksheets'].values()), key=lambda x: x['name']) if file_key_value['name'])
# files_description will only track non empty worksheets
files_description = files_description_orig.copy()
# Create updated files description filtering out empty worksheets.
for item in files_description_orig:
json_file = json_dir/f"{files_description_orig[item]['name']}.json"
# If a file didn't make it or was deleted by hand, we should
# remove from the filtered description & continue to the next item.
if not (json_file.exists() and json_file.is_file()):
del files_description[item]
continue
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
# If the file represents a worksheet with a body field, we want it.
if not json_dat['wsContents'].get('body'):
del files_description[item]
## Delete JSON files corresponsing to empty worksheets.
# f.close()
# try:
# (json_dir/f"{files_description_orig[item]['name']}.json").unlink()
# except:
# pass
# Produce a list of normalized filenames (no illegal or awkward characters).
file_names = set(
format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
for item in files_description)
# Add useful information to our files_description OrderedDict
for file_name in file_names:
repeats_cnt = 0
file_name_repeats = (
item
for item
in files_description
if file_name == format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
)
for file_uuid in file_name_repeats:
files_description[file_uuid]['normalizedName'] = file_name
files_description[file_uuid]['stemSuffix'] = '' if repeats_cnt == 0 else f'({repeats_cnt:0>2})'
repeats_cnt += 1
# Now we iterate on non-empty worksheets only.
for item in files_description:
json_file = json_dir/f"{files_description[item]['name']}.json"
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
body = json_dat['wsContents']['body']
body_bin = body.encode('latin_1')
body_txt = zlib.decompress(body_bin).decode('utf8')
sql_file = sql_dir/f"{files_description[item]['normalizedName']}{files_description[item]['stemSuffix']}.sql"
with open(sql_file, 'w') as sql_f:
sql_f.write(body_txt)
creation_stamp = files_description[item]['created']/1000
os.utime(sql_file, (creation_stamp,creation_stamp))
print('Done!')
As mentioned at Is there any option in snowflake to save or load worksheets? (and in Snowflake's own documentation), in the Classic UI, the worksheets are saved at the user stage under #~/worksheet_data/.
You can download it with a get command like:
get #~/worksheet_data/<name> file:///<your local location>; (though you might need quoting if running from Windows).
The problem is that I do not know how to access it programmatically. The downloaded files look like JSON but it is not valid JSON. The main key is "wsContents" and contains most of the worksheet information. Its value includes two subkeys, "encoding" and "body".
The "encoding" key denotes that gzip is being used. The "body" key seems to be the actual worksheet data which looks a lot like a straight binary representation of the compressed text data. As such, any JSON reader will choke on it.
If it is anything like that, I do not currently know how to access it programmatically using Python.
I do see that a JSON like format exists, BSON, that is bundled into PyMongo. Trying to use this on these files fails. I even tried bson.is_valid and it returns False so I am assuming that it means that these files in Snowflake are not actually BSON.
Edited to add: Again, BSON is a dead end.
Examining the "body" value as just binary data, the first two bytes of sample files do seem to correspond to default zlib compression (0x789c). However, attempting to run straight zlib.decompress on the slice created from that first byte to the last corresponding to the first & last characters of the "body" value results in the error:
Error - 3 while decompressing data: invalid code lengths set
This makes me think that the bytes there, as is, are at least partly garbage and still need some processing before they can be decompressed.
One clue that I failed to mention earlier is that the metadata file (called "metadata" and which serves as an inventory of the remaining files at the #~/worksheet_data/ location) declares that the files use the punycode encoding. However, I have not known how to use that information. The data in these files doesn't particularly look like what I feel punycode should look like nor does it particularly make sense to me that you would use punycode on binary data that is not meant to ever be used to directly generate text such as zlib compressed data.

Pause in the middle of a Clojure doseq function?

My program allows users to select files to import into a database. Before now, it only allowed them to import one file at a time. Letting them import more than one file at a time is easy. But the problem I have is that if the database already contains a page with the same title, I want to display a warning and get confirmation before overwriting the version in the database with the one being imported.
Here's what I have so far. It largely follows what I was doing to import single files.
;; Handler for the POST "/import" route.
(defn- post-import-page
"Import the file(s) specified in the upload dialog. Checks the page
title of the import file against existing pages in the database. If
a page of the same name already exists, asks for confirmation before
importing."
[{{file-info "file-info"
referer "referer"} :multipart-params :as req}]
(let [file-vec (if (map? file-info)
(conj [] file-info)
file-info)]
(doseq [fm file-vec]
(let [file-name (:filename fm)
import-map (files/load-markdown-from-file (:tempfile fm))
page-title (get-in import-map [:meta :title])
id-exists? (db/title->page-id page-title)]
(if id-exists?
(build-response
(layout/compose-import-existing-page-warning
import-map file-name referer) req)
(do-the-import import-map file-name req))))))
This function imports any files that don't already exist in the database, but doesn't import anything that would overwrite an existing database entry with the same title. It never shows the warning page asking for confirmation either.
The warning page is constructed like this:
(defn compose-import-existing-page-warning
"Return a page stating that a page with the same title already exists
in the wiki. Ask the user to proceed or cancel the import."
[import-map file-name referer]
(short-form-template
[:div {:class "cwiki-form"}
(form-to {:enctype "multipart/form-data"
:autocomplete "off"}
[:post "proceed-with-import"]
(hidden-field "import-map" import-map)
(hidden-field "file-name" file-name)
(hidden-field "referer" referer)
[:p {:class "form-title"} "Page Already Exists"]
[:div {:class "form-group"}
[:p (str "A page with the title \"" (get-in import-map [:meta :title])
"\" already exists in the wiki.")]
[:p (str "Click \"Proceed\" to delete the existing page and "
"replace it with the contents of the imported file.")]
[:div {:class "button-bar-container"}
(submit-button {:id "proceed-with-import-button"
:class "form-button button-bar-item"}
"Proceed")
[:input {:type "button" :name "cancel-button"
:value "Cancel"
:class "form-button button-bar-item"
:autofocus "autofocus"
:onclick "window.history.back();"}]]])]))
How would the program be paused in the middle of the doseq (or other looping function) to display the confirmation page and wait for the user to make a selection?
Just use read-line in the middle of your loop, and then an if to choose the branching you want. Here is a list of other documentation you may find useful, especially the Clojure CheatSheet.

Open linked data_a data set

I downloaded a data set which is supposed to be in RDF format http://iw.rpi.edu/wiki/Dataset_1329, using Notepad++ I opened it but can't read it. Any suggestions?
The file, uncompressed, is about 140MB. Notepad++ is probably failing due to the size of the file. The RDF format used in this dataset is Ntriples, one triple per line with three components (subject, predicate, object), very human readable. Sample data from the file:
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/race_other_multi_racial> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/race_black_and_white> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/national_origin_hispanic> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/filed_cases> "1" .
If you want to have a look at the data then try to open it with a tool that streams the file rather than loading it all at once, for instance less or head.
If you want to use the data you might want to look into loading it in a triple store (4store, Virtuoso, Jena TDB, ...) and use SPARQL to query it.
Try Google Refine (possibly with RDF extension: http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/ )

How do I get a temporary File object (of correct content-type, without writing to disk) directly from a ZipEntry (RubyZip, Paperclip, Rails 3)?

I'm currently trying to attach image files to a model directly from a zip file (i.e. without first saving them on a disk). It seems like there should be a clearer way of converting a ZipEntry to a Tempfile or File that can be stored in memory to be passed to another method or object that knows what to do with it.
Here's my code:
def extract (file = nil)
Zip::ZipFile.open(file) { |zip_file|
zip_file.each { |image|
photo = self.photos.build
# photo.image = image # this doesn't work
# photo.image = File.open image # also doesn't work
# photo.image = File.new image.filename
photo.save
}
}
end
But the problem is that photo.image is an attachment (via paperclip) to the model, and assigning something as an attachment requires that something to be a File object. However, I cannot for the life of me figure out how to convert a ZipEntry to a File. The only way I've seen of opening or creating a File is to use a string to its path - meaning I have to extract the file to a location. Really, that just seems silly. Why can't I just extract the ZipEntry file to the output stream and convert it to a File there?
So the ultimate question: Can I extract a ZipEntry from a Zip file and turn it directly into a File object (or attach it directly as a Paperclip object)? Or am I stuck actually storing it on the hard drive before I can attach it, even though that version will be deleted in the end?
UPDATE
Thanks to blueberry fields, I think I'm a little closer to my solution. Here's the line of code that I added, and it gives me the Tempfile/File that I need:
photo.image = zip_file.get_output_stream image
However, my Photo object won't accept the file that's getting passed, since it's not an image/jpeg. In fact, checking the content_type of the file shows application/x-empty. I think this may be because getting the output stream seems to append a timestamp to the end of the file, so that it ends up looking like imagename.jpg20110203-20203-hukq0n. Edit: Also, the tempfile that it creates doesn't contain any data and is of size 0. So it's looking like this might not be the answer.
So, next question: does anyone know how to get this to give me an image/jpeg file?
UPDATE:
I've been playing around with this some more. It seems output stream is not the way to go, but rather an input stream (which is which has always kind of confused me). Using get_input_stream on the ZipEntry, I get the binary data in the file. I think now I just need to figure out how to get this into a Paperclip attachment (as a File object). I've tried pushing the ZipInputStream directly to the attachment, but of course, that doesn't work. I really find it hard to believe that no one has tried to cast an extracted ZipEntry as a File. Is there some reason that this would be considered bad programming practice? It seems to me like skipping the disk write for a temp file would be perfectly acceptable and supported in something like Zip archive management.
Anyway, the question still stands:
Is there a way of converting an Input Stream to a File object (or Tempfile)? Preferably without having to write to a disk.
Try this
Zip::ZipFile.open(params[:avatar].path) do |zipfile|
zipfile.each do |entry|
filename = entry.name
basename = File.basename(filename)
tempfile = Tempfile.new(basename)
tempfile.binmode
tempfile.write entry.get_input_stream.read
user = User.new
user.avatar = {
:tempfile => tempfile,
:filename => filename
}
user.save
end
end
Check out the get_input_stream and get_output_stream messages on ZipFile.

Resources