Pyodide filesystem for NLTK resources : missing files - filesystems

I am trying to use NLTK in browser, thanks to pyodide.
Pyodide starts well, manages to load NLTK, print its version.
Nevertheless, while the package downloading seems fine, when invoking nltk.sent_tokenize(str), NLTK raises the error that it can't find the package "punkt".
I would say the downloaded resource is lost somewhere, but I didn't understand well how Pyodide / WebAssembly manage files. Any insights ?
Simple version:
import nltk
nltk.download(pkg)
for sent in nltk.sent_tokenize("Test string"):
print(sent)
Version with more details, specifying download directory and server url.
import nltk
pkg = "punkt"
downloader = nltk.downloader.Downloader(server_index_url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml")
downloader.download(pkg, download_dir='/nltk_data')
downloader.status(pkg)
for sent in nltk.sent_tokenize("Test string"):
print(sent)
Full sample code:
<!DOCTYPE html>
<html>
<body>
<script type="text/javascript" src="https://cdn.jsdelivr.net/pyodide/v0.18.0/full/pyodide.js"></script>
<script type="text/javascript">
// init Pyodide
async function pyodide_loader() {
let pyodide_premise = loadPyodide({
indexURL: "https://cdn.jsdelivr.net/pyodide/v0.18.0/full/",
});
let pyodide = await pyodide_premise;
await pyodide.loadPackage("micropip");
await pyodide.loadPackage("nltk");
return pyodide_premise;
}
let pyodideReadyPromise = pyodide_loader();
// run Python code and load NLTK
async function load_packages() {
let pyodide = await pyodideReadyPromise;
let output = pyodide.runPython(`
print(f"*** import nltk")
import nltk
print(f"*** NLTK version {nltk.__version__=} imported, downloading resources now")
pkg = "punkt"
nltk.download(pkg)
str = "Just for testing"
for sent in nltk.sent_tokenize(str):
print(sent)
`);
}
load_packages()
</script>
</body>
</html>

Short answer is that downloading files with Python currently won't work in Pyodide because http.client, requests etc require POSIX sockets which are not supported in the browser VM.
It's curious that nltk.download doesn't error though -- it should have.
The workaround is to manually download the needed resources, for instance, using the JavaScript fetch API as illustrated in this comment;
from js import fetch
response = await fetch("<url>")
js_buffer = await response.arrayBuffer()
py_buffer = js_buffer.to_py() # this is a memoryview
stream = py_buffer.tobytes() # now we have a bytes object
# that we can finally write under the appropriate path
with open("<file_path>", "wb") as fh:
fh.write(stream)
I didn't understand well how Pyodide / WebAssembly manage files.
By default it's virtual file-system (MEMFS) that gets reset at each page load. You can access it with standard python tools (open, 'os', etc). If necessary you can also mount a persistent filesystem.

Here is a working example of loading punkt with pyodide v0.18.1. I tried to post this as a comment to #rth's accepted answer, but the number of characters exceeded the 240 chars limit.
from js import fetch
import nltk
from pathlib import Path
import os, sys, io, zipfile
response = await fetch('https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip')
js_buffer = await response.arrayBuffer()
py_buffer = js_buffer.to_py() # this is a memoryview
stream = py_buffer.tobytes() # now we have a bytes object
d = Path("/nltk_data/tokenizers")
d.mkdir(parents=True, exist_ok=True)
Path('/nltk_data/tokenizers/punkt.zip').write_bytes(stream)
# extract punkt.zip
zipfile.ZipFile('/nltk_data/tokenizers/punkt.zip').extractall(
path='/nltk_data/tokenizers/'
)
# check file contents in /nltk_data/tokenizers/
# print(os.listdir("/nltk_data/tokenizers/punkt"))
nltk.word_tokenize("some text here")
I received a lot of help from the pyodide maintainers and other wonderful people at https://github.com/pyodide/pyodide/issues/1798 to figure this out. Thanks!

Related

process is undefined in React

I am building a simple React app that generates a QR code from data. I am interested in inspecting the memory usage when the QR code is generated. I am using the built process.memoryUsage() function but the app throws and exception
Uncaught TypeError: process__WEBPACK_IMPORTED_MODULE_1__.process is undefined
I have tested some different solution, i tried to rollback the react script version to "4.0.3" i tried to download the npm polyfill webpack but there is no success.
I am currently using these imports
import React, { useState, useEffect } from 'react';
import process from 'process';
import './App.css';
const QRCode = require('qrcode');
The function looks like this
let stringData = JSON.stringify(qrData);
console.log("Number of chars in data" + " " + stringData.length);
QRCode.toDataURL(stringData, function (err, url) {
if(err) return console.log("error occured")
//window.location.href = url;
})
const used = process.memoryUsage();
for (let key in used) {
console.log(`${key} ${Math.round(used[key] / 1024 / 1024 * 100) / 100} MB`);
}
}
process is a Node.js API and is not available in the browser where your React app is running, which is why you see that error. If there is an available global process object, it is being polyfilled by something in your build tools, and will not have the memoryUsage method.
There is no equivalent API for the browser, but some related APIs do exist. Note that this is an evolving space and some are non-standard, so be sure to read both the spec and documentation before considering any usage:
Device Memory API (MDN)
Performance.memory (MDN)
You are using a built in nodejs package in a react app. Node executes on the server and has access to system level resources. React runs in the browser and does not. See this article for some tips measuring performance in React.

React Load Binary File / URL scheme "file" is not supported

Background
I built an app, which converts files from type A to type B (a binary file). I want to import and use a dummy file of type B to fill the data of file type A. The dummy always stays the same. The app has no backend. I want to share the html, so anything which requires turning off browser security etc., isn't an option.
Problem
At the moment, I load the files as I found here, but this works only with a backend server:
Requesting blob images and transforming to base64 with fetch API
import dummy from '../templates/Grid2.shp';
let hex = await fetch(dummy)
.then( response => response.blob() )
.then( blob => new Promise( callback =>{
let reader = new FileReader() ;
reader.onload = function(){
const serumShp = atob(this.result.substring(37)); // 37 strips the base64 info data:...
callback(binaryToHex(serumShp))
} ;
reader.readAsDataURL(blob) ;
}) ) ;
It works in my development but not at the built stage. As the browsers requests from the filesystem.
I found a solution over a file loader, but this solution also throws an error:
Using file-loader to load binary file in react
import/no-webpack-loader-syntax
Also, I don't see any configuration files for Webpack. As far as I have seen I would need to eject them, which is also not recommended.
Question:
How can I import binary files into my app without a backend server/any changes, etc.?
Sorry, I cannot help, but pointing out that there is a general discussion in CRA to support a more elegant way of importing binary/raw data. Sadly there doesn't seem to be much progress, the proposal is from 2018.

Authentication to serve static files on Next.js?

So, I looked for a few authentication options for Next.js that wouldn't require any work on the server side of things. My goal was to block users from entering the website without a password.
I've set up a few tests with NextAuth (after a few other tries) and apparently I can block pages with sessions and cookies, but after a few hours of research I still can't find how I would go about blocking assets (e.g. /image.png from the /public folder) from non-authenticated requests.
Is that even possible without a custom server? Am I missing some core understanding here?
Thanks in advance.
I did stumble upon this problem too. It took my dumbass a while but i figured it out in the end.
As you said - for auth you can just use whatever. Such as NextAuth.
And for file serving: I setup new api endpoint and used NodeJS magic of getting the file and serving it in pipe. It's pretty similar to what you would do in Express. Don't forget to setup proper head info in your response.
Here is little snippet to demonstrate (typescript version):
import { NextApiRequest, NextApiResponse } from 'next'
import {stat} from "fs/promises"
import {createReadStream, existsSync} from "fs"
import path from "path"
import mime from "mime"
//basic nextjs api
export default async function getFile (req: NextApiRequest, res: NextApiResponse) {
// Dont forget to auth first!1!!!
// for this i created folder in root folder (at same level as normal nextjs "public" folder) and the "somefile.png" is in it
const someFilePath = path.resolve('./private/somefile.png');
// if file is not located in specified folder then stop and end with 404
if (! existsSync(someFilePath)) return res.status(404);
// Create read stream from path and now its ready to serve to client
const file = createReadStream(path.resolve('./private/somefile.png'))
// set cache so its proper cached. not necessary
// 'private' part means that it should be cached by an invidual(= is intended for single user) and not by single cache. More about in https://stackoverflow.com/questions/12908766/what-is-cache-control-private#answer-49637255
res.setHeader('Cache-Control', `private, max-age=5000`);
// set size header so browser knows how large the file really is
// im using native fs/promise#stat here since theres nothing special about it. no need to be using external pckages
const stats = await stat(someFilePath);
res.setHeader('Content-Length', stats.size);
// set mime type. in case a browser cant really determine what file its gettin
// you can get mime type by lot if varieties of methods but this working so yay
const mimetype = mime.getType(someFilePath);
res.setHeader('Content-type', mimetype);
// Pipe it to the client - with "res" that has been given
file.pipe(res);
}
Cheers

How to encode/compress gltf to draco

I want to compress/encode gltf file using draco programatically in threejs with reactjs. I dont want to use any commandline tool, I want it to be done programatically. Please suggest me a solution.
I tried using gltf-pipeline but its not working in client side. Cesium library was showing error when I used it in reactjs.
Yes, this is can be implemented with glTF-Transform. There's also an open feature request on three.js, not yet implemented.
First you'll need to download the Draco encoder/decoder libraries (the versions currently published to NPM do not work client side), host them in a folder, and then load them as global script tags. There should be six files, and two script tags (which will load the remaining files).
Files:
draco_decoder.js
draco_decoder.wasm
draco_wasm_wrapper.js
draco_encoder.js
draco_encoder.wasm
draco_encoder_wrapper.js
<script src="assets/draco_encoder.js"></script>
<script src="assets/draco_decoder.js"></script>
Then you'll need to write code to load a GLB file, apply compression, and do something with the compressed result. This will require first installing the two packages shown below, and then bundling the web application with your tool of choice (I used https://www.snowpack.dev/ here).
import { WebIO } from '#gltf-transform/core';
import { DracoMeshCompression } from '#gltf-transform/extensions';
const io = new WebIO()
.registerExtensions([DracoMeshCompression])
.registerDependencies({
'draco3d.encoder': await new DracoEncoderModule(),
'draco3d.decoder': await new DracoDecoderModule(),
});
// Load an uncompressed GLB file.
const document = await io.read('./assets/Duck.glb');
// Configure compression settings.
document.createExtension(DracoMeshCompression)
.setRequired(true)
.setEncoderOptions({
method: DracoMeshCompression.EncoderMethod.EDGEBREAKER,
encodeSpeed: 5,
decodeSpeed: 5,
});
// Create compressed GLB, in an ArrayBuffer.
const arrayBuffer = io.writeBinary(document); // ArrayBuffer
In the latest version of 1.5.0, what are these two files? draco_decoder_gltf.js and draco_encoder_gltf.js. Does this means we no longer need the draco_encoder and draco_decoder files? and how do we invoke the transcoder interface without using MeshBuilder. A simpler API would be musth better.

Upload to Amazon S3 using Boto3 and return public url

Iam trying to upload files to s3 using Boto3 and make that uploaded file public and return it as a url.
class UtilResource(BaseZMPResource):
class Meta(BaseZMPResource.Meta):
queryset = Configuration.objects.none()
resource_name = 'util_resource'
allowed_methods = ['get']
def post_list(self, request, **kwargs):
fileToUpload = request.FILES
# write code to upload to amazone s3
# see: https://boto3.readthedocs.org/en/latest/reference/services/s3.html
self.session = Session(aws_access_key_id=settings.AWS_KEY_ID,
aws_secret_access_key=settings.AWS_ACCESS_KEY,
region_name=settings.AWS_REGION)
client = self.session.client('s3')
client.upload_file('zango-static','fileToUpload')
url = "some/test/url"
return self.create_response(request, {
'url': url // return's public url of uploaded file
})
I searched whole documentation I couldn't find any links which describes how to do this can someone explain or provide any resource where I can find the soultion?
I'm in the same situation.
Not able to find anything in the Boto3 docs beyond generate_presigned_url which is not what I need in my case since I have public readable S3 Objects.
The best I came up with is:
bucket_location = boto3.client('s3').get_bucket_location(Bucket=s3_bucket_name)
object_url = "https://s3-{0}.amazonaws.com/{1}/{2}".format(
bucket_location['LocationConstraint'],
s3_bucket_name,
key_name)
You might try posting on the boto3 github issues list for a better solution.
I had the same issue.
Assuming you know the bucket name where you want to store your data, you can then use the following:
import boto3
from boto3.s3.transfer import S3Transfer
credentials = {
'aws_access_key_id': aws_access_key_id,
'aws_secret_access_key': aws_secret_access_key
}
client = boto3.client('s3', 'us-west-2', **credentials)
transfer = S3Transfer(client)
transfer.upload_file('/tmp/myfile', bucket, key,
extra_args={'ACL': 'public-read'})
file_url = '%s/%s/%s' % (client.meta.endpoint_url, bucket, key)
The best solution I found is still to use the generate_presigned_url, just that the Client.Config.signature_version needs to be set to botocore.UNSIGNED.
The following returns the public link without the signing stuff.
config = Config(signature_version=botocore.UNSIGNED)
config.signature_version = botocore.UNSIGNED
boto3.client('s3', config=config).generate_presigned_url('get_object', ExpiresIn=0, Params={'Bucket': bucket, 'Key': key})
The relevant discussions on the boto3 repository are:
https://github.com/boto/boto3/issues/110
https://github.com/boto/boto3/issues/169
https://github.com/boto/boto3/issues/1415
Somebody who wants to build up a direct URL for the public accessible object to avoid using generate_presigned_url for some reason.
Please build URL with urllib.parse.quote_plus considering whitespace and special character issue.
My object key: 2018-11-26 16:34:48.351890+09:00.jpg
please note whitespace and ':'
S3 public link in aws console: https://s3.my_region.amazonaws.com/my_bucket_name/2018-11-26+16%3A34%3A48.351890%2B09%3A00.jpg
Below code was OK for me
import boto3
s3_client = boto3.client
bucket_location = s3_client.get_bucket_location(Bucket='my_bucket_name')
url = "https://s3.{0}.amazonaws.com/{1}/{2}".format(bucket_location['LocationConstraint'], 'my_bucket_name', quote_plus('2018-11-26 16:34:48.351890+09:00.jpg')
print(url)
Going through the existing answers and their comments, I did the following and works well for special cases of file names like having whitespaces, having special characters (ASCII), corner cases. E.g. file names of the form: "key=value.txt"
import boto3
import botocore
config = botocore.client.Config(signature_version=botocore.UNSIGNED)
object_url = boto3.client('s3', config=config).generate_presigned_url('get_object', ExpiresIn=0, Params={'Bucket': s3_bucket_name, 'Key': key_name})
print(object_url)
For Django, if you use Django storages with boto3 the code below does exactly what you want:
default_storage.url(name=f.name)
I used an f-string for the same
import boto3
#s3_client = boto3.session.Session(profile_name='sssss').client('s3')
s3_client=boto3.client('s3')
s3_bucket_name = 'xxxxx'
s3_website_URL= f"http://{s3_bucket_name}.s3-website.{s3_client.get_bucket_location(Bucket=s3_bucket_name)['LocationConstraint']}.amazonaws.com"

Resources