How can I give Persian stop-words to CountVectorizer as an argument? - stop-words

I'm trying to use Persian stop-words for CountVectorizer() in python (google-colaboratory).
I don't know how I should give Persian stop-words to the function as an argument
For instance, here is a Persian stop-words list yet I don't know how should I give the list to my code
vect = CountVectorizer(stop_words='persian', tokenizer = hazm.word_tokenize).fit(txt)

You can simply put all of those stop words you're referring to in a python list and then pass the list to the CountVectorizer. For example:
persian_stop_words = ["در", "این"]
vect = CountVectorizer(stop_words=persian_stop_words)

You can use this open-source repository to find a collection of Persian stopwords:
https://github.com/kharazi/persian-stopwords
To load them, simply just copy and paste rows in a single file (separated by new lines) and call it for example "stopwords.data". Then you can load the file in your project and put the loaded file as your CountVectorizer "stop_words" argument:
persian_stop_words = loadtxt('stopwords.dat', dtype=str, delimiter='\n')
vect = CountVectorizer(stop_words=persian_stop_words)

Related

How to export Snowflake Web UI Worksheet SQL to file

Classic Snowflake Web UI and the new Snowsight are great at importing sql from a file but neither allows you to export sql to a file. Is there a workaround?
You can use an IDE to connect to snowflake and write queries. Then the scripts can be downloaded using IDE features and can sync with git repo as well.
dbeaver is one such IDE which supports snowflake :
https://hevodata.com/learn/dbeaver-snowflake/
The query pane is interactive so the obvious workaround will be:
CTRL + A (select all)
CTRL + C (copy)
<open_favourite_text_editor>
CTRL + P (paste)
CTRL + S (save)
This tool can help you while the team develops a native feature to export worksheets:
"Snowflake Snowsight Extensions wrap Snowsight features that do not have API or SQL alternatives, such as manipulating Dashboards and Worksheets, and retrieving Query Profile and step timings."
https://github.com/Snowflake-Labs/sfsnowsightextensions
Further explained on this post:
https://medium.com/snowflake/importing-and-exporting-snowsight-dashboards-and-worksheets-3cd8e34d29c8
For example, to save to a file within PowerShell:
PS > $dashboards | foreach {$_.SaveToFolder(“path/to/folder”)}
PS > $dashboards[0].SaveToFile(“path/to/folder/mydashboard.json”)
ETA: I'm adding this edit to the front because this is what actually worked.
Again, BSON was a dead end & punycode is irrelevant. I don't know why punycode is referenced in the metadata file; but my best guess is that they might use punycode to encode the worksheet name itself (though I'm not sure why that would be needed since it shouldn't need to be part of a URL).
After doing terrible things and trying a number of complex ways of dealing with escape character hell, I found that the actual encoding is very simple. It just works as an 8 bit encoding with anything that might cause problems escaped away (null, control codes, double quotes, etc.). To load, treat the file as a text file using an 8-bit encoding; extract the data as a JSON field, then re-encode that extracted data as that same encoding. I just used latin_1 to read; but it may not even matter which encoding you use as long as you are consistent and use the same one to re-encode. The encoded field will then be valid zlib compressed data.
I decided that I wanted to start from scratch so I needed to back the worksheets first and I made a Python script based on my findings above. Be warned that this may return even worksheets that you previously closed for good. After running this and verifying that backups were created, I just ran rm #~/worksheet_data/;, closed the tab & reopened it.
Here's the code (fill in the appropriate base directory location):
import os
from collections import OrderedDict
import configparser
from sqlalchemy import create_engine, exc
from snowflake.sqlalchemy import URL
import pathlib
import json
import zlib
import string
def format_filename(s: str) -> str: # From https://gist.github.com/seanh/93666
"""Take a string and return a valid filename constructed from the string.
Uses a whitelist approach: any characters not present in valid_chars are
removed. Also spaces are replaced with underscores.
Note: this method may produce invalid filenames such as ``, `.` or `..`
When I use this method I prepend a date string like '2009_01_15_19_46_32_'
and append a file extension like '.txt', so I avoid the potential of using
an invalid filename.
"""
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
filename = ''.join(c for c in s if c in valid_chars)
# filename = filename.replace(' ','_') # I don't like spaces in filenames.
return filename
def trlng_dash(s: str) -> str:
"""Removes trailing character if present."""
return s[:-1] if s[-1] == '-' else s
sso_authenticate = True
# Assumes CLI config file exists.
config = configparser.ConfigParser()
home = pathlib.Path.home()
config_loc = home/'.snowsql/config' # Assumes it's set up from Snowflake CLI.
base_dir = home/r'{your Desired base directory goes here.}'
json_dir = base_dir/'json' # Location for your worksheet stage JSON files.
sql_dir = base_dir/'sql' # Location for your worksheets.
# Assumes CLI config file exists.
config.read(config_loc)
# Add connection parameters here (assumes CLI config exists).
# Using sso so only 2 are needed.
# If there's no config file, etc. enter by hand here (or however you want to do it).
connection_params = {
'account': config['connections']['accountname'],
'user': config['connections']['username'],
}
if sso_authenticate:
connection_params['authenticator'] = 'externalbrowser'
if config['connections'].get('password', None) is not None:
connection_params['password'] = config['connections']['password']
if config['connections'].get('rolename', None) is not None:
connection_params['role'] = config['connections']['rolename']
if locals().get('database', None) is not None:
connection_params['database'] = database
if locals().get('schema', None) is not None:
connection_params['schema'] = schema
sf_engine = create_engine(URL(**connection_params))
if not base_dir.exists():
base_dir.mkdir()
if not json_dir.exists():
json_dir.mkdir()
if not (sql_dir).exists():
sql_dir.mkdir()
with sf_engine.connect() as connection:
connection.execute(f'get #~/worksheet_data/ \'file://{str(json_dir.as_posix())}\';')
for file in [path for path in json_dir.glob('*') if path.is_file()]:
if file.suffix != '.json':
file.replace(file.with_suffix(file.suffix + '.json'))
with open(json_dir/'metadata.json', 'r') as metadata_file:
files_meta = json.load(metadata_file)
# List of files from metadata file will contain some empty worksheets.
files_description_orig = OrderedDict((file_key_value['name'], file_key_value) for file_key_value in sorted(files_meta['activeWorksheets'] + list(files_meta['inactiveWorksheets'].values()), key=lambda x: x['name']) if file_key_value['name'])
# files_description will only track non empty worksheets
files_description = files_description_orig.copy()
# Create updated files description filtering out empty worksheets.
for item in files_description_orig:
json_file = json_dir/f"{files_description_orig[item]['name']}.json"
# If a file didn't make it or was deleted by hand, we should
# remove from the filtered description & continue to the next item.
if not (json_file.exists() and json_file.is_file()):
del files_description[item]
continue
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
# If the file represents a worksheet with a body field, we want it.
if not json_dat['wsContents'].get('body'):
del files_description[item]
## Delete JSON files corresponsing to empty worksheets.
# f.close()
# try:
# (json_dir/f"{files_description_orig[item]['name']}.json").unlink()
# except:
# pass
# Produce a list of normalized filenames (no illegal or awkward characters).
file_names = set(
format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
for item in files_description)
# Add useful information to our files_description OrderedDict
for file_name in file_names:
repeats_cnt = 0
file_name_repeats = (
item
for item
in files_description
if file_name == format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
)
for file_uuid in file_name_repeats:
files_description[file_uuid]['normalizedName'] = file_name
files_description[file_uuid]['stemSuffix'] = '' if repeats_cnt == 0 else f'({repeats_cnt:0>2})'
repeats_cnt += 1
# Now we iterate on non-empty worksheets only.
for item in files_description:
json_file = json_dir/f"{files_description[item]['name']}.json"
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
body = json_dat['wsContents']['body']
body_bin = body.encode('latin_1')
body_txt = zlib.decompress(body_bin).decode('utf8')
sql_file = sql_dir/f"{files_description[item]['normalizedName']}{files_description[item]['stemSuffix']}.sql"
with open(sql_file, 'w') as sql_f:
sql_f.write(body_txt)
creation_stamp = files_description[item]['created']/1000
os.utime(sql_file, (creation_stamp,creation_stamp))
print('Done!')
As mentioned at Is there any option in snowflake to save or load worksheets? (and in Snowflake's own documentation), in the Classic UI, the worksheets are saved at the user stage under #~/worksheet_data/.
You can download it with a get command like:
get #~/worksheet_data/<name> file:///<your local location>; (though you might need quoting if running from Windows).
The problem is that I do not know how to access it programmatically. The downloaded files look like JSON but it is not valid JSON. The main key is "wsContents" and contains most of the worksheet information. Its value includes two subkeys, "encoding" and "body".
The "encoding" key denotes that gzip is being used. The "body" key seems to be the actual worksheet data which looks a lot like a straight binary representation of the compressed text data. As such, any JSON reader will choke on it.
If it is anything like that, I do not currently know how to access it programmatically using Python.
I do see that a JSON like format exists, BSON, that is bundled into PyMongo. Trying to use this on these files fails. I even tried bson.is_valid and it returns False so I am assuming that it means that these files in Snowflake are not actually BSON.
Edited to add: Again, BSON is a dead end.
Examining the "body" value as just binary data, the first two bytes of sample files do seem to correspond to default zlib compression (0x789c). However, attempting to run straight zlib.decompress on the slice created from that first byte to the last corresponding to the first & last characters of the "body" value results in the error:
Error - 3 while decompressing data: invalid code lengths set
This makes me think that the bytes there, as is, are at least partly garbage and still need some processing before they can be decompressed.
One clue that I failed to mention earlier is that the metadata file (called "metadata" and which serves as an inventory of the remaining files at the #~/worksheet_data/ location) declares that the files use the punycode encoding. However, I have not known how to use that information. The data in these files doesn't particularly look like what I feel punycode should look like nor does it particularly make sense to me that you would use punycode on binary data that is not meant to ever be used to directly generate text such as zlib compressed data.

Returning the filename of the current sketch

I am trying to write a GUI that will display the name of the sketch it was generated from using a simple text() command. However, I am running into trouble getting any of the general JS solutions to work for me. Many solutions I have found use the filename reserved word but that does not seem to be reserved in Processing 3.5.4. I have also tried parsing the strings using a similar method to what can be found here. I am very new to processing and this is only my 2nd attempt at using Processing.
Any advice would be greatly appreciated
You can get the path (as a string) to the sketch with sketchPath().
From there you could either parse the string (pull off everything after the last slash) to get the sketch name, or you can use sketchFile() to get a reference to the file itself and get the name from there:
String path = sketchPath();
File file = sketchFile(path);
String sketchName = file.getName();
println(sketchName);
You could combine this all into one line like so:
String sketchName = sketchFile(sketchPath()).getName();

Send XLSX file as mail attachment via ABAP

I have to create an email and attach an XLSX file. I looked at the BCS_EXAMPLE_7 program.
I have transformed the content with the following method:
TRY.
cl_bcs_convert=>string_to_solix(
EXPORTING
iv_string = lv_content
iv_codepage = '4103'
iv_add_bom = 'X'
IMPORTING
et_solix = pt_binary_content
ev_size = pv_size ).
CATCH cx_bcs.
ls_return-type = text-023.
ls_return-message = text-024.
APPEND ls_return TO pt_return.
ENDTRY.
CONCATENATE lv_save_file_name '_' sy-datum '.xlsx' INTO lv_save_file_name.
lv_attachment_subject = lv_save_file_name.
CONCATENATE '&SO_FILENAME=' lv_attachment_subject INTO ls_attachment_header.
APPEND ls_attachment_header TO lt_attachment_header.
lo_document->add_attachment( i_attachment_type = 'XLS'
i_attachment_subject = lv_attachment_subject
i_attachment_size = pv_size
i_att_content_hex = pt_binary_content
i_attachment_header = lt_attachment_header ).
The email is sent correctly but when I open the attachment I see the error
Cannot open the file because the file extension is incorrect
Could you help me? thanks
That's a normal behavior of Excel, unrelated to ABAP, when the file name has extension .xlsx but doesn't contain data in format corresponding to XLSX. Excel does the same kind of checks for other extensions. If you need more information about these checks, please search the Web.
As I see that your program creates the attachment based on text converted into UTF-16LE code page (SAP code page 4103), I guess that you created the Excel data in format CSV, tab-separated values or even the old Excel XMLSS/XML 2003 format.
In that case, the extension .xlsx is not valid, to avoid the message, use the adequate extension, respectively .csv, .txt or .xml.
If you really need the extension .xlsx for some reason, then you must create the data in XLSX format. You may use the free API abap2xlsx. If you need further assistance about how to use abap2xlsx, please ask a new question (unrelated to email).
NB: maybe you were told to use the extension .xlsx although there is no real need to use it (each format has its own features, but simple unformatted values can be achieved with all formats), in that case you may propose to use a simple format like CSV or tab-separated values.
NB: you may also have the opposite case that Excel sniffs that the file contains data in format corresponding to XLSX, but the file name doesn't have the extension .xlsx, and the same for all other formats, but I can't say what is the exact Excel reaction to each case.
It appears that whatever you have in lv_content isn't actually a valid excel file. You can not just take arbitrary data, give it the extension .xlsx and expect MS Excel to know what to do with it.
Unfortunately, creating valid MS Office files is anything but trivial. It's a format which is theoretically open and based on XML (actually a zip archive containing multiple XML files), but in practice the specification is over a 5000(!) pages long.
Fortunately, there is a library for that. abap2xlsx is an open source (Apache License) library which provides an easy API to create (and read) valid XLSX files in ABAP.
You could also try to open the file with a text editor (eg. NotePad++), maybe this gives a hint of the actual content.
But I guess that something went wrong generating the binary table. Maybe you are using the wrong file size or code page.
Possible problems:
First problem: as correctly said by Sandra you may have invalid content of your lv_content variable, which doesn't correspond to correct XLSX structure.
Second problem: which you already solved, as seen from your coding, BCS classes do not support 4-character extensions.
Here is the sample how to build and send correct XLSX file via mail:
SELECT * UP TO 100 ROWS
FROM spfli
INTO TABLE #DATA(lt_spfli).
cl_salv_table=>factory( IMPORTING r_salv_table = DATA(lr_table)
CHANGING t_table = lt_spfli ).
DATA: lr_xldimension TYPE REF TO if_ixml_node,
lr_xlworksheet TYPE REF TO if_ixml_element.
DATA(lv_xlsx) = lr_table->to_xml( if_salv_bs_xml=>c_type_xlsx ).
DATA(lr_zip) = NEW cl_abap_zip( ).
lr_zip->load( lv_xlsx ).
lr_zip->get( EXPORTING name = 'xl/worksheets/sheet1.xml' IMPORTING content = DATA(lv_file) ).
DATA(lr_file) = NEW cl_xml_document( ).
lr_file->parse_xstring( lv_file ).
* Row elements are under SheetData
DATA(lr_xlnode) = lr_file->find_node( 'sheetData' ).
DATA(lr_xlrows) = lr_xlnode->get_children( ).
* Create new element in the XML file
lr_xlworksheet ?= lr_file->find_node( 'worksheet' ).
DATA(lr_xlsheetpr) = cl_ixml=>create( )->create_document( )->create_element( name = 'sheetPr' ).
DATA(lr_xloutlinepr) = cl_ixml=>create( )->create_document( )->create_element( name = 'outlinePr' ).
lr_xlsheetpr->if_ixml_node~append_child( lr_xloutlinepr ).
lr_xloutlinepr->set_attribute( name = 'summaryBelow' value = 'false' ).
lr_xldimension ?= lr_file->find_node( 'dimension' ).
lr_xlworksheet->if_ixml_node~insert_child( new_child = lr_xlsheetpr ref_child = lr_xldimension ).
* Create xstring and move it to XLSX
lr_file->render_2_xstring( IMPORTING stream = lv_file ).
lr_zip->delete( EXPORTING name = 'xl/worksheets/sheet1.xml' ).
lr_zip->add( EXPORTING name = 'xl/worksheets/sheet1.xml' content = lv_file ).
lv_xlsx = lr_zip->save( ).
DATA lv_size TYPE i.
DATA lt_bintab TYPE solix_tab.
* Convert to binary
CALL FUNCTION 'SCMS_XSTRING_TO_BINARY'
EXPORTING
buffer = lv_xlsx
IMPORTING
output_length = lv_size
TABLES
binary_tab = lt_bintab.
DATA main_text TYPE bcsy_text.
* create persistent send request
DATA(send_request) = cl_bcs=>create_persistent( ).
* create document object from internal table with text
APPEND 'Valid Excel file' TO main_text.
DATA(document) = cl_document_bcs=>create_document( i_type = 'RAW' i_text = main_text i_subject = 'Test Created for stella' ).
DATA lt_att_head TYPE soli_tab.
APPEND '<(>&< )>SO_FILENAME=MySheet.xlsx' TO lt_att_head.
* add the spread sheet as attachment to document object
document->add_attachment(
i_attachment_type = 'xls'
i_attachment_subject = 'MySheet'
i_attachment_size = CONV so_obj_len( lv_size )
i_attachment_header = lt_att_head
i_att_content_hex = lt_bintab ).
send_request->set_document( document ).
DATA(recipient) = cl_cam_address_bcs=>create_internet_address( 'some_recipient#mail.com' ).
send_request->add_recipient( recipient ).
DATA(sent_to_all) = send_request->send( i_with_error_screen = 'X' ).
COMMIT WORK.

kotlin add elements in array as sections

Can anyone teach me how to initiate/add items in array as section?
something like - array[["john"],["daniel"],["jane"]]
tried a few like var d = ArrayList<CustomClass>(arrayListOf<CustomClass>())
that does not work.
for example, if i add "david" i want it look like array[["john"],["daniel", "david"],[“keith"]] by finding the index of array contains letter "d"
how can i display them on a custom base adapter / listview? currently using a viewHolder
val viewHolder = ViewHolder(row.name) to display.
Thanks!
If I am right, you're trying to create an ArrayList that contains other ArrayLists. So writing:
var d = ArrayList<CustomClass>(arrayListOf<CustomClass>())
Won't work because the type of the main ArrayList is not String, but ArrayList. So you need to write:
var names = ArrayList<ArrayList<String>>()
names.add(arrayListOf("jane", "john))
Regarding the second question, check out this course that has a section about RecyclerView: https://classroom.udacity.com/courses/ud9012 If you are a beginner, I strongly encourage you to follow the whole tutorial.

MATLAB extract data from mat-files using logical conditions

I have a lot of data in several hundred .mat-files where I want to extract specific data from. All the names of my .mat-files have specific numbers to identify the content like Number1_Number2_Number3_Number4.mat:
01_33_06_121.mat
01_24_12_124.mat
02_45_15_118.mat
02_33_11_190.mat
01_33_34_142.mat
Now I want to extract for example all the data from files with Number1=01 or Number1=02 and Number2=33.
Before I start to write a program from scratch, I would like to know, if there is a simple way to do this with Matlab. Does anybody know how I can solve this problem in a fast way?
Thanks a lot!
There are multiple ways you can do this; on top of my head following can work:
Obtain all the file names into an array
allFiles = dir( 'folder' );
allNames = { allFiles.name };
Loop through your file names and compare against the condition using the regex
for i=1:size(allNames)
if regexp(allNames, pattern, 'match')
disp(allNames)
end
end

Resources