Load joblib/pickle files from Databricks file system - filesystems

I have a problem: I can't see to load objects from dbfs (data bricks file system) outside of spark (I can load the data fine with spark but not with pandas).
The objects we want to load are joblib and pickled files.
corps_encoder = "/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib"
corpsEncoding = joblib.load(f'/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib')
Here the error message:
FileNotFoundError: [Errno 2] No such file or directory: '/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib'
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<command-2364347171462503> in <module>
1 #corps_encoder = "/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib"
----> 2 corpsEncoding = joblib.load(f'/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib')
/databricks/python/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
575 obj = _unpickle(fobj)
576 else:
--> 577 with open(filename, 'rb') as f:
578 with _read_fileobject(f, filename, mmap_mode) as fobj:
579 if isinstance(fobj, str):
FileNotFoundError: [Errno 2] No such file or directory: '/dbfs:/mnt/xxxx/encoders/corpsEncoder.joblib'
Any idea of how to load this type of file on db?

Related

odoo 14 connection to database failed

I do a clean Odoo 14 install and postgreSQL for a new Job, and I had this problem:
This is the odoo log file:
2022-09-26 22:08:19,881 4564 INFO ? odoo.sql_db: Connection to the database failed
2022-09-26 22:08:19,883 4564 INFO ? werkzeug: 127.0.0.1 - - [26/Sep/2022 22:08:19] "GET /favicon.ico HTTP/1.1" 500 - 0 0.000 0.019
2022-09-26 22:08:19,885 4564 ERROR ? werkzeug: Error on request:
Traceback (most recent call last):
File "D:\odoo\python\lib\site-packages\werkzeug\serving.py", line 306, in run_wsgi
execute(self.server.app)
File "D:\odoo\python\lib\site-packages\werkzeug\serving.py", line 294, in execute
application_iter = app(environ, start_response)
File "D:\odoo\server\odoo\service\server.py", line 441, in app
return self.app(e, s)
File "D:\odoo\server\odoo\service\wsgi_server.py", line 113, in application
return application_unproxied(environ, start_response)
File "D:\odoo\server\odoo\service\wsgi_server.py", line 88, in application_unproxied
result = odoo.http.root(environ, start_response)
File "D:\odoo\server\odoo\http.py", line 1328, in __call__
return self.dispatch(environ, start_response)
File "D:\odoo\server\odoo\http.py", line 1294, in __call__
return self.app(environ, start_wrapped)
File "D:\odoo\python\lib\site-packages\werkzeug\middleware\shared_data.py", line 220, in __call__
return self.app(environ, start_response)
File "D:\odoo\server\odoo\http.py", line 1479, in dispatch
self.setup_db(httprequest)
File "D:\odoo\server\odoo\http.py", line 1385, in setup_db
httprequest.session.db = db_monodb(httprequest)
File "D:\odoo\server\odoo\http.py", line 1567, in db_monodb
dbs = db_list(True, httprequest)
File "D:\odoo\server\odoo\http.py", line 1534, in db_list
dbs = odoo.service.db.list_dbs(force)
File "D:\odoo\server\odoo\service\db.py", line 384, in list_dbs
with closing(db.cursor()) as cr:
File "D:\odoo\server\odoo\sql_db.py", line 678, in cursor
return Cursor(self.__pool, self.dbname, self.dsn, serialized=serialized)
File "D:\odoo\server\odoo\sql_db.py", line 250, in __init__
self._cnx = pool.borrow(dsn)
File "D:\odoo\server\odoo\sql_db.py", line 561, in _locked
return fun(self, *args, **kwargs)
File "D:\odoo\server\odoo\sql_db.py", line 629, in borrow
**connection_info)
File "D:\odoo\python\lib\site-packages\psycopg2\__init__.py", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError - - -
I'm new in the world of database and coding... It seems a little problem but I do know how to resolve it. thanks for reading
Check the status of the postgresql service. It seems it has not started.
Check status:
sudo service postgresql status
Restart:
sudo service postgresql restart

Flink 1.15.1 Checkpoint Problem with gRPC

I am trying to understand the Flink Checkpointing system (in PyFlink). This is why I created a playground for it. Here is my environment
env = StreamExecutionEnvironment.get_execution_environment()
config = Configuration(j_configuration=get_j_env_configuration(env._j_stream_execution_environment))
config.set_integer("python.fn-execution.bundle.size", 1000)
env.set_runtime_mode(RuntimeExecutionMode.STREAMING)
env.enable_checkpointing(3000)
env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode.AT_LEAST_ONCE)
env.get_checkpoint_config().set_checkpoint_storage(FileSystemCheckpointStorage('file:///home/ubuntu/checkpoints'))
env.get_checkpoint_config().set_fail_on_checkpointing_errors(True)
env.get_checkpoint_config().set_tolerable_checkpoint_failure_number(1)
and I use state with a simple map function as
class RandomRaiserMap(MapFunction):
def open(self, runtime_context: RuntimeContext):
state_desc = ValueStateDescriptor('cnt', Types.INT())
self.cnt_state = runtime_context.get_state(state_desc)
def map(self, value: Row):
cnt = self.cnt_state.value()
self.cnt_state.update(1 if cnt is None else cnt + 1)
if random.random() < 0.001:
raise Exception("Planted")
return Row(count=self.cnt_state.value(), **value.as_dict())
I used Kafka source (FlinkKafkaConsumer) as
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(Types.ROW_NAMED(["seq", "payload", "event_time"], [Types.INT(), Types.STRING(), Types.SQL_TIMESTAMP()])).build()
source = FlinkKafkaConsumer(
topics='test_checkpoint_sequence_2',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': 'localhost:9092'}
)
Now, when start to execute job,
3> +I[1,273328,23fe377010c55a18,2022-08-02 12:36:50.325000]
3> +I[2,273329,79782a6af4a9badb,2022-08-02 12:36:50.336000]
3> +I[3,273330,d3537b0b9aedf6f9,2022-08-02 12:36:50.348000]
3> +I[4,273331,d600824835e4f04d,2022-08-02 12:36:50.359000]
3> +I[5,273332,e1e0851bfbe743ab,2022-08-02 12:36:50.370000]
3> +I[6,273333,27a547b6d4469ac0,2022-08-02 12:36:50.381000]
3> +I[7,273334,f12da6dbccbfc1be,2022-08-02 12:36:50.392000]
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.CANCELLED
details = "Multiplexer hanging up"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:38775 {grpc_message:"Multiplexer hanging up", grpc_status:1, created_time:"2022-08-02T12:36:53.274140888+00:00"}"
>
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/data_plane.py", line 470, in input_elements
element = received.get(timeout=1)
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/queue.py", line 178, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute
response = task()
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in <lambda>
lambda: self.create_worker().do_instruction(request), request)
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction
return getattr(self, request_type)(
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 988, in process_bundle
for element in data_channel.input_elements(instruction_id,
File "/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/apache_beam/runners/worker/data_plane.py", line 473, in input_elements
raise RuntimeError('Channel closed prematurely.')
RuntimeError: Channel closed prematurely.
3> +I[188,273515,0fec22df999f7054,2022-08-02 12:36:52.362000]
3> +I[189,273516,9587bc4039b1b8d9,2022-08-02 12:36:52.373000]
3> +I[190,273517,e8e2d9dceb66ba15,2022-08-02 12:36:52.384000]
3> +I[191,273518,569c5513e994fa6b,2022-08-02 12:36:52.395000]
3> +I[192,273519,1f2809cd273204f9,2022-08-02 12:36:52.406000]
3> +I[193,273520,2f63560bdbeb2c96,2022-08-02 12:36:52.417000]
When a periodical checkpoint starts, I get this error. However, it manages to write a checkpoint on the given path (like 1f488f8d38f82d55cc342e1d0f676707) and if an error happens, it can recover as well. I am stuck with what the error can be.
I use
Ubuntu 22.04
Java 11
Python 3.8
Flink 1.15.1
apache-flink 1.15.1

OwlReady Ontology Parsing Error while loading ontology

I am trying to load the ontology http://semrob-ontology.mpi.aass.oru.se/OntoCity.owl using the OwlReady python library
from owlready2 import *
onto = get_ontology("http://semrob-ontology.mpi.aass.oru.se/OntoCity.owl").load()
The traceback thread is as below:
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/owlready2/namespace.py in load(self, only_local, fileobj, reload, reload_if_newer, **args)
769 if _LOG_LEVEL: print("* Owlready2 * ...loading ontology %s from %s..." % (self.name, f), file = sys.stderr)
--> 770 try: fileobj = urllib.request.urlopen(f)
771 except: raise OwlReadyOntologyParsingError("Cannot download '%s'!" % f)
HTTPError: HTTP Error 308: Permanent Redirect
During handling of the above exception, another exception occurred:
OwlReadyOntologyParsingError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/owlready2/namespace.py in load(self, only_local, fileobj, reload, reload_if_newer, **args)
769 if _LOG_LEVEL: print("* Owlready2 * ...loading ontology %s from %s..." % (self.name, f), file = sys.stderr)
770 try: fileobj = urllib.request.urlopen(f)
--> 771 except: raise OwlReadyOntologyParsingError("Cannot download '%s'!" % f)
772 try: new_base_iri = self.graph.parse(fileobj, default_base = self.base_iri, **args)
773 finally: fileobj.close()
OwlReadyOntologyParsingError: Cannot download 'http://purl.org/dc/elements/1.1'!
How to resolve this error?

Airflow Scheduler with SQL Server backend and pyodbc

I have setup Airflow a SQL Server as backend (SQL Azure). Init DB is successful. I am trying to run a simple dag every 2 minutes.
The dag has 2 tasks:
print date
sleep
When it start the airflow scheduler, it creates tasks instances for both the tasks, the first one succeeds & the second one seems to be stuck in "running" state.
Looking at scheduler logs, I see the following error repeatedly.
[2019-01-04 11:38:48,253] {jobs.py:397} ERROR - Got an exception! Propagating...
Traceback (most recent call last):
File "/home/sshuser/.local/lib/python2.7/site-packages/airflow/jobs.py", line 389, in helper
pickle_dags)
File "/home/sshuser/.local/lib/python2.7/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/home/sshuser/.local/lib/python2.7/site-packages/airflow/jobs.py", line 1816, in process_file
dag.sync_to_db()
File "/home/sshuser/.local/lib/python2.7/site-packages/airflow/utils/db.py", line 74, in wrapper
return func(*args, **kwargs)
File "/home/sshuser/.local/lib/python2.7/site-packages/airflow/models.py", line 4296, in sync_to_db
DagModel).filter(DagModel.dag_id == self.dag_id).first()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2755, in first
ret = list(self[0:1])
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2547, in __getitem__
return list(res)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
return self._execute_and_instances(context)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances
close_with_result=True)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args
**kw
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session
conn = self.session.connection(**kw)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 1019, in connection
execution_options=execution_options)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 1024, in _connection_for_bind
engine, execution_options)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 409, in _connection_for_bind
conn = bind.contextual_connect()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
self._wrap_pool_connect(self.pool.connect, None),
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect
e, dialect, self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1465, in _handle_dbapi_exception_noconnection
exc_info
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 387, in connect
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 768, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 516, in checkout
rec = pool._do_get()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1140, in _do_get
self._dec_overflow()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1137, in _do_get
return self._create_connection()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 333, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 461, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 651, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/strategies.py", line 105, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 393, in connect
return self.dbapi.connect(*cargs, **cparams)
InterfaceError: (pyodbc.InterfaceError) ('28000', u"[28000] [unixODBC][Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Login failed for user 'airflowuser'. (18456) (SQLDriverConnect)")
Airflow is configured to use LocalExecutor & pyodbc to connect to SQL Azure
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor
#executor = SequentialExecutor
executor = LocalExecutor
# The SqlAlchemy connection string to the metadata database.
# SqlAlchemy supports many different database engine, more information
# their website
#sql_alchemy_conn = sqlite:////home/sshuser/airflow/airflow.db
#connection string to MS SQL Serv er
sql_alchemy_conn = mssql+pyodbc://airflowuser#afdsqlserver76:<password>#afdsqlserver76.database.windows.net:1433/airflowdb?driver=ODBC+Driver+17+for+SQL+Server
# The encoding for the databases
sql_engine_encoding = utf-8
# If SqlAlchemy should pool database connections.
sql_alchemy_pool_enabled = True
# The SqlAlchemy pool size is the maximum number of database connections
# in the pool. 0 indicates no limit.
sql_alchemy_pool_size = 10
# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite. If the number of DB connections is ever exceeded,
# a lower config value will allow the system to recover faster.
sql_alchemy_pool_recycle = 180
# How many seconds to retry re-establishing a DB connection after
# disconnects. Setting this to 0 disables retries.
sql_alchemy_reconnect_timeout = 300
Dag is as follows
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 4),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('tutorial', default_args=default_args, schedule_interval='*/2 * * * *', max_active_runs=1, catchup=False)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
I am a bit lost on why scheduler would not be able to connect to DB after it has run the first task successfully.
Any pointers to resolve this is much appreciated.
I have a sample program that uses sqlalchemy to connect to SQL Azure using the same credentials & this works.
import sqlalchemy
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://afdadmin#afdsqlserver76:<password>#afdsqlserver76.database.windows.net:1433/airflowdb?driver=ODBC+Driver+17+for+SQL+Server")
connection = engine.connect()
result = connection.execute("select version_num from alembic_version")
for row in result:
print("Version:", row['version_num'])
connection.close()
The issue was resolved after Pooling = True was set in odbcinst.ini
[ODBC]
Pooling = Yes

Looping over with open() thows IOError

I am iterating over several DB records and writing data from their respective BLOB fields into files:
def build(self, records):
"""
Builds openimmo.anhang
"""
result = None
anh_records = [r for r in records if type(r) == anhaenge]
if not anh_records:
return result
anhang = []
print('RECORDS: ' + str(len(anh_records)))
for anh_record in anh_records:
if anh_record.daten:
__, path = mkstemp()
with open(path, 'wb') as target:
target.write(anh_record.daten)
anh = openimmo.anhang()
anh.anhangtitel = anh_record.anhangtitel
anh.format = 'image/jpeg' #MIMEUtil.getmime(path)
anh.daten = openimmo.daten()
anh.daten.pfad = path
anh.location = id2location.get(anh_record.location)
anh.gruppe = id2gruppe.get(anh_record.gruppe)
anhang.append(anh)
try:
result.validateBinding()
except:
self.log.err('Could not build "anhang": ' + str(result))
if anhang:
result = openimmo.anhaenge()
result.anhang = anhang
return result
This, however produces the following error:
RECORDS: 5
Message: "[Errno 24] Too many open files: '/tmp/tmpo54qfq'
daemon panic:
Caught unexpected exception in _main() on 2014-08-20 11:53:37.918353
Message: "[Errno 24] Too many open files: '/tmp/tmpo54qfq'" of type "<class 'IOError'>"
Traceback (most recent call last):
File "/usr/local/lib/python3.2/dist-packages/homie_core-1.0-py3.2.egg/homie/serv/daemon.py", line 345, in __run
File "/usr/local/lib/python3.2/dist-packages/homie_core-1.0-py3.2.egg/homie/serv/service.py", line 72, in _main
File "/usr/local/lib/python3.2/dist-packages/homie_core-1.0-py3.2.egg/homie/api/itf.py", line 127, in export
File "/usr/local/lib/python3.2/dist-packages/homie_openimmodb-0.2_indev-py3.2.egg/openimmodb/itf.py", line 51, in _retrieve
File "/usr/local/lib/python3.2/dist-packages/homie_openimmodb-0.2_indev-py3.2.egg/openimmodb/conv.py", line 27, in decode
File "/usr/local/lib/python3.2/dist-packages/homie_openimmodb-0.2_indev-py3.2.egg/openimmodb/factories/openimmo/immobilie.py", line 60, in build
File "/usr/local/lib/python3.2/dist-packages/homie_openimmodb-0.2_indev-py3.2.egg/openimmodb/factories/openimmo/anhaenge.py", line 30, in build
IOError: [Errno 24] Too many open files: '/tmp/tmpo54qfq'
According to lsof the whole process has over 5k open files:
# lsof| grep python3| wc -l
5375
I checked it several times: I am using with open(file) as desc everywhere in the code, when I open a file.
Shouldn't the files be closed automatically at the end of each with block, or am I missing something?
tempfile.mkstemp() opens a file for you:
fd, path = mkstemp()
with open(fd, 'wb') as target:
# os.close(fd) is called automatically
You don't need open(path) that opens another file (with the same name).
You could use tempfile.NamedTemporaryFile(delete=False) instead of tempfile.mkstemp().

Resources