In Django how can I run a custom clean function on fixture data during import and validation? - database

In a ModelForm I can write a clean_<field_name> member function to automatically validate and clean up data entered by a user, but what can I do about dirty json or csv files (fixtures) during a manage.py loaddata?

Fixtures loaded with loaddata are assumed to contain clean data that doen't need validation (usually as an inverse operation to a prior dumpdata), so the short answer is that loaddata isn't the approach you want if you need to clean your inputs.
However, you probably can use some of the underpinnings of loaddata while implementing your custom data cleaning code--I'm sure you can easily script something using the Django serialization libs to read your existing data files them in and the save the resulting objects normally after the data has been cleaned up.

In case others want to do something similar, I defined a model method to do the cleaning (so it can be called from ModelForms)
MAX_ZIPCODE_DIGITS = 9
MIN_ZIPCODE_DIGITS = 5
def clean_zip_code(self, s=None):
#s = str(s or self.zip_code)
if not s: return None
s = re.sub("\D","",s)
if len(s)>self.MAX_ZIPCODE_DIGITS:
s = s[:self.MAX_ZIPCODE_DIGITS]
if len(s) in (self.MIN_ZIPCODE_DIGITS-1,self.MAX_ZIPCODE_DIGITS-1):
s = '0'+s # FIXME: deal with other intermediate lengths
if len(s)>=self.MAX_ZIPCODE_DIGITS:
s = s[:self.MIN_ZIPCODE_DIGITS]+'-'+s[self.MIN_ZIPCODE_DIGITS:]
return s
Then wrote a standalone python script to clean up my legacy json files using any clean_ methods found among the models.
import os, json
def clean_json(app = 'XYZapp', model='Entity', fields='zip_code', cleaner_prefix='clean_'):
# Set the DJANGO_SETTINGS_MODULE environment variable.
os.environ['DJANGO_SETTINGS_MODULE'] = app+".settings"
settings = __import__(app+'.settings').settings
models = __import__(app+'.models').models
fpath = os.path.join( settings.SITE_PROJECT_PATH, 'fixtures', model+'.json')
if isinstance(fields,(str,unicode)):
fields = [fields]
Ns = []
for field in fields:
try:
instance = getattr(models,model)()
except AttributeError:
print 'No model named %s could be found'%(model,)
continue
try:
cleaner = getattr(instance, cleaner_prefix+field)
except AttributeError:
print 'No cleaner method named %s.%s could be found'%(model,cleaner_prefix+field)
continue
print 'Cleaning %s using %s.%s...'%(fpath,model,cleaner.__name__)
fin = open(fpath,'r')
if fin:
l = json.load(fin)
before = len(l)
cleans = 0
for i in range(len(l)):
if 'fields' in l[i] and field in l[i]['fields']:
l[i]['fields'][field]=cleaner(l[i]['fields'][field]) # cleaner returns None to delete records
cleans += 1
fin.close()
after = len(l)
assert after>.5*before
Ns += [(before, after,cleans)]
print 'Writing %d/%d (new/old) records after %d cleanups...'%Ns[-1]
with open(fpath,'w') as fout:
fout.write(json.dumps(l,indent=2,sort_keys=True))
return Ns
if __name__ == '__main__':
clean_json()

Related

Turning an Array into a Data Frame

I would like to put my summary statistics into a table using the kable function, but I cannot because it comes up as an array.
```{r setup options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.height = 5, echo = TRUE)
library(mosaic)
library(knitr)
```
```{r}
sum = summary(SwimRecords$time) # generic data set included with mosaic package
kable(sum) # I want this to be printed into a table
```
Any suggestions?
You can do so easily with the broom package which is built to "tidy" these stats-related objects:
#install.packages(broom)
broom::tidy(sum)

Memory-efficient large dataset streaming to S3

I am trying to copy over S3 large dataset (larger than RAM) using SQL alchemy.
My constraints are:
I need to use sqlalchemy
I need to keep memory pressure at lowest
I don't want to use the local filsystem as intermediary step to send data to s3
I just want to pipe data from a DB to S3 in a memory efficient way.
I can to do it normal with data sets (using below logic) but with larger dataset I hit a buffer issue.
The first problem I solved is that executing a query usually buffers the result in memory. I use the fetchmany() method.
engine = sqlalchemy.create_engine(db_url)
engine.execution_options(stream_results=True)
results=engine.execute('SELECT * FROM tableX;')
while True:
chunk = result.fetchmany(10000)
if not chunk:
break
On the other side, I have a StringIO buffer that I feed with the fetchmany data check. Then I send its content to s3.
from io import StringIO
import boto3
import csv
s3_resource = boto3.resource('s3')
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
s3_resource.Object(bucket, s3_key).put(Body=csv_buffer.getvalue())
The problem I have is essentially a design issue, how do I make these parts work together. Is it even possible in the same runtime?
engine = sqlalchemy.create_engine(db_url)
s3_resource = boto3.resource('s3')
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
engine.execution_options(stream_results=True)
results=engine.execute('SELECT * FROM tableX;')
while True:
chunk = result.fetchmany(10000)
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
s3_resource.Object(bucket, s3_key).put(Body=csv_buffer.getvalue())
if not chunk:
break
I can make it work for one cycle of fetchmany, but not several. Any idea?
I'm assuming that by "make these parts work together" you mean you want a single file in S3 instead of just parts? All you need to do is to create a file object that, when read, will issue a query for the next batch and buffer that. We can make use of python's generators:
def _generate_chunks(engine):
with engine.begin() as conn:
conn = conn.execution_options(stream_results=True)
results = conn.execute("")
while True:
chunk = results.fetchmany(10000)
if not chunk:
break
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
yield csv_buffer.getvalue().encode("utf-8")
This is a stream of chunks of your file, so all we need to do is to stitch these together (lazily, of course) into a file object:
class CombinedFile(io.RawIOBase):
def __init__(self, strings):
self._buffer = ""
self._strings = iter(strings)
def read(self, size=-1):
if size < 0:
return self.readall()
if not self._buffer:
try:
self._buffer = next(self._strings)
except StopIteration:
pass
if len(self._buffer) > size:
ret, self._buffer = self._buffer[:size], self._buffer[size:]
else:
ret, self._buffer = self._buffer, b""
return ret
chunks = _generate_chunks(engine)
file = CombinedFile(chunks)
upload_file_object_to_s3(file)
Streaming the file object to S3 is left as an exercise for the reader. (You can probably use put_object.)

Run another DAG with TriggerDagRunOperator multiple times

i have a DAG (DAG1) where i copy a bunch of files. I would then like to kick off another DAG (DAG2) for each file that was copied. As the number of files copied will vary per DAG1 run, i would like to essentially loop over the files and call DAG2 with the appropriate parameters.
eg:
with DAG( 'DAG1',
description="copy files over",
schedule_interval="* * * * *",
max_active_runs=1
) as dag:
t_rsync = RsyncOperator( task_id='rsync_data',
source='/source/',
target='/destination/' )
t_trigger_preprocessing = TriggerDagRunOperator( task_id='trigger_preprocessing',
trigger_daq_id='DAG2',
python_callable=trigger
)
t_rsync >> t_trigger_preprocessing
i was hoping to use the python_callable trigger to pull the relevant xcom data from t_rsync and then trigger DAG2; but its not clear to me how to do this.
i would prefer to put the logic of calling DAG2 here to simplify the contents of DAG2 (and also provide stacking schematics with the max_active_runs)
ended up writing my own operator:
class TriggerMultipleDagRunOperator(TriggerDagRunOperator):
def execute(self, context):
count = 0
for dro in self.python_callable(context):
if dro:
with create_session() as session:
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True)
session.add(dr)
session.commit()
count = count + 1
else:
self.log.info("Criteria not met, moving on")
if count == 0:
raise AirflowSkipException('No external dags triggered')
with a python_callable like
def trigger_preprocessing(context):
for base_filename,_ in found.items():
exp = context['ti'].xcom_pull( task_ids='parse_config', key='experiment')
run_id='%s__%s' % (exp['microscope'], datetime.utcnow().replace(microsecond=0).isoformat())
dro = DagRunOrder(run_id=run_id)
d = {
'directory': context['ti'].xcom_pull( task_ids='parse_config', key='experiment_directory'),
'base': base_filename,
'experiment': exp['name'],
}
LOG.info('triggering dag %s with %s' % (run_id,d))
dro.payload = d
yield dro
return
and then tie it all together with:
t_trigger_preprocessing = TriggerMultipleDagRunOperator( task_id='trigger_preprocessing',
trigger_dag_id='preprocessing',
python_callable=trigger_preprocessing
)

shiny selecting specific columns from uploaded data frame

I have merged different sources of code to make an app that allows one to upload a file (data frame).
However, beyond this I would also like to make it possible to select specific columns from the data frame and analyse them. This is difficult however as one must predefine the given data frame in order to be able to refer to it in the ui.R script....
So when a previously undefined data frame is uploaded to the site, one can not revere to it in the ui.R as it is defined in the server....
predefined variables
vchoices <- 1:ncol(mtcars)
names(vchoices) <- names(mtcars)
ui.R
runApp(
ui = basicPage(
h2('The uploaded file data'),
dataTableOutput('mytable'),
fileInput('file', 'Choose info-file to upload',
accept = c(
'text/csv',
'text/comma-separated-values',
'text/tab-separated-values',
'text/plain',
'.csv',
'.tsv'
)
),
actionButton("choice", "incorporate external information"),
selectInput("columns", "Select Columns", choices=vchoices, inline = T),
#notice that the 'choices' in selectInput are set to the predefined
#variables above whereas I would like to set them equal to the
#not yet defined uploaded file below in server.R
tableOutput("table_display")
))
Notice that the 'choices' in selectInput are set to the predefined variables above whereas I would like to set them equal to the not yet defined uploaded file below in server.R
server.R
server = function(input, output) {
info <- eventReactive(input$choice, {
inFile <- input$file
if (is.null(inFile))
return(NULL)
isolate(f<-read.table(inFile$datapath, header = T,
sep = "\t"))
f
})
output$table_display<-renderTable({
f<-info()
f<-subset(f, select=input$columns) #subsetting takes place here
head(f)
})
}
Does anyone know of a way to refer to a variable that's defined in in the server, in the ui and thus allow for interactive manipulation?
You can use a family of functions update*Input - in this case updateSelectInput. Its first argument has to be session and you also have to add session to server <- function(input, output) to be able to update your widget.
You could make an update of the widget immediately after clicking on the actionButton - so, you had to use updateSelectInput within eventReactive.
Let's take a look how we can do that:
First, you can save the names of columns of the new uploaded dataset in a variable, say, vars and then pass it to the function updateSelectInput.
(The choices of the selectInput are initially set to NULL - we don't need to specify them before because they are going to be updated anyway)
info <- eventReactive(input$choice, {
inFile <- input$file
# Instead # if (is.null(inFile)) ... use "req"
req(inFile)
# Changes in read.table
f <- read.table(inFile$datapath, header = input$header, sep = input$sep, quote = input$quote)
vars <- names(f)
# Update select input immediately after clicking on the action button.
updateSelectInput(session, "columns","Select Columns", choices = vars)
f
})
I've added a small upload interface to your code.
The other way would be to define widgets on the server side and then to pass them to the client side via renderUI function. You can find here an example.
Full example:
library(shiny)
ui <- fluidPage(
h2('The uploaded file data'),
dataTableOutput('mytable'),
fileInput('file', 'Choose info-file to upload',
accept = c(
'text/csv',
'text/comma-separated-values',
'text/tab-separated-values',
'text/plain',
'.csv',
'.tsv'
)
),
# Taken from: http://shiny.rstudio.com/gallery/file-upload.html
tags$hr(),
checkboxInput('header', 'Header', TRUE),
radioButtons('sep', 'Separator',
c(Comma=',',
Semicolon=';',
Tab='\t'),
','),
radioButtons('quote', 'Quote',
c(None='',
'Double Quote'='"',
'Single Quote'="'"),
'"'),
################################################################
actionButton("choice", "incorporate external information"),
selectInput("columns", "Select Columns", choices = NULL), # no choices before uploading
tableOutput("table_display")
)
server <- function(input, output, session) { # added session for updateSelectInput
info <- eventReactive(input$choice, {
inFile <- input$file
# Instead # if (is.null(inFile)) ... use "req"
req(inFile)
# Changes in read.table
f <- read.table(inFile$datapath, header = input$header, sep = input$sep, quote = input$quote)
vars <- names(f)
# Update select input immediately after clicking on the action button.
updateSelectInput(session, "columns","Select Columns", choices = vars)
f
})
output$table_display <- renderTable({
f <- info()
f <- subset(f, select = input$columns) #subsetting takes place here
head(f)
})
}
shinyApp(ui, server)

How to append or change data within a file

Hiya i have made a program that stores the player name and strength..Here is the code:
data = {
"PLAYER":name2,
"STRENGTH":str(round(strength, 2)),
}
with open("data2.txt", "w", encoding="utf-8") as file:
file.write(repr(data))
file.close()
So this stores the data so what to i do if i wanna append/change the value after a certain action usch as a 'BATTLE'
Is it possible the get the variable of 'STRENGTH' and then change the number?
At the moment to read data from the external file 'DATA1.txt'i am using this code:
with open("data1.txt", "r", encoding="utf-8") as file:
data_string = file.readline()
data = eval(data_string)
# (data["STRENGTH"])
S1 = (float(data["STRENGTH"]))
file.close()
Now i can do something with the variable --> 'S1'
Here is the external text file 'data1.txt'
{'PLAYER': 'Oreo', 'STRENGTH': '11.75'}
... But i wanna change the strength value after a "battle" many thanks
Maybe you're not understanding Python dict semantics?
Seems to me you're doing a lot of unnecessary things like S1 = (float(data['STRENGTH'])) to try to manipulate and change values when you could be doing really simple stuff.
>>> data = {'PLAYER': 'Oreo', 'STRENGTH': '11.75'}
>>> data['STRENGTH'] = float(data['STRENGTH'])
>>> data
{'PLAYER': 'Oreo', 'STRENGTH': 11.75}
>>> data['STRENGTH'] += 1
>>> data
{'PLAYER': 'Oreo', 'STRENGTH': 12.75}
Maybe you should give Native Data Types -- Dive Into Python 3 a read to see if it clears things up.

Resources