I have a csv file I'm trying to load to Snowflake that has a column 'custom_fields', everything in this column is a dictionary, but one row had a None key, for example {'Account_ID': None} which threw the error
Error parsing JSON: {'Account_ID': None}
I have the format set up like below, where I tried also adding 'None' within null_if
#property
def fileformat_opts(self):
return {
'type': 'csv',
'skip_header': 1,
'field_delimiter': ',',
'field_optionally_enclosed_by': '"',
'escape_unenclosed_field': None,
'date_format': 'YYYY-MM-DD',
'record_delimiter': '\n',
'null_if': ('NULL', '', ' ', 'NULL', '//N', 'None'),
'empty_field_as_null': True,
'error_on_column_count_mismatch': False,
'timestamp_format': 'YYYY-MM-DD HH24:MI:SS'
}
You need dumps dictionary to json, e.g. in python json.dumps(dict), then only can load to Snowflake as object or variant
I think you are trying to parse this:
{"Account_ID": None}
None is not a valid JSON value. You can check your JSON string using any online validator. For example:
https://jsonlint.com/
As a solution, you can write null instead of None to your CSV file.
Related
I have logs in the following type of format:
2021-10-12 14:41:23,903716 [{"Name":"A","Dimen":[{"Name":"in","Value":"348"},{"Name":"ses","Value":"asfju"}]},{"Name":"read","A":[{"Name":"ins","Value":"348"},{"Name":"ses","Value":"asf5u"}]}]
2021-10-12 14:41:23,903716 [{"Name":"B","Dimen":[{"Name":"in","Value":"348"},{"Name":"ses","Value":"a7hju"}]},{"Name":"read","B":[{"Name":"ins","Value":"348"},{"Name":"ses","Value":"ashju"}]}]
Each log on a new line. Problem is I want each object from the single line in the top level array to be a separate document and parsed accordingly.
I need to parse this and send it to Elasticsearch. I have tried a number of filters, grok, JSON, split etc and I cannot get it to work the way I need to and I have little experience with these filters so if anyone can help it would be much appreciated.
The JSON codec is what I would need if I can remove the Text/timestamp from the file.
"If the data being sent is a JSON array at its root multiple events will be created (one per element)"
If there is a way to do that, this would also be helpful
This is the config example for your usecase:
input { stdin {} }
filter {
grok {
match => { "message" => "%{DATA:date},%{DATA:some_field} %{GREEDYDATA:json_message}" }
}
#Use the json plugin to translate raw to json
json { source => "json_message" target => "json" }
#and split the result to dedicated raws
split { field => "json" }
}
output {
stdout {
codec => rubydebug
}
}
If you need to parse the start of the log as date, you can use the grok with the date format or connect two fields and set than as source to the date plugin.
I use JSON for sending data through an API for clients.
My data is a JSON array of objects, each
object in the array has the same type,
and the keys value are the same for all.
70% of a request is consumed by repeating useless key names.
Is there way to send data without this overhead?
"I know some way exists like csv but I want to choose general solution for this problem"
for example my array in json 5Mb and in csv its only 500kb
A simple json array
var people = [
{ firstname:"Micro", hasSocialNetworkSite: false, lastname:"Soft", site:"http://microsoft.com" },
{ firstname:"Face", hasSocialNetworkSite: true, lastname:"Book", site:"http://facebook.com" },
{ firstname:"Go", hasSocialNetworkSite: true, lastname:"ogle", site:"http://google.com" },
{ firstname:"Twit", hasSocialNetworkSite: true, lastname:"Ter", site:"http://twitter.com" },
{ firstname:"App", hasSocialNetworkSite: false, lastname:"Le", site:"http://apple.com" },
];
and this above array in csv format
"firstname","hasSocialNetworkSite","lastname","site"
"Micro","False","Soft","http://microsoft.com"
"Face","True","Book","http://facebook.com"
"Go","True","ogle","http://google.com"
"Twit","True","Ter","http://twitter.com"
"App","False","Le","http://apple.com"
you can see that the performance of json array of object in example.
Why would using a csv file not be a 'general solution'?
If your data is tabular you don't really need a hierachical format like json or xml.
You can even shrink your csv file further by removing the double quotes (those are only needed when there is a separator inside the field):
firstname,hasSocialNetworkSite,lastname,site
Micro,False,Soft,http://microsoft.com
Face,True,Book,http://facebook.com
Go,True,ogle,http://google.com
Twit,True,Ter,http://twitter.com
App,False,Le,http://apple.com
I want to split the custom logs
"2016-05-11 02:38:00.617,userTestId,Key-string-test113321,UID-123,10079,0,30096,128,3"
that log means
Timestamp, String userId, String setlkey, String uniqueId, long providerId, String itemCode1, String itemCode2, String itemCode3, String serviceType
I try to made a filter using ruby
filter {
ruby{
code => "
fieldArray = event['message'].split(',')
for field in fieldArray
result = field
event[field[0]] = result
end
"
}
}
but I don't have idea how to split the logs with adding field names each custom values as belows.
Timestamp : 2016-05-11 02:38:00.617
userId : userTestId
setlkey : Key-string-test113321
uniqueId : UID-123
providerId : 10079
itemCode1 : 0
itemCode2 : 30096
itemCode3 : 128
serviceType : 3
How can I do?
Thanks regards.
You can use the grok filter instead. The grok filter parse the line with a regex and you can associate each group to a field.
It is possible to parse your log with this pattern :
grok {
match => {
"message" => [
"%{TIMESTAMP_ISO8601:timestamp},%{USERNAME:userId},%{USERNAME:setlkey},%{USERNAME:uniqueId},%{NUMBER:providerId},%{NUMBER:itemCode1},%{NUMBER:itemCode2},%{NUMBER:itemCode3},%{NUMBER:serviceType}"
]
}
}
This will create the fields you wish to have.
Reference: grok patterns on github
To test : Grok constructor
Another solution :
You can use the csv filter, which is even more closer to your needs (but I went with grok filter first since I have more experience with it): Csv filter documentation
The CSV filter takes an event field containing CSV data, parses it, and stores it as individual fields (can optionally specify the names). This filter can also parse data with any separator, not just commas.
I have never used it, but it should look like this :
csv {
columns => [ "Timestamp", "userId", "setlkey", "uniqueId", "providerId", "itemCode1", "itemCode2 "itemCode3", "serviceType" ]
}
By default, the filter is on the message field, with the "," separator, so there no need to configure them.
I think that the csv filter solution is better.
I am creating the bulkloader.yaml automatically from my existing schema and have trouble downloading my data due the repeated=True of my KeyProperty.
class User(ndb.Model):
firstname = ndb.StringProperty()
friends = ndb.KeyProperty(kind='User', repeated=True)
The automatic created bulkloader looks like this:
- kind: User
connector: csv
connector_options:
# TODO: Add connector options here--these are specific to each connector.
property_map:
- property: __key__
external_name: key
export_transform: transform.key_id_or_name_as_string
- property: firstname
external_name: firstname
# Type: String Stats: 2 properties of this type in this kind.
- property: friends
external_name: friends
# Type: Key Stats: 2 properties of this type in this kind.
import_transform: transform.create_foreign_key('User')
export_transform: transform.key_id_or_name_as_string
This is the error message I am getting:
google.appengine.ext.bulkload.bulkloader_errors.ErrorOnTransform: Error on transform. Property: friends External Name: friends. Code: transform.key_id_or_name_as_string Details: 'list' object has no attribute 'to_path'
What can I do please?
Possible Solution:
After Tony's tip I came up with this:
- property: friends
external_name: friends
# Type: Key Stats: 2 properties of this type in this kind.
import_transform: myfriends.stringToValue(';')
export_transform: myfriends.valueToString(';')
myfriends.py
def valueToString(delimiter):
def key_list_to_string(value):
keyStringList = []
if value == '' or value is None or value == []:
return None
for val in value:
keyStringList.append(transform.key_id_or_name_as_string(val))
return delimiter.join(keyStringList)
return key_list_to_string
And this works! The encoding is in Unicode though: UTF-8. Make sure to open the file in LibreOffice as such or you would see garbled content.
The biggest challenge is import. This is what I came up with without any luck:
def stringToValue(delimiter):
def string_to_key_list(value):
keyvalueList = []
if value == '' or value is None or value == []:
return None
for val in value.split(';'):
keyvalueList.append(transform.create_foreign_key('User'))
return keyvalueList
return string_to_key_list
I get the error message:
BadValueError: Unsupported type for property friends: <type 'function'>
According to Datastore viewer, I need to create something like this:
[datastore_types.Key.from_path(u'User', u'kave#gmail.com', _app=u's~myapp1')]
Update 2:
Tony you are to be a real expert in Bulkloader. Thanks for your help. Your solution worked!
I have moved my other question to a new thread.
But one crucial problem that appears is that, when I create new users I can see my friends field shown as <missing> and it works fine.
Now when I use your solution to upload the data, I see for those users without any friend entries a <null> entry. Unfortunately this seems to break the model since friends can't be null.
Changing the model to reflect this, seems to be ignored.
friends = ndb.KeyProperty(kind='User', repeated=True, required=False)
How can I fix this please?
update:
digging further into it:
when the status <missing> is shown in the data viewer, in code it shows friends = []
However when I upload the data via csv I get a <null>, which translates to friends = [None]. I know this, because I exported the data into my local data storage and could follow it in code. Strangely enough if I empty the list del user.friends[:], it works as expected. There must be a beter way to set it while uploading via csv though...
Final Solution
This turns out to be a bug that hasn't been resolved since over one year.
In a nutshell, even though there is no value in csv, because a list is expected, gae makes a list with a None inside. This is game breaking, since retrieval of such a model ends up in an instant crash.
Adding a post_import_function, which deletes the lists with a None inside.
In my case:
def post_import(input_dict, instance, bulkload_state_copy):
if instance["friends"] is None:
del instance["friends"]
return instance
Finally everything works as expected.
When you are using repeated properties and exporting to a CSV, you should be doing some formatting to concatenate the list into a CSV understood format. Please check the example here on import/export of list of dates and hope it can help you.
EDIT : Adding suggestion for import transform from an earlier comment to this answer
For import, please try something like:
`from google.appengine.api import datastore
def stringToValue(delimiter):
def string_to_key_list(value):
keyvalueList = []
if value == '' or value is None or value == []: return None
for val in value.split(';'):
keyvalueList.append(datastore.Key.from_path('User', val))
return keyvalueList
return string_to_key_list`
if you have id instead of name , add like val = int(val)
My xml file is:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<investors>
<investor>Active</investor>
<investor>Aggressive</investor>
<investor>Conservative</investor>
<investor>Day Trader</investor>
<investor>Very Active</investor>
</investors>
<events>
<event>3 Month Expiry</event>
<event>LEAPS</event>
<event>Monthlies</event>
<event>Monthly Expiries</event>
<event>Weeklies</event>
<event>Weeklies Expiry</event>
</events>
<prices>
<price>0.05</price>
<price>0.5</price>
<price>1</price>
<price>1</price>
<price>22</price>
<price>100.34</price>
</prices>
</root>
my ExtJS code is :
Ext.regModel('Card', {
fields: ['investor','event','price']
});
var store = new Ext.data.Store({
model: 'Card',
proxy: {
type: 'ajax',
url: 'http:/.../initpicker.php',
reader: {
type: 'xml',
record: 'root'
}
},
listeners: {
single: true,
datachanged: function(){
var items = [];
store.each(function(r){
stocks.push({text: '<span style="color:Blue;font-weight:bolder;font-size:30px;">'+ r.get('price') +'</span>'});
values.push({text: '<span style="font-weight: bold;font-size:25px;">'+ r.get('investor') +'</span>'});
points.push({text: '<span style="font-weight: bold;font-size:25px;">'+ r.get('event') +'</span>'});
});
}
}
});
store.read();
my question is that if my xml contains same tags like for five times can we still parse the data. . . . .?
i've tried this code but it only parsed the first one..........................
if there is anyother way please suggest. . .
Thank you.
It really depends on what your Record looks like.
Is the first investor element supposed to be related to the first event and price elements and bundled into a single Record? What about the second record - would that contain Aggressive, LEAPS and 0.5 as data values? If so, the XML doesn't really make that much sense.
I don't believe Sencha's XmlReader will cope with this that well, which would explain why you're only getting the first record.
There are two solutions:
Modify the XML being produced to make more sense to the XmlReader
Write your own Reader class to parse and extract records from the format of data available to you
you can parse the xml using ExtJs. But xml file should be in same domain
What are you using this XML for?
I'm assuming it's for a grid. Also your code doesn't seem to match the tags in your XML. What data are you trying to grab in your code? You should be accessing data from the tags in the XML when you set up the data object.
I would suggest that you rethink the structure of your XML. Your tags do not describe the data contained within the tags. In some ways, you seem to be missing the point of XML.
Something like this should be what you need to populate a grid.
<investors>
<investor>
<name>Bob</name>
<style>Aggressive</style>
<price>0.5</price>
</investor>
<investor>
<name>Francis</name>
<price>150.00</price>
</investor>
</investors>
I highly suggest you check out this link:
XML Grid Sample from Sencha Webste