Upload all text file content to mongodb (using pymongo) - file

I'm starting with mongodb.
I have a text file with 3400000 lines of data and I want to upload that data to a mongodb database.
The text file looks like this:
360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL
3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL
....
and I want to put it on a mongodb database with the following structure :
{'log' : '360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL'}
{'log' : '3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL'}
....
Actually I am uploading line by line like this :
for data_line in records:
parsed_line = re.sub(r'[^a-zA-Z0-9,.\;():/]', '', data_line)
to_insert = unicode(parsed_line, "utf-8")
db.data_source.insert({'log':to_insert})
Is there a way to upload all lines at one time to the format I want to ?
The "line by line" approach is to much slower.
Previously , i had coded a script in python to parse the text file, and it takes one second to 100000 lines and takes 65 seconds to 3400000.
I considered to use mongodb to improve the process but , actually , with mongodb it takes the same time just to upload 100000 lines to the database that my script to performe all the data parse. So is not to much hard to say that I'm doing something badly wrong with mongodb.
I'm grateful if someone can help me.
Greetings , João

I would try bulk inserting the data into mongodb. Right now you are calling insert() once for every single line in your text file which is expensive.
If you are running mongoDB version 2.6.x you can use bulk inserts http://api.mongodb.org/python/current/examples/bulk.html . Example code below from the tutorial:
>>> import pymongo
>>> db = pymongo.MongoClient().bulk_example
>>> db.test.insert(({'i': i} for i in xrange(10000)))
[...]
>>> db.test.count()
10000
In this case a single insert() call inserted 10000 documents into mongodb.
Another example below:
>>> new_posts = [{"author": "Mike",
... "text": "Another post!",
... "tags": ["bulk", "insert"],
... "date": datetime.datetime(2009, 11, 12, 11, 14)},
... {"author": "Eliot",
... "title": "MongoDB is fun",
... "text": "and pretty easy too!",
... "date": datetime.datetime(2009, 11, 10, 10, 45)}]
>>> posts.insert(new_posts)
[ObjectId('...'), ObjectId('...')]
http://api.mongodb.org/python/current/tutorial.html

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

faltten command takes one variable, how to use it for multiple rows

I'm new to snowflake, trying to figure out a way to flatten multiple rows. Appreciate any help here!
we are receiving files from snowpipe. multiple files come into one staged table. each file contains multiple records, however, all the records are in one array variable. When I use flatten command, if there is only one file, it works fine, it flattens the array and separates the records. However, when there are multiple files the flatten command fails with the error "single-row subquery returns more than one row". how to handle this?
{ "test": [ { "a": true, "b": "20" }, { "a": true, "b": "30"}, { "a": false, "b": "40" } ], "Date": "Sun Jan 02 2022 16:00:30 GMT+0000 (Coordinated Universal Time)" }
This is made up data for clarification, Each file contains data like this. SQL I'm using is
select * from table(flatten(select $1:test from #stage))
If there is only one file with the above structure it works fine. However for multiple files it's failing
change the order of operations
SELCT
s.*,
f.*
FROM #stage s,
TABLE(FLATTEN(input=>s.$1:test)) f
this way you get a row s for every stage file, and then get access to the flattened f results.

How split annotated dataset into sentence

i have dataset annoted in spacy 2 format like below
td = ["Where is Shaka Khan lived.I Live In London.", {"entities": [(9, 19, "FRIENDS"),(32, 37, "JILLA")]}]
my datasets has sequence length greater than 512 and trying to migrate to hugging face so would like to split document into sentences at same time need to update the tagging also is there any tools available for that my expected result should be like below
td = [["Where is Shaka Khan lived.", {"entities": [(9, 19, "FRIENDS")]}],["I Live In London.", {"entities": [(10, 16, "JILLA")]}],]
Why to do it with spacy? write a small parser that split it, and then run spacy on the already split sentences , it will give you the same result you want

Read a Struct JSON with AWS Glue that is on a single line

I have this JSON on a Bucket that has been crawled with a classifier that splits arrays into record with this JSON classifier $[*].
I noticed that the JSON is on a single line - nothing wrong with syntax - but this results in the table being created having a single column of type array containing a struct which contains the actual fields I need.
In Athena I wasn't able to access the data and Glue was not able to read the columns as in array.field; so I manually changed the structure of the table making a single struct type with the other fields inside. This I'm able to query on Athena and get the Glue wizard to recognise the single columns as part of the struct.
When I create the job and map the fields accordingly (this is what is automatically generated, note the array.field notation applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("array.col1", "long", "col1", "long"), ("array.col2", "string", "col2", "string"), ("array.col3", "string", "col3", "string")], transformation_ctx = "applymapping1")) I test the output on a table in an S3 Bucket. The Job does not fail at all, BUT creates files in the Bucket that are empty!
Another thing I've tried is to modify the Source JSON and add return lines:
this is before:
[{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}]
this is after:
[
{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},
{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}
]
Having the file modified as stated before lets me correctly read and write data; this led me to believe that the problem is having a bad JSON at the beginning. Before asking to change the JSON is there something I can implement in my Glue Job (Spark 2.4, Python 3) to handle a JSON on a single line? I've searched everywhere but found nothing.
The end goal is to load data into Redshift, we're working S3 to S3 to check on why data isn't being read.
Thanks in advance for your time and consideration.

explode() of pyspark.sql is not working as expected even if only one of the record is not following the schema

I have a json file which I would like to convert (say csv) by expanding one of the fields into columns.
I used explode() for this, but it gives an error even if one of the many records is not having the exact schema.
Input File:
{"place": "KA",
"id": "200",
"swversion": "v.002",
"events":[ {"time": "2020-05-23T22:34:32.770Z", "eid": 24, "app": "testing", "state": 0} ]}
{"place": "AP",
"id": "100",
"swversion": "v.001",
"events":[[]]}
In the above, i want to expand the fields of the "events" and they should become columns.
Ideal case, "events" is a Array of Struct Type..
Expected Output File Columns:
*place, id, swversion, time, eid, app, state*
For this, i have used explode() available in pyspark.sql, but because my second record in the Input file, does not follow the schema where "events" is an Array of Struct Type, explode() fails here by giving an error.
The code i have used to explode :
df = spark.read.json("InputFile")
ndf = df.withColumn("event", explode("events")).drop("events")
ndf.select("place", "id", "swversion", "event.*")
The last line fails because of the second record in my input file..
It should not be too difficult for explode() to handle this i believe.
Can you suggest how to avoid
Cannot expand star types
If I change the "events":[[]] to "events":[{}], explode() works fine since again its an Array of StructType, but since I have no control of the input data, i need to handle this.

Resources