I need to migrate millions of records from a SQL database to ES. Currently we insert records in ES via GELF HTTP, but only doing that one record at a time just isn't feasible.
I've been working on this a couple days and am new to both GrayLog and ElasticSearch. I'm trying to find a way to Bulk insert messages into ES and then have them display in GrayLog. I've been using Cerebro to monitor the indexes and the number of messages in each of them. When I do the Bulk insert, the message count does increase in the correct Index, however I can not see them in GrayLog.
Here is what I have:
var _elasticsearchContext = new ElasticsearchContext(ConnectionString, new ElasticsearchMappingResolver());
var connectionSettings = new ConnectionSettings(new Uri(ConnectionString))
.MapDefaultTypeIndices(m => m.Add(typeof(Auditing_Dev), "auditing-dev_0"));
var elasticClient = new ElasticClient(connectionSettings);
var items = new List<Auditing_Dev>();
//I loop through a DataReader creating new Auditing_Dev objects
//and add them to the items collection
var bulkResponse = elasticClient.Bulk(b => b.IndexMany(items, (d, doc) => d.Document(doc).Index("auditing-dev_0").Type("message")));
I get back a valid response and I see the document count increase in Cerebro in the auditing-dev_0 index. When I compare a message that I insert via Bulk to one that is inserted via HTTP request, the indexes and types are the same.
Message I insert:
{
"_index" : "auditing-dev_0",
"_type" : "message",
"_id" : "AVsWWn-jNp2NX1vOria1",
"_version" : 1,
"found" : true,
"_source" : {
"level" : 5,
"origin" : "10.80.3.2",
"success" : true,
"type" : "Company.Enterprise",
"user" : "stupid#dropdown.test",
"gl2_source_input" : "57193c1d0cf25a44afc31c15",
"gl2_source_node" : "5866cc80-382e-4287-ae5b-8a0a68a9a1f1",
"gl2_remote_ip" : "10.100.20.164",
"gl2_remote_port" : 52273,
"streams" : [ "578fbabe738a897c6d91336b" ]
}
}
Compared to one inserted via HTTP:
{
"_index" : "auditing-dev_0",
"_type" : "message",
"_id" : "e3d34d50-0a8a-11e7-84bb-00155d007a32",
"_version" : 1,
"found" : true,
"_source" : {
"level" : 5,
"gl2_remote_ip" : "192.168.211.114",
"origin" : "192.168.211.35",
"gl2_remote_port" : 2960,
"streams" : [ "578fbabe738a897c6d91336b" ],
"gl2_source_input" : "57193c1d0cf25a44afc31c15",
"success" : "True",
"gl2_source_node" : "5866cc80-382e-4287-ae5b-8a0a68a9a1f1",
"user" : "admin#purple-pink.test",
"timestamp" : "2017-03-16 22:43:44.000"
}
}
I see the _id is a different format, but does that matter?
In GrayLog there is only one Input and that is the one for GELF HTTP. Do I need to add a new Input?
Turned out to be the Timestamp field not being present. Who knew?
Related
I have connected to mongodb and filter data like this:
val coll = mongoDB.getTable(db, "report")
val query = new BasicDBObject()
query.put("version",getVersion)
val fields = new BasicDBObject()
fields.put("version",1)
fields.put("project",1)
fields.put("uri",1)
fields.put("test",1)
fields.put("browser",1)
val cursor = coll.find(queryObject,fields)
then get a document like this:
{
"_id" : ObjectId("5a968c6cffac6135b09d6cd5"),
"project" : "com.sink.www.helper",
"uri" : "com.sink.www.helper.somketest",
"browser" : "firefox",
"develop" : "scala",
"duration" : "0.61",
"test" : "acceptance"
"version" : "1.0"
"try" : 1
}
{
"_id" : ObjectId("5a968daaffacc7195c2f9b6i"),
"project" : "com.sink.www.helper",
"uri" : "com.sink.www.helper.somketest",
"browser" : "firefox",
"develop" : "scala",
"duration" : "0.62",
"test" : "acceptance"
"version" : "1.0"
"try" : 1
}
{
"_id" : ObjectId("5a968ea5fface1a723ffr246"),
"project" : "com.sink.www.helper",
"uri" : "com.sink.www.helper.somketest",
"browser" : "chrome",
"develop" : "scala",
"duration" : "0.58",
"test" : "acceptance"
"version" : "1.0"
"try" : 1
}
.....
My question is when "version" is specific value like '1.0', then the field try=try + 1, I have write code like this but this will read mongodb every time, this will cost much time:
while (cursor.hasNext())
{
val details = cursor.next()
if (details.get("version").toString() == getVersion && details.get("browser").toString() == getBrowser)
{
try += 1
}
}
because there are much data in mongodb, so it's not adapt to our project. so I want to get specific data to array or others and get to judge then let try+1, I write code like this but report error:
val scenarioArray = coll.find(queryObject,fields).toArray
for (i <- 0 to scenarioArray.length - 1)
{
println(scenarioArray(i))
}
error: value length is not a member of java.util.List[com.mongodb.DBObject]
error: java.util.List[com.mongodb.DBObject] does not take parameters
I don't know how to retrieve the document to array or list or others container, could anyone get to help on this? Thanks.
Below is MongoDB document.
`{
"_id" : ObjectId("588f09c8d466d7054114b456"),
"phonebook" : [
{
"pb_name_first" : "Aasu bhai",
"pb_phone_number" : [
{
"ph_id" : 2,
"ph_no" : "+91111111",
"ph_type" : "Mobile"
}
],
"pb_email_id" : [
{
"email_id" : "temp#gmail.com",
"email_type" : "Home",
"em_id" :1
},
{
"email_id" : "test#gmail.com",
"email_type" : "work",
"em_id" :2
}
],
"pb_name_prefix" : "MR."
}
]
}`
I want mongodb query that will update email_id data in pb_email_id array on basis of em_id. If i select em_id=1 then that record temp#gmail.com will update.if i select em_id=2 then test#gmail.com will update.
I don't think you can apply if-else logic in update call, you can run two separate update calls
db.collection.update({'pb_email_id.em_id':1},{$set : {'pb_email_id.$.email_id' : 'temp#gmail.com'}},{multi:true});
db.collection.update({'pb_email_id.em_id':2},{$set : {'pb_email_id.$.email_id' : 'test#gmail.com'}},{multi:true});
However you can run a script on collection to apply multiple logic
db.collection.find({}).forEach(function(doc){
if(doc.pb_email_id && doc.pb_email_id.length>0){
for(var i in doc.pb_email_id){
if(doc.pb_email_id[i].em_id === 1){
doc.pb_email_id[i].email_id = "temp#gmail.com"}
else if(doc.pb_email_id[i].em_id === 2){doc.pb_email_id[i].email_id = "test#gmail.com"}
db.collection.save(db)
}
}
})
If you have to apply multiple logic, you can run script, otherwise two update calls if that's as much as needed.
P.S - since you didn't mentioned collection name, I used db.collection.update it should be collection name like db.phonebook.find etc.
I'm using a model tree structures with an array of ancestors and I need to check if any document is missing.
{
"_id" : "GbxvxMdQ9rv8p6b8M",
"type" : "article",
"ancestors" : [ ]
}
{
"_id" : "mtmTBW8nA4YoCevf4",
"parent" : "GbxvxMdQ9rv8p6b8M",
"ancestors" : [
"GbxvxMdQ9rv8p6b8M"
]
}
{
"_id" : "J5Dg4fB5Kmdbi8mwj",
"parent" : "mtmTBW8nA4YoCevf4",
"ancestors" : [
"GbxvxMdQ9rv8p6b8M",
"mtmTBW8nA4YoCevf4"
]
}
{
"_id" : "tYmH8fQeTLpe4wxi7",
"refType" : "reference",
"parent" : "J5Dg4fB5Kmdbi8mwj",
"ancestors" : [
"GbxvxMdQ9rv8p6b8M",
"mtmTBW8nA4YoCevf4",
"J5Dg4fB5Kmdbi8mwj"
]
}
My attempt would be to check each ancestors id if it is existing. If this fails, this document is missing and the data structure is corrupted.
let ancestors;
Collection.find().forEach(r => {
if (r.ancestors) {
r.ancestors.forEach(a => {
if (!Collection.findOne(a))
missing.push(r._id);
});
}
});
But doing it like this will need MANY db calls. Is it possible to optimize this?
Maybe I could get an array with all unique ancestor ids first and check if these documents are existing within one db call??
First take out all distinct ancesstors from your collections.
var allAncesstorIds = db.<collectionName>.distinct("ancestors");
Then check if any of the ancesstor IDs are not in the collection.
var cursor = db.<collectionName>.find({_id : {$nin : allAncesstorIds}}, {_id : 1})
Iterate the cursor and insert all missing docs in a collection.
cursor.forEach(function (missingDocId) {
db.missing.insert(missingDocId);
});
I wanted to do a query to match documents in one collection with documents in another collection based upon a value which should be contained in both sets of documents but, as I have been informed that Mongo does not support a JOIN, I believe I can't do what I want in the way I want to.
My alternative method then is to insert a document into the collection (col1) where I want to do a query and update which contains an array of all the unique cycle number which are in the other collection (col2).
Collection 1 (Col 1)
{
"_id" : ObjectId("5670961f910e1f54662c11ag"),
"objectType" : "Account Balance",
"Customer" : "Thomas Brown",
"status" : "unprocessed",
"cycle" : "1234"
},
{
"_id" : ObjectId("5670961f910e1f54662c12fd"),
"objectType" : "Account Balance",
"Customer" : "Luke Underwood",
"status" : "unprocessed",
"cycle" : "1235"
}
Collection 2 (Col 2)
{
"_id" : ObjectId("5670961f910e1f54662c1d9d"),
"objectOrigin" : "Xero",
"Value" : "500.00",
"key" : "grossprofit",
"cycle" : "1234",
"company" : "e56e09ef-5c7c-423e-b699-21469bd2ea00"
},
{
"_id" : ObjectId("5670961f910e1f54662c1d9f"),
"objectOrigin" : "Xero",
"Value" : "500.00",
"key" : "grossprofit",
"cycle" : "1234",
"company" : "0a514db8-1428-4da6-9225-0286dc2662c1"
},
{
"_id" : ObjectId("5670961f910e1f54662c1da0"),
"objectOrigin" : "Xero",
"Value" : "-127.28",
"key" : "grossprofit",
"cycle" : "1234",
"company" : "c2d0561c-dc5d-44b9-beaf-d69a3472a2b8"
},
{
"_id" : ObjectId("5670961f910e1f54662c1da1"),
"objectOrigin" : "Xero",
"Value" : "-127.28",
"key" : "grossprofit",
"cycle" : "1235",
"company" : "c3fbe6e4-962a-45f6-9ce3-71e2a588438c"
}
So I want to create a document in collection 1 which looks like this:
{
"_id" : ObjectId("5670961f910e1f54662c1d9f"),
"objectType" : "Status Updater",
"cycles" : ["1234","1235"]
}
Now what I want to do is query ALL documents where cycle = cycles and update "status" to "processed". I believe I would do this with a findAndModify with multi : true but not entirely sure.
When finished, I will just simply delete any document in the Collection 1 where objectType is "Status Updater".
If I understand correctly, you want to
a) update all documents in collection #1 where the value of cycle
exists in collection #2.
b) Furthermore, your document of type "objectType" : "Status
Updater" is only a temporary document to keep track of all the cycle
values.
I think you can skip b) and just use the following (this code needs to be executed in the mongo shell):
# get values of all unique cycle values
# returns an array containing: ["1234", "1235", ...]
values = db.coll2.distinct("cycles", {})
# find all documents where cycles value is in the values array
# and update their status.
db.coll1.update({cycles: {$in: values}}, {$set: {status: "processed"}}, {multi: true})
I'm very new to mongoDB and having some problems on joining two collections.
I've read some posts on mapReduce to perform NOSQL way of joining but still having some difficulties here
Collection 1: attraction
{
"_id" : "0001333b-e485-4fee-a0e2-9b7dc338d5a2",
"types" : "Shops",
"name" : "name",
"geo_location" : {
"lat" : 36.0567700000000002,
"lon" : -112.1354520000000008
},
"overall_rating" : 10.0000000000000000,
"num_of_review" : 6,
"review" : [
{
"review_ids" : [
"66ea1cd8-da34-40dc-8ad6-f30df5de9c2c",
"76f51c8d-d2a8-4609-8b7c-c2b0c386e35c",
"185c962a-fcfe-4d03-a3ac-86398be6312a",
"2212535b-28c6-423e-91f7-cc1dfb407d79",
"7e0f1d85-e79e-4bec-9e9c-7dfb03223816",
"f19a83a6-c6ef-4cbe-b90d-f6187bd50baa"
]
}
]
}
Collection 2: attraction_review
{
"_id" : "7e0f1d85-e79e-4bec-9e9c-7dfb03223816",
"user_id" : "somename",
"review_id" : "r122796525",
"unified_id" : "0001333b-e485-4fee-a0e2-9b7dc338d5a2",
"source_id" : "d1057961",
"review_url" : "someURL",
"title" : "some title",
"overall_rating" : 10,
"review_date" : "dates",
"content" : "some contents here",
"source" : "source",
"traval_date" : "dates",
"sort" : ""
}
Basically I need to keep (or copy) the reviews in the attraction_review whose _id has appeared in the review_ids array of the attraction collection.
The example above shows the matching review in red.
It is guaranteed that the attraction_review collection contains every ids in the review_ids for all records in the attraction collection.
The difficulty here is that the review_ids array is within review array, and I am not sure how I would go about mapping many instances of ids.
I would be grateful for some suggestions.
Many thanks