When we connect to a RDBMS like MYSQL using Hadoop we usually get a record from the DB into a user-defined class which extends DBWritable and Writable. If our SQL query generates N records as output then the act of reading a record into the user-defined class is done N times. Is there a way in which I can get more number of records into the mapper at the same time instead of 1 record each time ?
If I understand you correctly, you think Hadoop causes N SELECT statements under the hood. That is not true. As you can see in DBInputFormat's source, it creates chunks of rows based on what Hadoop deems fit.
Obviously, each mapper will have to execute a query to fetch some data for it to process, and it might do so repeatedly, but that's still definitely nowhere near the number of rows in the table.
However, if performance degrades, you might be better off just dumping the data into HDFS / Hive and processing it from there.
Related
Currently have a flow using QueryDatabaseTable which reads from a DB and puts the data into HDFS.
Decided to use QueryDatabaseTable because:
of the state kept for using it for delta loads
also the fine tuning
when tables are in the 100s of million records.
My question is that I now have 100 tables that require the same flow (DB => HDFS). I do not want to create the same flow 100 times. I have looked into ListDatabaseTables which would be perfect, but it seems QueryDatabaseTable doesn't take any input.
Has anyone encountered something similar?
QueryDatabaseTable is meant to do incremental loading of a table and therefore has to maintain state about the table so it can now what to retrieve on next execution. As a result, it can't allow dynamic tables because then there is an infinite amount of state that needs to be kept.
ListDatabaseTables is meant to be used more with GenerateTableFetch and ExecuteSQL to do bulk loading of a DB table.
I'm using this SELECT Statment
SELECT ID, Code, ParentID,...
FROM myTable WITH (NOLOCK)
WHERE ParentID = 0x0
This Statment is repeated each 15min (Through Windows Service)
The problem is the database become slow to other users when this query is runnig.
What is the best way to avoid slow performance while query is running?
Generate an execution plan for your query and inspect it.
Is the ParentId field indexed?
Are there other ways you might optimize the query?
Is it possible to increase the performance of the server that is hosting SQL Server?
Does it need more disk or RAM?
Do you have separate drives (spindles) for operating system, data, transaction logs, temporary databases?
Something else to consider - must you always retrieve the very latest values from this table for your application, or might it be possible to cache prior results and use those for some length of time?
Seems your table got huge number of records. You can think of implementing page-wise retrieval of data. You can first request for say TOP 100 rows and then having multiple calls to fetch rest of data.
I still don't understand need to run such query every 15 mins. You may think of implementing a stored procedure which can perform majority of processing and return you a small subset of data. This will be good improvement if it suits your requirement.
I have 2 databases, one is the main database that many users work on it and a testing database, the second one is test database that loaded by a dump from the main DB.
I have a select query a with join conditions and union all on a table TAB11 that contains 40 million rows.
The problem that the query is reading wrong index in the main DB but in test DB is reading correct index. Note that both have latest gather statistics on the table and same count rows. I start to dig into histograms and skew data and I noticed in main DB the table has 37 histogram created on its columns ,however in the test db the table has only 14 columns has histogram. so apparently those created histogram are effecting the query plan to read wrong index (right?). ( those histogram created by oracle , and not by anyone)
My question:
-should I remove the histogram from those columns, and when I gather static again oracle will create the needed one and read them correctly ? but I am afraid it will effect the performance of the table.
-should I add this when i gather tab statistics method_opt=>'for all columns size skewonly' but I am not sure if the data are skewed or not.
-should I run gather index stats on the desired index and the oracle might read it?
how to make the query read the right index, without droping it or using force index?
There are too many possible reasons for choosing a different index in one DB vs another (including object life-cycle differences e.g. when data gets loaded, deletions/truncations/inserts/stats gathering index rebuilds ...). Having said that, in cases like this I usually do a parameter by parameter comparison of the initialization parameters on each DB; also an object by object comparison (you've already observed a delta in the histogram; thee may be others as well that are impacting this).
I understand there are lot of issues to consider when doing benchmarking between MongoDB and SQL DB's but I am only trying to understand the right query to use in my setup.
Example the equivalent of SELECT * FROM table in SQL is db.collection.find() in MongoDB.
In calculating the total time taken for each query, must I iterate over the SQL ResultSet and MongoDB Cursor respectively or just executing the query will be enough. Below is my sample iteration steps.
ResultSet rs = statement.executeQuery(queryString);
time = getTime()
while (rs.next()) {}
time1 = getTime()
total = time1-time;
My understanding is, if I don't iterate over the Cursor in MongoDB, will that not just return the time taken to process the first batch of results i.e. 20 documents instead of all the matched documents?
My understanding is, if I don't iterate over the Cursor in MongoDB, will that not just return the time taken to process the first batch or results i.e. 20 documents instead of all the matched documents?
you get first batch only because mongo shell help you to do so. a cursor actually won't fetch anything, it just returns at once. so yes you'll have to iterate the mongo cursor, or invoke toArray() to get all of them as an Array.
You might also want to consider how normalize and denormalize affects your query. because you also benefit from denormalized data schema. with denormalized schema, you won't need JOIN which forces the hard drive to scan different clusters of disks. Neither does spence joining two records. However denormalize is not always an option. So it only makes sense when you have a actual scenario, build queries against that scenario and benchmark those queries.
I have a pretty large file 10GB in Size need to load the records into the DB,
I want to have two additional columns
LoadId which is a constant (this indicates the files unique Load number)
ChunkNumber which would indicate the Chunk of the batch size.
So if I have a batch size of 10,000 records I want
LoadId = {GUID}
ChunkNumber = 1
for the next 10,000 records i want
LoadId = {GUID}
ChunkNumber = 2
Is this possible in SSIS? I suppose I can write a custom component for this but there should be a inbuilt ID if i could use as SSIS is already running stuff in batches of size 10,000
Can some one help me to figure out this parameter if it exists and can it be used?
Ok little bit more detail on the background of what and why.
We we get the data into a Slice of 10,000 records then we can start calling the Stored Procedures to enrich the data in chunks, all i am trying to do is can the SSIS help here by putting a Chunk number and a Guid
this helps the stored proc to move the data in chunks, although i could do this after the fact with a row number, Select has to travel through the whole set again and update the chunk numbers. its a double effort.
A GUID will represent the the complete dataset and individual chunks are related to it.
Some more insight. There is a WorkingTable we import this large file into and if we start enriching all the data at once the Transaction log would be used up, it is more manageable if we can get the data into chunks, so that Transaction log would not blow up and also we can parallel the enrichment process.
The data moves from De-normalized format normalized format from here. SP is more maintainable in therms of release and management of day today, so any help is appreciated.
or is there an other better way of dealing with this?
For the LoadID, you could use the SSIS variable
System::ExecutionInstanceGUID
which is generated by SSIS when the package runs.