matlab database toolbox fetch max row limit - database

I am executing the database toolbox fetch function as simply
curs = exec(conn,sqlQuery);
fetched = fetch(curs);
and therefore should be getting the default max row limit of infinity, however the function is only returning 90,000 rows (there should be more than 180k).
Does anyone know why the fetch function would be truncating my results?
Thanks.

You should try to limit the batch size with:
bsize = 80000; %where bsize < NumberOfRow
setdbprefs('FetchInBatches','yes') %you divide your result into smaller part.
setdbprefs('FetchBatchSize','h') %you fix the size of each part
curs = exec(conn,sqlQuery);
fetched = fetch(curs);
if it's still don't work, it could be possible that the matlab's variable cannot support this amount of data so you can replace the last line with:
vsize = 85000; %variable size
fetched = fetch(curs,vsize);

Related

Solve system of equations with data loaded, loop through group IDs and different observations

I have data for a large amount of Group IDs, and each group ID has anywhere from 4 to 30 observations. I would like to solve a (linear or nonlinear, depending on approach) system of equations using data in Matlab. I want to solve a system of three equations and three unknowns, but also load in data for known variables. I need observations 2 through 4 in order to solve this, but would also like to move to the next set of 3 observations (if it exists) to see how the solutions change. I would like to record these calculations as well.
What is the best way to accomplish this? I have a standard idea of how to solve the system using fsolve, but what is the best way to loop through group IDs with varying amounts of observations?
Here is some sample code I have written when thinking about this issue:
%%Load Data
Data = readtable('dataset.csv'); % Full Dataset
%Define Variables
%Main Data
groupID = Data{:,1};
Known1 = Data{:,7};
Known2 = Data{:,8};
Known3 = Data{:,9};
%%%%%%Function %%%%%
f = [A,B,C];
% Define the function handle for the system of equations
fun = #(f) [A^2 + B*Known3 - 2C*Known1 +1/Known2 - D2;
A + (B^2)Known3 - C*Known1 +1/Known2 - D3;
A - B*Known3 + C^2*Known1 +1/Known2 - D4];
% Define the initial guess for the solution
f0 = [0; 0; 0];
% Solve the nonlinear system of equations
f = fsolve(fun, f0)
%%%% Create Loop %%%%%%
% Set the number of observations to load at a time
numObservations = 3;
% Set the initial group ID
groupID = 1;
% Set the maximum number of groups
maxGroups = 100;
% Loop through the groups of data
while groupID <= maxGroups
% Load the data for the current group
data = loadData(groupID, numObservations);
% Update the solution using the new data
x = fsolve(fun, x);
% Print the updated solution
disp(x);
% Move on to the next group of data
groupID = groupID + 1;
end
What are the pitfalls with writing the code like this, and how can I improve it?

Split a query with too many arguments in SQLAlchemy

In my program when I did not run a database update for a long time and then try to update my data,
my sqlalchemy script generates a postgresql upsert query with >4000 params each with >8 items.
When the query is executed with databases.Database.execute(query) I end up with this error:
asyncpg.exceptions._base.InterfaceError: the number of query arguments cannot exceed 32767
My idea is to automatically split the query based on the number of arguments as threshold and execute it in two parts and merge the results.
Do you have an idea how to resolve that problem?
I ended up writing a check for the number of query arguments by getting the length of the first argument's dictionary keys in my list of argument dictionaries as they all have the same number of keys=arguments per list item:
args_per_row = len(args_dict_list[0])
PSQL_QUERY_ALLOWED_MAX_ARGS = 32767
allowed_args_per_query = int(math.floor(PSQL_QUERY_ALLOWED_MAX_ARGS/args_per_row))
Then I divided the args_dict_list into parts that have the size of allowed args per query:
query_args_sets = [
args_dict_list[x:x + allowed_args_per_query] for x in range(
0,
len(args_dict_list),
allowed_args_per_query
)
]
and finally looped over the query_args_sets and generated and executed a separate query for each set:
for arg_set in query_args_sets:
query = query_builder.build_upsert(values=arg_set)
await database.execute(query)

How to Get SSIS Expression Leftover odd Rows right?

I’m trying to come up with Expression in Conditional Split To Return Whatever leftOver Rows.
For example I have Flat File with 10 Rows and if want to split the file onto 2 Rows, I will have 5 files with 2 Rows in each file, but what if I want to split the file By 3 I will have only 3 files with 3 Rows each file, But Where is the Leftover 1 Row modulus remainder will go?. I tried the Conditional split and was fine But the LeftOver Rows Output DID NOT Come out Right. Please take look at my images and tell me if I have issues with my expression in Conditional Split and For Loop.
I have these variables:
#Counter = 0
#ReCount = total count that comes from RowCount Transformation
#SplitRow = Given Number that will Divided by
#EndCount = #ReCount / #SplitRow
I need to have Right expression for LeftOver or modulus remainder of any ‘Odds’ Number
Given that you already know the rownumber for each row as well as the rowcount of the entire file and the split number of rows in each destination file, the expression that will flag up any leftover rows from your splitting is any row in which the following returns true:
rownumber > (rowcount - (rowcount % split))

Unusual SQL Server query with "select top 1 #arastr = k"

select top 1 #arastr = k
from #m
where datalength(k) = (select max(datalength(k)) from #m)
What does this query do, and what is the point of select top 1 #arastr = k? This query is taken from a stored procedure which has been working for 7-8 years, so there is nothing wrong with the query, but I could not understand what it does.
(#m is a temp table which is created in the early part of the query.)
The query select one random value (since top is used without an order by clause) from the column k in the temporary table #m and assigns it to a variable #arastr (which has been previously declared supposedly). The string selected will be any matching the longest (measured in bytes (by the datalength function)) string in the table.
This is a quite common (but a little old fashioned) way to get the value of k into the (previously declared!) variable #arastr for later usage.
The function DATALENGTH will measure the length of e.g. a VARCHAR.
With TOP 1 you geht in any case only one result row, the one with the "longest" k, it's value is in #arastr afterwards...
EDIT: As pointed out by #jpw this will be random, if there is more than one k with the same (longest) length.
Without knowing, what #m looks like and what kind of data is in 'k' I cannot tell you any more.
probably makes more sense if it looks like this
SET #arastr = (SELECT TOP 1 k
FROM #m
WHERE DATALENGTH(k) = (SELECT MAX(DATALENGTH(k)) FROM #m))

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Resources