Golang large memory usage when inserting in postgres - database

I'm using goland standard library database/sql to insert an attachment into a bytea column in postgres
_, err = tx.Exec(`
INSERT INTO
election_news_attachments(news_id, file_name, file_type, data, thumbnail)
VALUES
($1, $2, $3, $4, $5)
`,
newsID,
attachments[i].FileName,
attachments[i].FileType,
attachments[i].Data,
attachments[i].Thumbnail,
)
And the problem here is the attachments[i].Data which is []byte data of size 70MB.
And once this Exec is called it takes this data and starts using a lot of RAM.
It's like it takes that data and makes 10 duplicates of it
I'm using this library to check the RAM usage https://pkg.go.dev/runtime#MemStats
the memory image
the first row is before the Exec function is called and the second is after the Exec function
I know that it's not a great thing to place big files like this into a column but this is not something I decide. But it is something I need to implement...
So if anyone knows the cause for this much RAM usage by the Exec function and is there a way around it. The way around it could be could be a different library or some approach.
I was thinking about placing it in a temporary file and calling read_binary_file inside postgres and this works amazing locally but on production the database and backend are on different servers so this is not an option.
And do you maybe think this RAM issue is a problem that should be created as an issue on the library or is it just a thing that should not be done by passing big files

Related

How download more than 100MB data into csv from snowflake's database table

Is there a way to download more than 100MB of data from Snowflake into excel or csv?
I'm able to download up to 100MB through the UI, clicking the 'download or view results button'
You'll need to consider using what we call "unload", a.k.a. COPY INTO LOCATION
which is documented here:
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html
Other options might be to use a different type of client (python script or similar).
I hope this helps...Rich
.....EDITS AS FOLLOWS....
Using the unload (COPY INTO LOCATION) isn't quite as overwhelming as it may appear to be, and if you can use the snowSQL client (instead of the webUI) you can "grab" the files from what we call an "INTERNAL STAGE" fairly easily, example as follows.
CREATE TEMPORARY STAGE my_temp_stage;
COPY INTO #my_temp_stage/output_filex
FROM (select * FROM databaseNameHere.SchemaNameHere.tableNameHere)
FILE_FORMAT = (
TYPE='CSV'
COMPRESSION=GZIP
FIELD_DELIMITER=','
ESCAPE=NONE
ESCAPE_UNENCLOSED_FIELD=NONE
date_format='AUTO'
time_format='AUTO'
timestamp_format='AUTO'
binary_format='UTF-8'
field_optionally_enclosed_by='"'
null_if=''
EMPTY_FIELD_AS_NULL = FALSE
)
overwrite=TRUE
single=FALSE
max_file_size=5368709120
header=TRUE;
ls #my_temp_stage;
GET #my_temp_stage file:///tmp/ ;
This example:
Creates a temporary stage object in Snowflake, which will be discarded when you close your session.
Takes the results of your query and loads them into one (or more) csv files in that internal temporary stage, depending on size of your output. Notice how I didn't create another database object called a "FILE FORMAT", it's considered a best practice to do so, but you can do these one off extracts without creating that separate object if you don't mind having the command be so long.
Lists the files in the stage, so you can see what was created.
Pulls the files down using the GET, in this case this was run on my mac and the file(s) were placed in /tmp, if you are using Windoz you will need to modify a little bit.

How to create a table of 5 GB in HBase for YCSB benchmarking?

I want to benchmark an HBase using YCSB. It's my first time using either.
I've gone through some online tutorials, and now I need to create a sample table of size 5 GB. But I don't know how to:
Batch-put a bunch of data into a table
Control the size to be around 5 GB
Could anyone give me some help on that?
Before, I've used HBase performance evaluation tool to load data into HBase. May be it can help you.
hbase org.apache.hadoop.hbase.PerformanceEvaluation
Various options are available for this tool. For your case you can set the data size to be 5GB.
This is pretty easy, the default (core) workload uses strings that are ~1KB each. So to get 5GB, just use 5,000,000 records.
You can do this by specifying the recordcount parameter in the command line, or creating your own workload file with this parameter inside.
Here's how you would do it on the command line (using the included workload workloada):
./bin/ycsb load hbase12 -P workloads/workloada -p recordcount=5000000
A custom file would look like this:
recordcount=5000000
operationcount=1000000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readproportion=0.8
updateproportion=0.2
scanproportion=0
insertproportion=0
And then you just run:
./bin/ycsb load hbase12 -P myWorkload
This will insert all the data into your database.

Import XML objects in batches

I'm working on a PowerShell script that deals with a very large dataset. I have found that it runs very well until the memory available is consumed. Because of how large the dataset is, and what the script does, it has two arrays that become very large. The original array is something around a half gig, and the final object is easily six or seven gigs en-memory. My idea is that it should work better if I'm able to release rows as done and run the script in increments.
I am able to split the imported XML using a function I've found and tweaked, but I'm not able to change the data actually contained in the array.
This is the script I'm using to split the array into batches currently: https://gallery.technet.microsoft.com/scriptcenter/Split-an-array-into-parts-4357dcc1
And this is the code used to import and split the results.
# Import object which should have been prepared beforehand by the query
# script. (QueryForCombos.ps1)
$SaveObj = "\\server\share$\me\Global\Scripts\Resultant Sets\LatestQuery.xml"
$result_table_import = Import-Clixml $SaveObj
if ($result_tables.count > 100000) {
$result_tables = Split-Array -inArray $result_table_import -size 30000;
} else {
$result_tables = Split-Array -inArray $result_table_import -parts 6
}
And then of course there is the processing script which actually uses the data and converts it as desired.
For large XML files, I don't think you want to read it all into memory as is required with an XmlDocument or Import-Clxml. You should look at the XmlTextReader as one way to process the XML file a bit at a time.

replacing flat-file db with proper database with record level editing

I cannot install SQLite on a remote machine, so I have to find a way to store a large amount of data in some kind of database structure.
Example data
key,values...
key,values....
..
There are currently about a million rows in a 20MB flat file, and hourly I have to read through each record and value in the file and update or add a record. Since it is a flat file I have to rewrite the whole file each time.
I am looking at the Storable module, but I think it also writes data sequentially. I want to edit only those records which need to be changed.
reading and updating of random records is a requirement. Additions can be anywhere(order is not important)
Can anyone suggest something? How will I know if I can setup a native Berkeley database file on these systems, which are a mixture of Solaris and Linux?
________________finally__________________
finally I understood things better (thank you all), and based on your suggestions I used AnyDBM_File. It found NDBM_File ('C' library) installed on all OS. So far so good.
Just to check how it will play out in real world. I ran a sample script to add 1 million records (the max records i think i may ever get in a day, normally between 500k to 700k). OMG it created a 110G data file on my disk !!!! and all the records were like:
a628234 = 0.178532683639599
I mean my real world records are longer than that. compare this to a flat file which is holding real-life 700k+ records and is only 15Mb on disk.
I am disappointed with the slowness and bloat-ness of this, so for now i think i will pay the price by writing the whole file each time an edit is required.
Thanks again for all your help.
As they said in the comments you may use SDBM_File module. For example:
#!/usr/bin/perl
use strict;
use warnings;
use v5.14;
use Fcntl;
use SDBM_File;
my $filename = "dbdb";
my %h;
tie %h, 'SDBM_File', $filename, O_RDWR|O_CREAT, 0666
or die "Error: $!\n";
# To run only one time to fill the dbdb file.
# Next time you may delete this line and
# the output will be the same "16,40".
$h{$_} = $_ * 2 . "," . $_ * 5 for 1..100;
say $h{8};
untie %h;
Output: 16,40
Depends, what your program logic needs, but one solution is to partition database, based on keys. So you can deal with many smaller files instead of one big file.

Out of memory error as inserting a 600MB files into sql server express as filestream data

(please read the update section below, I leave the original question too for clarity)
I am inserting many files into a SQL Server db configured for filestream.
I am inserting in a loop the files from a folder to a database table.
Everything goes fine until I try to insert a 600 MB file.
As it inserts it there is a +600MB memory usage in task manager and I have the error.
The DB size is < 1 GB and the total size of documents is 8 GB, I am using SQL Server Express R2, and according to the documentation I could have problems only if trying to insert a document that is greater than 10 GB (Express limitation) - Current DB Size.
Can anyone tell me why do I have this error? It is very crucial for me.
UPDATE FOR BOUNTY:
I offered 150 because it is very crucial for me!
This seems to be a limitation of Delphi memory Manager, trying to insert a document bigger than 500MB, I didn't check the exact threshold anyway it is between 500 and 600MB). I use SDAC components, in particular a TMSQuery (but I think the same can be done with and TDataset descendant), to insert the document in a table that has a PK (ID_DOC_FILE) and a varbinary(max) field (DOCUMENT) I do:
procedure UploadBigFile;
var
sFilePath: String;
begin
sFilePath := 'D:\Test\VeryBigFile.dat';
sqlInsertDoc.ParamByName('ID_DOC_FILE').AsInteger := 1;
sqlInsertDoc.ParamByName('DOCUMENT').LoadFromFile(sFilePath, ftblob);
sqlInsertDoc.Execute;
sqlInsertDoc.Close;
end;
SDAC team told me this is a limitation of Delphi memory manager. Now since SDAC doesn't support filestream I cannot do what has been suggested in c# in the first answer. Is the only solution reporting to Embarcadero and ask a bug fix?
FINAL UPDATE:
Thanks, really, to all you that answered me. For sure inserting big blobs can be a problem for the Express Edition (because the limitations of 1 GB of ram), anyway I had the error on the Enterprise edition, and it was a "delphi" error, not a sql server one. So I think that the answer that I accepted really hits the problem, even if I have no time to verify it now.
SDAC team told me this is a limitation of Delphi memory manager
To me that looked like an simplistic answer, and I investigated. I don't have the SDAC components and I also don't use SQL Server, my favorites are Firebird SQL and the IBX component set. I tried inserting an 600Mb blob into a table, using IBX, then tried the same using ADO (covering two connection technologies, both TDataSet descendants). I discovered the truth is somewhere in the middle, it's not really the memory manager, it's not SDAC's fault (well... they are in a position to do something about it, if many more people attempt inserting 600 Mb blobs into databases, but that's irrelevant to this discussion). The "problem" is with the DB code in Delphi. As it turns out Delphi insists on using an single Variant to hold whatever type of data one might load into an parameter. And it makes sense, after all we can load lots of different things into an parameter for an INSERT. The second problem is, Delphi wants to treat that Variant like an VALUE type: It copies it around at list twice and maybe three times! The first copy is made right when the parameter is loaded from the file. The second copy is made when the parameter is prepared to be sent to the database engine.
Writing this is easy:
var V1, V2:Variant;
V1 := V2;
and works just fine for Integer and Date and small Strings, but when V2 is an 600 Mb Variant array that assignment apparently makes a full copy! Now think about the memory space available for a 32 bit application that's not running in "3G" mode. Only 2 Gb of addressing space are available. Some of that space is reserved, some of that space is used for the executable itself, then there are the libraries, then there's some space reserved for the memory manager. After making the first 600 Mb allocation there just might not be enough available addressing space to allocate an other 600 Mb buffer! Because of this it's safe to blame it on the memory manager, but then again, why exactly does the DB stuff need an other copy of the 600 Mb monster?
One possible fix
Try splitting up the file into smaller, more manageable chunks. Set up the database table to have 3 fields: ID_DOCUMENT, SEQUENCE, DOCUMENT. Also make the primary key on the table to be (ID_DOCUMENT, SEQUENCE). Next try this:
procedure UploadBigFile(id_doc:Integer; sFilePath: String);
var FS:TFileStream;
MS:TMemoryStream;
AvailableSize, ReadNow:Int64;
Sequence:Integer;
const MaxPerSequence = 10 * 1024 * 1024; // 10 Mb
begin
FS := TFileStream.Create(sFilePath, fmOpenRead);
try
AvailableSize := FS.Size;
Sequence := 0;
while AvailableSize > 0 do
begin
if AvailableSize > MaxPerSequence then
begin
ReadNow := MaxPerSequence;
Dec(AvailableSize, MaxPerSequence);
end
else
begin
ReadNow := AvailableSize;
AvailableSize := 0;
end;
Inc(Sequence); // Prep sequence; First sequence into DB will be "1"
MS := TMemoryStream.Create;
try
MS.CopyFrom(FS, ReadNow);
sqlInsertDoc.ParamByName('ID_DOC_FILE').AsInteger := id_doc;
sqlInsertDoc.ParamByName('SEQUENCE').AsInteger := sequence;
sqlInsertDoc.ParamByName('DOCUMENT').LoadFromStream(MS, ftblob);
sqlInsertDoc.Execute;
finally MS.Free;
end;
end;
finally FS.Free;
end;
sqlInsertDoc.Close;
end;
You could loop through the byte stream of the object you are trying to insert and essentially buffer a piece of it at a time into your database until you have your entire object stored.
I would take a look at the Buffer.BlockCopy() method if you're using .NET
Off the top of my head, the method to parse your file could look something like this:
var file = new FileStream(#"c:\file.exe");
byte[] fileStream;
byte[] buffer = new byte[100];
file.Write(fileStream, 0, fileStream.Length);
for (int i = 0; i < fileStream.Length; i += 100)
{
Buffer.BlockCopy(fileStream, i, buffer, 0, 100);
// Do database processing
}
Here is an example that reads a disk file and saves it into a FILESTREAM column. (It assumes that you already have the transaction Context and FilePath in variables "filepath" and "txContext".
'Open the FILESTREAM data file for writing
Dim fs As New SqlFileStream(filePath, txContext, FileAccess.Write)
'Open the source file for reading
Dim localFile As New FileStream("C:\temp\microsoftmouse.jpg",
FileMode.Open,
FileAccess.Read)
'Start transferring data from the source file to FILESTREAM data file
Dim bw As New BinaryWriter(fs)
Const bufferSize As Integer = 4096
Dim buffer As Byte() = New Byte(bufferSize) {}
Dim bytes As Integer = localFile.Read(buffer, 0, bufferSize)
While bytes > 0
bw.Write(buffer, 0, bytes)
bw.Flush()
bytes = localFile.Read(buffer, 0, bufferSize)
End While
'Close the files
bw.Close()
localFile.Close()
fs.Close()
You're probably running into memory fragmentation issues somewhere. Playing around with really large blocks of memory, especially in any situation where they might need to be reallocated tends to cause out of memory errors when in theory you have enough memory to do the job. If it needs a 600mb block and it can't find a hole that's 600mb wide that's it, out of memory.
While I have never tried it my inclination for a workaround would be to create a very minimal program that does ONLY the one operation. Keep it absolutely as simple as possible to keep the memory allocation minimal. When faced with a risky operation like this call the external program to do the job. The program runs, does the one operation and exits. The point is the new program is in it's own address space.
The only true fix is 64 bit and we don't have that option yet.
I recently experienced a similar problem while running DBCC CHECKDB on a very large table. I would get this error:
There is insufficient system memory in
resource pool 'internal' to run this
query.
This was on SQL Server 2008 R2 Express. The interesting thing was that I could control the occurrence of the error by adding or deleting a certain number of rows to the table.
After extensive research and discussions with various SQL Server experts, I came to the conclusion that the problem was a combination of memory pressure and the 1 GB memory limitation of SQL Server Express.
The recommendation given to me was to either
Acquire a machine with more memory and a
licensed edition of SQL Server or...
Partition the table into sizeable
chunks that DBCC CHECKDB could
handle
Due the complicated nature of parsing these files into the FILSTREAM object, I would recommend the filesystem method and and simply use SQL Server to store the locations of the files.
"While there are no limitations on the number of databases or users supported, it is limited to using one processor, 1 GB memory and 4 GB database files (10 GB database files from SQL Server Express 2008 R2)." It is not the size of the database files that is an insue but "1 GB memory". Try spitting the 600MB+ file but putting it in the stream.

Resources