Learning big data for a real case - database

I made a database (150GB) to index the Bitcoin blockchain.
table 1 : id, block_height, block_hash : 500 000 lines
table 2 : id, block_height, transaction_hash : 780 millions lines
table 3 : id, transaction_hash, address : 480 millions lines
On a i7#3Ghz, 16GB RAM, Windows 10, SSD SATA3 I tried adding an index on table3.address. The RAM goes to 100% and after 30h there was an I/O error and the index was not created. I tried a select distinct on table3.address, after 86h of my SSD and RAM being at 100% I decided to kill the SQLite process.
What can I do? I'm going toward my custom solution : text files, one text file per address, per transactions, per block. Want to know all the unique address? List the files in the address folder. Want to know what happened in a transaction? Open the file with its hash.txt.

Related

Cannot repair specific tables on specific nodes in Cassandra

I'm running 5 nodes in one DC of Cassandra 3.10.
As I'm trying to maintain those nodes I'm running on daily basis on every node
nodetool repair -pr
and weekly
nodetool repair -full
This is only table I have difficulties:
Table: user_tmp
SSTable count: 4
Space used (live): 366.71 MiB
Space used (total): 366.71 MiB
Space used by snapshots (total): 216.87 MiB
Off heap memory used (total): 5.28 MiB
SSTable Compression Ratio: 0.4690289976332873
Number of keys (estimate): 1968368
Memtable cell count: 2353
Memtable data size: 84.98 KiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 1108
Local read count: 62938927
Local read latency: 0.324 ms
Local write count: 62938945
Local write latency: 0.018 ms
Pending flushes: 0
Percent repaired: 76.94
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 4.51 MiB
Bloom filter off heap memory used: 4.51 MiB
Index summary off heap memory used: 717.62 KiB
Compression metadata off heap memory used: 76.96 KiB
Compacted partition minimum bytes: 51
Compacted partition maximum bytes: 654949
Compacted partition mean bytes: 194
Average live cells per slice (last five minutes): 2.503074492537404
Maximum live cells per slice (last five minutes): 179
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 19 bytes
Percent repaired is never above 80% on this table on this and one more node but on others is above 85%. RF is 3, and strategy is SizeTieredCompactionStrategy
gc_grace_period is on 10days and as I somewhere in that period I'm getting writetimeout on exactly this table but after consumer which got this timeout is immediately replaced with another one everything keep going like nothing happened. Its like one time writetimeout.
My question is: Are you maybe have suggestion for better repair strategy because I'm kind of a noob and every suggest is a big win for me + any other for this table?
Maybe repair -inc instead of repair -pr
The nodetool repair command in Casandra 3.10 defaults to running incremental repair. There have been some major issues with incremental repair and it's currently not recommended by the community to run incremental repair. Please see this article for some great insight into repair and the issues with incremental repair: http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
I would recommend, as does many others, to run:
nodetool repair -full -pr
Please be aware that you need to run repair on every node in your cluster. This means that if you run repair on one node per day you can have a max of 7 nodes (since with default gc_grace you should aim to finish repair within 7 days). And you also have to rely on that nothing goes wrong when doing repair since you would have to restart any failing jobs.
This is why tools like Reaper exist. It solves these issues with ease, it automates repair and makes life simpler. Reaper runs scheduled repairs and provides a web interface to make administration easier. I would highly recommend using reaper for routine maintance and nodetool repair for unplanned activities.
Edit: Link http://cassandra-reaper.io/

1. What type of database, if any for information arrangement? 2. Faster with int instead of string values in database?

I am collecting data from the internet at regular intervals and put it in text files
Columns: Datetime data1 data2 data3
Values: 20170717 1800 text1 text2 int
What kind of database would be best, where the data is predictable? I could rearrange the data for a column-base if needed. Then you have those saying you should not use databases at all (referring to Google).
Would it be faster and cheaper on disc space if I translate string values into simple int and have a separate translation table when queries are needed?
For instance writing 1, 2 and 3 instead of hockey stick, football and frisbee or 1,2,3 instead of Newport, Kuala Lumpur, Lumpini Stadium into the database and when searching doing in pseudocode INNER JOIN database AND translation-table WHERE 1 = Newport
Perhaps there would be trade offs between saving disc space and increasing the workload on the CPU when querying the database with INNER JOIN-type queries?
The answer is to look further into the subject matter. Data formats I did not know about popped up in the following video and then a good text on the matter.
VIDEO: More parallelization file formats matter
TEXT: How to choose data type

Talend Open Studio - Iterate all X rows and not for each row

I used talend for data extraction and insertion into OTSDB. But I need to cut my file, and a classic iteration take too much time (40 rows/s and I have 90 millions rows).
Do you know how to send for example 50 rows by 50 rows instead of each row individually?
The writing mode can be adjusted for many talend components.
The tMysqlOutput component for example can be configured to do an insert every X rows.
The tFileOutputDelimited (eg. CSV) has a setting for the flush buffer size.
Have a closer look at your component's advanced settings.

Finding the location of millions of records from IPAddress

I have a table that has 2.5 million IP addresses mostly in the U.S. I want to use this data to find their time zone. So far I have downloaded the GeoLite city tables from Maxmind and imported them to my server.
http://dev.maxmind.com/geoip/legacy/geolite/
The first table in Maxmind (Blocks) has a starting IP integer column an ending IP integer column and a LocID corresponding to an integer that is within that range. The table starts from the integer 16 million and goes to about 1.5 billion. The second table has geographical information corresponding to the LocID in the first table.
In a CTE, I used the code below to convert the IP addresses in my table to the integer format. The code seems to output the correct value. I also included the primary key ID column, and the regular IP address.The CTE is called CTEIPInteger.
(TRY_CONVERT(bigint, PARSENAME(IpAddress,1)) +
TRY_CONVERT(bigint, PARSENAME(IpAddress,2)) * 256 +
TRY_CONVERT(bigint, PARSENAME(IpAddress,3)) * 65536 +
TRY_CONVERT(bigint, PARSENAME(IpAddress,4)) * 16777216 ) as IPInteger
I then created a non clustered index on both the starting and ending IP integer columns.
I tried using a join as follows.
select IPAddress,IPInteger,LocID
from CTEIPInteger join Blocks
on IPInteger>= StartIpNum and IPInteger<=EndIpNum
The first 1000 records load pretty quickly but after the computer runs forever without outputting anything.
For the Blocks table, I have also tried indexes on just StartIPNum and I also tried with an index on only the LocID.
How should I obtain the time zones? Am I using the right database? If I have to I might be willing to pay for Geolocation service.

SSIS Export all data from one table into multiple files

I have a table called customers which contains around 1,000,000 records. I need to transfer all the records to 8 different flat files which increment the number in the filename e.g cust01, cust02, cust03, cust04 etc.
I've been told this can be done using a for loop in SSIS. Please can someone give me a guide to help me accomplish this.
The logic behind this should be something like "count number of rows", "divide by 8", "export that amount of rows to each of the 8 files".
To me, it will be more complex to create a package that loops through and calculates the amount of data and then queries the top N segments or whatever.
Instead, I'd just create a package with 9 total connection managers. One to your Data Database (Source) and then 8 identical Flat File Connection managers but using the patterns of FileName1, Filename2 etc. After defining the first FFCM, just copy, paste and edit the actual file name.
Drag a Data Flow Task onto your Control Flow and wire it up as an OLE/ADO/ODBC source. Use a query, don't select the table as you'll need something to partition the data on. I'm assuming your underlying RDBMS supports the concept of a ROW_NUMBER() function. Your source query will be
SELECT
MT.*
, (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 8 AS bucket
FROM
MyTable AS MT;
That query will pull back all of your data plus assign a monotonically increasing number from 1 to ROWCOUNT which we will then apply the modulo (remainder after dividing) operator to. By modding the generated value by 8 guarantees us that we will only get values from 0 to 7, endpoints inclusive.
You might start to get twitchy about the different number bases (base 0, base 1) being used here, I know I am.
Connect your source to a Conditional Split. Use the bucket column to segment your data into different streams. I would propose that you map bucket 1 to File 1, bucket 2 to File 2... finally with bucket 0 to file 8. That way, instead of everything being a stair step off, I only have to deal with end point alignment.
Connect each stream to a Flat File Destination and boom goes the dynamite.
You could create a rownumber with a Script Component (don't worry very easy): http://microsoft-ssis.blogspot.com/2010/01/create-row-id.html
or you could use a rownumber component like http://microsoft-ssis.blogspot.com/2012/03/custom-ssis-component-rownumber.html or http://www.sqlis.com/post/Row-Number-Transformation.aspx
For dividing it in 8 files you could use the Balanced Data Distributor or the Conditional Split with a modulo expression (using your new rownumber column):

Resources