Why Two Same-data SQLite Databases Have Different Sizes? - database

I have some financial data for over 6600 stocks stored in a Foxpro database. I could download the database views into a set of 15 files, which I did first into .dbf files and then into .txt files (comma-delimited).
For the .dbf set of files I used a spatialite virtualization extension with Python and Sqlite to convert them into Sqlite tables then merged them into an 8-table database (Let's call it DBF-derived). So with c for cursor:
c.execute("CREATE VIRTUAL TABLE temp_virt USING VirtualDbf({}, UTF-8)".format(file))
c.execute("CREATE TABLE {} AS SELECT * FROM temp_virt;".format(table_name))
For the .txt files, I used Pandas to convert and combine 12 of the 15 files into 5 CSV files, then I plied them with other remaining 3 .txt files in Python and Sqlite to create an 8-table database (let's call it CSV-derived) using a modified version of this code (from this page):
with open(csvfile, "rb") as f:
reader = csv.reader(f)
header = True
for row in reader:
if header:
# gather column names from the first row of the csv
header = False
sql = "DROP TABLE IF EXISTS %s" % tablename
c.execute(sql)
sql = "CREATE TABLE %s (%s)" % (tablename,
", ".join([ "%s text" % column for column in row ]))
c.execute(sql)
for column in row:
if column.lower().endswith("_id"):
index = "%s__%s" % ( tablename, column )
sql = "CREATE INDEX %s on %s (%s)" % ( index, tablename, column )
c.execute(sql)
insertsql = "INSERT INTO %s VALUES (%s)" % (tablename,
", ".join([ "?" for column in row ]))
Now when I examined both sqlite databases, I found the following:
The DBF-derived database retained its ID column (although it was not designed as primary key).
The ID column did not survive the download to .txt in the CSV-derived db so I declared the stock ticker column as primary key.
The DBF-derived was not indexed in sqlite.
The CSV-derived got automatic indexing in sqlite.
Dates retained their date format in the CSV-derived db, whereas they turned into a number of days in the DBF-derived db.
The main data type that came through the vertualization process for the DBF-derived db was REAL, which I also set as the datatype as I
created the CSV-derived db.
All else was identical, except that the CSV-derived db was 22% smaller in size than the DBF-derived, and I am puzzled as to why
considering that it is indexed and has the same data and datatype.
The two databases technically display the same information in the DB
Browser program.
Any explanation as to why the difference in size? Is it because of the 3 .txt files that I did not convert to CSV?

It is hard to understand what you are doing and particularly why you would ever want to use a CSV in between when you could directly get data from another database system. Anyway, it is your choice, the difference is probably for the fact that VFP DBF data with character fields have trailing spaces. Say a 30 chars field, having a single letter in it still has a length of 30. Your conversion to SQLite might not be trimming the trailing spaces, while in a CSV file those data are already saved as trimmed.
Probably the easiest and most reliable way would be to directly create the SQLite tables and fill them with data from within a VFP program (using VFP is not a must of course, could be done in any language).

Related

Copy data files from internal stage table to Logical tables

I am dealing with json and csv files moving from Unix/S3 bucket to Internal/External stage receptively
and I don't have any issue with json files copy from Internal/External stages to Static or logical table, where I am storing as JsonFileName, and JsonFileContent
Trying to copy to Static table ( parse_json($1) is working for JSON)
COPY INTO LogicalTable (FILE_NM, JSON_CONTENT)
from (
select METADATA$FILENAME AS FILE_NM, parse_json($1) AS JSON_CONTENT
from #$TSJsonExtStgName
)
file_format = (type='JSON' strip_outer_array = true);
I am looking for something similar for CSV, copy csv file name and csv file content from internal/external staging to Static or logical tables. Mainly looking for this to separate file copy and file loading, load may fail due number of columns mismatch, newline character, or bad data in one of the records.
If any one of below gets clarified is fine, please suggest
1) Trying to copy to Static table (METADATA$?????? not working for CSV)
select METADATA$FILENAME AS FILE_NM, METADATA$?????? AS CSV_CONTENT
from #INT_REF_CSV_UNIX_STG
2) Trying for dynamic columns (T.* not working for CSV)
SELECT METADATA$FILENAME,$1, $2, $3, T.*
FROM #INT_REF_CSV_UNIX_STG(FILE_FORMAT => CSV_STG_FILE_FORMAT)T
Regardless of whether the file is CSV or JSON, you need to make sure that your SELECT matches the table layout of the target table. I assume with your JSON, your target table is 2 columns...filename and a VARIANT column for your JSON contents. For CSV, you need to do the same thing. So, you need to do the $1, $2, etc. for each column that you want from the file...that matches your target table.
I have no idea what you are referencing with METADATA$??????, btw.
---ADDED
Based on your comment below, you have 2 options, which aren't native to a COPY INTO statement:
1) Create a Stored Procedure that looks at a table DDL and generates a COPY INTO statement that has the static columns defined and then executing the COPY INTO from within the SP.
2) Leverage an External Table. By defining an External Table with the METADATA$FILENAME and the rest of the columns, the External Table will return the CSV contents to you as JSON. From there, you can treat it in the same way you are treating your JSON tables.

SAP Data Services .csv data file load from Excel with special characters

I am trying to load data from an Excel .csv file to a flat file format to use as a datasource in a Data Services job data flow which then transfers the data to an SQL-Server (2012) database table.
I consistently lose 1 in 6 records.
I have tried various parameter values in the file format definition and settled on setting Adaptable file scheme to "Yes", file type "delimited", column delimeter "comma", row delimeter {windows new line}, Text delimeter ", language eng(English) and all else as defaults.
I have also set "write errors to file" to "yes" but it just creates an empty error file (I expected the 6,000 odd unloaded rows to be in here).
If we strip out three of the columns containing special characters (visible in XL) it loads a treat so I think these characters are the problem.
The thing is, we need the data in those columns and unfortunately, this .csv file is as good a data source as we are likely to get and it is always likely to contain special characters in these three columns so we need to be able to read it in if possible.
Should I try to specifically strip the columns in the Query source component of the dataflow? Am I missing a data-cleansing trick in the query or file format definition?
OK so didn't get the answer I was looking for but did get it to work by setting the "Row within Text String" parameter to "Row delimiter".

How can I compare the one line in one CSV with all lines in another CSV file?

I have two CSV files:
Identity(no,name,Age) which has 10 rows
Location(Address,no,City) which has 100 rows
I need to extract rows and check the no column in the Identity with Location CSV files.
Get the single row from Identity CSV file and check Identity.no with Location.no having 100 rows in Location CSV file.
If it is matching then combine the name, Age, Address, City in Identity, Location
Note: I need to get 1st row from Identity compare it with 100 rows in Location CSV file and then get the 2nd row compare it with 100 rows. It will be continue up to 10 rows in Identity CSV file.
And overall results convert into Json.Then move the results in to SQL Server.
Is it possible in Apache Nifi?
Any help appreciated.
You can do this in NiFi by using the DistributedMapCache feature, which implements a key/value store for lookups. The setup requires a distributed map cache, plus two flows - one to populate the cache with your Address records, and one to look up the address by the no field.
The DistributedMapCache is defined by two controller services, a DistributedMapCacheServer and a DistributeMapCacheClientService. If your data set is small, you can just use "localhost" as the server.
Populating the cache requires reading the Address file, splitting the records, extracting the no key, and putting key/value pairs to the cache. An approximate flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> PutDistributedMapCache.
Looking up your identity records is actually fairly similar to the flow above, in that it requires reading the Identity file, splitting the records, extracting the no key, and then fetching the address record. Processor flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> FetchDistributedMapCache.
You can convert the whole or parts from CSV to JSON with AttributesToJSON, or maybe ExecuteScript.

Convert a large SQL Server table to XML

I have a large table in SQL Server 2008 with roughly 500000 records and 40 columns. Some columns are string and they contain \n and other symbols. I want to convert this table to an XML file for use in project. When I use FOR XML to export this table, some errors are shown.
For example, when test:
select testData.*
from testData
FOR XML PATH('sample'), Type, ELEMENTS, ROOT(TestData')
only 3500 records are converted to XML and also, final element (that is record 3500) is not complete.
When test (without Type):
select testData.*
from testData
FOR XML PATH('sample'), ELEMENTS, ROOT(TestData')
All the records are converted to XML but some CR/LF character are added to the XML file that failed xml file. So, some tag like Product split to prod CRLF uct.
I searched for a long time but no page was helpful.
If its a one-shot work, you can use the soft Altova XMLSpy, which is free during 30 days. The Altova mission kit suite contains a lot of tools like XML MapForce which can map a db to a xml.

Handling data truncation in Talend

I am copying data from Excel sheet to the SQL server tables.
In some of the sheets I have data bigger in size of the Table's schema in SQL.
i.e. Table's column has data type nvarchar(50) where as my Excel sheet has data of more than 50 characters in some of the shells.
Now while copying, the rows which has such data are not being inserted in to the database. Instead I would like to insert rows with such data by truncating extra characters. How do I do this?
You can use Java's substring method with a check to the length of the string with something like:
row1.foobar.length() > 50 ? row1.foobar.substring(0,50) : row1.foobar
This uses Java's String length method to test to see if it's longer than 50. If it is then it uses the substring method to get the characters between 0 and 50 (so the first 50 characters) and if it's not then it returns the whole string.
If you pop this in a tMap or a tJavaRow then you should be able to limit strings to 50 characters (or whatever you want with some tweaking):
If you'd prefer to remove any rows not compliant with your database schema then you should define your job's schema to match the database schema and then use a tSchemaComplianceCheck component to filter out the rows that don't match that schema.

Resources