How can I dump the .parquet data that is in Azure DataLakeStorage to a Microsoft SQL Server database using Nifi?

How can I dump the .parquet data that is in Azure DataLakeStorage to a Microsoft SQL Server database using Nifi? - sql-server

I've been looking for information for a long time and I can't get it. I'm starting to think it can't be done if the .parquet are in Azure DataLake Storage.
I have a folder with subfolders in Azure DataLake Storage. In these subfolders there are many .parquet. I manage to get them out using ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination. Then I try to pass them through a PutDatabaseRecord (which I think is the correct processor for the dump in the DB).
I think I have the PutDatabaseRecord well configured. But when executing it gives me an error: "Failed to process session due to Failed to process StandardFlowFileRecord due to java.lang.NullPointerException: Name is null".
I'm not sure I'm using the PutDatabaseRecord right. I thought that PutDatabaseRecord read the flowfiles that came to it interpreting their content as .parquet (it is supposed to use a ParquetReader as a RecordReader), being able to understand the data as records. But it surprises me that it is not necessary to indicate how to interpret the .parquet, nor how to map its columns with those of the DB table. It still doesn't work as I think and it needs the flowfile content to already arrive as records?
The truth is that I can't explain myself better either because I don't really understand what is considered a record in Nifi or how a record is related to a reading of a .parquet.
Either I am missing a processor or something I am configuring wrong. But the only thing I find is the FetchParquet, which seems to be able to read a .parquet and put it into the FlowFile as records. However, it can only be used with ListHDFS or ListFile, which do not allow me to fetch data from Azure Data Lake Storage
After several tests (using the ConvertRecord and QueryRecord processors), I have come to the conclusion that the problem is in the reading that the ParquetReader does of the content of the FlowFiles that arrive. Well, every processor that needs a ParquetReader gives the same error. Downloading the content of the FlowFile that enters the processor that the ParquetReader uses (whatever it is) and using a .parquet viewer I have verified that this content is fine.
Without knowing what to do, I have attached a screenshot of the specific error. I still don't know what "Name" the error refers to.
Error Name is null
Note: I also posted my problem on Cloudera, perhaps better explained. I leave the link in case someone wants to look at it. (https://community.cloudera.com/t5/Support-Questions/How-can-I-dump-the-parquet-data-that-is-in-Azure/td-p/316020)

In the end, the closest thing to the error I was getting was found here (https://issues.apache.org/jira/browse/NIFI-7817). It seems that it is an error related to the creation of the ParquetReader. This makes sense because it would hit any processor that used a ParquetReader. In addition, the FlowFiles did not even enter the processor that used it.
I was using Nifi version 1.12.1. I have downloaded version 1.13.2 and it no longer gives the Name error. In addition, it is seen that the Flow Files already enter the processor. On the download page of the new version (https://nifi.apache.org/download.html) you can access the Release Notes and the Migration Guidance to know what has been fixed with respect to previous versions and with which processors you have to be careful when migrating.
However, even though the data goes into the processor, it still gives me an error, but it is different and I will open it in another post.

Related

Will Jira complain if I set the Resolution date to a date before the creation via direct DB write?

Some colleagues were using an Excel file to keep track of some issues, and they have decided to switch to a better system, asking me to setup a Jira project for them and to import all the tickets. A way or the other I have done everything, but the resolution date is now wrong, because it's the one of when I ran the script to import them into Jira. They would like to have the original one, so that they can know when an issue was really fixed. Unfortunately there's no way to change it from Jira's interface, so I have to access the DB directly. The command, for the record, is like:
update jiraissue
set RESOLUTIONDATE = "2015-02-16 14:48:40"
where pkey = "OV001-1";
Now, low-level writes to a database in general are dangerous, and I am wondering whether there can be any risks. Our test server is not available right now, so I'd have to work directly on the production one. One thing I had seen on our test server is that this seemed to work, except that JQL queries such as
resolved < 2015-03-20
are wrong because they still use the old Resolution date. Clearly, I have to reindex; but I'm wondering whether it is safe. Does Jira perform some consistency checks? Like, verifying that a ticket is solved after it is created. In my case, since I have modified the resolution date but not the creation, it is a clear inconsistency. Will Jira complain about this? Is there the risk to corrupt the DB? And if I also modify the creation date, do I have to watch out for other things?
We are using Jira 5.2.11.

I have access to the test server again, and I have tried it. I have modified all the RESOLUTIONDATE fields I had to fix, and when I reloaded the page the new date was there. Jira didn't complain about anything. I reindexed the server, so that queries yield correct results, and I saw no issues. Then I even ran the integrity checks (Administration -> System -> Integrity Checker), and no error was found.
Finally I did the same on the production server, and again everything is running fine.
I can therefore conclude that the operation is not dangerous at all, and it can be done safely.

What can I do with generated error logs?

I'm currently working on a web application which generates daily error (and non error) logs.
The current system outputs a log per task to a text file, and outputs critical errors as well as "start" and "finish" type messages to an email account.
The current workflow is as follows: scour the email box for errors, then go and find the .txt file to look at the associated errors and find the cause.
There are around 30 txt files split across about 5 servers.
This system was set up before me, but I'm looking for any advice on how to deal with the situation.
I have control of the script forming the error logs so can do pretty much anything - but I'm lost where to start: I'd considered some kind of web facing dashboard tool, maybe output the files to RSS or something?
Are there any external or internal tools I should be using?

Of course you may use the SQL Server Reporting Services or review this comparison table, there are some packages which may support SQL Server but they may be overwhelming for your task.

It's not really clear what your problem is or what you want to do, but if I understand correctly, your biggest problem is that some messages are logged to a log file but others are sent by email. Therefore, there is no single location that has all error messages in it and that makes analysis and troubleshooting difficult.
The best solution would be to use a logging framework that supports multiple logging destinations (file, DB, email) and severities. That would allow you to specify a configuration like "all errors are logged to a text file and critical ones are also sent by email", so you can ensure that you have everything in one place for general analysis but critical errors are also handled with priority.
You didn't mention what programming language you use, but assuming it's .NET-based then log4net and Enterprise Library are two common frameworks and there are many questions about them here on SO. Googling should give you a good idea of the pros and cons for your situation. If you're using a different language then you can look for the equivalent package: log4j (Java), logging (Python) etc.

App Engine backup never finishes only clue is failure in map reduce worker_callback

Over the last few weeks we have repeatedly failed on doing a complete backup of the data store using the datastore admin tool. We thought the issues had to do with quota errors we were running into so we switched our application from a free to a paid app and we still have problems.
Each time we are attempting to back up to the blobstore and what occurs is that the process never finishes. We see the backup in our Pending Backups list but it never actually completes. We only have a total of 43MB of data right now so we don't see it as a data transfer problem. Looking at our default Task Queues it shows that we have two pending tasks one is a call to /_ah/mapreduce/controller_callback and another is a call to /_ah/mapreduce/worker_callback
The worker_callback racks up its retry count and the only error clue we have is on the Previous Run tab it shows the last http response code to be 500. There is no error message, nothing shows up in our error logs, it just keeps trying over and over again.
We've been able to narrow the backup problems to a specific entity kind for a particular namespace but we can't figure out why that entity kind is failing whereas the others are not. The major difference is the entity kind has a large number of embedded entities, but if the app engine is able to read / put those entities we can't understand why it seems to be having problems backing it up. The particular namespace that the error occurs in has the largest data stored for that entity kind compared to the other namespaces we have setup.
We think if we can see what error is occurring in the worker_callback we may be able to figure out why the backup is failing, or what is wrong with our data that's preventing the backup. Is there something we need to setup / enable through settings / configuration files to give us more detailed information on the backup? Or is there some other avenue we should explore to figure out how to investigate/fix this problem?
I should mention we are using the Java SDK as well as Objectify V3 to work with the data store. We are also backing up data to the Blobstore.
Thank you.

Well with the app engine team's help we figured what the problem was and we worked around the issue. I want to give details in case anyone else runs into this problem.
From issue 8363 the app engine team indicated that from their logs they could see that the map reduce failed because of the large number of properties that our entity kind had. The specific entity kind that was causing the failure had a large number of variable properties that was generating errors when map reduce tried to write out a schema. They indicated that the solution on their end was to ignore entities that were like this in the backup to make it so the backup worked successfully.
What we did to work around the issue and make the backup work was change how we told objectify to store out data. The large number of properties were being created due to our use of the #embedded keyword on a HashMap() class member field. Since the embedded keyword breaks down classes into individual components it was generating a large number of properties. We switched the member field to be #serialized and then ran a conversion process to make it use the new serialized property. This made the backup / restore work again.
You can read more about the differences between embedded and serialized on objectify's website

snielson, would you mind opening an issue on our Public issue tracker here. Remember to add your Application ID so we can further debug this specific scenario.
Thanks!

Getting data from mdb database file in my Windows program

I have for some time helped a customer to export mdb table data to csv files (and then to further process these csv files). I have used Ubuntu, so mdbtools (mdb viewer) has been available to me. Now the customer wants me to automate the work I do in the form of a Windows program. I have run into two problems:
After some hours, I still haven't found a free tool on Windows that can export my table data in a way that I can incorporate in a program/script. Jackcess (jackcess.sourceforge.net) looks promising, but when running the downloaded jar a totally unrelated Nokia Suite program pops up...
I have managed to open two of the tables in a python program by using the pyodbc module, but all the other tables fail to open because of "no read permissions". Until now I thought that there were no access restrictions on the database, because mdb viewer on Ubuntu opens all tables without any fuzz. There is no other file available to be, just the mdb file. One possibility might be that this is not a permissions problem at all, but a problem with special characters in column names. All the tables that I cannot open have at least one column name with a national character, whereas the 2 two tables I can open do not. I tried to use square brackets in the SQL select called from python, like so:
SQL = 'SELECT [colname] from SomeTable;'
but it makes no difference. I cannot fetch data from the columns that do not contain national characters either (except from the 2 two tables that do work).
If it indeed is a permission problem, any solution must also be possible for my program to perform, there must not be any manual steps.
Edit: The developer of the program that produces the mdb files has confirmed that there is no restrictions for any tables. So, the error message "no read permissions" is misleading. I will instead focus on getting around what I presume is a problem with national characters in column names. I will start with the JSDB approach suggested below. Thanks everyone!
Edit 2: I made a discovery that I feel is important: All tables that I can open using pyodbc have Owner=Admin whereas all tables that I cannot open have no owner at all (empty string it seems, "Owner=").
Edit 3: I gave JDBC a shot. Same error again, as one could expect given the finding in Edit 2. Apparently the problem to solve is the table ownership (although MDB Viewer under Linux doesn't seem to care about that...). Since the creator of the files says he didn't introduce any permission settings, I guess the strange table ownership could be the result of using new programs (like 2010) to read data produced in a old program (like sometime in the 90s), or were introduced during some migration of the old program. Any ideas on how to solve it?

You might be able to use VBScript. VBScript is usually used in ASP files for web pages, but can be used stand alone as a Windows program as well.
VBScript is free as it's code you write in Notepad.
Others may come up with better answers for you. Good luck.

Creating the Front End MDE

I created a database for tracking metrics, with some automation tricks (email, .doc,.ppt presentations, etc) with a very large Main-table, and lots of forms/GUI. This is the first time I have ever I worried about an MDE/front-end for the thing. So if you would be so kind to answer a few questions, or offer any advice, it would be greatly appreciated (I would hate for all this work to not be utilized).
What is the first thing I need to do? It the 2000 version that must be converted to 03 to create the MDE, but does that get done before I use the database splitter?
Will the amount of objects in the database effect the ability to do this? I have something like 80 forms, 70 queries, 20+ macros, 12 tables, etc...but does the amount of objects prevent some of this from working well once the front end is there?
when i split the database, can I continue to work/make changes and such on the "back end", and have those changes directly effect the front end?
These may be some basic questions, but I don't know the answer so.....Thanks!

Here is my 2 ¢.
Question 1 - I have never used the database splitter as I feel I have more control doing it manually. If you do it manually you can do it to a version that does not have a database splitter. But if you do use the splitter then--yes--you will have to upgrade to a version that has a splitter before doing it.
To do it manually here are the steps.
Backup everything.
Create a copy of your file into the same directory. So if you have an MyApp.MDB create a copy into the same directory with a new name, such as MyAppDATA.mdb.
Open the new DATA file (MyAppDATA.mdb) and delete all of the objects EXCEPT the TABLES.
Open the App file (MyApp.mdb) and delete all of the tables.
Also in MyApp.mdb...go to the File/Get External Data/Link Tables menu to link the tables in MyAppDATA.mdb to MyApp.mdb. Select All and create the links.
That should do it. And if you screw up you made a backup...right?
A couple of tips and gotchas...be sure that you go to Tools/Options and that you are NOT showing System and Hidden tables. You just don't want to delete system tables from MyApp. Another way to do it is do NOT delete tables that start with MSys or USys.
Question 2 - Does not matter how many object you have. In fact you don't have that many objects anyway.
Question 3 - Yes...you will make backend changes in MyAppData.mdb and when you open MyApp.mdb those changes will auto-magically be there to see and query against etc. (In the query designer you may need to save/close/reopen to see new fields if you made the mod while in the query). The EXCEPTION to that is New Tables You will have to use the File/Get External Data/Link Tables option to create links to new tables.
One thing to remember (and that I hope you already realize) is that the one downside of splitting the database is that when you deploy the front end file that usually the relative path to the data will vary from machine to machine and there is no automatic re-linking of tables in access. If your target clients have full access you can always use Tools/Database Utilities/Linked Table Manager to refresh the links to the right location. If you can't do that then you will have to do one of the following:
1. Write code that does the automatic re-linking for you. Basically it will check the links...if invalid it will prompt the user for the data location (or look it up in an INI file) and re-link the tables.
2. Always deploy your app to the same location on all machines. If you have commercial visions for your application this won't work...I mention it for academic reasons. It might be doable for a limited deployment where you have a lot of control over file placement on each machine.
3. Put the Data file (MyAppDATA.mdb) onto a network share and link the table across the network using a drive mapping or UNC (\myserver\mydata\ApplicationData\MyAppData.mdb). The latter is preferred but both of them run the same risks as number two.
Seth
PS This answer assumes Access 2003.
PPS If you have commercial visions for your application then the table linking has got to be REALLY robust.
PPPS I agree with the commenter that you may want to take the plunge and do SQL if it is in your skill set.

One thing that hasn't been discussed, and that's the issue of whether the compile to MDE could fail. Basically, if your code compiles in your front-end MDB, it will convert to an MDE. But I've noticed that lots of people never compile.
Some hints for keeping your VBA code in good shape:
in VBE options, turn off COMPILE ON DEMAND.
add the COMPILE button to your standard VBE toolbar and USE IT OFTEN.
periodically, backup your MDB and decompile/recompile it.
Also, remember that you must keep the MDB source, as the VBA code is not editable in an MDE and not recoverable by any good method.
EDIT:
Steps for a decompile:
backup your MDB.
start an instance of Access with the /decompile commandline argument. For, instance, I have a shortcut on my deskstop that has this as the target:
"C:\Program Files\Microsoft Office\OFFICE11\MSACCESS.EXE" /decompile
having opened that instance of Access, open the MDB you want to decompile. You will see nothing happen. DO NOTHING FURTHER IN THIS INSTANCE OF ACCESS -- close this instance of Access (the reason for this is that Michael Kaplan, who knows a thing or two about this, recommended that you never do any work in an Access instance opened with the decompile switch because he said there was no guarantee that the Access application code executed under those circumstances in a way that was fully safe for all kinds of Access work).
open the just-decompiled MDB holding down the shift key (you want to be sure that startup routines don't run because that would likely recompile the product before you've finished your cleanup) and compact the MDB (holding down the shift key again).
open the code editor and compile the project (DEBUG -> COMPILE [db name] for those who haven't step #2 in my original compiling instructions at the top of the post before the edit).
compact the MDB (doesn't matter if you bypass startup, since it's already fully compiled).
Why so many steps?
Because the purpose of the decompile is to get rid of the compiled p-code in order to start afresh from the canonical VBA code. Following the steps above insures that you have completely cleared the data pages storing the compiled code before you recompile. The reason for this is that without the compact step after the decompile, under some very rare circumstances, the code can behave strangely. I can't imagine that the old discarded p-code is being used again, but there's something about the pointers between the canonical code and the compiled code that apparently doesn't get completely flushed by a decompile without a compact.

This would be a comment to Seth's answer, but my rep isn't high enough to comment yet.
Seth did a great job answering your questions, I just wanted to add a bit more to part #1 about using the Database Splitter. The Database Splitter in the Tools menu works fine. Doing it manually is alright too, but it's a whole lot faster and easier to use the Database Splitter. I've used it a dozen times and never encountered any issues after using it.
http://www.databasedev.co.uk/split_a_database.html has a decent page about some of the pros, cons of splitting your database.
http://www.accessmvp.com/TWickerath/articles/multiuser.htm also has some good info when dealing with a split database in a multi-user environment.

Seth gave you a very good answer. But I'll add a few comments.
The number of objects only becomes relevant when you get close to about 1000 forms, reports and modules which have code. There's a limit about there. If you do get that message when trying to make an MDE then you almost certainly have a code error and need to compile to find the error
Another resource is "Splitting your app into a front end and back end Tips"
See the Auto FE Updater downloads page to make the process of distributing new FEs relatively painless.. The utility also supports Terminal Server/Citrix quite nicely.