I was checking out importing and exporting data from Salesforce. Salesforce provides reasonable documentation at their trail-head and it’s worth the time to understand what’s available with the platform. These solutions provide point solutions to move data in and out of Salesforce. If on the other hand your use case requires additional requirements for data sources, cleaning, performance, error recovery, auditing, (i.e. enterprise type features), look on.
As I looked around. I pulled up the Gartner Magic Quadrant for Data Integration tools as well as Google searches for available options. The ones that caught my interest were:
Talend provides data movement & data quality tools. Gartner classifies Telend as a leader. Each has an open source component as well as a paid one. A beginning comparison matrix is here.
Informatica — Offers a free data wizard. Beyond the free offering, there is a PaaS offering as well as the tried and true PowerCenter (on premise) offering. Gartner classifies Informatica as their top executing leader.
Gartner reviewed SSIS from Microsoft, which is available on premise. The upcoming PaaS offering Data Factory v2 (in preview) also seems quite interesting. Gartner classifies Microsoft as a well executing challenger.
Dataloader.io — free version for a limited number of records. You can increase the number of records or move to MuleSoft (it’s underlying technology) for enterprise use cases.
JitterBit — Free for a small number of use cases, and can upgrade to Harmony for additional use cases and performance.
As always, in making decisions between products, I suggest to:
1) Define the use cases.
2) Reduce the list of vendors using price, customer support (and other criteria).
3) Execute a proof of concept using your data with one or two vendors.
4) Run ROI / TCO calculations.
I had a devil of a time logging into Visual Studio 2017. Turns out my credentials were cached in the .IdentityService folder under AppData. Deleting the .IdentityService folder finally solved the issue for me.
I’ve been working on a Power BI file in Power BI Desktop and it’s gotten a fair amount of data in it. Each time I published the file to the Power BI Service, the time to complete the publishing was getting larger and larger. When it got past 20 minutes to publish, I started looking for ways to speed this up.
One of the data sources that Power BI can consume is another .pbix file that already exists in the Power BI Service. So, I separated my reports and data into 2 different .pbix files and referenced data the data model as a source from the reports .pbix file.
Publishing each .pbix file follows the same flow as any publish and the way to find the data model takes 2 steps. The first is to choose Get Data and Power BI Service:
Next you will see a dialog box showing workspaces in your service. Pick the workspace and correct source file and you’re done.
This has reduced my time of publishing from 20 minutes (or more) to seconds.
I wanted to push data from Treasure Data into Azure SQL. The data was row based and roughly about 1 TB in size.
As I examined my options, there were quite a few, however, in the end, it turned out that the easy options were just not enough. I’ll walk through the options I considered and my thoughts on each.
Pushing from TD
- Push from Treasure Data using MS Sql Connector. The good part of this option is that it’s built into Treasure Data and supports things such as only push updates. As I looked into this option, I was hoping it would work as this would make my job a lot easier. However, as I pushed data it turned out that pushing data to Azure SQL continually timed out. So after contacting TD support (who are *very* responsive), I kept reducing the array size to a point where the data transfer was functional, but the data throughput was so poor, moving the amount of data I wanted was a non-starter. On to option 2.
- The second option I tried was also from the TD console. I pushed data from TD into Blob storage at Azure. This went fairly smoothly, but when I wanted to rename the file I was placing in Azure, I was not able to do so. On to option 3.
- The third option I tried was to write a script using the TD command line. This gave me all the flexibility I wanted for naming the file as well as adding parameters so that I could move other tables in the future. Since I was starting a TD job, all the built in monitoring and logging for TD also kicked in. Also, all the compression & security for transport just worked. Nice.
So, that was it pushing the data out of TD. Now to look at pulling the data in Azure SQL. In the architecture diagram, this covers steps 1 and 2.
Import into Azure SQL
- On the Azure side, I wanted to notice whenever a new file showed up in blob storage. This turned out to be a nice fit for Azure Functions. As I looked into Azure Functions, there were two versions. One is in GA now (v1) and the second in preview (v2). As I looked into them, I realized my requirements were pretty basic, so I took a chance on the preview version (v2). Turns out writing an Azure functions in C# is really easy. I wrote 2 functions, one to uncompress the incoming file and the second to load the .csv file into Azure SQL. This item covers steps 3 and 4 of the architecture diagram.
- Turns out there’s a new shiny on the horizon — Azure Data Factory (currently in preview). I suspect I’ll rewrite the Azure functions into Azure Data Factory as it moves from preview into GA. The main reasons in my mind are improved monitoring and alerting.
If you are behind a proxy, you’ll need to tell Node about the proxy. Normally, I would expect setting HTTP_PROXY and HTTPS_PROXY environment variables to do the trick. However, in the case of node, the commands you’re after are:
npm config set proxy <HTTP_PROXY>
npm config set https-proxy <HTTPS_PROXY>
Also, you can put these commands into your .npmrc file.