Azure Data Factory using self-hosted integration runtime

In my first attempt to create an Azure Data Factory job, I used compute resources in Azure that are managed by Data Factory itself. Alternatively, you can use a self-hosted integration runtime to make use of compute resources elsewhere – such as your own on-premise environment.

In this example, I install a self-hosted integration runtime within my own on-premise environment (which is just my laptop!), and create a Data Factory job to extract content from a table hosted on a SQL Server instance running on that same laptop.

Creating the self-hosted integration runtime

To do this, you must open Azure Data Studio, select the “Manage” toolbox icon from the left-hand menu, “Integration runtimes” from the “Connections” list, and then “+ New”.

The “AutoResolveIntegrationRuntime” that you can see present already is the Data Factory-managed Azure resource that I used for the previous copy data pipeline.

Choose “Azure, Self-Hosted”

Of the options that then appear, I am going to select “Self-Hosted” to try and install the integration runtime on my laptop.

Copy and paste one of the Authentication Keys that appear (store them somewhere that you can access later). Then click “Download and install integration runtime”

You’ll be taken to a page of the Microsoft Download Centre, where you can choose from a few recent versions of the integration runtime installer.

Once the .msi file has downloaded to your local machine, open it and run through the installation wizard.

Paste the authentication key that you saved earlier into the registration page shown below:

The “Integration Runtime (Self-hosted) node name” that appears below should be the name of the local machine that you installed the integration runtime on (my company laptop is named “DATAMINISTER”)

Click “Finish” to complete the registration.

You now have a new integration runtime (IR) to choose within Data Factory Studio:

Test out the self-hosted IR with a new pipeline

Check that the new IR can be used successfully in a Data Factory pipeline by opening up Azure Data Factory Studio, and selecting “New pipeline” from “Factory Resources”.

Create a “Copy data” activity…

Within that copy data activity, create a new source dataset connection, with a type of “SQL Server”. This will be configured to connect to a SQL Server instance on the same server that the self-hosted IR was created on.

Fill in the details of your on-premise SQL Server instance, and the database within that instance that you want to extract data from.

In the “Connect via integration runtime” dropdown, you can select the self-hosted IR that you’ve just created.

You’ll need to provide an authentication method too…

In the above, I’m using a SQL account to connect to the source database. The success of the connection will rely upon the SQL account having adequate access to the database.

Checking the account’s credentials within SQL Server Management Studio (SSMS) shows that the “DataFactory_Reader” user has read access within the “AdventureWorksDW2017” database.

Back in Azure Data Factory Studio, if the Linked Service you’ve just set up is working OK you should be able to see individual tables under the “Table name” dropdown, within the “Connection” section of the Linked Service.

Back on the “Source” tab of the “Copy data” task for the pipeline, you can enter your own SQL query to just extract a few named columns from the table you’ve specified in the connection above.

You can even “Preview data” to check this is working:

With the connection to the source database created, you now need a destination database to transfer data to. I am going to use a new table within the same “DataMinister_DB” that I created in this previous blog.

To create the new table, I opened a Query Editor session against the database within the Azure portal, and created a simple table (that has a schema matching the data I’ve selected with my query, which extracts three columns from the source database).

Within the “Sink” tab, I specify a connection to this new table:

Now that I’ve configured the source and destination databases/tables for the “Copy data” activity, I can run “Debug” within Azure Data Factory Studio to check it works OK…

…and it does! Note that the integration runtime used was the self-hosted IR that I created at the start of this article.

120 rows of data have been copied from the on-premise database hosted upon my laptop, to the Azure SQL Database that I created within the cloud.

It’s worth noting that this process succeeded first time for me because I’d already added several IP addresses that Data Factory uses to the “approved” IPs list for the destination database (while I was running through the steps in this post). If you haven’t followed the steps within that post first, you may get a few connection errors when trying to connect to the Azure database. The post shows how you can overcome these errors by “approving” the IP addresses.

A basic Azure Data Factory pipeline: Copying data from a csv to an Azure SQL database

This is my first attempt at creating an Azure Data Factory pipeline. It’s not a difficult process, but there are several steps to it. So that I don’t forget them all for next time, and in case this is helpful for anyone else, here are my notes.

Setting up the initial environment

If you don’t have one already, you’ll need a source and destination to copy the data between. In my case, these will be a csv file held within Azure Blob Storage, and an Azure SQL Database. Both of these (along with the Data Factory pipeline itself) will need to live within a Resource Group, so I’ll create that first.

Click on “Resource groups” within the Azure portal, and then “+ Create”.

Give the Resource Group a name and region (preferably the same one as where your source and destinations will be).

Create a Storage Account, to hold the source csv file.

Within the new Storage Account, create a Blob Service, and container to hold your files.

Upload the csv file, which will act as the source data in the Data Factory pipeline. For the sake of this demo, I’m using a very simple one-column dataset of salaries. The file is called “LFData”

With the file that you want to upload selected, press the “Upload” button.

The freshly-uploaded file should appear in your container.

Creating an Azure SQL Database

In the Azure portal, search for “Azure SQL”, and then select the “Azure SQL” option that appears in the Marketplace.

In the “SQL databases” window, select “Single database” from the dropdown.

Specify your Subscription, to be used for the database deployment.
Select the Resource Group created earlier.
Give the database a name (we use “DataMinister_DB”).
Choose a server to put the database on. If you haven’t got one already,
you’ll need to create a new one, as shown further below…

If you don’t have a server set up already, click “Create new” under the “Server” dropdown, and complete the form below. You’ll need to specify a SQL or AD account to use for authentication.

Back in the “Create SQL Database” wizard, go to the “networking” tab and allow connectivity through a public endpoint. Under “Firewall rules”, add the current client IP address to the approved list.

Click “Review + Create”, and then you’ll be taken to a “Deployment in progress” screen, which hopefully results in a successfully-created new database.

Within that database, select “Query editor” from the menu on the left:

Login using the SQL account / password you’ve just created:

Write a simple query to create a basic table (which will need to match the schema of the data you’re importing), and click “Run”. The table should now be visible in the object explorer on the left:

Create the Azure Data Factory Pipeline

Dive into the new Resource Group and click “create a resource”. Then from the integration menu, choose “Data Factory”.

Create a Data Factory instance inside of the Resource Group.

Once the new resource has been created, click on it, and select “Open Azure Data Factory Studio” from the “Get started” section.

Create a new pipeline.

From the “Move & transform” menu, drag “copy data” over to the pipeline canvas.

Go to “Source” for the “Copy data” activity, and “+ New”.

On the “New dataset” menu that appears, select “Azure Blob Storage”

Hit “continue”, and then “DelimitedText” for the file format:

Give the dataset a name (“LFData_Import”), and create a new linked service.

Point the linked service to the Storage Account that you created previously.

Test the connection.

Once the linked service has been created, select “Open” next to the source dataset.

Specify the exact file within the store that you want to import data from.

You have the option to “preview data”, to check things look OK.

Following a very similar method to create a destination, select the “Sink” tab of the “Copy data” activity, then “+ New”, and choose “Azure SQL Database” from the options that appear.

Specify the connection details to the Azure SQL Database that you created earlier. You’ll need to provide authentication details too. I’m using the same “admin” account for simplicity here, although in a real environment it may be preferable to use a separate account.

Clicking the “Test connection” button should result in an error message at first.

The IP address given in the error message will need to be added to the approved list of the server that contains the database you’re connecting to.

 If the IP address is dynamic, you may want to use the whole IP range…

Try the connection again, and hopefully it succeeds now.

Select the table within the database.

Back in the Data Factory pipeline, hit “Debug”. You’ll probably get another error at first, and have to add another IP address to the approved list for the database server.

After pressing “Debug” again, now you should hopefully get a successful run of the copy activity.

Back in the Query Editor for the database, a SELECT statement should show that data from the csv held on Azure Blob Storage has now been imported into the Azure SQL Database table, using Data Factory.