In this blog post, we are going to discuss how to copy all files from one storage location to another using the Copy Data activity in Azure Data Factory (ADF) or Synapse Pipelines.

When you search the web, the most commonly suggested pattern is shown below:

Let me briefly describe this pattern:

In this pattern, the Get Metadata activity fetches information about the files within a folder specified by a file path. It returns a dataset that contains each file in the folder (when you set the Field list argument to ‘child items’) along with the file’s properties. The Filter activity then accepts this dataset and filters the files based on a criterion (e.g., item type is a file and the name follows a specific pattern). The result is then passed to the ForEach activity. Within the ForEach activity, there is a Copy Data activity that reads the file based on the path provided by each item in the loop. The Copy Data activity then copies the content of the file, or the file itself, to the intended destination.

This pattern works perfectly and is ideal for copying files within one folder. However, it doesn’t work if you need to include files within subfolders.

Let’s take a look at how simple it is to achieve this using just the Copy Data activity.

In this example, I’m going to use a Synapse Pipeline, but the steps are exactly the same with Azure Data Factory (ADF).

  1. First, let’s add a new pipeline. You can name it as you prefer, but for our purposes, I will simply name it ‘CopyFiles’ in this example.

2. Next, let’s search for the Copy Data activity in the Activities pane. Once found, drag and drop it onto the white canvas.

3. Next, under the General tab, you can change the activity name to something else, but for now, let’s keep the default name as it is. Then, go to the Source tab and click ‘New’ to create a new dataset.

4. Search for Azure Data Lake Storage Gen2. Select it, then choose Binary. Let’s name the dataset ‘SourceDLakeBinaryFiles’.

5. Under Linked Service, select New. In this example, we will use ADLSGen2 storage account as the linked service type and use the account key as the authentication type. Let’s name the linked service ‘LinkedServiceADLSGen2’, and for the account selection method, let’s choose ‘Enter manually’.

  • In the storage account key field, enter the storage account key. This can be found in the access keys section of your storage account, where you can copy the key from either key1 or key2. Once set, click Create.

 

In the file path property, enter the following:

This will create parameters that you can use to pass the directory and name of the files you want to copy. Click OK.

In this example, I am copying all the files from the source directory testfiles/esd_property (with testfiles being the container) to the destination directory testfiles/testdestination.

6.Go to the Source tab. In the FilePath parameter. I’ve entered ‘testfiles/esd_property’ . Since I’m copying all files, I have to set the file name to blank by using a wildcard name, which we will set in the next property. We cannot leave the FileName parameter blank because it will return an error. To force a blank value, I added a dynamic expression with the value @trim(' '). This forces an empty value. In the File Path Type, select ‘Wildcard file path,’ and in the second text box, I’ve entered an asterisk (*) in the dynamic expression, indicating that I want to copy all files in the directory. Then, I set the ‘Recursively’ property to true so it will get all files in the subfolders as well.

7. Go to the Sink tab. In the FilePath parameter, I’ve entered ‘testfiles/testdestination’. As with the previous step, since we will copy all files as they are from the source, the FileName does not need to be supplied. (Unless you need to combine data from multiple CSV files into one file, in which case you need to specify the filename.) In this case, we will again add the expression @trim(' ') to set the value to empty. Then, in the copy behavior, select ‘Preserve Hierarchy’ to keep the same folder structure as in the source directory.

8. Let’s run the pipeline by clicking the Debug button. Once the run is complete and successful, inspect the destination folder. We should see all the files, including those in subfolders, have been copied successfully.

Source

Destination

You can further extend this solution by placing the Copy Data activity within a ForEach loop container, allowing you to loop through:

  • Specific sets of folders and files. You can specify the folder name and specific file names using wildcard search in the FilePath and FileName parameters, respectively, and copy those files to a destination folder.
  • Series of folders. You can copy the content of each folder and move its content to the same folder in another location. This can be achieved by passing the folder name in the FilePath parameter as a dynamic value and using the same file name settings as described in step 6 above.

One thing to note is that the ForEach activity has a setting for running the activities within it in parallel. If you need to copy multiple folders simultaneously to quickly finish the process, you may want to turn this setting on. The default parallelism that you can set is 20, but you can increase it up to 50.

We hope you find this article useful and that it helps with the task or problem you are currently working on. Please feel free to share it with your colleagues or mention us in your replies to any forum threads. If you have further questions or a specific issue that you think we can help with, don’t hesitate to send a message via our Contact Us page.

Thanks for reading!!!

Subscribe

It’s The Bright One, It’s The Right One, That’s Newsletter.

© 2023 DataGlyphix All Rights Reserved.