Thursday, August 12, 2010

Dynamic SSIS Package to Import Excel files into SQL Server Database

Problem/Scenario
Need to import Excel files to a SQL table. Everyday one file is created at specified location in the source system. We need to copy that file from Source System to Local system and then load to SQL Table.

Conditions:
1.  Each file should be loaded only once. Everynight Job should be executed to load data into reporting Data mart.
2.  Source system will maintain all the history files so files at souce should not be deleted.
3.  If Job failed due to some reason (schema changes, server down, connection issues etc.), it should load all the files from last run date to current date in next successul run. For example, job didn't run last one week then whenever job runs next time successfully, it should load current file as well as all the files of last week which were missing.
4.  All the source files will have the same structure (Schema)
5.  Nomenclature - Each file will have name Transaction followed by current date in YYYY-MM-DD format. For example, if a file was created on 01-Aug-2010 then its name should be Transaction2010-08-01.

Solution
I will take advantage of the ForEach Loop Container. I'll create a new SSIS package to solve the above problem.

STEP 1:
Create following tables in your local database (destination database).
CREATE TABLE [TransactionHistory](
   [TransactionID] [int] NOT NULL,
   [ProductID] [int] NOT NULL,
   [ReferenceOrderID] [int] NOT NULL,
   [TransactionDate] [datetime] NOT NULL,
   [TransactionType] [nchar](1) NOT NULL,
   [Quantity] [int] NOT NULL,
   [ActualCost] [money] NOT NULL,
   CONSTRAINT [PK_TransactionHistory_TransactionID]
   PRIMARY KEY CLUSTERED ([TransactionID] ASC)
) ON [PRIMARY]
GO

CREATE TABLE [ExcelFilesLog](
   [FileName] [varchar](100),
   [FileCreatedDate] [datetime],
   [FileLoadDate] [datetime],
   [RecordCount] [int],
) ON [PRIMARY]
GO

STEP 2: Create a new SSIS package and rename it to LoadDynamicExcelFiles.dtsx

STEP 3: Add following package variables:
VariableNameDescriptionExamle
LastRunDateTo store last run date 2010-07-30
CurrentDateTo hold running date 2010-08-10
Directory
Source
To store source
directory path
D:\SSIS\Hari\SourceFiles
Directory
Destination
To store local
directory path
D:\SSIS\Hari\DestinationFiles
CurrentFile
Name
To store current
file name
D:\SSIS\Hari\SourceFiles\
Transaction2010-08-01

Create one OLE DB connection (I will use (local).TestHN connection manager) for local database where you want to load excel files data. Create one Excel Connection Manager for excel files located in D:\SSIS\Hari\DestinationFiles. At least one file should be there to create excel connection manager.

Click on Excel Connection Manager --> go to Properties window --> Select Expression and set ExcelFilePath with package variable User::CurrentFileName as shown below:

For this article, I'll use two directories - one for sources files and one for destination files. Location of these files are given below:
Source Files: D:\SSIS\Hari\SourceFiles
Destination Files: D:\SSIS\Hari\DestinationFiles

I have created few excel files from Production.TransactionHistory table of AdventureWorks2008R2 database as shown below:
I have created files from 2010-07-20 to 2010-08-01. After executing the package first time, I will create files from 2010-08-02 to 2010-08-09 to test the package.

STEP 4:
Drag and drop File System Task and double click to open File System Task Editor. Enter FST - Delete destination directory content in name and  select Delete directory content as Operation. Set IsSourcePathVariable to True and select SourceVariable as User::DirectoryDestination. Finally click on OK and save changes.

Drag and drop Execute SQL Task and double click to open Execute SQL Task Editor. Enter Execute SQL Task - Get LastRunDate in Name, select Single row as Result Set, Conection Type as OLE DB and Connection as (local).TestHN and SQLSourceType as Direct input. Enter below query in SQLStatement:
SELECT ISNULL(MAX([FileCreatedDate]),'2010-01-01') AS LastRunDate
FROM [dbo].[ExcelFilesLog] (NOLOCK)

In Result Set tab, set Result Name 0 to variable User::LastRunDate.
Finally click OK and save changes.


Drag and drop Script Task and double click to open Script Task Editor. Select User::DirectoryDestination, User::DirectorySource, User::LastRunDate in ReadOnlyVariables. Click on Edit Script... and paste below code:

using System.IO; 

public void Main()
{
   try
   {
      string DirectorySource = Dts.Variables["User::DirectorySource"].Value.ToString();
      string DirectoryDestination = Dts.Variables["User::DirectoryDestination"].Value.ToString();
      DateTime LastRunDate = (DateTime)Dts.Variables["User::LastRunDate"].Value;


      string fileName,fileExtension;
      string[] files = Directory.GetFiles(DirectorySource);
      //Copy source files to destination
      foreach (string f in files)
      {
         fileName = Path.GetFileName(f);
         fileExtension = Path.GetExtension(f);
         DateTime CurrentDate = DateTime.Parse(fileName.Substring("Transaction".Length, 10));


         if ((DateTime.Compare(CurrentDate,LastRunDate)>0) && (fileExtension == ".xlsx"))
        {
            //MessageBox.Show(fileName.ToString());
            File.Copy(f, Path.Combine(DirectoryDestination, fileName), true);
         }
      }
      Dts.TaskResult = (int)ScriptResults.Success;
   }
   catch (Exception ex)
   {
      Dts.Log(ex.Message, 0, null);
      Dts.TaskResult = (int)ScriptResults.Failure;
   }
}

Drag and drop Foreach Loop container. Select Foreach File Enumerator as Enumerator, enter D:\SSIS\Hari\DestinationFiles in Folder path and *.xlsx in files textbox. Select Fully qualified as Retrieve file name. In Variable Mappings, Select User::CurrentFileName for Index 0 to store current file name for each iteration.

Now drag and drop Data Flow Task inside Foreach loop container. Use Excel Source reader to read excel files from destination directory. Use Excel Connection Manager as connection manager for excel files. In connection properties, Select @[User::CurrentFileName] as ExcelFilePath. Use Data Conversion, if required. Use Row Count Task to count number of rows in data flow and store it in User::RecordCount variable. Use OLE DB Destination to load data into SQL table.

Drag and drop Execute SQL Task inside Foreach loop container to log information about current file.
Double click on Execute SQL Task to open Execute SQL Task Editor, enter Execute SQL Task - Insert info into Log table as Name, None as Result Set, OLE DB as connection type, (local).TestHN as connection, Direct input as SQLSourceType and below query as SQLStatement.

DECLARE
@FileName varchar(500)
,@FilePath varchar(500)
,@DestinationPath varchar(500)

SET @DestinationPath = ?
SET @FilePath = ?
SET @FileName = REPLACE(@FilePath,@DestinationPath + '\','')

INSERT INTO [ExcelFilesLog]
(
   [FileName]
   ,[FileCreatedDate]
   ,[FileLoadDate]
   ,[RecordCount]
)
SELECT
@FileName [FileName]
,CAST(SUBSTRING(@FileName,12,10) as datetime) [FileCreatedDate]
,GETDATE() [FileLoadDate]
,?

In Parameter mapping,  map User::DirectoryDestination, User::CurrentFileName, and User::RecordCount with parameter 0,1,2 respectively as shown below.
Finally click OK and save changes.

Below is the final layout of our package:

STEP 6:
We are done with the package development. To execute the package, go to SSIS package location, (in this example, it is D:\SSIS\Hari\Sample\SSIS-Sample1), right click on LoadDynamicExcelFiles.dtsx --> Open with --> SQL Server 2008 Integration Services Package Execution Utility. This will open Execute Package Utility. Click on Execute button to run ssis package. Now you can see the progress of package execution in Package Execution Progress window.

Now you can check ExcelFileLog table to cross check the result of package.

Now I'll add few more excel files in source location - from
Transaction2010-08-02 to Transaction2010-08-09 as shown below:
When I execute this package next time, it will load only new files. Yon can check ExcelFileLog for each iteration:

9 comments:

  1. Hey i have tried the process ,on executing the script task "load excel sheet to table " I have been receiving the status as Failure. Please find the code below :
    Public Class ScriptMain


    Enum ScriptResults
    Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
    Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum

    Public Sub Main()
    '
    ' Add your code here
    '
    ' Dts.TaskResult = Dts.Results.Success


    Try
    Dim DirectorySource As String
    Dim DirectoryDestination As String
    'Dim ScriptResults As ScriptTask


    DirectorySource = Dts.Variables("User::DirectorySource").Value.ToString()
    DirectoryDestination = Dts.Variables("User::DirectoryDestination").Value.ToString()
    Dim LastRunDate As DateTime = CType(Dts.Variables("User::LastRunDate").Value, DateTime)


    Dim fileName As String, fileExtension As String
    Dim files() As String = Directory.GetFiles(DirectorySource)
    'Copy source files to destination
    Dim CurrentDate As Date

    Dim f As String
    For Each f In files

    fileName = Path.GetFileName(f)
    fileExtension = Path.GetExtension(f)

    CurrentDate = DateTime.Parse(fileName.Substring("Transaction".Length, 10))


    If (DateTime.Compare(CurrentDate, LastRunDate) > 0) Then
    'MessageBox.Show(fileName.ToString());
    File.Copy(f, Path.Combine(DirectoryDestination, fileName), True)
    End If
    Next

    Dts.TaskResult = CType(ScriptResults.Success, Integer)

    Catch ex As Exception
    Dts.Log(ex.Message, 0, Nothing)
    Dts.TaskResult = CType(ScriptResults.Failure, Integer)
    End Try

    End Sub

    End Class

    ReplyDelete
  2. Thank you for your article. It is really intersting. Unfortunately if you are a NewB there are couple Steps that aren't detailed. I am getting the following error when tryied to folow your steps project step by step. could you help please?

    Error 1 Validation error. FST - Delete destination directory content File System Task: Variable "DirectoryDestination" is used as a source or destination and is empty.
    Error 2 Validation error. Data Flow Task OLE DB Destination [26]: A connection manager has not been assigned to the runtime connection "OleDbConnection" (34).

    ReplyDelete
  3. Hi Ronaldomar,
    Error1: Did you set variable DirectoryDestination with default value? In my exapmle it is "D:\SSIS\Hari\DestinationFiles". However you can configure this value through package configuration.

    Error2: It seems that you haven't properly configured your connection Manger which is used for OLE DB Destination.

    ReplyDelete
  4. it is an old but no doubt it is quite usefull to learn SSIS but unfortunately it is missing many many steps, and without those one could get easily lost and purpose of education dies.

    I wish you could have explained in segments and without skipping some fundamental tasks.

    You efforts are appreciated very highly but efforts without a fruitful results is not good

    ReplyDelete
    Replies
    1. Thanks ZeeAay for the great feedback. Albeit, I would appreciate if you mention "MANY MANY step" which are missing...

      I wrote this article couple of years back by following actual code I wrote for some real project. I tried to include all the steps I could remember but it would be great if you can mention what steps are actually missing.

      Delete
  5. Please provide the all steps will be appreciated.

    ReplyDelete
  6. Awesome article! You are providing us very valid information. This is worth reading. Keep sharing more such articles.
    how Data Science Changing the World Today

    ReplyDelete

Here are few FREE resources you may find helpful.