The Availability Group Failover Automation Framework (or AGFA) is a tool designed to allow for automation of planned availability group failovers.Often times these actions need to occur during off hours so as to minimize the impact of the failover to end users. Normally this would require a resource to be online and manually process the failover commands as part of a defined sequence of events. AGFA eliminates this need by allowing for scheduled failovers to occur, while ensuring that best practices and rules are followed during the failover itself.
Suggestions for enhancements or fixes are welcome and can be sent to this address.
- SQL Server 2012 Enterprise Edition or higher (not yet tested on SQL 2016 Standard Edition with AlwaysOn Basic Availability Groups).
- Powershell 3.0 or higher.
- Sysadmin access on both the current and intended primary replica for the availability group.
- Update / Select rights on the table where the failover tasks are stored.
SQL Table Creation
To create the table required to store availability group failover tasks, open the script file “dbo.AvailabilityGroupFailoverTask” and execute it against a server and database of your choosing. It is not recommended to create the table in a system database such as ‘master’. Instead I recommend creating it in a utility database.
Any user that will be processing failover tasks requires SELECT and UPDATE privileges on this table as a minimum. To facilitate this a role is created as part of the script called “prole_FailoverTaskProcessor”; membership in this role will include all necessary privileges.
Place the two .ps1 files in the package in a location accessible to the calling user.
Scheduling of Execution
AGFA itself does not have any scheduling mechanism; it must be triggered / executed by an external scheduler, or on demand.
I recommend using SQL Agent, utilizing a CMDEXEC job step to execute a script that imports and calls the functions, and having the script return a non-zero exit code to indicate failure. A sample script showing how to import and call the functions (ProcessAvailabilityGroupFailoverTasks.ps1) is provided as part of the package. As a best practice, consider 1) setting up job failure notifications, 2) making use of a job proxy to execute the step. This allows you to have a special user setup that has the required rights on the replica servers, rather than granting the SQL Agent service account those rights.
AGFA is based in large part on Powershell functions, with SQL Server acting as the repository for configuring scheduled failover actions.
AGFA consists of two main Powershell functions. Both have detailed Powershell help documentation for parameters, but are discussed in more granular terms here.
This connects to the specified server and database, retrieves any qualifying availability groups failover tasks, then passes them on to the Move-AvailabilityGroup function for processing. Task records qualify when the current date and time (in UTC time) is between the values of the FailoverWindowStartTime and FailoverWindowEndTime fields on the AvailabilityGroupFailoverTasks table (see below for documentation).
If the failover succeeds, the database record is marked as processed, and a success message is written to the record. If the failover is not processed due to an error, then the database record is marked as processed and the text of the error is written to the record as well. Once all qualifying tasks are completed, the results will be sent through the Powershell pipeline for capture by other functions or in the context of a calling script. In cases where one or more failovers did not process, the output record will indicate this by means of the ExitCode property being a value greater than zero, and an exception will be thrown.
This function performs the actual failover. The following lists the actions (in order) that it takes when processing a failover.
- Connects to the specified (for new primary status) replica and checks that the specified availability group is present.
- Checks if the specified replica is already the primary. If this conditions evaluates as true, then further processing availability group will be skipped; however, no error will be signaled since technically the group is already in the desired state.
- Checks if the specified replica is an asynchronous replica. If it is, AGFA will switch the group into synchronous mode.
- Checks that the state of the replica is synchronized and healthy. If it is not, AGFA will pause for up to 30 seconds to allow the replica to synchronize (this is mainly for the case where the group was previously in asynchronous mode). If the group is not synchronized at the end of the wait period, AGFA will cease processing this group and an error will be reported.
- Failover the availability group to the designated (new) primary replica.
- If the group was previously in an asynchronous state, it will be returned to that state again. This step occurs regardless of whether any previous steps failed. If the availability group was successfully failed over, then the old primary (new secondary) will be set as an asynchronous replica. If the failover did not process, then the designated replica (the one that the group was to be failed to, but was not due to an error) is set back to asynchronous mode. This is to ensure the group is left in the same state as when the failover operation began.
Upon completion, the function writes an object to the output containing the following properties:
- AvailabilityGroupName: The name of the availability group processed.
- OriginalPrimaryName: The name of the original primary server. If the failover did not succeed, then this will still be the primary server.
- NewPrimaryName: The name of the new (current if the failover was successful) primary server.
- ExitCode: A numerical value indicating success (0) or failure (non zero) of the failover operation.
- StatusText: A textual field containing details around the operation, such as whether the group was failed over successfully and if the replica was set back to asynchronous mode.
SQL Database Table
The framework utilizes a single database table to store information around availability group failover tasks. The table has the following columns.
- AvailabilityGroupFailoverTaskID: an identity column for primary key purposes.
- AvailabilityGroupName: the name of the availability group to process.
- PrimaryServerName: the name of the server which the availability group should be failed over to.
- FailoverWindowStartTime: the date and time (in UTC time) that the group begins to qualify for failover.
- FailoverWindowEndTime: the date and time (in UTC time) that the group ceases to qualify for failover.
- StatusText: contains textual details of the results of the failover operation, including an error message if one occurred.
- ProcessedFl: a bit field indicating if the failover has been processed by the AGFA framework.
Below is a sample INSERT statement (currently records must be created manually).
INSERT INTO dbo.AvailabilityGroupFailoverTask
N'MyAvailabilityGroupName', -- AvailabilityGroupName - sysname
N'DB02', -- PrimaryServerName - sysname
'2016-04-01 17:00:00', -- FailoverWindowStartTime - datetime
'2016-04-03 17:30:00', -- FailoverWindowEndTime - datetime
N'', -- StatusText - nvarchar
0 -- ProcessedFl - bit
Roadmap for Future Enhancements
Allow for pre-failover actions
The framework will allow the registration of function calls to be made before a failover is processed. Functions will need to have a standard set of parameters.
Allow for halting on first error
The framework will optionally allow for all processing of failover tasks to be halted on the first error. Currently all failover tasks are processed even if one of them fails (which may be how some want it, however I suspect some may want it to stop on the first error).
Allow for user configurable synchronization timeout
The framework will allow the user to override the current hard-coded timeout of thirty seconds.
Bundle Powershell functions as a module
The Powershell functions will be bundled as a module for ease of import.
Automation of installation
The installation will be fully scripted so no manual steps will be required.
To be notified of updates, please sign up below.