SD-WAN Automation: Replacing Ticket-Driven Network Changes with Code
Most SD-WAN rollouts stall at manual configuration. This guide covers network-as-code with Ansible and Terraform, branch automation patterns, and how to integrate SD-WAN policy into your CI/CD pipeline.
Introduction
SD-WAN adoption is exploding. Gartner estimates that over 60% of enterprises will have deployed some form of SD-WAN by the end of 2026, up from roughly 30% just three years ago. The promise is compelling: application-aware routing, centralized policy, cheaper bandwidth through broadband and LTE diversity, and a path toward SASE. But there is a gap between the marketing slide deck and the operational reality.
Most SD-WAN deployments are still managed through vendor GUIs and support tickets. A branch office needs a new VLAN? Open a ticket. An application needs a QoS policy change? Open a ticket. A new retail location needs to be onboarded? Open a ticket, wait three days, then open a follow-up ticket when the configuration does not match the template. Network engineers spend days doing what should take minutes. The bottleneck is not the technology — it is the process.
Network-as-code is the answer. The same principles that transformed server provisioning and cloud infrastructure — version control, peer review, automated testing, continuous delivery — apply directly to WAN management. This article walks through the tools, patterns, and practices that make SD-WAN automation work in production, not just in a lab.
Why SD-WAN Deployments Stall
If you have been through a large-scale SD-WAN rollout, you have likely encountered one or more of these failure modes. They are not unique to any single vendor — they are structural problems with how networks are traditionally managed.
GUI-Only Management Does Not Scale Past 20 Branches
Every SD-WAN vendor ships a controller with a web interface. For a proof of concept with five sites, the GUI works fine. At twenty sites, it becomes tedious. At two hundred sites, it becomes a full-time job for multiple engineers. Clicking through forms to configure VLANs, firewall rules, and routing policies one branch at a time does not scale. The GUI was designed for visibility and troubleshooting, not for bulk provisioning.
Configuration Drift Between Branches
When configuration is applied manually, drift is inevitable. Branch A gets the updated QoS policy. Branch B still runs the old one because the engineer was interrupted halfway through the change window. Branch C has a custom override that someone applied six months ago and nobody remembers why. Over time, every branch becomes a unique snowflake. Troubleshooting becomes archaeology.
No Version Control or Rollback Capability
When a change breaks something, the first question is always “what changed?” If configuration lives only in the vendor controller, answering that question requires digging through audit logs — if they exist. There is no diff, no commit history, no blame annotation. Rolling back means manually reconstructing the previous state, often from memory or a spreadsheet. This is not engineering. This is hope-driven operations.
Network-as-Code: The Foundation
Network-as-code means applying software engineering practices to network configuration. The configuration for every device, every policy, and every template lives in a Git repository. Changes go through pull requests with peer review. Automated tests validate syntax, compliance, and connectivity before anything touches production. A CI/CD pipeline handles the deployment.
If you have adopted infrastructure-as-code for your cloud environments — Terraform for AWS, Pulumi for Azure, CloudFormation for GovCloud — the mental model is identical. The network is just another layer of infrastructure that should be declared, versioned, and delivered through a pipeline.
The four pillars of network-as-code:
- Version control — every configuration change is a Git commit with an author, a timestamp, and a message explaining why
- Code review — no change reaches production without at least one peer approval
- Automated testing — linting, schema validation, and topology simulation catch errors before deployment
- Continuous delivery — a pipeline applies changes consistently across staging and production environments
The tooling ecosystem for network automation has matured significantly. Ansible, Terraform, Nornir, and NAPALM each fill a specific niche. For SD-WAN specifically, Ansible and Terraform cover the vast majority of use cases, so that is where we will focus.
Ansible for SD-WAN Configuration
Ansible dominates network automation for good reasons. It is agentless — it connects over SSH or REST APIs, so there is nothing to install on the managed devices. It has vendor-supported collections for Cisco SD-WAN (vManage), Fortinet FortiManager, VMware VeloCloud, Aruba EdgeConnect, and most other major platforms. The learning curve is gentle for network engineers who have never written code, because playbooks are YAML, not Python.
The key abstraction in Ansible is the playbook: a declarative description of the desired state. You do not write “log in to the controller, navigate to the branch configuration page, click add VLAN, type 200.” You write “this branch should have VLAN 200 with this subnet and this routing policy.” Ansible figures out how to make it so.
Here is a practical example — a playbook that configures a branch SD-WAN device with a VLAN, a WAN interface, and an application-aware routing policy:
---
- name: Configure branch SD-WAN device
hosts: branch_routers
gather_facts: false
connection: ansible.netcommon.httpapi
vars:
vlan_id: 200
vlan_name: "corp-data"
vlan_subnet: "10.{{ site_id }}.200.0/24"
wan_interface: "GigabitEthernet0/0/0"
app_policy_name: "branch-standard"
tasks:
- name: Configure corporate data VLAN
cisco.sdwan.device_template:
state: present
template_name: "branch-{{ site_id }}-vlans"
device_type: vedge-C8000V
features:
- type: vlan
vlan_id: "{{ vlan_id }}"
name: "{{ vlan_name }}"
ipv4_address: "{{ vlan_subnet }}"
shutdown: false
- name: Configure WAN interface with dual transport
cisco.sdwan.wan_interface:
interface: "{{ wan_interface }}"
tunnel_interface: true
color: biz-internet
restrict: false
carrier: default
nat_refresh_interval: 5
hello_interval: 1000
hello_tolerance: 12
- name: Apply application-aware routing policy
cisco.sdwan.app_route_policy:
name: "{{ app_policy_name }}"
sequences:
- seq_id: 10
match:
app_list: ["microsoft-365", "salesforce", "zoom"]
action:
sla_class: "interactive-video"
preferred_color: "mpls"
backup_color: "biz-internet"
- seq_id: 20
match:
app_list: ["general-internet"]
action:
sla_class: "best-effort"
preferred_color: "biz-internet"
- name: Push configuration to device
cisco.sdwan.config_push:
device_id: "{{ inventory_hostname }}"
wait_for_complete: true
timeout: 300Notice the use of variables like site_id. This playbook works across hundreds of branches because the site-specific values come from inventory, not from the playbook itself. The logic is generic; the data is specific. This separation is what makes network-as-code scalable.
Terraform for SD-WAN Fabric Provisioning
The question practitioners always ask is: “when do I use Terraform versus Ansible?” The answer is straightforward. Terraform excels at day-0 provisioning — creating the resources that make up your SD-WAN fabric. Hub sites, branch templates, overlay topologies, and controller configurations. It tracks state, handles dependencies, and plans changes before applying them. Ansible excels at day-2 operations — pushing configuration changes to devices that already exist, running operational commands, and orchestrating multi-step workflows.
In practice, many teams use both: Terraform to stand up the fabric and define templates, Ansible to configure individual devices and push policy updates. Here is a Terraform example that provisions an SD-WAN hub and a set of branch templates:
terraform {
required_providers {
sdwan = {
source = "CiscoDevNet/sdwan"
version = "~> 0.3"
}
}
}
provider "sdwan" {
url = var.vmanage_url
username = var.vmanage_username
password = var.vmanage_password
}
# Hub site feature template
resource "sdwan_cisco_system_feature_template" "hub_system" {
name = "hub-system-template"
description = "System template for regional hub sites"
device_types = ["vedge-C8000V"]
host_name = "hub-${var.region}"
system_ip = var.hub_system_ip
site_id = var.hub_site_id
console_baud = "115200"
timezone = "UTC"
}
# Branch device template — small office profile
resource "sdwan_feature_device_template" "branch_small" {
name = "branch-small-office"
description = "Template for branches with < 50 users"
device_type = "vedge-C8000V"
device_role = "sdwan-edge"
general_templates {
id = sdwan_cisco_system_feature_template.hub_system.id
type = "cisco_system"
}
general_templates {
id = sdwan_cisco_vpn_feature_template.corp_vpn.id
type = "cisco_vpn"
}
general_templates {
id = sdwan_app_route_policy.branch_standard.id
type = "app-route"
}
}
# Branch device template — large campus profile
resource "sdwan_feature_device_template" "branch_large" {
name = "branch-large-campus"
description = "Template for campuses with 200+ users"
device_type = "vedge-C8000V"
device_role = "sdwan-edge"
general_templates {
id = sdwan_cisco_system_feature_template.hub_system.id
type = "cisco_system"
}
general_templates {
id = sdwan_cisco_vpn_feature_template.corp_vpn.id
type = "cisco_vpn"
}
general_templates {
id = sdwan_app_route_policy.campus_priority.id
type = "app-route"
}
}
output "small_branch_template_id" {
value = sdwan_feature_device_template.branch_small.id
}
output "large_branch_template_id" {
value = sdwan_feature_device_template.branch_large.id
}The critical advantage of Terraform here is the plan phase. Before any change is applied, terraform plan shows you exactly what will be created, modified, or destroyed. This is your pre-change review — and it can be automated as part of a pull request workflow, so reviewers see the planned impact before approving.
Branch Rollout Automation
Rolling out SD-WAN to hundreds of branches requires a systematic approach. The pattern that works in production is:
- Define branch profiles — small office (5-50 users, single WAN link), large campus (200+ users, dual WAN with MPLS), retail (PCI compliance, guest WiFi isolation), and data centre interconnect
- Template the configuration — each profile gets a parameterized template. Site-specific values (IP ranges, VLAN IDs, circuit identifiers) come from inventory
- Automate the rollout stages — staging (lab validation), canary (one branch per region), fleet (all remaining branches in waves)
The inventory file is where all site-specific data lives. Here is an Ansible inventory that organizes branches by profile and region, with variables that feed into the playbooks:
---
all:
children:
# Branch profiles
small_office:
hosts:
branch-nyc-001:
site_id: 101
wan_links: ["biz-internet"]
user_count: 35
vlans: [100, 200]
region: us-east
branch-chi-002:
site_id: 102
wan_links: ["biz-internet"]
user_count: 22
vlans: [100, 200]
region: us-central
large_campus:
hosts:
campus-sfo-001:
site_id: 201
wan_links: ["mpls", "biz-internet"]
user_count: 450
vlans: [100, 200, 300, 400]
region: us-west
ha_enabled: true
campus-lon-001:
site_id: 202
wan_links: ["mpls", "biz-internet"]
user_count: 380
vlans: [100, 200, 300, 400]
region: eu-west
ha_enabled: true
retail:
hosts:
store-atl-001:
site_id: 301
wan_links: ["biz-internet", "lte"]
user_count: 12
vlans: [100, 500]
pci_zone: true
guest_wifi: true
region: us-east
# Rollout stages
canary:
hosts:
branch-nyc-001: {}
campus-sfo-001: {}
store-atl-001: {}
wave_1:
children:
small_office: {}
wave_2:
children:
large_campus: {}
retail: {}With this structure, rolling out to the canary group is a single command: ansible-playbook -i inventory/branches.yaml playbooks/branch-sdwan-config.yaml -l canary. Once the canary validates, wave_1 follows, then wave_2. Each wave can be triggered manually or automated through a pipeline with health checks between stages.
Integrating SD-WAN into CI/CD
The highest-maturity network teams treat their SD-WAN configuration the same way application teams treat code: every change flows through a pipeline that lints, tests, stages, and deploys. The pipeline is the single path to production — nobody logs into the controller to make ad hoc changes.
A typical pipeline has four stages. First, lint the configuration files for syntax errors and schema compliance. Second, run the configuration through a simulated topology using tools like Containerlab or GNS3 to catch routing loops, policy conflicts, and connectivity failures. Third, deploy to a staging environment that mirrors a subset of production. Fourth, deploy to production in waves with health checks.
name: SD-WAN Configuration Deployment
on:
push:
branches: [main]
paths: ["network-config/**"]
pull_request:
branches: [main]
paths: ["network-config/**"]
env:
ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD }}
VMANAGE_URL: ${{ secrets.VMANAGE_URL }}
jobs:
lint:
name: Lint and Validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install ansible-lint yamllint jsonschema
- name: YAML lint
run: yamllint -c .yamllint network-config/
- name: Ansible lint
run: ansible-lint playbooks/
- name: Validate against schema
run: python scripts/validate_inventory.py
simulate:
name: Topology Simulation
needs: lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start Containerlab topology
run: |
sudo containerlab deploy -t topologies/sdwan-test.yaml
- name: Run playbook against simulated devices
run: |
ansible-playbook -i inventory/test.yaml \
playbooks/branch-sdwan-config.yaml \
--check --diff
- name: Run connectivity tests
run: python scripts/test_connectivity.py
- name: Tear down topology
if: always()
run: sudo containerlab destroy -t topologies/sdwan-test.yaml
deploy-staging:
name: Deploy to Staging
needs: simulate
if: github.ref == 'refs/heads/main'
runs-on: self-hosted
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging branches
run: |
ansible-playbook -i inventory/staging.yaml \
playbooks/branch-sdwan-config.yaml
- name: Validate SLA metrics
run: python scripts/check_sla.py --env staging --wait 300
deploy-production:
name: Deploy to Production (Canary)
needs: deploy-staging
runs-on: self-hosted
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to canary branches
run: |
ansible-playbook -i inventory/branches.yaml \
playbooks/branch-sdwan-config.yaml \
-l canary
- name: Health check — wait 10 minutes
run: python scripts/check_sla.py --env production --limit canary --wait 600
- name: Deploy to all branches (wave rollout)
run: |
for wave in wave_1 wave_2; do
ansible-playbook -i inventory/branches.yaml \
playbooks/branch-sdwan-config.yaml \
-l $wave
python scripts/check_sla.py --env production --limit $wave --wait 300
doneThe environment keyword in GitHub Actions provides manual approval gates. Production deployments require a human to click “approve” after reviewing the staging results. This gives you automation with guardrails — speed without recklessness.
SASE and Zero Trust Integration
SD-WAN does not exist in isolation anymore. The industry has converged on SASE (Secure Access Service Edge) as the architecture that combines SD-WAN connectivity with cloud-delivered security. Every major vendor — Cisco, Fortinet, Palo Alto, Zscaler, Netskope — now offers an integrated SASE platform or partnership.
For automation, this means your SD-WAN policy definitions need to incorporate security controls from day one. When you define an application-aware routing policy that sends Microsoft 365 traffic directly to the internet rather than backhauling it through a data centre, that traffic must pass through a Secure Web Gateway (SWG) and a Cloud Access Security Broker (CASB). When a remote user connects through the SD-WAN fabric, Zero Trust Network Access (ZTNA) policies should determine what they can reach based on identity, device posture, and context — not just network location.
The automation implication is that your network-as-code repository should include not just routing and connectivity configuration, but also firewall rules, SWG policies, ZTNA application definitions, and CASB controls. These security policies should go through the same review and testing pipeline as your network configuration. A change to an SD-WAN routing policy that bypasses the security stack should be caught in code review, not discovered in a penetration test six months later.
What to automate alongside SD-WAN policy:
- Cloud firewall rules that govern direct internet breakout traffic from branches
- ZTNA application access policies tied to identity provider groups (Okta, Entra ID)
- SWG URL filtering and TLS inspection policies
- DLP rules applied at the CASB layer for SaaS applications
- DNS security policies to block command-and-control communication
Monitoring Automated Networks
Automation without observability is flying blind. When a pipeline pushes a configuration change to two hundred branches, you need to know immediately whether that change improved things, broke things, or had no effect. The monitoring strategy for an automated SD-WAN needs to cover three dimensions: device health, application performance, and configuration compliance.
For device health, traditional SNMP polling still has a place, but streaming telemetry via gNMI (gRPC Network Management Interface) is the modern approach. Devices push metrics to a collector in real time rather than waiting to be polled. This reduces monitoring latency from minutes to seconds and eliminates the scaling problems of SNMP at hundreds of sites.
For application performance, synthetic probing and SLA monitoring are essential. Tools like ThousandEyes run synthetic transactions from branch locations to SaaS endpoints, measuring latency, jitter, and packet loss. Kentik provides flow-based analytics that show you how traffic is actually traversing the SD-WAN overlay. LibreNMS offers an open-source alternative for teams that need to keep costs down while still getting comprehensive network monitoring.
For configuration compliance, run regular audits that compare the running configuration on each device against the declared state in your Git repository. Any drift indicates either a manual change that bypassed the pipeline or a failed deployment that was not caught. Both are problems that need to be addressed.
Rollback and Safety
The fear that stops many network teams from adopting automation is the blast radius of a mistake. If a bad playbook runs against every branch simultaneously, the result is a company-wide outage. This fear is legitimate — and the answer is not to avoid automation, but to build rollback and safety into every step.
The first layer of safety is pre-change snapshots. Before any configuration change, capture the current running configuration and store it as an artifact. The second layer is canary deployment — apply the change to a small set of branches first and validate health metrics before proceeding. The third layer is automatic rollback: if SLA metrics degrade beyond a threshold after a change, revert immediately without waiting for a human.
Ansible's block/rescue pattern is ideal for implementing rollback at the playbook level:
---
- name: Safe SD-WAN configuration deployment
hosts: branch_routers
gather_facts: false
serial: 1
max_fail_percentage: 0
tasks:
- name: Deployment with automatic rollback
block:
- name: Snapshot current configuration
cisco.sdwan.config_backup:
device_id: "{{ inventory_hostname }}"
dest: "/tmp/backup-{{ inventory_hostname }}.json"
- name: Apply new configuration
cisco.sdwan.device_template:
state: present
template_name: "branch-{{ site_id }}-vlans"
device_type: vedge-C8000V
features: "{{ branch_features }}"
- name: Wait for convergence
ansible.builtin.pause:
seconds: 60
- name: Validate SLA metrics
cisco.sdwan.sla_check:
device_id: "{{ inventory_hostname }}"
sla_class: "interactive-video"
max_latency_ms: 150
max_jitter_ms: 30
max_loss_pct: 1
register: sla_result
- name: Fail if SLA violated
ansible.builtin.fail:
msg: "SLA violation detected: {{ sla_result }}"
when: not sla_result.within_threshold
rescue:
- name: Restore previous configuration
cisco.sdwan.config_restore:
device_id: "{{ inventory_hostname }}"
src: "/tmp/backup-{{ inventory_hostname }}.json"
- name: Notify team of rollback
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "ROLLBACK: {{ inventory_hostname }} reverted after SLA violation. Check pipeline for details."
- name: Abort remaining hosts
ansible.builtin.meta: end_playThe serial: 1 directive ensures branches are configured one at a time. Combined with max_fail_percentage: 0, any failure on a single branch stops the entire rollout. The rescue block restores the backup, notifies the team via Slack, and halts execution. This is not perfect — there is still a window of degradation during convergence — but it is orders of magnitude safer than manual changes with no rollback plan.
Common Pitfalls
After helping multiple enterprises adopt SD-WAN automation, we see the same mistakes repeatedly. Knowing them in advance can save months of wasted effort.
- Vendor lock-in in automation tooling — building your entire automation stack on vendor-specific APIs means starting over when you switch platforms. Use abstraction layers where possible. Write playbooks that separate vendor-specific modules from business logic.
- Testing only in the lab — a lab with three virtual routers does not replicate the complexity of 300 branches with varying circuit quality, asymmetric routing, and legacy equipment. Your test topology needs to model real-world failure conditions, not just happy-path connectivity.
- Not handling partial failures — a deployment that succeeds on 198 of 200 branches is not a success. It is two branches with unknown state. Your pipeline must detect, report, and remediate partial failures automatically.
- Treating network-as-code as a project — automation is not something you do once and declare victory. It is a practice that requires ongoing investment in tooling, testing, and team skills. The playbooks need maintenance. The pipeline needs updates as vendors change APIs. The team needs training as new engineers join.
- Skipping the cultural change — the hardest part of network automation is not the technology. It is convincing network engineers that reviewing YAML in a pull request is a better use of their time than clicking through a GUI. Invest in training, pair programming, and celebrating early wins.
From Tickets to Pipelines
SD-WAN is a powerful technology, but technology alone does not solve operational problems. The vendors will sell you the overlay, the controller, and the dashboard. What they will not sell you is the automation, the testing, and the deployment discipline that makes it all work at scale. That is on you — or on the partner you choose to work with.
The path from ticket-driven network changes to pipeline-driven network-as-code is not overnight. It starts with putting your configuration in Git. Then adding a linter. Then a test topology. Then a staging environment. Each step reduces risk and increases velocity. Within six months, your team will wonder how they ever managed a WAN without it.
NubisCore helps enterprises automate SD-WAN deployments — from initial architecture and tooling selection through branch rollout automation and ongoing pipeline management. Whether you are starting from scratch or trying to bring order to an existing deployment, we can help.
Book a consultation to discuss your SD-WAN automation strategy.