SD-WAN Automation: Replacing Ticket-Driven Network Changes with Code

Introduction

SD-WAN adoption is exploding. Gartner estimates that over 60% of enterprises will have deployed some form of SD-WAN by the end of 2026, up from roughly 30% just three years ago. The promise is compelling: application-aware routing, centralized policy, cheaper bandwidth through broadband and LTE diversity, and a path toward SASE. But there is a gap between the marketing slide deck and the operational reality.

Most SD-WAN deployments are still managed through vendor GUIs and support tickets. A branch office needs a new VLAN? Open a ticket. An application needs a QoS policy change? Open a ticket. A new retail location needs to be onboarded? Open a ticket, wait three days, then open a follow-up ticket when the configuration does not match the template. Network engineers spend days doing what should take minutes. The bottleneck is not the technology — it is the process.

Network-as-code is the answer. The same principles that transformed server provisioning and cloud infrastructure — version control, peer review, automated testing, continuous delivery — apply directly to WAN management. This article walks through the tools, patterns, and practices that make SD-WAN automation work in production, not just in a lab.

Why SD-WAN Deployments Stall

If you have been through a large-scale SD-WAN rollout, you have likely encountered one or more of these failure modes. They are not unique to any single vendor — they are structural problems with how networks are traditionally managed.

GUI-Only Management Does Not Scale Past 20 Branches

Every SD-WAN vendor ships a controller with a web interface. For a proof of concept with five sites, the GUI works fine. At twenty sites, it becomes tedious. At two hundred sites, it becomes a full-time job for multiple engineers. Clicking through forms to configure VLANs, firewall rules, and routing policies one branch at a time does not scale. The GUI was designed for visibility and troubleshooting, not for bulk provisioning.

Configuration Drift Between Branches

When configuration is applied manually, drift is inevitable. Branch A gets the updated QoS policy. Branch B still runs the old one because the engineer was interrupted halfway through the change window. Branch C has a custom override that someone applied six months ago and nobody remembers why. Over time, every branch becomes a unique snowflake. Troubleshooting becomes archaeology.

No Version Control or Rollback Capability

When a change breaks something, the first question is always “what changed?” If configuration lives only in the vendor controller, answering that question requires digging through audit logs — if they exist. There is no diff, no commit history, no blame annotation. Rolling back means manually reconstructing the previous state, often from memory or a spreadsheet. This is not engineering. This is hope-driven operations.

Network-as-Code: The Foundation

Network-as-code means applying software engineering practices to network configuration. The configuration for every device, every policy, and every template lives in a Git repository. Changes go through pull requests with peer review. Automated tests validate syntax, compliance, and connectivity before anything touches production. A CI/CD pipeline handles the deployment.

If you have adopted infrastructure-as-code for your cloud environments — Terraform for AWS, Pulumi for Azure, CloudFormation for GovCloud — the mental model is identical. The network is just another layer of infrastructure that should be declared, versioned, and delivered through a pipeline.

The four pillars of network-as-code:

Version control— every configuration change is a Git commit with an author, a timestamp, and a message explaining why
Code review— no change reaches production without at least one peer approval
Automated testing— linting, schema validation, and topology simulation catch errors before deployment
Continuous delivery— a pipeline applies changes consistently across staging and production environments

The tooling ecosystem for network automation has matured significantly. Ansible, Terraform, Nornir, and NAPALM each fill a specific niche. For SD-WAN specifically, Ansible and Terraform cover the vast majority of use cases, so that is where we will focus.

Ansible for SD-WAN Configuration

Ansible dominates network automation for good reasons. It is agentless — it connects over SSH or REST APIs, so there is nothing to install on the managed devices. It has vendor-supported collections for Cisco SD-WAN (vManage), Fortinet FortiManager, VMware VeloCloud, Aruba EdgeConnect, and most other major platforms. The learning curve is gentle for network engineers who have never written code, because playbooks are YAML, not Python.

The key abstraction in Ansible is the playbook: a declarative description of the desired state. You do not write “log in to the controller, navigate to the branch configuration page, click add VLAN, type 200.” You write “this branch should have VLAN 200 with this subnet and this routing policy.” Ansible figures out how to make it so.

Here is a practical example — a playbook that configures a branch SD-WAN device with a VLAN, a WAN interface, and an application-aware routing policy:

playbooks/branch-sdwan-config.yamlyaml

---
- name: Configure branch SD-WAN device
  hosts: branch_routers
  gather_facts: false
  connection: ansible.netcommon.httpapi

  vars:
    vlan_id: 200
    vlan_name: "corp-data"
    vlan_subnet: "10.{{ site_id }}.200.0/24"
    wan_interface: "GigabitEthernet0/0/0"
    app_policy_name: "branch-standard"

  tasks:
    - name: Configure corporate data VLAN
      cisco.sdwan.device_template:
        state: present
        template_name: "branch-{{ site_id }}-vlans"
        device_type: vedge-C8000V
        features:
          - type: vlan
            vlan_id: "{{ vlan_id }}"
            name: "{{ vlan_name }}"
            ipv4_address: "{{ vlan_subnet }}"
            shutdown: false

    - name: Configure WAN interface with dual transport
      cisco.sdwan.wan_interface:
        interface: "{{ wan_interface }}"
        tunnel_interface: true
        color: biz-internet
        restrict: false
        carrier: default
        nat_refresh_interval: 5
        hello_interval: 1000
        hello_tolerance: 12

    - name: Apply application-aware routing policy
      cisco.sdwan.app_route_policy:
        name: "{{ app_policy_name }}"
        sequences:
          - seq_id: 10
            match:
              app_list: ["microsoft-365", "salesforce", "zoom"]
            action:
              sla_class: "interactive-video"
              preferred_color: "mpls"
              backup_color: "biz-internet"
          - seq_id: 20
            match:
              app_list: ["general-internet"]
            action:
              sla_class: "best-effort"
              preferred_color: "biz-internet"

    - name: Push configuration to device
      cisco.sdwan.config_push:
        device_id: "{{ inventory_hostname }}"
        wait_for_complete: true
        timeout: 300

Notice the use of variables like site_id. This playbook works across hundreds of branches because the site-specific values come from inventory, not from the playbook itself. The logic is generic; the data is specific. This separation is what makes network-as-code scalable.

Terraform for SD-WAN Fabric Provisioning

The question practitioners always ask is: “when do I use Terraform versus Ansible?” The answer is straightforward. Terraform excels at day-0 provisioning — creating the resources that make up your SD-WAN fabric. Hub sites, branch templates, overlay topologies, and controller configurations. It tracks state, handles dependencies, and plans changes before applying them. Ansible excels at day-2 operations — pushing configuration changes to devices that already exist, running operational commands, and orchestrating multi-step workflows.

In practice, many teams use both: Terraform to stand up the fabric and define templates, Ansible to configure individual devices and push policy updates. Here is a Terraform example that provisions an SD-WAN hub and a set of branch templates:

sdwan-fabric/main.tfhcl

terraform {
  required_providers {
    sdwan = {
      source  = "CiscoDevNet/sdwan"
      version = "~> 0.3"
    }
  }
}

provider "sdwan" {
  url      = var.vmanage_url
  username = var.vmanage_username
  password = var.vmanage_password
}

# Hub site feature template
resource "sdwan_cisco_system_feature_template" "hub_system" {
  name          = "hub-system-template"
  description   = "System template for regional hub sites"
  device_types  = ["vedge-C8000V"]
  host_name     = "hub-${var.region}"
  system_ip     = var.hub_system_ip
  site_id       = var.hub_site_id
  console_baud  = "115200"
  timezone      = "UTC"
}

# Branch device template — small office profile
resource "sdwan_feature_device_template" "branch_small" {
  name          = "branch-small-office"
  description   = "Template for branches with < 50 users"
  device_type   = "vedge-C8000V"
  device_role   = "sdwan-edge"

  general_templates {
    id   = sdwan_cisco_system_feature_template.hub_system.id
    type = "cisco_system"
  }

  general_templates {
    id   = sdwan_cisco_vpn_feature_template.corp_vpn.id
    type = "cisco_vpn"
  }

  general_templates {
    id   = sdwan_app_route_policy.branch_standard.id
    type = "app-route"
  }
}

# Branch device template — large campus profile
resource "sdwan_feature_device_template" "branch_large" {
  name          = "branch-large-campus"
  description   = "Template for campuses with 200+ users"
  device_type   = "vedge-C8000V"
  device_role   = "sdwan-edge"

  general_templates {
    id   = sdwan_cisco_system_feature_template.hub_system.id
    type = "cisco_system"
  }

  general_templates {
    id   = sdwan_cisco_vpn_feature_template.corp_vpn.id
    type = "cisco_vpn"
  }

  general_templates {
    id   = sdwan_app_route_policy.campus_priority.id
    type = "app-route"
  }
}

output "small_branch_template_id" {
  value = sdwan_feature_device_template.branch_small.id
}

output "large_branch_template_id" {
  value = sdwan_feature_device_template.branch_large.id
}

The critical advantage of Terraform here is the plan phase. Before any change is applied, terraform planshows you exactly what will be created, modified, or destroyed. This is your pre-change review — and it can be automated as part of a pull request workflow, so reviewers see the planned impact before approving.

Branch Rollout Automation

Rolling out SD-WAN to hundreds of branches requires a systematic approach. The pattern that works in production is:

Define branch profiles— small office (5-50 users, single WAN link), large campus (200+ users, dual WAN with MPLS), retail (PCI compliance, guest WiFi isolation), and data centre interconnect
Template the configuration— each profile gets a parameterized template. Site-specific values (IP ranges, VLAN IDs, circuit identifiers) come from inventory
Automate the rollout stages— staging (lab validation), canary (one branch per region), fleet (all remaining branches in waves)

The inventory file is where all site-specific data lives. Here is an Ansible inventory that organizes branches by profile and region, with variables that feed into the playbooks:

inventory/branches.yamlyaml

---
all:
  children:
    # Branch profiles
    small_office:
      hosts:
        branch-nyc-001:
          site_id: 101
          wan_links: ["biz-internet"]
          user_count: 35
          vlans: [100, 200]
          region: us-east
        branch-chi-002:
          site_id: 102
          wan_links: ["biz-internet"]
          user_count: 22
          vlans: [100, 200]
          region: us-central

    large_campus:
      hosts:
        campus-sfo-001:
          site_id: 201
          wan_links: ["mpls", "biz-internet"]
          user_count: 450
          vlans: [100, 200, 300, 400]
          region: us-west
          ha_enabled: true
        campus-lon-001:
          site_id: 202
          wan_links: ["mpls", "biz-internet"]
          user_count: 380
          vlans: [100, 200, 300, 400]
          region: eu-west
          ha_enabled: true

    retail:
      hosts:
        store-atl-001:
          site_id: 301
          wan_links: ["biz-internet", "lte"]
          user_count: 12
          vlans: [100, 500]
          pci_zone: true
          guest_wifi: true
          region: us-east

    # Rollout stages
    canary:
      hosts:
        branch-nyc-001: {}
        campus-sfo-001: {}
        store-atl-001: {}

    wave_1:
      children:
        small_office: {}

    wave_2:
      children:
        large_campus: {}
        retail: {}

With this structure, rolling out to the canary group is a single command: ansible-playbook -i inventory/branches.yaml playbooks/branch-sdwan-config.yaml -l canary. Once the canary validates, wave_1 follows, then wave_2. Each wave can be triggered manually or automated through a pipeline with health checks between stages.

Integrating SD-WAN into CI/CD

The highest-maturity network teams treat their SD-WAN configuration the same way application teams treat code: every change flows through a pipeline that lints, tests, stages, and deploys. The pipeline is the single path to production — nobody logs into the controller to make ad hoc changes.

A typical pipeline has four stages. First, lint the configuration files for syntax errors and schema compliance. Second, run the configuration through a simulated topology using tools like Containerlab or GNS3 to catch routing loops, policy conflicts, and connectivity failures. Third, deploy to a staging environment that mirrors a subset of production. Fourth, deploy to production in waves with health checks.

.github/workflows/sdwan-deploy.yamlyaml

name: SD-WAN Configuration Deployment

on:
  push:
    branches: [main]
    paths: ["network-config/**"]
  pull_request:
    branches: [main]
    paths: ["network-config/**"]

env:
  ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD }}
  VMANAGE_URL: ${{ secrets.VMANAGE_URL }}

jobs:
  lint:
    name: Lint and Validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install ansible-lint yamllint jsonschema
      - name: YAML lint
        run: yamllint -c .yamllint network-config/
      - name: Ansible lint
        run: ansible-lint playbooks/
      - name: Validate against schema
        run: python scripts/validate_inventory.py

  simulate:
    name: Topology Simulation
    needs: lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Start Containerlab topology
        run: |
          sudo containerlab deploy -t topologies/sdwan-test.yaml
      - name: Run playbook against simulated devices
        run: |
          ansible-playbook -i inventory/test.yaml \
            playbooks/branch-sdwan-config.yaml \
            --check --diff
      - name: Run connectivity tests
        run: python scripts/test_connectivity.py
      - name: Tear down topology
        if: always()
        run: sudo containerlab destroy -t topologies/sdwan-test.yaml

  deploy-staging:
    name: Deploy to Staging
    needs: simulate
    if: github.ref == 'refs/heads/main'
    runs-on: self-hosted
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging branches
        run: |
          ansible-playbook -i inventory/staging.yaml \
            playbooks/branch-sdwan-config.yaml
      - name: Validate SLA metrics
        run: python scripts/check_sla.py --env staging --wait 300

  deploy-production:
    name: Deploy to Production (Canary)
    needs: deploy-staging
    runs-on: self-hosted
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to canary branches
        run: |
          ansible-playbook -i inventory/branches.yaml \
            playbooks/branch-sdwan-config.yaml \
            -l canary
      - name: Health check — wait 10 minutes
        run: python scripts/check_sla.py --env production --limit canary --wait 600
      - name: Deploy to all branches (wave rollout)
        run: |
          for wave in wave_1 wave_2; do
            ansible-playbook -i inventory/branches.yaml \
              playbooks/branch-sdwan-config.yaml \
              -l $wave
            python scripts/check_sla.py --env production --limit $wave --wait 300
          done

The environmentkeyword in GitHub Actions provides manual approval gates. Production deployments require a human to click “approve” after reviewing the staging results. This gives you automation with guardrails — speed without recklessness.

SASE and Zero Trust Integration

SD-WAN does not exist in isolation anymore. The industry has converged on SASE (Secure Access Service Edge) as the architecture that combines SD-WAN connectivity with cloud-delivered security. Every major vendor — Cisco, Fortinet, Palo Alto, Zscaler, Netskope — now offers an integrated SASE platform or partnership.

For automation, this means your SD-WAN policy definitions need to incorporate security controls from day one. When you define an application-aware routing policy that sends Microsoft 365 traffic directly to the internet rather than backhauling it through a data centre, that traffic must pass through a Secure Web Gateway (SWG) and a Cloud Access Security Broker (CASB). When a remote user connects through the SD-WAN fabric, Zero Trust Network Access (ZTNA) policies should determine what they can reach based on identity, device posture, and context — not just network location.

The automation implication is that your network-as-code repository should include not just routing and connectivity configuration, but also firewall rules, SWG policies, ZTNA application definitions, and CASB controls. These security policies should go through the same review and testing pipeline as your network configuration. A change to an SD-WAN routing policy that bypasses the security stack should be caught in code review, not discovered in a penetration test six months later.

What to automate alongside SD-WAN policy:

Cloud firewall rules that govern direct internet breakout traffic from branches
ZTNA application access policies tied to identity provider groups (Okta, Entra ID)
SWG URL filtering and TLS inspection policies
DLP rules applied at the CASB layer for SaaS applications
DNS security policies to block command-and-control communication

Monitoring Automated Networks

Automation without observability is flying blind. When a pipeline pushes a configuration change to two hundred branches, you need to know immediately whether that change improved things, broke things, or had no effect. The monitoring strategy for an automated SD-WAN needs to cover three dimensions: device health, application performance, and configuration compliance.

For device health, traditional SNMP polling still has a place, but streaming telemetry via gNMI (gRPC Network Management Interface) is the modern approach. Devices push metrics to a collector in real time rather than waiting to be polled. This reduces monitoring latency from minutes to seconds and eliminates the scaling problems of SNMP at hundreds of sites.

For application performance, synthetic probing and SLA monitoring are essential. Tools like ThousandEyes run synthetic transactions from branch locations to SaaS endpoints, measuring latency, jitter, and packet loss. Kentik provides flow-based analytics that show you how traffic is actually traversing the SD-WAN overlay. LibreNMS offers an open-source alternative for teams that need to keep costs down while still getting comprehensive network monitoring.

For configuration compliance, run regular audits that compare the running configuration on each device against the declared state in your Git repository. Any drift indicates either a manual change that bypassed the pipeline or a failed deployment that was not caught. Both are problems that need to be addressed.

Rollback and Safety

The fear that stops many network teams from adopting automation is the blast radius of a mistake. If a bad playbook runs against every branch simultaneously, the result is a company-wide outage. This fear is legitimate — and the answer is not to avoid automation, but to build rollback and safety into every step.

The first layer of safety is pre-change snapshots. Before any configuration change, capture the current running configuration and store it as an artifact. The second layer is canary deployment — apply the change to a small set of branches first and validate health metrics before proceeding. The third layer is automatic rollback: if SLA metrics degrade beyond a threshold after a change, revert immediately without waiting for a human.

Ansible's block/rescue pattern is ideal for implementing rollback at the playbook level:

playbooks/safe-deploy.yamlyaml

---
- name: Safe SD-WAN configuration deployment
  hosts: branch_routers
  gather_facts: false
  serial: 1
  max_fail_percentage: 0

  tasks:
    - name: Deployment with automatic rollback
      block:
        - name: Snapshot current configuration
          cisco.sdwan.config_backup:
            device_id: "{{ inventory_hostname }}"
            dest: "/tmp/backup-{{ inventory_hostname }}.json"

        - name: Apply new configuration
          cisco.sdwan.device_template:
            state: present
            template_name: "branch-{{ site_id }}-vlans"
            device_type: vedge-C8000V
            features: "{{ branch_features }}"

        - name: Wait for convergence
          ansible.builtin.pause:
            seconds: 60

        - name: Validate SLA metrics
          cisco.sdwan.sla_check:
            device_id: "{{ inventory_hostname }}"
            sla_class: "interactive-video"
            max_latency_ms: 150
            max_jitter_ms: 30
            max_loss_pct: 1
          register: sla_result

        - name: Fail if SLA violated
          ansible.builtin.fail:
            msg: "SLA violation detected: {{ sla_result }}"
          when: not sla_result.within_threshold

      rescue:
        - name: Restore previous configuration
          cisco.sdwan.config_restore:
            device_id: "{{ inventory_hostname }}"
            src: "/tmp/backup-{{ inventory_hostname }}.json"

        - name: Notify team of rollback
          ansible.builtin.uri:
            url: "{{ slack_webhook_url }}"
            method: POST
            body_format: json
            body:
              text: "ROLLBACK: {{ inventory_hostname }} reverted after SLA violation. Check pipeline for details."

        - name: Abort remaining hosts
          ansible.builtin.meta: end_play

The serial: 1 directive ensures branches are configured one at a time. Combined with max_fail_percentage: 0, any failure on a single branch stops the entire rollout. The rescue block restores the backup, notifies the team via Slack, and halts execution. This is not perfect — there is still a window of degradation during convergence — but it is orders of magnitude safer than manual changes with no rollback plan.

Common Pitfalls

After helping multiple enterprises adopt SD-WAN automation, we see the same mistakes repeatedly. Knowing them in advance can save months of wasted effort.

Vendor lock-in in automation tooling— building your entire automation stack on vendor-specific APIs means starting over when you switch platforms. Use abstraction layers where possible. Write playbooks that separate vendor-specific modules from business logic.
Testing only in the lab— a lab with three virtual routers does not replicate the complexity of 300 branches with varying circuit quality, asymmetric routing, and legacy equipment. Your test topology needs to model real-world failure conditions, not just happy-path connectivity.
Not handling partial failures— a deployment that succeeds on 198 of 200 branches is not a success. It is two branches with unknown state. Your pipeline must detect, report, and remediate partial failures automatically.
Treating network-as-code as a project— automation is not something you do once and declare victory. It is a practice that requires ongoing investment in tooling, testing, and team skills. The playbooks need maintenance. The pipeline needs updates as vendors change APIs. The team needs training as new engineers join.
Skipping the cultural change— the hardest part of network automation is not the technology. It is convincing network engineers that reviewing YAML in a pull request is a better use of their time than clicking through a GUI. Invest in training, pair programming, and celebrating early wins.

From Tickets to Pipelines

SD-WAN is a powerful technology, but technology alone does not solve operational problems. The vendors will sell you the overlay, the controller, and the dashboard. What they will not sell you is the automation, the testing, and the deployment discipline that makes it all work at scale. That is on you — or on the partner you choose to work with.

The path from ticket-driven network changes to pipeline-driven network-as-code is not overnight. It starts with putting your configuration in Git. Then adding a linter. Then a test topology. Then a staging environment. Each step reduces risk and increases velocity. Within six months, your team will wonder how they ever managed a WAN without it.

NubisCore helps enterprises automate SD-WAN deployments— from initial architecture and tooling selection through branch rollout automation and ongoing pipeline management. Whether you are starting from scratch or trying to bring order to an existing deployment, we can help.

Book a consultation to discuss your SD-WAN automation strategy.