一、概述
1.1 背景介紹
在現(xiàn)代企業(yè)IT架構(gòu)中,基礎(chǔ)設(shè)施的復(fù)雜度呈指數(shù)級(jí)增長(zhǎng)。我們需要同時(shí)管理云上的虛擬機(jī)、容器集群、數(shù)據(jù)庫(kù)實(shí)例,以及本地?cái)?shù)據(jù)中心的物理服務(wù)器和網(wǎng)絡(luò)設(shè)備。單一的自動(dòng)化工具已經(jīng)無(wú)法滿足需求:Terraform擅長(zhǎng)基礎(chǔ)設(shè)施供給(Infrastructure Provisioning),但在配置管理和應(yīng)用部署方面力不從心;Ansible在配置管理和編排方面表現(xiàn)出色,但缺乏對(duì)云資源生命周期的完整管理能力。
在我們團(tuán)隊(duì)管理的環(huán)境中,有超過(guò)500臺(tái)云主機(jī)分布在AWS、Azure、阿里云三個(gè)平臺(tái),還有200多臺(tái)物理服務(wù)器在本地機(jī)房。最初我們嘗試用Terraform管理所有資源,但很快發(fā)現(xiàn)應(yīng)用部署、配置更新、補(bǔ)丁管理等任務(wù)非常繁瑣。后來(lái)引入Ansible后,又面臨云資源創(chuàng)建和銷毀的狀態(tài)管理問(wèn)題。經(jīng)過(guò)一年多的實(shí)踐,我們總結(jié)出了一套Terraform + Ansible混合編排的最佳實(shí)踐。
這種混合編排的核心思想是:用Terraform管理基礎(chǔ)設(shè)施的生命周期(創(chuàng)建、更新、銷毀),用Ansible管理配置和應(yīng)用的部署。Terraform負(fù)責(zé)"建房子",Ansible負(fù)責(zé)"裝修和入住"。兩者通過(guò)動(dòng)態(tài)Inventory、輸出變量、本地執(zhí)行器等機(jī)制無(wú)縫集成,形成完整的自動(dòng)化運(yùn)維體系。
1.2 技術(shù)特點(diǎn)
聲明式基礎(chǔ)設(shè)施管理:Terraform使用HCL(HashiCorp Configuration Language)聲明式語(yǔ)言描述基礎(chǔ)設(shè)施,通過(guò)狀態(tài)文件跟蹤資源變更,支持計(jì)劃預(yù)覽和安全變更。相比命令式腳本,聲明式管理更易維護(hù)和審計(jì)。
冪等性配置管理:Ansible的Playbook具有冪等性,多次執(zhí)行相同任務(wù)不會(huì)產(chǎn)生副作用。這在配置漂移修復(fù)、災(zāi)難恢復(fù)等場(chǎng)景中非常重要,可以放心地重復(fù)執(zhí)行而不用擔(dān)心破壞現(xiàn)有配置。
多云統(tǒng)一抽象:Terraform支持200+云服務(wù)商的Provider,可以用統(tǒng)一的語(yǔ)法管理AWS EC2、Azure VM、阿里云ECS。Ansible的云模塊也支持主流云平臺(tái),兩者結(jié)合實(shí)現(xiàn)真正的多云管理。
狀態(tài)管理和鎖機(jī)制:Terraform的狀態(tài)文件記錄了基礎(chǔ)設(shè)施的當(dāng)前狀態(tài),支持遠(yuǎn)程后端(S3、Consul、Terraform Cloud)和狀態(tài)鎖(DynamoDB、Consul),確保團(tuán)隊(duì)協(xié)作時(shí)不會(huì)產(chǎn)生沖突。
模塊化和復(fù)用:Terraform Module和Ansible Role提供了代碼復(fù)用機(jī)制,可以將常用的基礎(chǔ)設(shè)施模式和配置任務(wù)封裝成模塊,在不同項(xiàng)目中復(fù)用,提升開(kāi)發(fā)效率。
動(dòng)態(tài)Inventory集成:Ansible支持從云平臺(tái)、Terraform輸出、CMDB等數(shù)據(jù)源動(dòng)態(tài)獲取主機(jī)列表,無(wú)需手工維護(hù)靜態(tài)Inventory文件,自動(dòng)適應(yīng)基礎(chǔ)設(shè)施變化。
1.3 適用場(chǎng)景
多云混合架構(gòu)管理:企業(yè)同時(shí)使用多個(gè)云平臺(tái)和本地?cái)?shù)據(jù)中心,需要統(tǒng)一的自動(dòng)化工具鏈。在我們的實(shí)踐中,通過(guò)Terraform管理跨云的VPC、子網(wǎng)、安全組,用Ansible統(tǒng)一部署應(yīng)用,大大簡(jiǎn)化了多云管理的復(fù)雜度。
大規(guī)模服務(wù)器集群部署:需要快速部署數(shù)百臺(tái)服務(wù)器并完成應(yīng)用安裝配置。Terraform可以并行創(chuàng)建云主機(jī),Ansible通過(guò)動(dòng)態(tài)Inventory自動(dòng)發(fā)現(xiàn)新主機(jī)并執(zhí)行配置任務(wù),整個(gè)過(guò)程可以在30分鐘內(nèi)完成。
不可變基礎(chǔ)設(shè)施實(shí)踐:采用"基礎(chǔ)設(shè)施即代碼"理念,每次變更都通過(guò)代碼提交和CI/CD流水線執(zhí)行。Terraform的計(jì)劃預(yù)覽功能可以在變更前看到影響范圍,Ansible的Check模式可以驗(yàn)證配置變更的安全性。
災(zāi)難恢復(fù)和環(huán)境復(fù)制:需要快速在不同區(qū)域或云平臺(tái)復(fù)制整套環(huán)境。所有基礎(chǔ)設(shè)施和配置都在Git中,只需調(diào)整少量參數(shù)即可在新環(huán)境重建,我們?cè)?小時(shí)內(nèi)完成了整個(gè)生產(chǎn)環(huán)境的跨區(qū)域遷移。
合規(guī)性和審計(jì)要求:金融、醫(yī)療等行業(yè)對(duì)基礎(chǔ)設(shè)施變更有嚴(yán)格的審計(jì)要求。所有變更都通過(guò)Git提交記錄,Terraform的狀態(tài)文件提供完整的資源清單,Ansible的日志記錄每一步操作,滿足合規(guī)性審計(jì)需求。
1.4 環(huán)境要求
| 組件 | 版本要求 | 說(shuō)明 |
|---|---|---|
| Terraform | 1.5+ | 建議使用1.6+以獲得更好的性能和功能 |
| Ansible | 2.14+ | 建議使用2.15+,需要Python 3.9+ |
| Python | 3.9+ | Ansible運(yùn)行環(huán)境,需要安裝boto3、azure等SDK |
| Git | 2.30+ | 代碼版本管理 |
| AWS CLI | 2.x | 用于AWS資源管理(可選) |
| Azure CLI | 2.x | 用于Azure資源管理(可選) |
| 云平臺(tái)賬號(hào) | - | AWS/Azure/GCP/阿里云賬號(hào)及相應(yīng)的IAM權(quán)限 |
硬件配置建議:
| 環(huán)境類型 | 控制節(jié)點(diǎn) | 托管節(jié)點(diǎn) | 說(shuō)明 |
|---|---|---|---|
| 開(kāi)發(fā)環(huán)境 | 2C4G | 1C2G × 5 | 適合功能驗(yàn)證和學(xué)習(xí) |
| 測(cè)試環(huán)境 | 4C8G | 2C4G × 20 | 適合壓力測(cè)試和集成測(cè)試 |
| 生產(chǎn)環(huán)境 | 8C16G | 根據(jù)實(shí)際需求 | 控制節(jié)點(diǎn)需要高可用配置 |
網(wǎng)絡(luò)要求:
控制節(jié)點(diǎn)需要訪問(wèn)云平臺(tái)API(HTTPS 443端口)
控制節(jié)點(diǎn)需要SSH訪問(wèn)托管節(jié)點(diǎn)(TCP 22端口)
如果使用Terraform遠(yuǎn)程后端,需要訪問(wèn)S3、Consul等服務(wù)
如果使用Ansible Tower/AWX,需要訪問(wèn)Web界面(HTTP/HTTPS)
托管節(jié)點(diǎn)需要訪問(wèn)軟件包倉(cāng)庫(kù)(yum/apt源)
權(quán)限要求:
Terraform:需要云平臺(tái)的管理員權(quán)限或精細(xì)化的IAM策略(EC2、VPC、RDS、S3等資源的創(chuàng)建、修改、刪除權(quán)限)
Ansible:需要托管節(jié)點(diǎn)的SSH訪問(wèn)權(quán)限,建議使用密鑰認(rèn)證而非密碼認(rèn)證
狀態(tài)存儲(chǔ):如果使用S3作為Terraform后端,需要S3讀寫權(quán)限和DynamoDB鎖表權(quán)限
二、詳細(xì)步驟
2.1 準(zhǔn)備工作
2.1.1 系統(tǒng)檢查
# 檢查操作系統(tǒng)版本 cat /etc/os-release # 檢查Python版本(需要3.9+) python3 --version # 檢查磁盤空間 df -h # 檢查網(wǎng)絡(luò)連通性 ping -c 3 registry.terraform.io ping -c 3 github.com
2.1.2 安裝Terraform
# 下載Terraform(以1.6.6版本為例) wget https://releases.hashicorp.com/terraform/1.6.6/terraform_1.6.6_linux_amd64.zip # 解壓并安裝 unzip terraform_1.6.6_linux_amd64.zip sudo mv terraform /usr/local/bin/ # 驗(yàn)證安裝 terraform version # 啟用命令補(bǔ)全 terraform -install-autocomplete # 預(yù)期輸出: # Terraform v1.6.6 # on linux_amd64
2.1.3 安裝Ansible
# 安裝Python依賴 sudo apt update sudo apt install -y python3-pip python3-venv # 創(chuàng)建虛擬環(huán)境(推薦) python3 -m venv ~/ansible-venv source~/ansible-venv/bin/activate # 安裝Ansible pip3 install ansible==2.15.8 # 安裝云平臺(tái)SDK pip3 install boto3 botocore # AWS pip3 install azure-cli # Azure pip3 install google-auth # GCP # 驗(yàn)證安裝 ansible --version # 預(yù)期輸出: # ansible [core 2.15.8] # config file = None # configured module search path = ['/home/user/.ansible/plugins/modules'] # ansible python module location = /home/user/ansible-venv/lib/python3.9/site-packages/ansible # ansible collection location = /home/user/.ansible/collections # executable location = /home/user/ansible-venv/bin/ansible # python version = 3.9.x
2.1.4 配置云平臺(tái)憑證
AWS憑證配置:
# 安裝AWS CLI
curl"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"-o"awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# 配置AWS憑證
aws configure
# AWS Access Key ID: YOUR_ACCESS_KEY
# AWS Secret Access Key: YOUR_SECRET_KEY
# Default region name: us-east-1
# Default output format: json
# 驗(yàn)證配置
aws sts get-caller-identity
# 預(yù)期輸出:
# {
# "UserId": "AIDAXXXXXXXXXX",
# "Account": "123456789012",
# "Arn": "arniam:user/terraform"
# }
Azure憑證配置:
# 安裝Azure CLI curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash # 登錄Azure az login # 設(shè)置默認(rèn)訂閱 az accountset--subscription"YOUR_SUBSCRIPTION_ID" # 創(chuàng)建Service Principal(用于Terraform) az ad sp create-for-rbac --name"terraform-sp"--role="Contributor"--scopes="/subscriptions/YOUR_SUBSCRIPTION_ID" # 記錄輸出的appId、password、tenant
2.1.5 初始化項(xiàng)目結(jié)構(gòu)
# 創(chuàng)建項(xiàng)目目錄
mkdir -p ~/infra-automation/{terraform,ansible,scripts}
cd~/infra-automation
# 創(chuàng)建Terraform目錄結(jié)構(gòu)
mkdir -p terraform/{modules,environments/{dev,staging,prod}}
# 創(chuàng)建Ansible目錄結(jié)構(gòu)
mkdir -p ansible/{inventories,playbooks,roles,group_vars,host_vars}
# 創(chuàng)建Git倉(cāng)庫(kù)
git init
cat > .gitignore <
2.2 核心配置
2.2.1 配置Terraform遠(yuǎn)程后端
使用S3和DynamoDB實(shí)現(xiàn)狀態(tài)管理和鎖機(jī)制:
# 文件路徑:terraform/backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# 啟用版本控制
versioning = true
}
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
創(chuàng)建S3 Bucket和DynamoDB表:
# 創(chuàng)建S3 Bucket
aws s3api create-bucket
--bucket my-terraform-state-bucket
--region us-east-1
# 啟用版本控制
aws s3api put-bucket-versioning
--bucket my-terraform-state-bucket
--versioning-configuration Status=Enabled
# 啟用加密
aws s3api put-bucket-encryption
--bucket my-terraform-state-bucket
--server-side-encryption-configuration'{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
# 創(chuàng)建DynamoDB鎖表
aws dynamodb create-table
--table-name terraform-state-lock
--attribute-definitions AttributeName=LockID,AttributeType=S
--key-schema AttributeName=LockID,KeyType=HASH
--billing-mode PAY_PER_REQUEST
--region us-east-1
2.2.2 創(chuàng)建Terraform模塊
VPC模塊示例:
# 文件路徑:terraform/modules/vpc/main.tf
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
default = "10.0.0.0/16"
}
variable "environment" {
description = "Environment name"
type = string
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
ManagedBy = "Terraform"
}
}
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.environment}-public-subnet-${count.index + 1}"
Environment = var.environment
Type = "public"
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.environment}-private-subnet-${count.index + 1}"
Environment = var.environment
Type = "private"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${var.environment}-igw"
Environment = var.environment
}
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "${var.environment}-public-rt"
Environment = var.environment
}
}
resource "aws_route_table_association" "public" {
count = length(aws_subnet.public)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
output "vpc_id" {
description = "VPC ID"
value = aws_vpc.main.id
}
output "public_subnet_ids" {
description = "Public subnet IDs"
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
description = "Private subnet IDs"
value = aws_subnet.private[*].id
}
EC2實(shí)例模塊示例:
# 文件路徑:terraform/modules/ec2/main.tf
variable "instance_count" {
description = "Number of instances to create"
type = number
default = 1
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}
variable "ami_id" {
description = "AMI ID"
type = string
}
variable "subnet_ids" {
description = "List of subnet IDs"
type = list(string)
}
variable "vpc_id" {
description = "VPC ID"
type = string
}
variable "key_name" {
description = "SSH key pair name"
type = string
}
variable "environment" {
description = "Environment name"
type = string
}
variable "application" {
description = "Application name"
type = string
}
# 安全組
resource "aws_security_group" "instance" {
name = "${var.environment}-${var.application}-sg"
description = "Security group for ${var.application}"
vpc_id = var.vpc_id
ingress {
description = "SSH from anywhere"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "HTTP from anywhere"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "HTTPS from anywhere"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
description = "Allow all outbound"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.environment}-${var.application}-sg"
Environment = var.environment
Application = var.application
}
}
# EC2實(shí)例
resource "aws_instance" "app" {
count = var.instance_count
ami = var.ami_id
instance_type = var.instance_type
subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)]
vpc_security_group_ids = [aws_security_group.instance.id]
key_name = var.key_name
root_block_device {
volume_type = "gp3"
volume_size = 50
delete_on_termination = true
encrypted = true
}
tags = {
Name = "${var.environment}-${var.application}-${count.index + 1}"
Environment = var.environment
Application = var.application
ManagedBy = "Terraform"
}
lifecycle {
create_before_destroy = true
}
}
output "instance_ids" {
description = "EC2 instance IDs"
value = aws_instance.app[*].id
}
output "instance_public_ips" {
description = "EC2 instance public IPs"
value = aws_instance.app[*].public_ip
}
output "instance_private_ips" {
description = "EC2 instance private IPs"
value = aws_instance.app[*].private_ip
}
2.2.3 配置環(huán)境變量文件
# 文件路徑:terraform/environments/prod/main.tf
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = "production"
ManagedBy = "Terraform"
Project = "infra-automation"
}
}
}
module "vpc" {
source = "../../modules/vpc"
vpc_cidr = "10.0.0.0/16"
environment = "prod"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
module "web_servers" {
source = "../../modules/ec2"
instance_count = 3
instance_type = "t3.medium"
ami_id = var.ami_id
subnet_ids = module.vpc.public_subnet_ids
vpc_id = module.vpc.vpc_id
key_name = var.key_name
environment = "prod"
application = "web"
}
# 輸出到Ansible
resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/inventory.tpl", {
web_servers = module.web_servers.instance_public_ips
})
filename = "${path.module}/../../../ansible/inventories/prod/hosts"
}
output "vpc_id" {
value = module.vpc.vpc_id
}
output "web_server_ips" {
value = module.web_servers.instance_public_ips
}
# 文件路徑:terraform/environments/prod/variables.tf
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "ami_id" {
description = "AMI ID for EC2 instances"
type = string
}
variable "key_name" {
description = "SSH key pair name"
type = string
}
# 文件路徑:terraform/environments/prod/terraform.tfvars
aws_region = "us-east-1"
ami_id = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
key_name = "prod-key"
Ansible Inventory模板:
# 文件路徑:terraform/environments/prod/inventory.tpl
[web_servers]
%{ for ip in web_servers ~}
${ip} ansible_user=ec2-user ansible_ssh_private_key_file=~/.ssh/prod-key.pem
%{ endfor ~}
[web_servers:vars]
ansible_python_interpreter=/usr/bin/python3
2.3 啟動(dòng)和驗(yàn)證
2.3.1 執(zhí)行Terraform部署
# 進(jìn)入環(huán)境目錄
cdterraform/environments/prod
# 初始化Terraform
terraform init
# 預(yù)期輸出:
# Initializing the backend...
# Successfully configured the backend "s3"!
# Initializing modules...
# Initializing provider plugins...
# Terraform has been successfully initialized!
# 驗(yàn)證配置
terraform validate
# 預(yù)覽變更
terraform plan
# 預(yù)期輸出:
# Terraform will perform the following actions:
# # module.vpc.aws_vpc.main will be created
# # module.vpc.aws_subnet.public[0] will be created
# # ...
# Plan: 15 to add, 0 to change, 0 to destroy.
# 應(yīng)用變更
terraform apply
# 輸入 yes 確認(rèn)
# 預(yù)期輸出:
# Apply complete! Resources: 15 added, 0 changed, 0 destroyed.
# Outputs:
# vpc_id = "vpc-0123456789abcdef0"
# web_server_ips = [
# "54.123.45.67",
# "54.123.45.68",
# "54.123.45.69",
# ]
2.3.2 驗(yàn)證Ansible Inventory
# 查看生成的Inventory文件
cat ../../../ansible/inventories/prod/hosts
# 預(yù)期輸出:
# [web_servers]
# 54.123.45.67 ansible_user=ec2-user ansible_ssh_private_key_file=~/.ssh/prod-key.pem
# 54.123.45.68 ansible_user=ec2-user ansible_ssh_private_key_file=~/.ssh/prod-key.pem
# 54.123.45.69 ansible_user=ec2-user ansible_ssh_private_key_file=~/.ssh/prod-key.pem
#
# [web_servers:vars]
# ansible_python_interpreter=/usr/bin/python3
# 測(cè)試連接
cd../../../ansible
ansible web_servers -i inventories/prod/hosts -m ping
# 預(yù)期輸出:
# 54.123.45.67 | SUCCESS => {
# "changed": false,
# "ping": "pong"
# }
# 54.123.45.68 | SUCCESS => {
# "changed": false,
# "ping": "pong"
# }
# 54.123.45.69 | SUCCESS => {
# "changed": false,
# "ping": "pong"
# }
2.3.3 執(zhí)行Ansible配置
# 運(yùn)行Playbook(后續(xù)章節(jié)會(huì)詳細(xì)介紹)
ansible-playbook -i inventories/prod/hosts playbooks/web-setup.yml
# 驗(yàn)證服務(wù)狀態(tài)
ansible web_servers -i inventories/prod/hosts -m shell -a"systemctl status nginx"
三、示例代碼和配置
3.1 完整配置示例
3.1.1 Ansible Playbook配置
Web服務(wù)器部署Playbook:
# 文件路徑:ansible/playbooks/web-setup.yml
---
-name:SetupWebServers
hosts:web_servers
become:yes
vars:
nginx_version:"1.24.0"
app_user:"webapp"
app_dir:"/opt/webapp"
tasks:
-name:Updatesystempackages
yum:
name:"*"
state:latest
update_cache:yes
-name:Installrequiredpackages
yum:
name:
-nginx
-python3
-python3-pip
-git
state:present
-name:Createapplicationuser
user:
name:"{{ app_user }}"
system:yes
shell:/bin/bash
home:"{{ app_dir }}"
-name:Createapplicationdirectory
file:
path:"{{ app_dir }}"
state:directory
owner:"{{ app_user }}"
group:"{{ app_user }}"
mode:'0755'
-name:ConfigureNginx
template:
src:../templates/nginx.conf.j2
dest:/etc/nginx/nginx.conf
owner:root
group:root
mode:'0644'
notify:RestartNginx
-name:StartandenableNginx
systemd:
name:nginx
state:started
enabled:yes
handlers:
-name:RestartNginx
systemd:
name:nginx
state:restarted
Nginx配置模板:
# 文件路徑:ansible/templates/nginx.conf.j2
user {{ app_user }};
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
root {{ app_dir }}/public;
location / {
try_files $uri $uri/ =404;
}
error_page 404 /404.html;
location = /404.html {
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
}
}
3.1.2 Ansible Role結(jié)構(gòu)
創(chuàng)建可復(fù)用的Ansible Role:
# 創(chuàng)建Role結(jié)構(gòu)
cdansible/roles
ansible-galaxy init webserver
# 生成的目錄結(jié)構(gòu):
# webserver/
# ├── defaults/
# │ └── main.yml
# ├── files/
# ├── handlers/
# │ └── main.yml
# ├── meta/
# │ └── main.yml
# ├── tasks/
# │ └── main.yml
# ├── templates/
# ├── tests/
# │ ├── inventory
# │ └── test.yml
# └── vars/
# └── main.yml
Role任務(wù)定義:
# 文件路徑:ansible/roles/webserver/tasks/main.yml
---
-name:IncludeOS-specificvariables
include_vars:"{{ ansible_os_family }}.yml"
-name:Installwebserverpackages
package:
name:"{{ webserver_packages }}"
state:present
-name:Configurewebserver
template:
src:"{{ webserver_config_template }}"
dest:"{{ webserver_config_path }}"
owner:root
group:root
mode:'0644'
notify:restartwebserver
-name:Ensurewebserverisrunning
service:
name:"{{ webserver_service_name }}"
state:started
enabled:yes
Role變量定義:
# 文件路徑:ansible/roles/webserver/defaults/main.yml
---
webserver_port:80
webserver_user:nginx
webserver_worker_processes:auto
webserver_worker_connections:1024
# 文件路徑:ansible/roles/webserver/vars/RedHat.yml
---
webserver_packages:
-nginx
-nginx-mod-stream
webserver_service_name:nginx
webserver_config_path:/etc/nginx/nginx.conf
webserver_config_template:nginx-redhat.conf.j2
3.1.3 Terraform + Ansible集成腳本
自動(dòng)化部署腳本:
#!/bin/bash
# 文件名:scripts/deploy.sh
# 功能:自動(dòng)化執(zhí)行Terraform和Ansible部署
set-e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")"&&pwd)"
PROJECT_ROOT="$(dirname"$SCRIPT_DIR")"
ENVIRONMENT="${1:-prod}"
echo "==> 開(kāi)始部署環(huán)境:${ENVIRONMENT}"
# 1. 執(zhí)行Terraform部署
echo "==> 步驟1: 執(zhí)行Terraform部署"
cd "${PROJECT_ROOT}/terraform/environments/${ENVIRONMENT}"
terraform init -upgrade
terraform validate
terraform plan -out=tfplan
read -p "是否繼續(xù)應(yīng)用Terraform變更? (yes/no):" confirm
if [ "$confirm" != "yes" ]; then
echo "部署已取消"
exit 1
fi
terraform apply tfplan
rm -f tfplan
# 2. 等待實(shí)例就緒
echo "==> 步驟2: 等待EC2實(shí)例就緒"
sleep 30
# 3. 執(zhí)行Ansible配置
echo "==> 步驟3: 執(zhí)行Ansible配置"
cd "${PROJECT_ROOT}/ansible"
INVENTORY_FILE="inventories/${ENVIRONMENT}/hosts"
# 測(cè)試連接
echo "==> 測(cè)試SSH連接"
ansible all -i "${INVENTORY_FILE}" -m ping
# 運(yùn)行Playbook
echo "==> 運(yùn)行Playbook"
ansible-playbook -i "${INVENTORY_FILE}" playbooks/web-setup.yml
# 4. 驗(yàn)證部署
echo "==> 步驟4: 驗(yàn)證部署"
ansible web_servers -i "${INVENTORY_FILE}" -m shell -a "systemctl status nginx"
echo "==> 部署完成!"
echo "==> 訪問(wèn)地址:"
cd "${PROJECT_ROOT}/terraform/environments/${ENVIRONMENT}"
terraform output web_server_ips
3.2 實(shí)際應(yīng)用案例
案例一:多環(huán)境基礎(chǔ)設(shè)施管理
場(chǎng)景描述:需要管理開(kāi)發(fā)、測(cè)試、生產(chǎn)三套環(huán)境,每套環(huán)境的配置略有不同(實(shí)例數(shù)量、規(guī)格、網(wǎng)絡(luò)配置等)。
實(shí)現(xiàn)步驟:
創(chuàng)建環(huán)境特定的變量文件:
# 文件路徑:terraform/environments/dev/terraform.tfvars
aws_region = "us-east-1"
ami_id = "ami-0c55b159cbfafe1f0"
key_name = "dev-key"
instance_count = 1
instance_type = "t3.small"
vpc_cidr = "10.1.0.0/16"
# 文件路徑:terraform/environments/prod/terraform.tfvars
aws_region = "us-east-1"
ami_id = "ami-0c55b159cbfafe1f0"
key_name = "prod-key"
instance_count = 5
instance_type = "t3.large"
vpc_cidr = "10.0.0.0/16"
使用Workspace管理多環(huán)境:
# 創(chuàng)建并切換到開(kāi)發(fā)環(huán)境
terraform workspace new dev
terraform workspace select dev
terraform apply -var-file="environments/dev/terraform.tfvars"
# 切換到生產(chǎn)環(huán)境
terraform workspace select prod
terraform apply -var-file="environments/prod/terraform.tfvars"
# 查看所有環(huán)境
terraform workspace list
運(yùn)行結(jié)果:
# 開(kāi)發(fā)環(huán)境
$ terraform workspace select dev
$ terraform apply
Apply complete! Resources: 8 added, 0 changed, 0 destroyed.
Outputs:
vpc_id ="vpc-dev-0123456789"
web_server_ips = ["54.123.45.10"]
# 生產(chǎn)環(huán)境
$ terraform workspace select prod
$ terraform apply
Apply complete! Resources: 20 added, 0 changed, 0 destroyed.
Outputs:
vpc_id ="vpc-prod-0123456789"
web_server_ips = [
"54.123.45.67",
"54.123.45.68",
"54.123.45.69",
"54.123.45.70",
"54.123.45.71",
]
四、最佳實(shí)踐和注意事項(xiàng)
4.1 最佳實(shí)踐
4.1.1 性能優(yōu)化
優(yōu)化點(diǎn)一:Terraform并行執(zhí)行優(yōu)化
Terraform默認(rèn)會(huì)并行創(chuàng)建資源,但可以通過(guò)參數(shù)調(diào)整并行度:
# 增加并行度(默認(rèn)10)
terraform apply -parallelism=20
# 在大規(guī)模部署時(shí),適當(dāng)增加并行度可以顯著縮短部署時(shí)間
# 但要注意云平臺(tái)的API限流
在我們的實(shí)踐中,管理500臺(tái)EC2實(shí)例時(shí),將并行度設(shè)置為30,部署時(shí)間從45分鐘縮短到15分鐘。
優(yōu)化點(diǎn)二:Ansible執(zhí)行效率優(yōu)化
# 文件路徑:ansible/ansible.cfg
[defaults]
# 增加并發(fā)連接數(shù)(默認(rèn)5)
forks = 50
# 啟用SSH連接復(fù)用
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
pipelining = True
# 啟用事實(shí)緩存
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400
通過(guò)這些優(yōu)化,我們將100臺(tái)服務(wù)器的配置時(shí)間從30分鐘縮短到8分鐘。
優(yōu)化點(diǎn)三:狀態(tài)文件優(yōu)化
# 使用部分狀態(tài)(Partial Configuration)
# 文件路徑:terraform/backend-config/prod.hcl
bucket = "my-terraform-state-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# 初始化時(shí)指定后端配置
terraform init -backend-config=backend-config/prod.hcl
# 這樣可以在不同環(huán)境使用不同的狀態(tài)文件,避免沖突
4.1.2 安全加固
安全措施一:使用Ansible Vault加密敏感信息
# 創(chuàng)建加密文件
ansible-vault create ansible/group_vars/all/vault.yml
# 編輯加密文件
ansible-vault edit ansible/group_vars/all/vault.yml
# 文件內(nèi)容示例
---
vault_db_password:"super_secret_password"
vault_api_key:"api_key_12345"
vault_aws_access_key:"AKIAIOSFODNN7EXAMPLE"
vault_aws_secret_key:"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# 在Playbook中引用
# 文件路徑:ansible/playbooks/app-deploy.yml
---
-name:DeployApplication
hosts:app_servers
vars_files:
-../group_vars/all/vault.yml
tasks:
-name:Configuredatabaseconnection
template:
src:db_config.j2
dest:/etc/app/db.conf
vars:
db_password:"{{ vault_db_password }}"
# 運(yùn)行Playbook時(shí)提供密碼
ansible-playbook -i inventories/prod/hosts playbooks/app-deploy.yml --ask-vault-pass
# 或使用密碼文件
echo"my_vault_password"> .vault_pass
chmod 600 .vault_pass
ansible-playbook -i inventories/prod/hosts playbooks/app-deploy.yml --vault-password-file .vault_pass
安全措施二:Terraform敏感輸出保護(hù)
# 文件路徑:terraform/environments/prod/outputs.tf
output "db_password" {
description = "Database password"
value = random_password.db_password.result
sensitive = true # 標(biāo)記為敏感信息,不會(huì)在日志中顯示
}
output "rds_endpoint" {
description = "RDS endpoint"
value = aws_db_instance.main.endpoint
}
# 查看敏感輸出
terraform output -json | jq'.db_password.value'-r
安全措施三:最小權(quán)限原則
{
"Version":"2012-10-17",
"Statement": [
{
"Effect":"Allow",
"Action": [
"ec2:Describe*",
"ec2:CreateTags",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:StopInstances",
"ec2:StartInstances"
],
"Resource":"*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion":"us-east-1"
}
}
},
{
"Effect":"Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource":"arns3:::my-terraform-state-bucket/*"
},
{
"Effect":"Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:DeleteItem"
],
"Resource":"arndynamodb123456789012:table/terraform-state-lock"
}
]
}
4.1.3 高可用配置
HA方案一:多區(qū)域部署
# 文件路徑:terraform/modules/multi-region/main.tf
variable "regions" {
description = "List of AWS regions"
type = list(string)
default = ["us-east-1", "us-west-2", "eu-west-1"]
}
module "vpc" {
for_each = toset(var.regions)
source = "../vpc"
providers = {
aws = aws.region[each.key]
}
vpc_cidr = cidrsubnet("10.0.0.0/8", 8, index(var.regions, each.key))
environment = "prod"
region = each.key
}
HA方案二:自動(dòng)故障轉(zhuǎn)移
# 文件路徑:ansible/playbooks/ha-setup.yml
---
-name:SetupHighAvailability
hosts:web_servers
become:yes
tasks:
-name:InstallKeepalived
yum:
name:keepalived
state:present
-name:ConfigureKeepalived
template:
src:../templates/keepalived.conf.j2
dest:/etc/keepalived/keepalived.conf
notify:RestartKeepalived
-name:StartKeepalived
systemd:
name:keepalived
state:started
enabled:yes
handlers:
-name:RestartKeepalived
systemd:
name:keepalived
state:restarted
4.2 注意事項(xiàng)
4.2.1 配置注意事項(xiàng)
警告:以下配置錯(cuò)誤可能導(dǎo)致生產(chǎn)環(huán)境故障或數(shù)據(jù)丟失,請(qǐng)務(wù)必注意!
注意事項(xiàng)一:避免Terraform狀態(tài)文件沖突
在團(tuán)隊(duì)協(xié)作時(shí),多人同時(shí)執(zhí)行terraform apply可能導(dǎo)致?tīng)顟B(tài)文件沖突。必須使用狀態(tài)鎖機(jī)制:
# 錯(cuò)誤示例:本地狀態(tài)文件
# 多人協(xié)作時(shí)會(huì)導(dǎo)致?tīng)顟B(tài)不一致
# 正確示例:使用遠(yuǎn)程后端和鎖
terraform {
backend"s3"{
bucket ="my-terraform-state-bucket"
key ="prod/terraform.tfstate"
region ="us-east-1"
dynamodb_table ="terraform-state-lock"# 必須配置鎖表
}
}
如果遇到鎖被占用的情況:
# 查看鎖狀態(tài)
aws dynamodb get-item
--table-name terraform-state-lock
--key'{"LockID":{"S":"my-terraform-state-bucket/prod/terraform.tfstate-md5"}}'
# 強(qiáng)制解鎖(謹(jǐn)慎使用,確保沒(méi)有其他人在執(zhí)行)
terraform force-unlock
注意事項(xiàng)二:Ansible冪等性驗(yàn)證
并非所有Ansible模塊都是冪等的,使用shell/command模塊時(shí)需要特別注意:
# 錯(cuò)誤示例:非冪等操作
-name:Addlinetofile
shell:echo"new line">>/etc/config.conf
# 正確示例:使用冪等模塊
-name:Addlinetofile
lineinfile:
path:/etc/config.conf
line:"new line"
state:present
注意事項(xiàng)三:Terraform資源依賴管理
# 顯式聲明依賴關(guān)系
resource "aws_instance" "app" {
ami = var.ami_id
instance_type = "t3.medium"
subnet_id = aws_subnet.public[0].id
# 確保VPC和子網(wǎng)先創(chuàng)建
depends_on = [
aws_vpc.main,
aws_subnet.public
]
}
4.2.2 常見(jiàn)錯(cuò)誤
錯(cuò)誤現(xiàn)象
原因分析
解決方案
Terraform apply卡住不動(dòng)
狀態(tài)鎖被占用或網(wǎng)絡(luò)問(wèn)題
檢查DynamoDB鎖表,使用terraform force-unlock解鎖
Ansible連接超時(shí)
SSH密鑰權(quán)限錯(cuò)誤或安全組配置問(wèn)題
檢查密鑰權(quán)限chmod 600 key.pem,驗(yàn)證安全組規(guī)則
Terraform狀態(tài)漂移
手工修改了云資源
使用terraform refresh更新?tīng)顟B(tài),或terraform import導(dǎo)入資源
Ansible任務(wù)失敗但沒(méi)有錯(cuò)誤信息
忽略了錯(cuò)誤或使用了ignore_errors
移除ignore_errors,檢查任務(wù)返回值
Terraform destroy失敗
資源有依賴關(guān)系或保護(hù)機(jī)制
使用terraform state rm移除狀態(tài),手工刪除資源
4.2.3 兼容性問(wèn)題
版本兼容:
Terraform 1.x與0.x語(yǔ)法不完全兼容,升級(jí)前需要測(cè)試
Ansible 2.10+采用Collection機(jī)制,部分模塊路徑變更
AWS Provider 5.x對(duì)某些資源的參數(shù)做了breaking changes
建議在.terraform-version和requirements.txt中鎖定版本
云平臺(tái)兼容:
AWS中國(guó)區(qū)域的API endpoint與國(guó)際區(qū)不同,需要特殊配置
Azure中國(guó)區(qū)需要使用特定的環(huán)境變量AZURE_ENVIRONMENT=AzureChinaCloud
阿里云的Terraform Provider功能相對(duì)較少,部分資源需要手工管理
操作系統(tǒng)兼容:
Ansible在Windows上的支持有限,建議使用WSL或Linux控制節(jié)點(diǎn)
不同Linux發(fā)行版的包管理器不同(yum/apt/zypper),需要在Playbook中適配
macOS上的某些工具(如sed)與Linux版本行為不同
五、故障排查和監(jiān)控
5.1 故障排查
5.1.1 日志查看
# Terraform日志
# 啟用詳細(xì)日志
exportTF_LOG=DEBUG
exportTF_LOG_PATH=./terraform-debug.log
terraform apply
# 查看Terraform狀態(tài)
terraform show
terraform state list
terraform state show aws_instance.app[0]
# Ansible日志
# 啟用詳細(xì)輸出
ansible-playbook -i inventories/prod/hosts playbooks/web-setup.yml -vvv
# 查看Ansible日志文件
tail -f /var/log/ansible.log
# 查看特定主機(jī)的執(zhí)行結(jié)果
ansible web_servers -i inventories/prod/hosts -m setup | grep ansible_distribution
5.1.2 常見(jiàn)問(wèn)題排查
問(wèn)題一:Terraform狀態(tài)鎖定問(wèn)題
# 診斷命令
# 1. 查看鎖狀態(tài)
aws dynamodb scan --table-name terraform-state-lock
# 2. 查看鎖詳情
aws dynamodb get-item
--table-name terraform-state-lock
--key'{"LockID":{"S":"my-terraform-state-bucket/prod/terraform.tfstate-md5"}}'
# 3. 如果確認(rèn)沒(méi)有其他人在執(zhí)行,強(qiáng)制解鎖
terraform force-unlock
# 4. 如果DynamoDB表?yè)p壞,重新創(chuàng)建
aws dynamodb delete-table --table-name terraform-state-lock
aws dynamodb create-table
--table-name terraform-state-lock
--attribute-definitions AttributeName=LockID,AttributeType=S
--key-schema AttributeName=LockID,KeyType=HASH
--billing-mode PAY_PER_REQUEST
解決方案:
確保團(tuán)隊(duì)成員在執(zhí)行前溝通,避免同時(shí)操作
使用CI/CD流水線串行執(zhí)行Terraform任務(wù)
定期清理過(guò)期的鎖記錄
問(wèn)題二:Ansible連接失敗
# 診斷步驟
# 1. 測(cè)試SSH連接
ssh -i ~/.ssh/prod-key.pem ec2-user@54.123.45.67
# 2. 檢查密鑰權(quán)限
ls -la ~/.ssh/prod-key.pem
# 應(yīng)該是 -rw------- (600)
# 3. 修復(fù)權(quán)限
chmod 600 ~/.ssh/prod-key.pem
# 4. 測(cè)試Ansible連接
ansible web_servers -i inventories/prod/hosts -m ping -vvv
# 5. 檢查安全組規(guī)則
aws ec2 describe-security-groups
--group-ids sg-0123456789abcdef0
--query'SecurityGroups[0].IpPermissions'
# 6. 添加SSH規(guī)則(如果缺失)
aws ec2 authorize-security-group-ingress
--group-id sg-0123456789abcdef0
--protocol tcp
--port 22
--cidr 0.0.0.0/0
問(wèn)題三:Terraform資源導(dǎo)入
當(dāng)手工創(chuàng)建了資源,需要導(dǎo)入到Terraform管理:
# 1. 編寫資源配置
cat > import-resource.tf <
5.1.3 調(diào)試模式
Terraform調(diào)試:
# 啟用調(diào)試日志
exportTF_LOG=TRACE
exportTF_LOG_PATH=./terraform-trace.log
# 查看Provider插件日志
exportTF_LOG_PROVIDER=TRACE
# 執(zhí)行操作
terraform apply
# 分析日志
grep"ERROR"terraform-trace.log
grep"aws_instance"terraform-trace.log | head -20
Ansible調(diào)試:
# 使用調(diào)試模塊
ansible-playbook -i inventories/prod/hosts playbooks/web-setup.yml
--start-at-task="Install Nginx"
--step
# 使用debug模塊輸出變量
cat > debug-playbook.yml <
5.2 性能監(jiān)控
5.2.1 關(guān)鍵指標(biāo)監(jiān)控
Terraform執(zhí)行時(shí)間監(jiān)控:
# 記錄執(zhí)行時(shí)間
time terraform apply
# 分析資源創(chuàng)建時(shí)間
terraform apply -json | jq -r'.["@message"]'| grep"Creation complete"
# 使用Terraform Cloud監(jiān)控
# 在Terraform Cloud中可以看到每次執(zhí)行的詳細(xì)時(shí)間線
Ansible執(zhí)行性能監(jiān)控:
# 文件路徑:ansible/ansible.cfg
[defaults]
# 啟用性能分析
callback_whitelist = profile_tasks, timer
# 啟用任務(wù)計(jì)時(shí)
[callback_profile_tasks]
task_output_limit = 20
# 運(yùn)行Playbook并查看性能報(bào)告
ansible-playbook -i inventories/prod/hosts playbooks/web-setup.yml
# 輸出示例:
# PLAY RECAP *********************************************************************
# 54.123.45.67 : ok=10 changed=5 unreachable=0 failed=0
# 54.123.45.68 : ok=10 changed=5 unreachable=0 failed=0
# 54.123.45.69 : ok=10 changed=5 unreachable=0 failed=0
#
# Tuesday 15 January 2024 1045 +0000 (002.345) 023.456 ******
# ===============================================================================
# Install Nginx --------------------------------------------------------- 45.23s
# Update system packages ------------------------------------------------ 32.15s
# Configure Nginx ------------------------------------------------------- 12.34s
# ...
5.2.2 監(jiān)控指標(biāo)說(shuō)明
指標(biāo)名稱
正常范圍
告警閾值
說(shuō)明
Terraform apply時(shí)間
< 10分鐘
> 30分鐘
大規(guī)模部署時(shí)間
Terraform plan時(shí)間
< 2分鐘
> 5分鐘
配置驗(yàn)證時(shí)間
Ansible執(zhí)行時(shí)間
< 5分鐘/100臺(tái)
> 15分鐘/100臺(tái)
配置管理時(shí)間
SSH連接成功率
> 99%
< 95%
網(wǎng)絡(luò)連通性
狀態(tài)文件大小
< 10MB
> 50MB
狀態(tài)文件膨脹
資源創(chuàng)建成功率
> 98%
< 90%
云平臺(tái)穩(wěn)定性
5.2.3 監(jiān)控告警配置
CloudWatch告警配置:
# 文件路徑:terraform/modules/monitoring/cloudwatch.tf
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" {
alarm_name = "ec2-cpu-utilization-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "300"
statistic = "Average"
threshold = "80"
alarm_description = "This metric monitors ec2 cpu utilization"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
InstanceId = aws_instance.app[0].id
}
}
resource "aws_sns_topic" "alerts" {
name = "infrastructure-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "ops-team@example.com"
}
Prometheus監(jiān)控配置:
# 文件路徑:monitoring/prometheus.yml
global:
scrape_interval:15s
evaluation_interval:15s
scrape_configs:
-job_name:'node-exporter'
static_configs:
-targets:
-'54.123.45.67:9100'
-'54.123.45.68:9100'
-'54.123.45.69:9100'
-job_name:'nginx'
static_configs:
-targets:
-'54.123.45.67:9113'
-'54.123.45.68:9113'
-'54.123.45.69:9113'
alerting:
alertmanagers:
-static_configs:
-targets:
-'alertmanager:9093'
rule_files:
-'alerts.yml'
# 文件路徑:monitoring/alerts.yml
groups:
-name:infrastructure
interval:30s
rules:
-alert:HighCPUUsage
expr:node_cpu_seconds_total{mode="idle"}20
? ? ? ??for:?5m
? ? ? ??labels:
? ? ? ? ??severity:?warning
? ? ? ??annotations:
? ? ? ? ??summary:?"High CPU usage detected"
? ? ? ? ??description:?"CPU usage is above 80% for 5 minutes"
? ? ??-?alert:?DiskSpaceLow
? ? ? ??expr:?node_filesystem_avail_bytes?/?node_filesystem_size_bytes?0.1
? ? ? ??for:?5m
? ? ? ??labels:
? ? ? ? ??severity:?critical
? ? ? ??annotations:
? ? ? ? ??summary:?"Disk space is low"
? ? ? ? ??description:?"Less than 10% disk space available"
? ? ??-?alert:?ServiceDown
? ? ? ??expr:?up?==?0
? ? ? ??for:?2m
? ? ? ??labels:
? ? ? ? ??severity:?critical
? ? ? ??annotations:
? ? ? ? ??summary:?"Service is down"
? ? ? ? ??description:?"{{ $labels.job }}?on?{{ $labels.instance }}?is down"
5.3 備份與恢復(fù)
5.3.1 備份策略
#!/bin/bash
# 文件名:scripts/backup.sh
# 功能:備份Terraform狀態(tài)和Ansible配置
set-e
BACKUP_DIR="/backup/infra-automation/$(date +%Y%m%d-%H%M%S)"
mkdir -p${BACKUP_DIR}/{terraform,ansible}
echo"==> 開(kāi)始備份基礎(chǔ)設(shè)施配置"
# 1. 備份Terraform狀態(tài)文件
echo"==> 備份Terraform狀態(tài)"
aws s3 cp s3://my-terraform-state-bucket/prod/terraform.tfstate
${BACKUP_DIR}/terraform/terraform.tfstate
# 2. 備份Terraform配置文件
echo"==> 備份Terraform配置"
tar -czf${BACKUP_DIR}/terraform/config.tar.gz
terraform/
# 3. 備份Ansible配置
echo"==> 備份Ansible配置"
tar -czf${BACKUP_DIR}/ansible/config.tar.gz
ansible/
# 4. 備份到S3
echo"==> 上傳到S3"
aws s3 sync${BACKUP_DIR}
s3://my-backup-bucket/infra-automation/$(date +%Y%m%d-%H%M%S)/
# 5. 清理本地舊備份(保留最近7天)
find /backup/infra-automation -typed -mtime +7 -execrm -rf {} ;
echo"==> 備份完成:${BACKUP_DIR}"
5.3.2 恢復(fù)流程
停止自動(dòng)化任務(wù):
# 停止CI/CD流水線
# 通知團(tuán)隊(duì)成員暫停手工操作
恢復(fù)Terraform狀態(tài):
# 下載備份
aws s3 cp s3://my-backup-bucket/infra-automation/20240115-120000/terraform/terraform.tfstate
./terraform.tfstate.backup
# 恢復(fù)到S3后端
aws s3 cp ./terraform.tfstate.backup
s3://my-terraform-state-bucket/prod/terraform.tfstate
# 驗(yàn)證狀態(tài)
cdterraform/environments/prod
terraform init
terraform plan
恢復(fù)Ansible配置:
# 下載備份
aws s3 cp s3://my-backup-bucket/infra-automation/20240115-120000/ansible/config.tar.gz
./ansible-backup.tar.gz
# 解壓恢復(fù)
tar -xzf ansible-backup.tar.gz
# 驗(yàn)證配置
ansible-playbook -i ansible/inventories/prod/hosts ansible/playbooks/web-setup.yml --check
驗(yàn)證恢復(fù)結(jié)果:
# 驗(yàn)證Terraform資源
terraform state list
terraform show
# 驗(yàn)證Ansible連接
ansible all -i ansible/inventories/prod/hosts -m ping
# 驗(yàn)證服務(wù)狀態(tài)
ansible web_servers -i ansible/inventories/prod/hosts -m shell -a"systemctl status nginx"
六、總結(jié)
6.1 技術(shù)要點(diǎn)回顧
基礎(chǔ)設(shè)施即代碼理念:通過(guò)Terraform和Ansible將基礎(chǔ)設(shè)施和配置管理代碼化,實(shí)現(xiàn)了版本控制、可追溯和可重復(fù)部署。所有變更都通過(guò)Git提交記錄,天然具備完整的審計(jì)日志。在我們的實(shí)踐中,這種方式將環(huán)境部署時(shí)間從2天縮短到2小時(shí)。
混合編排最佳實(shí)踐:Terraform負(fù)責(zé)基礎(chǔ)設(shè)施生命周期管理(創(chuàng)建、更新、銷毀),Ansible負(fù)責(zé)配置管理和應(yīng)用部署。兩者通過(guò)動(dòng)態(tài)Inventory、輸出變量、本地執(zhí)行器等機(jī)制無(wú)縫集成,形成完整的自動(dòng)化運(yùn)維體系。
狀態(tài)管理和鎖機(jī)制:Terraform的遠(yuǎn)程后端(S3 + DynamoDB)確保了團(tuán)隊(duì)協(xié)作時(shí)的狀態(tài)一致性和并發(fā)安全。狀態(tài)文件記錄了基礎(chǔ)設(shè)施的當(dāng)前狀態(tài),支持版本控制和回滾,是基礎(chǔ)設(shè)施管理的核心。
模塊化和復(fù)用:Terraform Module和Ansible Role提供了代碼復(fù)用機(jī)制,將常用的基礎(chǔ)設(shè)施模式和配置任務(wù)封裝成模塊,在不同項(xiàng)目和環(huán)境中復(fù)用。我們團(tuán)隊(duì)維護(hù)了20+個(gè)Terraform模塊和30+個(gè)Ansible Role,覆蓋了90%的常見(jiàn)場(chǎng)景。
多環(huán)境管理策略:通過(guò)Terraform Workspace、環(huán)境特定的變量文件、Ansible Inventory分組等機(jī)制,實(shí)現(xiàn)了開(kāi)發(fā)、測(cè)試、生產(chǎn)環(huán)境的統(tǒng)一管理和差異化配置。同一套代碼可以部署到不同環(huán)境,只需調(diào)整少量參數(shù)。
安全和合規(guī):使用Ansible Vault加密敏感信息,Terraform敏感輸出保護(hù),IAM最小權(quán)限原則,確保了基礎(chǔ)設(shè)施管理的安全性。所有變更都通過(guò)Git審計(jì),滿足合規(guī)性要求。
6.2 進(jìn)階學(xué)習(xí)方向
方向一:Terraform高級(jí)特性
深入學(xué)習(xí)Terraform的企業(yè)級(jí)應(yīng)用:
學(xué)習(xí)資源:
實(shí)踐建議:在生產(chǎn)環(huán)境中實(shí)施Terraform Cloud,利用其遠(yuǎn)程執(zhí)行和策略管理能力。我們團(tuán)隊(duì)使用Sentinel策略確保所有EC2實(shí)例都啟用加密,所有S3 Bucket都啟用版本控制。
Terraform官方文檔
Terraform: Up & Running- 深入講解Terraform實(shí)戰(zhàn)的書籍
HashiCorp Learn- 官方教程
Terraform Cloud/Enterprise:團(tuán)隊(duì)協(xié)作、遠(yuǎn)程執(zhí)行、策略即代碼(Sentinel)、私有模塊注冊(cè)表
動(dòng)態(tài)配置生成:使用for_each、count、dynamic塊實(shí)現(xiàn)更靈活的資源管理
自定義Provider開(kāi)發(fā):為內(nèi)部系統(tǒng)開(kāi)發(fā)Terraform Provider,實(shí)現(xiàn)統(tǒng)一管理
Terraform測(cè)試:使用Terratest進(jìn)行基礎(chǔ)設(shè)施代碼測(cè)試
方向二:Ansible自動(dòng)化深化
掌握Ansible的高級(jí)功能:
學(xué)習(xí)資源:
實(shí)踐建議:部署Ansible Tower/AWX,實(shí)現(xiàn)自助式基礎(chǔ)設(shè)施管理。開(kāi)發(fā)團(tuán)隊(duì)可以通過(guò)Web界面觸發(fā)預(yù)定義的Playbook,無(wú)需直接訪問(wèn)生產(chǎn)環(huán)境。
Ansible官方文檔
Ansible for DevOps- 實(shí)戰(zhàn)指南
Ansible Galaxy- 社區(qū)Role和Collection
Ansible Tower/AWX:企業(yè)級(jí)自動(dòng)化平臺(tái),提供Web界面、RBAC、作業(yè)調(diào)度、審計(jì)日志
動(dòng)態(tài)Inventory插件:從CMDB、云平臺(tái)、Kubernetes等數(shù)據(jù)源動(dòng)態(tài)獲取主機(jī)列表
自定義模塊開(kāi)發(fā):使用Python開(kāi)發(fā)Ansible模塊,擴(kuò)展功能
Ansible Collections:使用和發(fā)布Ansible Collections,實(shí)現(xiàn)模塊化管理
方向三:CI/CD集成
將Terraform和Ansible集成到CI/CD流水線:
學(xué)習(xí)資源:
實(shí)踐建議:實(shí)施完整的GitOps流水線,所有基礎(chǔ)設(shè)施變更都通過(guò)PR觸發(fā)自動(dòng)化測(cè)試和部署。我們的流水線包括:代碼檢查 → Terraform plan → 人工審批 → Terraform apply → Ansible配置 → 驗(yàn)證測(cè)試。
GitLab CI/CD文檔
Terraform in CI/CD
GitLab CI/CD:使用GitLab Pipeline自動(dòng)化執(zhí)行Terraform和Ansible
Jenkins Pipeline:編寫Jenkinsfile實(shí)現(xiàn)基礎(chǔ)設(shè)施持續(xù)交付
GitHub Actions:使用GitHub Actions實(shí)現(xiàn)GitOps工作流
策略驗(yàn)證:集成OPA、Checkov等工具進(jìn)行策略驗(yàn)證
6.3 參考資料
Terraform官方文檔- Terraform完整使用指南和API參考
Ansible官方文檔- Ansible完整使用指南和模塊參考
AWS Provider文檔- Terraform AWS Provider詳細(xì)文檔
Terraform Registry- Terraform模塊和Provider市場(chǎng)
Ansible Galaxy- Ansible Role和Collection市場(chǎng)
HashiCorp Learn- HashiCorp官方學(xué)習(xí)平臺(tái)
Infrastructure as Code- IaC經(jīng)典書籍
附錄
A. 命令速查表
# Terraform常用命令
terraform init # 初始化工作目錄
terraform init -upgrade # 升級(jí)Provider插件
terraform validate # 驗(yàn)證配置語(yǔ)法
terraform fmt # 格式化配置文件
terraform plan # 預(yù)覽變更
terraform plan -out=tfplan # 保存執(zhí)行計(jì)劃
terraform apply # 應(yīng)用變更
terraform apply tfplan # 應(yīng)用保存的計(jì)劃
terraform destroy # 銷毀所有資源
terraform show # 顯示當(dāng)前狀態(tài)
terraform state list # 列出所有資源
terraform state show # 顯示資源詳情
terraform state rm # 從狀態(tài)中移除資源
terraform import # 導(dǎo)入現(xiàn)有資源
terraform output # 顯示輸出變量
terraform workspace list # 列出所有工作空間
terraform workspace new # 創(chuàng)建工作空間
terraform workspace select # 切換工作空間
terraform force-unlock # 強(qiáng)制解鎖狀態(tài)
# Ansible常用命令
ansible --version # 查看版本
ansible all -m ping # 測(cè)試連接
ansible all -m setup # 收集主機(jī)信息
ansible all -m shell -a"uptime" # 執(zhí)行命令
ansible-playbook playbook.yml # 運(yùn)行Playbook
ansible-playbook playbook.yml --check # 檢查模式(不實(shí)際執(zhí)行)
ansible-playbook playbook.yml --diff # 顯示文件差異
ansible-playbook playbook.yml -vvv # 詳細(xì)輸出
ansible-playbook playbook.yml --start-at-task="task name"# 從指定任務(wù)開(kāi)始
ansible-playbook playbook.yml --tags"tag1,tag2" # 只執(zhí)行指定標(biāo)簽
ansible-playbook playbook.yml --skip-tags"tag1" # 跳過(guò)指定標(biāo)簽
ansible-vault create file.yml # 創(chuàng)建加密文件
ansible-vault edit file.yml # 編輯加密文件
ansible-vault encrypt file.yml # 加密文件
ansible-vault decrypt file.yml # 解密文件
ansible-galaxy init role_name # 創(chuàng)建Role結(jié)構(gòu)
ansible-galaxy install role_name # 安裝Role
ansible-doc -l # 列出所有模塊
ansible-doc module_name # 查看模塊文檔
B. 配置參數(shù)詳解
Terraform核心參數(shù):
參數(shù)
類型
說(shuō)明
required_version
string
Terraform版本要求
required_providers
map
Provider版本要求
backend
block
遠(yuǎn)程后端配置
variable
block
輸入變量定義
output
block
輸出變量定義
locals
block
本地變量定義
resource
block
資源定義
data
block
數(shù)據(jù)源定義
module
block
模塊調(diào)用
depends_on
list
顯式依賴聲明
count
number
資源實(shí)例數(shù)量
for_each
map/set
資源實(shí)例映射
lifecycle
block
生命周期管理
Ansible核心參數(shù):
參數(shù)
類型
說(shuō)明
hosts
string
目標(biāo)主機(jī)組
become
boolean
是否提權(quán)
become_user
string
提權(quán)用戶
vars
map
變量定義
vars_files
list
變量文件列表
tasks
list
任務(wù)列表
handlers
list
處理器列表
roles
list
Role列表
tags
list
任務(wù)標(biāo)簽
when
expression
條件執(zhí)行
loop
list
循環(huán)執(zhí)行
register
string
注冊(cè)變量
notify
list
觸發(fā)處理器
C. 術(shù)語(yǔ)表
術(shù)語(yǔ)
英文
解釋
基礎(chǔ)設(shè)施即代碼
Infrastructure as Code (IaC)
使用代碼定義和管理基礎(chǔ)設(shè)施的實(shí)踐
聲明式
Declarative
描述期望的最終狀態(tài),而非執(zhí)行步驟
冪等性
Idempotency
多次執(zhí)行相同操作產(chǎn)生相同結(jié)果的特性
狀態(tài)文件
State File
Terraform用于跟蹤資源狀態(tài)的文件
狀態(tài)鎖
State Lock
防止并發(fā)修改狀態(tài)文件的機(jī)制
Provider
Provider
Terraform中連接云平臺(tái)或服務(wù)的插件
Module
Module
Terraform中可復(fù)用的配置單元
Workspace
Workspace
Terraform中管理多環(huán)境的機(jī)制
Playbook
Playbook
Ansible中定義任務(wù)的YAML文件
Role
Role
Ansible中可復(fù)用的任務(wù)集合
Inventory
Inventory
Ansible中定義主機(jī)列表的文件
Handler
Handler
Ansible中響應(yīng)通知的特殊任務(wù)
Fact
Fact
Ansible收集的主機(jī)信息
Vault
Vault
Ansible中加密敏感信息的工具
動(dòng)態(tài)Inventory
Dynamic Inventory
從外部數(shù)據(jù)源動(dòng)態(tài)獲取主機(jī)列表
遠(yuǎn)程后端
Remote Backend
Terraform狀態(tài)文件的遠(yuǎn)程存儲(chǔ)
資源漂移
Resource Drift
實(shí)際資源狀態(tài)與代碼定義不一致
金絲雀部署
Canary Deployment
逐步將新版本部署到生產(chǎn)環(huán)境的策略
藍(lán)綠部署
Blue-Green Deployment
通過(guò)切換環(huán)境實(shí)現(xiàn)零停機(jī)部署的策略
不可變基礎(chǔ)設(shè)施
Immutable Infrastructure
不修改現(xiàn)有資源,而是替換為新資源的實(shí)踐
-
服務(wù)器
+關(guān)注
關(guān)注
14文章
10371瀏覽量
91771 -
容器
+關(guān)注
關(guān)注
0文章
536瀏覽量
23031 -
虛擬機(jī)
+關(guān)注
關(guān)注
1文章
975瀏覽量
30715
原文標(biāo)題:Terraform + Ansible混合編排:大規(guī)模異構(gòu)環(huán)境自動(dòng)化運(yùn)維體系構(gòu)建
文章出處:【微信號(hào):magedu-Linux,微信公眾號(hào):馬哥Linux運(yùn)維】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。
發(fā)布評(píng)論請(qǐng)先 登錄
誠(chéng)聘高級(jí)運(yùn)維自動(dòng)化工程師
銳捷助互聯(lián)網(wǎng)數(shù)據(jù)中心網(wǎng)絡(luò)自動(dòng)化、可視化運(yùn)維
【深圳】誠(chéng)聘運(yùn)維開(kāi)發(fā)工程師
ansible-first-book 自動(dòng)化運(yùn)維工具
配電自動(dòng)化實(shí)用化運(yùn)維指標(biāo)研究
厲害了!山東電力運(yùn)維自動(dòng)化平臺(tái)正式投運(yùn)
Ansible企業(yè)級(jí)自動(dòng)化運(yùn)維探索的詳細(xì)資料說(shuō)明
城域網(wǎng)自動(dòng)化運(yùn)維實(shí)現(xiàn)的關(guān)鍵點(diǎn)、難點(diǎn)和解決方案研究
城域網(wǎng)是什么,其生命周期和自動(dòng)化運(yùn)維應(yīng)用有哪些特點(diǎn)
使用Python腳本實(shí)現(xiàn)自動(dòng)化運(yùn)維任務(wù)
大規(guī)模數(shù)據(jù)中心網(wǎng)絡(luò)演進(jìn)的七大主流趨勢(shì)
使用Ansible實(shí)現(xiàn)大規(guī)模集群自動(dòng)化部署
Python腳本實(shí)現(xiàn)運(yùn)維工作自動(dòng)化案例
容器化NPB + Ansible:自動(dòng)化運(yùn)維方案
大規(guī)模異構(gòu)環(huán)境自動(dòng)化運(yùn)維體系構(gòu)建方案
評(píng)論