🚨 场景复现:支付系统死锁
问题现象
线上报警:订单处理超时,部分接口响应时间从50ms飙升到30秒,CPU使用率正常但大量请求卡住。
第一步:查看日志找线索
# 查看应用日志tail -f /app/logs/order-service.log# 发现大量超时日志2026-03-09 14:23:45 ERROR [http-nio-8080-exec-12] OrderController - 订单处理超时: orderId=123452026-03-09 14:23:46 ERROR [http-nio-8080-exec-24] PaymentController - 支付超时: paymentId=678902026-03-09 14:23:47 WARN [http-nio-8080-exec-56] ThreadPoolTaskExecutor - 队列已满,任务被拒绝
第二步:使用jstack抓取线程快照
# 找到Java进程IDjps -l | grep order-service# 输出:12345 com.example.order.OrderApplication# 抓取线程堆栈(连续抓3次,间隔5秒)jstack -l 12345 > /tmp/jstack_1.logsleep 5jstack -l 12345 > /tmp/jstack_2.logsleep 5jstack -l 12345 > /tmp/jstack_3.log
🔎 第三步:分析jstack输出
查看线程状态统计
grep "java.lang.Thread.State" /tmp/jstack_1.log | sort | uniq -c# 输出:# 45 java.lang.Thread.State: BLOCKED (on object monitor)# 23 java.lang.Thread.State: RUNNABLE# 12 java.lang.Thread.State: TIMED_WAITING (sleeping)# 5 java.lang.Thread.State: WAITING (parking)# 发现大量线程处于BLOCKED状态,死锁嫌疑很大!
查找死锁线索
# 方法1:直接搜索死锁关键词(jstack会自动检测死锁)cat /tmp/jstack_1.log | grep -A 20 -B 10 "deadlock"# 输出关键信息:Found one Java-level deadlock:============================="Thread-123":waiting to lock monitor 0x00007f9b0c00a800 (object 0x000000076b5f8a98, a java.lang.Object),which is held by "Thread-456""Thread-456":waiting to lock monitor 0x00007f9b0c00b200 (object 0x000000076b5f8b20, a java.lang.Object),which is held by "Thread-123"Java stack information for the threads:---------------------------------------------------"Thread-123":at com.example.service.OrderService.cancelOrder(OrderService.java:234)- waiting to lock <0x000000076b5f8a98> (a com.example.entity.Order)- locked <0x000000076b5f8b20> (a com.example.entity.Payment)"Thread-456":at com.example.service.PaymentService.refund(PaymentService.java:123)- waiting to lock <0x000000076b5f8b20> (a com.example.entity.Payment)- locked <0x000000076b5f8a98> (a com.example.entity.Order)
💻 第四步:定位到源码
根据堆栈信息找到问题代码
OrderService.java
@Servicepublic class OrderService {@Autowiredprivate PaymentService paymentService;@Transactionalpublic void cancelOrder(Long orderId, Long paymentId) {Order order = orderRepository.findById(orderId).orElseThrow();Payment payment = paymentRepository.findById(paymentId).orElseThrow();// 先锁Order,再锁Paymentsynchronized (order) { // 锁1log.info("已锁定订单: {}", orderId);// 模拟业务处理order.setStatus(OrderStatus.CANCELLED);orderRepository.save(order);// 调用支付服务退款paymentService.refund(payment); // 这里需要锁Paymentsynchronized (payment) { // 锁2payment.setStatus(PaymentStatus.REFUNDED);paymentRepository.save(payment);}}}}
PaymentService.java
@Servicepublic class PaymentService {@Autowiredprivate OrderService orderService;@Transactionalpublic void refund(Long paymentId, Long orderId) {Payment payment = paymentRepository.findById(paymentId).orElseThrow();Order order = orderRepository.findById(orderId).orElseThrow();// 先锁Payment,再锁Order(和OrderService相反的顺序)synchronized (payment) { // 锁2log.info("已锁定支付: {}", paymentId);payment.setStatus(PaymentStatus.REFUNDING);paymentRepository.save(payment);// 回调订单服务orderService.cancelOrder(orderId, paymentId); // 这里需要锁Ordersynchronized (order) { // 锁1order.setStatus(OrderStatus.CANCELLED);orderRepository.save(order);}}}}
🔬 第五步:复现死锁条件
编写测试用例验证
@SpringBootTestpublic class DeadlockTest {@Autowiredprivate OrderService orderService;@Autowiredprivate PaymentService paymentService;@Testpublic void testDeadlock() throws InterruptedException {Long orderId = 123L;Long paymentId = 456L;// 线程1:从订单开始Thread t1 = new Thread(() -> {try {orderService.cancelOrder(orderId, paymentId);} catch (Exception e) {log.error("线程1异常", e);}}, "Thread-Order");// 线程2:从支付开始Thread t2 = new Thread(() -> {try {paymentService.refund(paymentId, orderId);} catch (Exception e) {log.error("线程2异常", e);}}, "Thread-Payment");// 同时启动t1.start();t2.start();// 等待观察Thread.sleep(10000);// 检查线程状态log.info("线程1状态: {}", t1.getState());log.info("线程2状态: {}", t2.getState());}}// 运行结果:两个线程都卡在BLOCKED状态,死锁复现!
🎯 第六步:解决方案
方案1:固定锁获取顺序(推荐)
@Servicepublic class OrderService {public void cancelOrder(Long orderId, Long paymentId) {Order order = orderRepository.findById(orderId).orElseThrow();Payment payment = paymentRepository.findById(paymentId).orElseThrow();// 统一按照:先order后payment的顺序// 或者使用System.identityHashCode决定顺序Object lock1 = order;Object lock2 = payment;// 确保总是以相同的顺序获取锁if (System.identityHashCode(order) > System.identityHashCode(payment)) {lock1 = payment;lock2 = order;}synchronized (lock1) {synchronized (lock2) {// 业务处理}}}}
方案2:使用ReentrantLock的tryLock超时机制
@Servicepublic class PaymentService {private final ReentrantLock orderLock = new ReentrantLock();private final ReentrantLock paymentLock = new ReentrantLock();public void refund(Long paymentId, Long orderId) {boolean gotOrderLock = false;boolean gotPaymentLock = false;try {// 尝试获取两个锁,设置超时时间gotPaymentLock = paymentLock.tryLock(3, TimeUnit.SECONDS);if (!gotPaymentLock) {throw new RuntimeException("获取支付锁超时");}gotOrderLock = orderLock.tryLock(3, TimeUnit.SECONDS);if (!gotOrderLock) {throw new RuntimeException("获取订单锁超时");}// 业务处理doRefund(paymentId, orderId);} catch (InterruptedException e) {Thread.currentThread().interrupt();throw new RuntimeException("操作被中断");} finally {// 释放锁,注意顺序if (gotOrderLock) {orderLock.unlock();}if (gotPaymentLock) {paymentLock.unlock();}}}}
方案3:使用数据库行锁代替应用锁
@Servicepublic class OrderService {@Transactionalpublic void cancelOrder(Long orderId, Long paymentId) {// 使用数据库的悲观锁,由数据库管理锁顺序Order order = orderRepository.findByIdWithLock(orderId);Payment payment = paymentRepository.findByIdWithLock(paymentId);// 业务处理,数据库会处理死锁检测order.setStatus(OrderStatus.CANCELLED);payment.setStatus(PaymentStatus.REFUNDED);// 事务提交时释放锁}}@Repositorypublic interface OrderRepository extends JpaRepository<Order, Long> {@Lock(LockModeType.PESSIMISTIC_WRITE)@Query("select o from Order o where o.id = :id")Optional<Order> findByIdWithLock(@Param("id") Long id);}
🛠 第七步:预防措施
1. 代码规范检查
// 自定义注解,标记方法需要多锁@Target(ElementType.METHOD)@Retention(RetentionPolicy.RUNTIME)public @interface MultiLockOperation {String[] lockOrder() default {}; // 锁获取顺序}// 切面检查锁顺序@Aspect@Componentpublic class DeadlockPreventionAspect {@Before("@annotation(multiLock)")public void checkLockOrder(MultiLockOperation multiLock) {// 检查当前线程的锁获取顺序是否符合规范String[] requiredOrder = multiLock.lockOrder();List<String> currentOrder = LockRecorder.getCurrentThreadLocks();if (!isValidOrder(requiredOrder, currentOrder)) {throw new IllegalStateException("锁获取顺序可能引发死锁!");}}}
2. 监控告警
# 定期检查死锁#!/bin/bashwhile true; dojstack $(jps | grep OrderApplication | awk '{print $1}') > /tmp/jstack.logif grep -q "deadlock" /tmp/jstack.log; thenecho "检测到死锁!时间: $(date)" | mail -s "死锁告警" dev@company.comcp /tmp/jstack.log /tmp/deadlock_$(date +%Y%m%d_%H%M%S).logfisleep 60done
📊 死锁排查工具对比
| jstack | |||
| VisualVM | |||
| Arthas | |||
| FastThread |
🎓 经验总结
死锁产生的四个必要条件
互斥:资源一次只能被一个线程占用
占有且等待:线程持有锁还在等待其他锁
不可剥夺:锁只能自己释放
循环等待:线程间形成循环依赖
排查口诀
日志报警先发现,jstack抓线程BLOCKED线程多,死锁很可能搜索deadlock关键词,锁定两个线程查看堆栈找代码,锁顺序反了解决方案有三种,顺序最重要监控告警加规范,死锁不再来
夜雨聆风