Edge Video AI Pipelines: DAG Runtimes, Queues, and System Bottlenecks

This note is a practical systems sketch for shipping stable video analytics on embedded SoCs. It stays SDK-agnostic: swap decoder, NPU runtime, and queue policies to fit your platform. Topics: DAG topology, buffer contracts, scheduling, and the limits (DRAM, thermal, fabric) that show up under real load, not only on slides.

Goal: reason about p95/p99 latency, drops, and jitter with the same discipline as OS scheduling and memory: name the bottleneck, then optimize the critical path.

Scenario
DAG plus queues, not DAG alone
Map stages to executors
Sample shape: stages and a minimal scheduler loop
Bottlenecks beyond FLOPs
Profiling the critical path
Checklist

1. Scenario

Assume:

One inbound stream: 1080p30 (could scale to N streams later).
Decode from compressed video (vendor decoder on DSP/VDEC or CPU fallback).
Preprocess: resize, mean/variance, layout conversion.
Detector: INT8 network on NPU (or GPU where that is the only option).
Tracker and business logic: mostly CPU.
Output: overlay, encode, or metadata only.

In production, the painful failure mode is usually not “model accuracy on a slide.” It is p95/p99 latency, frame drops, and jitter when the SoC gets hot or DRAM gets busy.

2. DAG plus queues, not DAG alone

A DAG drawn on a whiteboard is topology. A shipped engine needs edges with semantics:

Buffer ownership (who allocates, who releases, who may alias).
Backpressure (what happens when the next stage is slower).
Failure policy (drop oldest, block, degrade resolution, skip inference every K frames).

The mental picture to implement is DAG nodes plus bounded queues between them, not only “node A calls node B.”

flowchart LR
  subgraph ingest [Ingest]
    Cam[CameraOrFile]
    Dec[Decoder]
  end
  subgraph vision [Vision]
    Pre[Preprocess]
    Det[DetectorNPU]
    Trk[TrackerCPU]
  end
  subgraph out [Output]
    Post[OverlayOrEncode]
  end
  Cam --> Dec
  Dec -->|Q1| Pre
  Pre -->|Q2| Det
  Det -->|Q3| Trk
  Trk -->|Q4| Post

Q1..Q4 are not decorative. They are where you implement max depth, drop policy, and metrics (wait time, utilization).

3. Map stages to executors

Below is a pattern table, not a universal law. Names change per vendor (DSP, VPU, NPU, GPU).

Stage	Typical executor	Why	Risk if misplaced
Decode	Hardware decoder or DSP	Bitstream work should not starve the rest	CPU pegged, everything else drifts
Preprocess	GPU shader, NEON, or vendor image DSP	High pixel volume	Extra copies through DRAM
Detector	NPU or GPU	Convolution-heavy	Deep queues hide latency until they explode
Tracker	CPU	Small tensors, branching logic	Lock contention if shared with unrelated work

Scheduling here is not only “a thread pool.” It is mapping nodes to executors plus policies (priority, affinity, batching rules).

4. Sample shape: stages and a minimal scheduler loop

The following is illustrative pseudo-C++ for handles, queues, stage functions, and policy hooks — not a drop-in library.

// Opaque buffer id: pool index, dmabuf handle, or vendor token.
struct BufferHandle { uint32_t id; };

template <typename T>
struct BoundedQueue {
  bool try_push(const T&);   // returns false when full (backpressure)
  bool try_pop(T& out);
  size_t depth() const;
};

struct StageCtx {
  BoundedQueue<BufferHandle>* in_q;
  BoundedQueue<BufferHandle>* out_q;
};

void run_preprocess(StageCtx& ctx) {
  BufferHandle in, out;
  while (ctx.in_q->try_pop(in)) {
    // map in, produce out, or recycle on failure
    if (!allocate_output(out)) {
      release(in);
      continue;
    }
    preprocess(in, out);
    while (!ctx.out_q->try_push(out)) {
      // policy: spin, sleep, or drop according to product rules
    }
    release(in);
  }
}

Review focus for this shape

Clear ownership transitions (release, allocate_output).
A place to attach metrics (depth(), wait in try_push).
No hidden copies in the “happy path” comments (real code still needs explicit sync).

5. Bottlenecks beyond FLOPs

flowchart TB
  subgraph soc [SoCSharedBudget]
    DRAM[DRAM_bandwidth]
    BUS[IO_and_system_fabric]
    THM[Thermal_headroom]
  end
  Dec[Decoder_path]
  NPU[Inference_path]
  Dec --> DRAM
  NPU --> DRAM
  Dec --> THM
  NPU --> THM

DRAM bandwidth: Large feature maps, suboptimal strides, or extra color converts can saturate memory before the NPU runs out of ops.

Thermal: Sustained load reduces clocks. Jitter rises. A pipeline that looked fine for five minutes fails after twenty.

PCIe / fabric: Matters when accelerators or cameras are off-chip. Even on-chip, contention on shared buses still appears in integrated designs under stress.

6. Profiling the critical path

Use vendor tools where available (GPU/NPU timelines), plus system-wide views (perf, eBPF, Nsight Systems, or equivalents). Then:

Identify the longest dependent chain per frame (critical path).
Decide whether the next optimization belongs there or in queue policy and memory traffic.

Optimizing a kernel that is not on the critical path is a common way to waste weeks.

7. Checklist

Every edge between stages has max queue depth and policy under pressure.
You can name the one system bottleneck for your workload (DRAM, thermal, PCIe/fabric, or a specific stage).
Latency numbers are reported with percentiles, not only averages.
The team agrees what degradation means before it is needed in the field.

Illustrative architecture note. Tune numbers and policies to your SoC, OS, and product SLA.