ARM CA7 CA15 Big.Little Memory Access Throughput Analysis with Cache Hit and Miss

Make it to the Right and Larger Audience

Blog/Press Release

ARM CA7 CA15 Big.Little Memory Access Throughput Analysis with Cache Hit and Miss

It is desired to build processors with high performance and low power. But these two are hard to combine in one device. ARM Limited answers this requirement with the so-called big.little architecture by combining a low power processor cores and high performance processor cores in one chip and being able to dynamically switching between them based on the task demand.

The processors most commonly found in the current generation of big.LITTLE systems are the ARM Cortex A53 (A53) and the ARM Cortex A57 (A57). Both are 64-bit processors that share the same microarchitecture. However, they differ greatly in complexity, performance, and power consumption.

A typical big.little system is shown in below with “cache coherent interconnect”, CCI, employed to connect two processor clusters. There are two processor clusters. One for CA53 and one for CA57. Each cluster has four processor cores.  L1 Cache, Level One cache, is inside each processor core. L2 cache, Level Two cache, is per cluster. So four cores in one cluster share a L2 cache. Cache coherence requirement is achieved by extending the existing Advanced Extensible Interface (AXI) – the common bus interface in ARM devices – with three more channels (address, data and response) called the AXI Coherency Extension (ACE). This key component is the so called Cache Coherency Interconnect (CCI). It is a separate co-processor on the die which administers all the communication as well as establishes and maintains cache coherency between the processors.

Here is a bigger picture of SoC architecture. “Bus” in below is normally an AXI-based NIC bus interconnect. Big.Little processors can access main memory through this “bus” and “memory controller”.

Basic concept of cache and CCI is when a processor core in a cluster tries to read a memory location, it first checks if that memory location is already buffered, aka cached, in local L1 cache. If so, there is a hit on L1 cache and processor just reads from L1 cache. If not, the transaction goes to cluster level and SCU in cluster, not shown in diagram, checks if there is a hit on L2 cache. If so, read from L2 cache. If not, transaction goes to CCI level and CCI will check if there is a hit on L2 cache in the other cluster and if not if there is a hit on L1 cache in processor cores in the other cluster. If nothing, transaction leaves CCI and goes to “bus”/NIC, memory controller, and eventually reaches main memory.

Below diagram kinds of capture this concept. Note step #1-#4 is a read from little processor and no hit on any cache. Step #6-#9 is a read from big processor and there is a hit on cache in little processor cache. Step #10-#14 shows a write to cache which causes invalidation to other caches.

As can be seen, memory access bandwidth can vary quite a bit based on cache hit or miss. At design phase how do we estimate the access throughput in different situations and thus to decide clock rates of processor, SCU/bus, CCI, AXI, etc.  We can use below spreadsheet to do some simple estimation. Here we use the first generation CA7/CA15 Big.Little system as an example.

Next we will explain how above table works. But first, here is the big.little slides where above pictures come from.

The following is site premium content.
Use points to gain access. You can either purchase points or contribute content and use contribution points to gain access.
Highlights: 408 words, 1 images, 1 docs
Preview:
 
Author brief is empty
1 Comment
  1. mazhar 3 months ago
    0
    -0

    Thank you for sharing.

    0

Contact Us

Thanks for helping us better serve the community. You can make a suggestion, report a bug, a misconduct, or any other issue. We'll get back to you using your private message ASAP.

Sending

©2021  ValPont.com

Forgot your details?