Some Considerations of ASIC Clock CTS and STA

Make it to the Right and Larger Audience

Blog/Press Release

Some Considerations of ASIC Clock CTS and STA

Let’s say a design has three clocks, clk1, clk2, and clk3. They are synchronous to each other. The Blog “Turing CTS Recipe”  mentions sometimes during CTS (clock tree generation) we would like to specify, for example, clk1 and clk2 to be synchronous while clk3 to be async to these two.

 

The reason is it is possible clk1 and clk2 domains are placed close to each other while clk3 domain is placed far away from them. In above diagram, three clocks come in from left side so clk1 and clk2 have small clock insertion delay but clk3 can have large clock insertion delay. If we specify three clocks to be sync during CTS, CTS tool will try to add delays to clk1 and clk2 to match delay of clk3. This is undesired. Similarly if clocks come in from right side, clk3 insertion delay will be increased to match those of clk1/clk2. Not desired either.

In below two cases,

  1. Paths between clk3 and clk1/clk2 are not many
  2. Even there are many clk3 to/from clk1/clk2 paths, paths are easier to meet timing due to ie less combinational logic on the paths

it is preferred to make clk3 async to clk1/clk2 to fix above clk tree insertion delay issue. But the blog does not mention how to let STA tool to check timing of clk3 to/from clk1/clk2 paths since these paths are still synchronous which is required by design. It is common in timing constraint to specify two clock domains to be sync and define some paths between them are async/false-path. But not the other way around.

Here is the trick. You can specify clk3 to async to clk1/clk2 during CTS so clk trees are optimized. But for STA constraint, specify clk3 to be sync to clk1/clk2. So STA will check and fix interface timings locally with clock tree untouched.

 

The blog gives a good trick of employing clock ordering during CTS. It says functional clock is faster and dft scan clock is slower. So we can let CTS to build clock tree for faster functional clock first and then build clock tree for slower scan clock with dont_touch_subtree specified. In this case function clock tree is not affected when scan clock tree is built. Good technique. One caveat is scan clock comes from OCC and it is combined shift clock and capture clock. Shift clock is normally slower than functional clock but capture clock needs to be at least the clock rate of functional clock, if not faster, for at-speed test. But the trick is valid. You can think you have a clock mux for two functional clocks.

 

The blog also mentions in case there is “clock used as data” case it must be explicitly marked as “exclude_pin” at the beginning of the corresponding data path to guide CTS to exclude anthing further from clock tree balancing. This is very important and yet lots of time still violated. If not marked, CTS can insert quite a bit delay on clock tree trying to balance.

But the blog does not mention what are possible cases of “clock used as data”. One case as drawn in above is clock is connected to register D pin. Why does a designer wants to make such a connection? It is possible the designer wants to use fw to read clock state for debug purpose to see if clock toggles. (But this is a a bad design. For this purpose user can use an div-by-2 circuit to check if clock toggles) Another common case is clock is connected to a debug bus and debug bus eventually goes out of chip or fpga. This case is treated as “clock used as data” in CTS/STA.

 

Another blog on the same site, False Path vs Case Analysis vs Disable Timing, states “both case analysis and disable timing result in fewer timing paths to be analyzed. False path still tries to fix the design rule (max cap, max transition and max fanout) violations”. Good point. As a matter of fact, backend normally prefers to use MCP (multiple cycle path) instead of false path if possible.

 

Here is another good blog on the same site, Common Path Pessimism, states “Ideally speaking, for setup analysis, we would like to take the +5% derated value of the delay of these buffers while considering launching path and -5% derated value while considering the capture path. However, here lies the catch! How can the same buffer or set of buffers be derated differently for launch and capture? Recall from the definition of OCV that it is the intra-chip variation in PVT that STA engineers consider them in the first place.

 

Actually this statement is not accurate for setup timing check. But overall how a logic, the common clock path in this case, has both a shorter-than-normal-delay (positive derate) and shorter-than-normal-delay (negative derate) at the same time? It can not. But for setup timing check, launching clock edge and capturing clock edge are NOT the same clock edge. In other words, they do not happen at the same time so it is possible for the common clock path buffers to have  horter-than-normal-delay (positive derate) at the launching clock edge and shorter-than-normal-delay (negative derate) at the capturing clock edge. But it is still the same clock cells so some delay difference can be removed but not the whole of them for sure. This is Common Path Pessimism Removal for setup.

Hold timing is different. Launching clock edge and capturing clock edge are the SAME clock edge. So most of delay difference of common clock path can be removed.

 

 

 
Author brief is empty
Groups:

0 Comments

Contact Us

Thanks for helping us better serve the community. You can make a suggestion, report a bug, a misconduct, or any other issue. We'll get back to you using your private message ASAP.

Sending

©2020  ValPont.com

Forgot your details?