Smaller chips, bigger packages, even bigger headaches (2021)

rob74 · on April 7, 2022

So, chiplet designs are a way around the issue of small defects making a big chip unusable, by splitting big chips into several chiplets. But these chiplet chips can only be properly tested when all the chiplets have been assembled into a package - just like non-chiplet chips. And if it turns out that one of the chiplets is bad, they have to throw out the whole package anyway. So, in order to really benefit from using chiplets, they have to figure out how to test them before they are assembled.

Taniwha · on April 7, 2022

Chips these days are generally designed to be tested before they're packaged - maybe not at speed but usually some scan vectors and a sanity check for current usage.

Time spent on a tester (a machine costing ~$1M+) is expensive, so is packaging a chip - so typically one does a quick sanity test before packaging and a longer test afterwards.

For chiplets one is more likely to do more early testing, because the packaging is likely to be more expensive (along with all the other dies in the package)

jleahy · on April 7, 2022

This. Otherwise what would be the point of chiplets? (aside from being able to mix process nodes as appropriate / cost-effective).

Of course this is why what Cerebas is doing (and what the article is about) is stupid.

For those who are interested you can look up ‘wafer testing’.

marcosdumay · on April 7, 2022

Hum... The article is a quick back of the envelope summary of the problems the industry hires people to solve.

Where it points a problem, read it as "you could be hired to work on X", and not as "companies are unable to do X".

Of course, it is also a submarine marketing piece from Cerebras. So it has an incentive to imply that the problem with large chips are solved and the ones with small ones are unsolvable. Just notice that it doesn't actually say this, you just get the impression from no specific place.

thebean11 · on April 7, 2022

> But these chiplet chips can only be properly tested when all the chiplets have been assembled into a package

Why is that, though? Could you have a physical test harness for a chiplet that mocks out the others, sends input to your chiplet, and make sure it's operating correctly? Maybe that's harder than it sounds, I'm coming at this from a software perspective.

jleahy · on April 7, 2022

It’s very easy, this kind of testing is crucial for hardware (in a way that it is not for software).

At the most basic level this is done through scan, which allows you to test every single wire in the design for several faults (stuck at 0 and stuck at 1 being the most important). This covers the vast majority of failures (for example missing vias and missing contacts). For memory other techniques are needed (reading and writing fixed patterns) and various ‘repairs’ can be done by blowing fuses to translate faulty addresses.

So for example you might do scan test (the first one) at the wafer before the dies are even cut, then assume that the majority of dies will have repairable (rather than irreparable) memory errors and only test those after packaging.

pwr-electronics · on April 7, 2022

Never made a chiplet, but as a general observation in hardware ...

Digital hardware acts predictably only when everything is electrically matched and connected properly. And the integration processes can introduce new defects. It's probably still an analog and materials problem at that point.

jotm · on April 7, 2022

Huh, can they not just block the malfunctioning chiplets and sell them as cheaper models with fewer cores/cache? Sort of what nVidia and AMD did with their GPUs. AMD has kinda been focusing on the mid and high end range with Ryzen, which makes sense given the demand, so the only actually cheap CPUs these days are Intel.

girvo · on April 7, 2022

Given both the demand and AMD/TSMC’s apparently higher than expected yield. You don’t really want to cut down working dies, so if their yield is high then there’s less silicon to turn into those cut down cheaper models

p_l · on April 7, 2022

From what I heard, AMD essentially makes few variants of CCXes then bins them and based on bins sells them as different models, mixing and matching in order to increase total yield.

sandworm101 · on April 7, 2022

Or disassemble them once a chiplet is found defective, freeing the other working chiplets to be used again.

xiphias2 · on April 7, 2022

The strangest thing with the statistical testing approach is thinking that TSMC isn't already doing it.

As many billions of dollars are on the line, I'm sure they already differentiate wafers and areas of wafers that are more problematic.

evancox100 · on April 7, 2022

They may be making a finer point here, I’m not an expert on packaging, but generally speaking there are packaging processes that support using only “known good die”. You still have the issue of testing the final integrated product, so the packaging process itself needs to have high yield.

For more info check out Dylan Patel’s series on advanced packaging:

https://semianalysis.substack.com/p/advanced-packaging-part-...

Or this from SemiWiki for a shorter overview of just TSMC processes:

https://semiwiki.com/semiconductor-manufacturers/tsmc/290560...

Symmetry · on April 7, 2022

I think this is overlooking two of the most important benefits of multi chip modules, re-using engineering and simpler supply chains. Once AMD has designed their eight core die they've got a design they can re-use between the cheapest desktop CPU without integrated graphics and the highest end server part. And, modulo binning, you can reallocate your dies between market segments as demand dictates. You've still got different IO dies for different segments but those on are easier to design and on less in-demand processes nodes.

rhn_mk1 · on April 7, 2022

Well played. I thought it was about potato chips and their air-filled packaging.