Andrew is co-founder and CEO of Cerebras Techniques. He’s an entrepreneur devoted to pushing boundaries within the compute house. Previous to Cerebras, he co-founded and was CEO of SeaMicro, a pioneer of energy-efficient, high-bandwidth microservers. SeaMicro was acquired by AMD in 2012 for $357M. Earlier than SeaMicro, Andrew was the Vice President of Product Administration, Advertising and BD at Force10 Networks which was later offered to Dell Computing for $800M. Previous to Force10 Networks, Andrew was the Vice President of Advertising and Company Improvement at RiverStone Networks from the corporate’s inception by IPO in 2001. Andrew holds a BA and an MBA from Stanford College.
Cerebras Techniques is constructing a brand new class of pc system, designed from first ideas for the singular objective of accelerating AI and altering the way forward for AI work.
May you share the genesis story behind Cerebras Techniques?
My co-founders and I all labored collectively at a earlier startup that my CTO Gary and I began again in 2007, referred to as SeaMicro (which was offered to AMD in 2012 for $334 million). My co-founders are a few of the main pc architects and engineers within the trade – Gary Lauterbach, Sean Lie, JP Fricker and Michael James. Once we obtained the band again collectively in 2015, we wrote two issues on a whiteboard – that we needed to work collectively, and that we needed to construct one thing that might remodel the trade and be within the Pc Historical past Museum, which is the equal to the Compute Corridor of Fame. We had been honored when the Pc Historical past Museum acknowledged our achievements and added WSE-2 processor to its assortment final yr, citing the way it has reworked the unreal intelligence panorama.
Cerebras Techniques is a crew of pioneering pc architects, pc scientists, deep studying researchers, and engineers of all sorts who love doing fearless engineering. Our mission after we got here collectively was to construct a brand new class of pc to speed up deep studying, which has risen as one of the vital essential workloads of our time.
We realized that deep studying has distinctive, large, and rising computational necessities. And it isn’t well-matched by legacy machines like graphics processing items (GPUs), which had been basically designed for different work. Consequently, AI in the present day is constrained not by purposes or concepts, however by the supply of compute. Testing a single new speculation – coaching a brand new mannequin – can take days, weeks, and even months and value tons of of 1000’s of {dollars} in compute time. That’s a serious roadblock to innovation.
So the genesis of Cerebras was to construct a brand new sort of pc optimized completely for deep studying, ranging from a clear sheet of paper. To satisfy the large computational calls for of deep studying, we designed and manufactured the most important chip ever constructed – the Wafer-Scale Engine (WSE). In creating the world’s first wafer-scale processor, we overcame challenges throughout design, fabrication and packaging – all of which had been thought-about unimaginable for the whole 70-year historical past of computer systems. Each component of the WSE is designed to allow deep studying analysis at unprecedented speeds and scale, powering the trade’s quickest AI supercomputer, the Cerebras CS-2.
With each element optimized for AI work, the CS-2 delivers extra compute efficiency at much less house and fewer energy than another system. It does this whereas radically decreasing programming complexity, wall-clock compute time, and time to resolution. Relying on workload, from AI to HPC, CS-2 delivers tons of or 1000’s of occasions extra efficiency than legacy alternate options. The CS-2 gives the deep studying compute assets equal to tons of of GPUs, whereas offering the convenience of programming, administration and deployment of a single machine.
Over the previous few months Cerebras appears to be all around the information, what are you able to inform us concerning the new Andromeda AI supercomputer?
We introduced Andromeda in November of final yr, and it is likely one of the largest and strongest AI supercomputers ever constructed. Delivering greater than 1 Exaflop of AI compute and 120 Petaflops of dense compute, Andromeda has 13.5 million cores throughout 16 CS-2 methods, and is the one AI supercomputer to ever show near-perfect linear scaling on giant language mannequin workloads. It is usually lifeless easy to make use of.
By means of reminder, the most important supercomputer on Earth – Frontier – has 8.7 million cores. In uncooked core depend, Andromeda is a couple of and a half occasions as giant. It does totally different work clearly, however this provides an concept of the scope: almost 100 terabits of inside bandwidth, almost 20,000 AMD Epyc cores feed it, and – not like the large supercomputers which take years to face up – we stood Andromeda up in three days and instantly thereafter, it was delivering close to excellent linear scaling of AI.
Argonne Nationwide Labs was our first buyer to make use of Andromeda, they usually utilized it to an issue that was breaking their 2,000 GPU cluster referred to as Polaris. The issue was operating very giant, GPT-3XL generative fashions, whereas placing the whole Covid genome within the sequence window, in order that you can analyze every gene within the context of the whole genome of Covid. Andromeda ran a singular genetic workload with lengthy sequence lengths (MSL of 10K) throughout 1, 2, 4, 8 and 16 nodes, with near-perfect linear scaling. Linear scaling is amongst probably the most sought-after traits of an enormous cluster. Andromeda delivered 15.87X throughput throughout 16 CS-2 methods, in comparison with a single CS-2, and a discount in coaching time to match.
May you inform us concerning the partnership with Jasper that was unveiled in late November and what it means for each firms?
Jasper’s a extremely attention-grabbing firm. They’re a pacesetter in generative AI content material for advertising, and their merchandise are utilized by greater than 100,000 prospects world wide to write down copy for advertising, adverts, books, and extra. It’s clearly a really thrilling and quick rising house proper now. Final yr, we introduced a partnership with them to speed up adoption and enhance the accuracy of generative AI throughout enterprise and client purposes. Jasper is utilizing our Andromeda supercomputer to coach its profoundly computationally intensive fashions in a fraction of the time. It will lengthen the attain of generative AI fashions to the plenty.
With the ability of the Cerebras Andromeda supercomputer, Jasper can dramatically advance AI work, together with coaching GPT networks to suit AI outputs to all ranges of end-user complexity and granularity. This improves the contextual accuracy of generative fashions and can allow Jasper to personalize content material throughout a number of courses of consumers rapidly and simply.
Our partnership permits Jasper to invent the way forward for generative AI, by doing issues which might be impractical or just unimaginable with conventional infrastructure, and to speed up the potential of generative AI, bringing its advantages to our quickly rising buyer base across the globe.
In a current press launch, the Nationwide Vitality Expertise Laboratory and Pittsburgh Supercomputing Heart Pioneer introduced the primary ever Computational Fluid Dynamics Simulation on the Cerebras wafer-scale engine. May you describe what particularly is a wafer-scale engine and the way it works?
Our Wafer-Scale Engine (WSE) is the revolutionary AI processor for our deep studying pc system, the CS-2. Not like legacy, general-purpose processors, the WSE was constructed from the bottom as much as speed up deep studying: it has 850,000 AI-optimized cores for sparse tensor operations, large excessive bandwidth on-chip reminiscence, and interconnect orders of magnitude quicker than a conventional cluster might probably obtain. Altogether, it offers you the deep studying compute assets equal to a cluster of legacy machines all in a single machine, simple to program as a single node – radically decreasing programming complexity, wall-clock compute time, and time to resolution.
Our second era WSE-2, which powers our CS-2 system, can resolve issues extraordinarily quick. Quick sufficient to permit real-time, high-fidelity fashions of engineered methods of curiosity. It’s a uncommon instance of profitable “robust scaling”, which is using parallelism to cut back resolve time with a set dimension drawback.
And that’s what the Nationwide Vitality Expertise Laboratory and Pittsburgh Supercomputing Heart are utilizing it for. We simply introduced some actually thrilling outcomes of a computational fluid dynamics (CFD) simulation, made up of about 200 million cells, at close to real-time charges. This video reveals the high-resolution simulation of Rayleigh-Bénard convection, which happens when a fluid layer is heated from the underside and cooled from the highest. These thermally pushed fluid flows are all spherical us – from windy days, to lake impact snowstorms, to magma currents within the earth’s core and plasma motion within the solar. Because the narrator says, it’s not simply the visible fantastic thing about the simulation that’s essential: it’s the pace at which we’re in a position to calculate it. For the primary time, utilizing our Wafer-Scale Engine, NETL is ready to manipulate a grid of almost 200 million cells in almost real-time.
What sort of knowledge is being simulated?
The workload examined was thermally pushed fluid flows, also called pure convection, which is an software of computational fluid dynamics (CFD). Fluid flows happen naturally throughout us — from windy days, to lake impact snowstorms, to tectonic plate movement. This simulation, made up of about 200 million cells, focuses on a phenomenon generally known as “Rayleigh-Bénard” convection, which happens when a fluid is heated from the underside and cooled from the highest. In nature, this phenomenon can result in extreme climate occasions like downbursts, microbursts, and derechos. It’s additionally answerable for magma motion within the earth’s core and plasma motion within the solar.
Again in November 2022, NETL launched a brand new subject equation modeling API, powered by the CS-2 system, that was as a lot as 470 occasions quicker than what was potential on NETL’s Joule Supercomputer . This implies it might ship speeds past what both clusters of any variety of CPUs or GPUs can obtain. Utilizing a easy Python API that allows wafer-scale processing for a lot of computational science, WFA delivers beneficial properties in efficiency and value that might not be obtained on typical computer systems and supercomputers – actually , it outperformed OpenFOAM on NETL’s Joule 2.0 supercomputer by over two orders of magnitude in time to resolution.
Due to the simplicity of the WFA API, the outcomes had been achieved in just some weeks and proceed the shut collaboration between NETL, PSC and Cerebras Techniques.
By remodeling the pace of CFD (which has at all times been a gradual, off-line activity) on our WSE, we are able to open up an entire raft of recent, real-time use instances for this, and plenty of different core HPC purposes. Our objective is that by enabling extra compute energy, our prospects can carry out extra experiments and invent higher science. NETL lab director Brian Anderson has advised us that this may drastically speed up and enhance the design course of for some actually huge tasks that NETL is engaged on round mitigating local weather change and enabling a safe vitality future — tasks like carbon sequestration and blue hydrogen manufacturing.
Cerebras is persistently outperforming the competitors with regards to releasing supercomputers, what are a few of the challenges behind constructing state-of-the-art supercomputers?
Mockingly, one of many hardest challenges of huge AI isn’t the AI. It’s the distributed compute.
To coach in the present day’s state-of-the-art neural networks, researchers usually use tons of to 1000’s of graphics processing items (GPUs). And it isn’t simple. Scaling giant language mannequin coaching throughout a cluster of GPUs requires distributing a workload throughout many small units, coping with machine reminiscence sizes and reminiscence bandwidth constraints, and punctiliously managing communication and synchronization overheads.
We’ve taken a very totally different strategy to designing our supercomputers by the event of the Cerebras Wafer-Scale Cluster, and the Cerebras Weight Streaming execution mode. With these applied sciences, Cerebras addresses a brand new technique to scale primarily based on three key factors:
The substitute of CPU and GPU processing by wafer-scale accelerators such because the Cerebras CS-2 system. This transformation reduces the variety of compute items wanted to attain a suitable compute pace.
To satisfy the problem of mannequin dimension, we make use of a system structure that disaggregates compute from mannequin storage. A compute service primarily based on a cluster of CS-2 methods (offering sufficient compute bandwidth) is tightly coupled to a reminiscence service (with giant reminiscence capability) that gives subsets of the mannequin to the compute cluster on demand. As traditional, a knowledge service serves up batches of coaching information to the compute service as wanted.
An progressive mannequin for the scheduling and coordination of coaching work throughout the CS-2 cluster that employs information parallelism, layer at a time coaching with sparse weights streamed in on demand, and retention of activations within the compute service.
There’s been fears of the tip of Moore’s Legislation for near a decade, what number of extra years can the trade squeeze in and what kinds of improvements are wanted for this?
I feel the query we’re all grappling with is whether or not Moore’s Legislation – as written by Moore – is lifeless. It isn’t taking two years to get extra transistors. It’s now taking 4 or 5 years. And people transistors aren’t coming on the similar worth – they’re coming in at vastly greater costs. So the query turns into, are we nonetheless getting the identical advantages of shifting from seven to 5 to 3 nanometers? The advantages are smaller they usually price extra, and so the options turn out to be extra sophisticated than merely the chip.
Jack Dongarra, a number one pc architect, gave a chat not too long ago and mentioned, “We’ve gotten a lot better at making FLOPs and at making I/O.” That’s actually true. Our means to maneuver information off-chip lags our means to extend the efficiency on a chip by an amazing deal. At Cerebras, we had been joyful when he mentioned that, as a result of it validates our choice to make a much bigger chip and transfer much less stuff off-chip. It additionally gives some steerage on future methods to make methods with chips carry out higher. There’s work to be completed, not only a wringing out extra FLOPs but additionally in methods to maneuver them and to maneuver the info from chip to chip — even from very huge chip to very huge chip.
Is there the rest that you simply want to share about Cerebras Techniques?
For higher or worse, folks usually put Cerebras on this class of “the actually huge chip guys.” We’ve been in a position to present compelling options for very, very giant neural networks, thereby eliminating the necessity to do painful distributed computing. I imagine that’s enormously attention-grabbing and on the coronary heart of why our prospects love us. The attention-grabbing area for 2023 will likely be methods to do huge compute to the next degree of accuracy, utilizing fewer FLOPs.
Our work on sparsity gives an especially attention-grabbing strategy. We don’t do work that doesn’t transfer us in direction of the objective line, and multiplying by zero is a foul concept. We’ll be releasing a extremely attention-grabbing paper on sparsity quickly, and I feel there’s going to be extra effort is taking a look at how we get to those environment friendly factors, and the way can we accomplish that for much less energy. And never only for much less energy and coaching; how can we reduce the price and energy utilized in inference? I feel sparsity helps on each fronts.
Thanks for these in-depth solutions, readers who want to study extra ought to go to Cerebras Techniques.