Nvidia rose to prominence on the strength of its PC graphics cards, but as the traditional PC market cools, the company has turned its attention to mobile. The Tegra line of ARM systems-on-a-chip (SoCs) from Nvidia has served as a testing ground for some awesome innovations in mobile computing, offering near desktop-class performance in some cases. The newly announced Tegra X1 makes history yet again by bringing Nvidia’s new desktop GPU architecture to mobile in record time. The big surprise was Nvidia’s decision to eschew its own custom “Denver” CPU cores in favor of ARM reference designs. Here’s what you need to know about Tegra X1.
A system-on-a-chip contains a variety of components including the CPU, GPU, memory controller, image signal processor, and more. The Tegra X1 uses 64-bit ARMv8 processing cores just like the second version of the Tegra K1 (the first version of the K1 had 32-bit ARMv7 cores). However, Nvidia went with standard ARM reference cores instead of its own custom “Denver” CPU cores as implemented in the second revision of the Tegra K1—that’s the chip that powers the Nexus 9.
Nvidia paid a pretty penny to ARM for an instruction set license that enabled it to build the custom 64-bit Denver CPU, so why stop using it so soon? According to Nvidia, this is all part of an Intel-style tick-tock hardware strategy. The last Tegra K1 was powered by two Denver CPU cores, and was produced a 28nm manufacturing process (that’s a measurement of the relative size of the features on the chip). With Tegra X1, Nvidia wanted to move to a smaller chip process, in this case 20nm. The smaller manufacturing process allows for more transistors on the chip, and less power use at a given level of performance. Thermal design power (TDP) is crucial when you’re talking about mobile devices.
We suspect Nvidia didn’t use its own CPU core design because the Denver architecture isn’t yet ready for the 20nm process. The next version of Tegra is code named “Parker,” and it may feature the follow-on to the Denver CPU cores, but no one knows exactly what Nvidia is up to right now. Previous roadmaps pointed to a 16nm process being adopted for Parker, but it sounds like that has been pushed back.
This is actually the same thing Nvidia did last year with the K1. We first got a quad-core chip using standard ARM CPU cores and a Kepler GPU (that’s the chip in the Shield Tablet), then a few months later the 64-bit Denver cores were paired up with Kepler in a newer version of the K1 (that’s the chip in the Nexus 9). So Nvidia has essentially promised that a 20nm Denver variant will come to a future version of the X1, but for the time being we’re looking at perfectly capable, but standard licensed ARM cores.
The Tegra X1 is packing four Cortex-A57 cores (the “big” cores) and four Cortex-A53 cores (the “little” cores). The big cores are fast and use more power, but the little cores are great for background processing and are much more power-efficient. Most chips that use this eight-core configuration are tied together using a system from ARM called big.LITTLE. The newest version of this technology moves data between the two CPU islands with so-called global task scheduling. With the latest version of global task scheduling (sometimes called heterogeneous multi-processing), you can get any mix of the eight big and little cores.
Rather than using ARM’s method for controlling all eight cores, Nvidia is using cluster migration with a custom cache coherence system to shuffle data between the two islands. Under this model, the OS scheduler only sees one cluster (either big or little) at a time.
So what does all that mean? The Tegra X1 only runs processes on one set of cores at a time, but the data can be moved back and forth between the big power-hungry cores and the small power-efficient cores. Cluster migration is typically be less efficient than global task scheduling, but Nvidia says its custom interconnect has vastly improved the power efficiency of cluster management.
This isn’t entirely new territory for Nvidia, as it was one of the first SoC makers to devise a system for using a low-power core in conjunction with traditional full-power CPU cores. The company originally called its design the “companion core” before renaming it to the much more boring “4-PLUS-1.” This tech debuted in the Tegra 3 and paired a single low-speed Cortex-A9 with four standard A9s. Nvidia’s engineers have probably learned a lot about ARM cores in the last few years, so hopefully this custom multi-processing setup is better.
Along with the new 64-bit CPU cores, Tegra X1 packs a brand new GPU based on its Maxwell architecture. This is not the first time Nvidia has used its desktop GPU architecture in Tegra: The Tegra K1 implemented a version of Kepler, but that came two years after the first desktop reference cards were out. Maxwell is arriving on mobile in half the time. Nvidia says this is thanks to a refocusing on mobile within the company. Kepler was designed for desktops then ported to mobile SoCs. Maxwell, on the other hand, was designed from the ground up with a mobile implementation in mind.
Nvidia’s use of licensed ARM CPU cores makes that element of the Tegra X1 is very similar to other mobile processors releasing in early 2015. The GPU has to be the differentiator, and Nvidia knows it. Maxwell on mobile supports Unreal Engine 4, DirectX 12, OpenGL 4.5, CUDA, and OpenGL ES 3.1. The Maxwell desktop parts were lauded for their high power efficiency, and that carries over to the mobile version.
A quick look at the specs of the X1's GPU shows it's a big leap over that in the K1. Tegra X1’s GPU has far more CUDA cores (256 vs. 192), two geometry units instead of one, 16 texture units (up from 8 in the K1), 16 ROPs (up from only 4), and a big jump in memory bandwidth to 25.6 GB/s (up from 17).
There are a few design changes that account for all the gains, but right at the top of the list is the move to 20nm process technology for the GPU. The X1 also moves to LPDDR4 memory and employs a new type of end-to-end memory compression, which allows Nvidia to stick with a memory bus width of 64-bits. This should reduce performance bottlenecks due to memory bandwidth, which is a common issue with mobile GPUs.
The 256 CUDA cores in Nvidia’s mobile Maxwell are impressive, but that’s still just a fraction of what you’ll find in a desktop Maxwell design. However, Tegra X1 has one feature those more powerful PC chips don’t: Tegra X1’s CUDA cores can be used to accelerate some floating point operations by a wide margin, specifically the low-precision FP16 operations.
Maxwell only features FP32 and FP64 CUDA cores. On Kepler, FP16 operations were run alone on a FP32 core, which wasted some capacity. Maxwell in the X1 is able to fuse two FP16 operations of the same type (i.e. addition, subtraction, and so on) and run them together on a single FP32 core. Android display drivers and game engines make heavy use of FP16 operations, so this could be a real help.
Efficiency and Implementations
When Nvidia announced the Tegra X1, it talked up the efficiency of the chip, lauding the process improvements and new GPU tweaks mentioned above. But how efficient is it? We were told that Tegra X1 hits peak power consumption of 10 watts when rendering an Unreal Engine 4 demo. That’s certainly impressive when you consider the Xbox needs 100 watts to do the same thing. However, 10 watts is still way too much for a tablet—most tablet processors consume maybe half that at the very top end.
The Tegra X1’s power envelope really speaks to the wide variety of applications Nvidia envisions. While the 10W number is very impressive in one context, it doesn’t tell the whole story. According to Nvidia, the power consumption in a tablet powered by Tegra X1 will be on par with Tegra K1. In fact, idle power consumption will be even lower thanks to the various architecture improvements. Tegra K1 was designed to operate at around 5-8 watts, with infrequent peaks up to 11 watts when running stressful benchmarks, so the X1 will be well within the realm of tablet power requirements.
The are two official Tegra X1 implementations so far, neither of which are for phones or tablets. Rather, they're part of Nvidia’s new DRIVE platform. The DRIVE PX is a system for fully automated self-driving cars. It’s powered by two Tegra X1 chips and can accept live feeds from up to 12 cameras. The more modest DRIVE CX is a single Tegra X1 chip for infotainment systems, capable of driving high-resolution displays and digital instrument clusters.
The DRIVE applications are at the high-end of what Tegra X1 can do. It will also find its way into tablets and embedded systems just like the Tegra K1 did. In those situations it won’t be able to pull 10 watts of juice, but even at power levels similar to other SoCs, the Tegra X1 will be considerably faster. Early benchmarks suggest the X1 will best the Snapdragon 810 by about 15%, but benchmarks don’t tell the whole story. While impressive on paper, Tegra chips have had trouble getting traction in the market.
There were only a handful of devices that ran the Tegra K1 (both variants), and one of the most notable was made by Nvidia itself—the Shield Tablet. Google’s experimental Project Tango tablet also runs a Tegra K1, but its most prominent use is in the Nexus 9. Is Tegra X1 going to be any different? It’s hard to say, but the DRIVE platform will at least ensure there will be more in-house applications for Tegra going forward. A next-generation Shield Portable or Tablet seems like a safe bet, too.
Even if Tegra X1 does find its way into more tablets and embedded devices this time around, you’re going to see a lot more of the Snapdragon 810. That chip is fully compatible with Qualcomm’s market-leading LTE modems, and the power envelope is better optimized for phones. A Tegra X1 will probably be restricted to tablets that can handle a higher-wattage chip, but it will be fantastic for gaming with that Maxwell-based GPU.