Featured Post

This is the Kodak Moment for the Auto Industry

Plug-In Drivers Not Missin' the Piston Electric vehicles are here to stay. Their market acceptance is currently small but growing...

Monday, May 25, 2020

What is Tesla's Project Dojo?


Tesla has made significant investments in artificial intelligence (AI). AI is the key to Tesla's full self-driving (FSD) future. Yet, Elon Musk has also called AI humanity's “biggest existential threat.” How do you reconcile this dichotomy? The answer is simple, Narrow AI vs General AI. A narrow AI is trained for a particular task such as playing a particular game or language processing. These narrow intelligences are not transferable. A narrow chess AI will not know anything about checkers despite the two games sharing a board. Whereas, a General AI (sometimes called Strong AI or Artificial General Intelligence(AGI)) is the hypothetical ability of a system to learn any intellectual task that a human could learn. Skills an AGI learned in one arena could be applied in new areas and an artificial superintelligence could quickly develop. An artificial superintelligence may find humans are irrelevant or worse, a threat. This is the “existential threat” that concerns Musk. 

So Tesla's FSD system will be a narrow AI, able to drive your car and you'll even be able to tell it where you'd like to go. You won't, however, be able to chat with the FSD AI about your day, but at least you'll know it won't decide that the best way to reduce traffic accidents is to kill all humans. 


Tesla's AI investments to date include creating an AI software development and validation team, creating a data labeling team, and creating an FSD hardware team to design their own custom neural network inference engine. Next on Tesla's AI investment list is "Project Dojo."


Project Dojo

We've been given a few hints about Dojo: Musk talked about it in the 2019 financial call and Tesla's Director of Artificial Intelligence and Autopilot, Andrej Karpathy, has talked about it at multiple AI conferences. We'll discuss how neural nets work and then move into some wild speculation; but first, we have to acknowledge the Dad Joke that is the name Project Dojo. We know that Project Dojo is intended to vastly improve the Autopilot Neural Network training. If you want to train, where do you go? A Dojo, of course. 



Before we get into Dojo we need to cover a few basics about neural networks. There are two fundamental phases to neural networks (NN): Training and Inference.

Training

NNs have to be trained. Training is a massive undertaking. This is when the digital ocean of data that is the training dataset must be digested. It takes terabytes of data and exaflops of compute to train a complex NN. Through training the NN forms "weights" for nodes. When the training is complete, the resulting NN is tested. A test dataset that was not part of the training dataset, where the expected results are known, is thrown at the resulting network and if the NN is properly trained, it infers the correct answer for each test. Since Project Dojo is all about training, we'll dig more into this later. Depending on the use case, there may be several stages of simulation and testing before the NN is deployed. Deploying the NN leads us to our next phase, Inference.

Inference

When a neural network receives input, it infers things about the input based on its training; this is known as “inference.” These inferences may or may not be correct. Compared to training, the storage and compute power needed for inference is significantly lower. However, in real-time applications, the inference needs to happen within milliseconds; whereas training can take hours, days, or weeks.

Unlike training, inference doesn't modify the neural network based on the results. So when the NN makes a mistake, it is important that these are captured and fed back to the training phase. This brings us to a third (optional) phase, Feedback.

Feedback

You may have heard the phrase "Data is the new Oil." Nowhere is this more applicable than AI training datasets. If you want an AI that performs well, you have to give it a training set that covers many examples of all of the types of situations that it may encounter. After you have deployed the AI, you have to collect the situations where it did the wrong thing, label it with the expected result, and add this (and perhaps hundreds or thousands of examples like it) to the training dataset. This allows the AI to iteratively improve. However, it means that your training dataset grows with each iteration and so does the amount of computing horsepower needed for training.


Tesla's Autopilot Flywheel 

Now that we've ever so briefly covered AI basics, let's look at how these apply to Tesla's FSD.

Let's start with Deploying the Neural Net. Every car that Tesla makes today is a connected car that receives over-the-air updates. This allows the cars to receive new software versions frequently. When a new version of Autopilot is deployed, Tesla collects data about its performance. The AI makes predictions such as the path of travel, where to stop, et cetera. If Autopilot is driving and you disengage it, this may be because it was doing something incorrectly. These disengagements are reported back to Tesla (assuming you have data sharing enabled). The report could be a small file that only has the data labels and a few details or it could be streams of sensor data and clips of video footage depending on the type of disengagement and the types of situations that Tesla is currently adding to their training set.

Even if Autopilot is not engaged, it is running in "shadow mode." In shadow mode, it is still making predictions and taking note when you, the human driver, don't follow those predictions. For example, if it predicts that the road bends to the left, but you go straight, this would be noted and potentially reported back to the mothership. If Autopilot infers that a traffic light is green but you stop, this data would again likely be noted and potentially reported back.

Tesla has about a million vehicles on the road today collectively driving about 15 billion miles each year. The bulk of these cars are from Tesla's Fremont factory. Tesla now has a second factory, Giga Shanghai, putting cars on the road. Soon Giga Berlin and Giga Austin (or will it be Tulsa?) will join them. All of this will result in a large amount of data for the training dataset.

The bigger the training set, the longer it takes to process. However, with a system like this, the best way to improve it is to quickly iterate (deploy it, collect errors, improve, repeat). If training takes months, this slows down the flywheel. How do you resolve this? With a supercomputer dedicated to AI training. This is Project Dojo: make a training system that can drink in the oceans of data and produce a trained NN in days instead of months.


A Cerebras Wafer Scale Engine

Cerebras

At the start, I promised some speculation. As promised, here it is.

The size of the chips used for AI training has been increasing every year. From 2013 to 2019, AI chips increased by about 50% in size. A startup called Cerebras saw this trend and extrapolated it to its natural conclusion of 1 chip per wafer. For comparison, the Cerebras chip is 56 times bigger than the largest GPU made in 2019, it has 3,000 times more on-chip memory, and it has more than 10,000 times the memory bandwidth.

This wafer-scale chip is an AI training accelerator and my conjecture is that a Cerebras chip will be at the heart of Project Dojo. This wafer-scale chip is the biggest (literally and figuratively) breakthrough in AI chip design in a long time.

There is one (albeit tenuous) thread that connects Tesla and Cerebras, both are part of ARK Invest's disruption portfolio. ARK has investments in both companies and meets with their management teams. When there are two companies that could mutually benefit working together and it would benefit their mutual investor, ARK, you can bet that introductions would be made.