[MUSIC PLAYING] MARTIN GORNER: Hi, everyone,and thank you for being here at 8:30 in the morning,and welcome to this session about TPUs and TPU pods.
So those are custommade accelerators that Google has designed toaccelerate machine learning workloads.
And before I tell you everythingabout them, me and Kaz, I would like to do something.
Of course, this is live, soyou want to see a live demo.
And I would like to trainwith you here, onstage, using a TPU pod, one ofthose big models that used to take days to train.
And we'll see if wecan finish the training within this session.
So let me start the training.
I will come back to explainingexactly what I'm doing here.
I'm just starting it.
Run all cells.
Seems to be running.
OK, I'm just checking.
I'm running this ona 128 core TPU pod.
So one of the thingsyou see in the logs here is that I have allmy TPUs appearing.
0, 1, 2, 6, and allthe way down to 128.
All right, so this is running.
I'm happy with it.
Let's hear more about TPUs.
So first of all, what isthis piece of silicon? And this is the demothat I've just launched.
It's an objectdetection demo that is training on a wildlifedata set of 300,000 images.
Why wildlife? Because I can showyou cute pandas.
And I can show you cuteelectronics as well.
So this is a TPU v2.
And we have a secondversion now, a TPU v3.
Those are fairly large boards.
It's large like this, roughly.
And as you can see, theyhave four chips on them.
Each chip is dual core,so each of these boards has 8 TPU cores on them.
And each core has two units.
That is a vectorprocessing unit.
That is a fairly standarddata-oriented processor, general purpose processor.
What makes this specialfor machine learning is the matrix multiply unit.
TPUs have a built-inhardware-based matrix multiplier that can multiply to128 by 128 matrices in one go.
So what is specialabout this architecture? There are twotricks that we used to make it fast and efficient.
The first one is, Iwould say, semi-standard.
It's reduced precision.
When you train neural networks,reducing the precision from 32-bit floatingpoints to 16-bit is something that peoplequite frequently do, because neural networks arequite resistant to the loss of precision.
Actually, it evenhappens sometimes that the noise that isintroduced by reduced precision acts as a kind of regularizerand helps with convergence.
So sometimes you're even luckywhen you reduce precision.
But then, as you see on thischart, float16 and float32, the floating point formats,they don't have the same number of exponent bits,which means that they don't cover the same range.
So when you take a model anddowngrade all your float32s into float16s, you might getinto underflow or overflow problems.
And if it is yourmodel, it's usually not so hard to go in and fix.
But if you're usingcode from GitHub and you don't knowwhere to fix stuff, this might be very problematic.
So that's why on TPUs,we chose a different– actually we designed adifferent floating point format called bfloat16.
And as you cansee, it's actually exactly the same as float32with just the fractional bits cut off.
So the point is it hasexactly the same number of exponent bits,exactly the same range.
And therefore, usually,it's a drop-in replacement for float32 andreduced precision.
So typically foryou, there is nothing to do on your model to benefitfrom the speed of reduced precision.
The TPU will do thisautomatically, on ship, in hardware.
And the second trickis architectural.
It's the design of thismatrix multiply unit.
So that you understandhow this works, try to picture, in yourhead, how to perform a matrix multiplication.
And one result, onepoint of the resulting matrix, try to remember calculusfrom school, is a dot product.
A dot product of one lineof one matrix and one column of the second matrix.
Now what is a dot product? A dot product is a series ofmultiply-accumulate operations, which means that theonly operation you need to perform amatrix multiplication is multiply and accumulate.
And multiply-accumulatein 16 bits, because we're usingbfloat16 reduced precision.
That is a tiny, tinypiece of silicon.
A 16-bit multiply-accumulatoris a tiny piece of silicon.
And if you wire them togetheras an array, as you see here.
So this in real life wouldbe a 128 by 128 array.
It's called a systolic array.
Systolic in Greek means flow.
Because you will flowthe data through it.
So the way it works is that youload one matrix into the array, and then you flow the secondmatrix through the array.
And you'll have tobelieve me, or maybe spend a little bit moretime with the animation, by the time the graydots have finished flowing through thosemultiply-accumulators, out of the right side comeall the dot products that make the resulting matrix.
So it's a one-shot operation.
There are no intermediate valuesto store anywhere, in memory, in registers.
All the intermediatevalues flow on the wires from one compute unit tothe second compute units.
It's very efficient.
And what is more,it's only made of those tiny 16-bitmultiply-accumulators, which means that we can crama lot of those into one chip.
128 by 128 is 16,000multiply-accumulators.
And that's how much you getin one TPU core, twice that in two TPU cores.
So this is what makes it dense.
Density means power efficiency.
And power efficiency inthe data center means cost.
And of course, youwant to know how cheap or how fast these things are.
Some people mightremember from last year I did a talk about what Ibuilt, this planespotting model, so I'm using this asa benchmark today.
And on GoogleCloud's AI platform, it's very easy to getdifferent configurations, so I can test howfast this trains.
My baseline is– on afast GPU, this model trains in half infour and a half hours.
But I can also get 5machines with powerful GPUs in a cluster.
And on those fivemachines, five GPUs, this model willtrain in one hour.
And I've chosen this numberbecause one hour is exactly the time it takes for thismodel to train on one TPU v2.
So the rule of thumbI want you to remember is that roughly 1 TPUv2, with its 4 chips, is roughly as fast asfive powerful GPUs.
That's in terms of speed.
But as you can see, it'salmost three times cheaper.
And that's the point ofoptimizing the architecture specifically for neuralnetwork workloads.
You might want to know howthis works in software as well.
So when you're using TensorFlow,or Keras in TensorFlow, your Python codeTensorFlow Python code generates a computational graph.
That is how TensorFlow works.
So your entire neural networkis represented as a graph.
Now, this graph is whatis sent to the TPU.
Your TPU does notexecute Python code.
This graph is processed throughXLA, the Accelerated Linear Algebra compiler,and that is how it becomes TPU microcodeto be executed on the TPU.
And one nice side-effectof this architecture is that if, in yourTensorFlow code, you load your data throughthe standard tf.
Dataset API, as you should, and asis required with TPUs, then even the data loading part, orimagery resizing, or whatever is in your data pipeline,ends up in the graph, ends up executed on the TPU.
And the TPU will be pullingdata from Google Cloud Storage directly during training.
So that is very efficient.
How do you actuallywrite this with code? So let me show you in Keras.
And one caveat, this isKeras in TensorFlow 1.
14, which should be outin these next days.
The API is slightlydifferent in TensorFlow 1.
13 today, but I'd rather showyou the one that will be– the new one, as oftomorrow or next week.
So it's only a coupleof lines of code.
There is the first line,TPUClusterResolver.
You can call it withoutparameters on most platforms, and that findsthe connected TPU.
The TPU is aremotely-connected accelerator.
This finds it.
You initialize the TPU and thenyou use the new distribution API in TensorFlow to define aTPU strategy based on this TPU.
And then you saywith strategy.
Scope, and everything that follows isperfectly normal Keras code.
Then you define yourmodel, you compile it, you do model.
Predict, anything you're usedto doing in Keras.
So in Keras, it's literallythese four lines of code to add– to work on a TPU.
And I would like to point outthat these four lines of code also transform your modelinto a distributed model.
Remember a TPU,even a single GPU, is a board with eight cores.
So from the get go it'sdistributed computing.
And these four linesof code put in place all the machinery ofdistributed computing for you.
One parameter to notice.
You see in the TPUstrategy, there is the steps_per_run equals 100.
So that's an optimization.
This tells the TPU, please run100 batches worth of training and don't report backuntil you're finished.
Because it's a networkattached accelerator, you don't want the TPUto be reporting back after each batch forperformance reasons.
So this is the software.
If you don't want towrite your own code, I encourage you to do so.
But if you don't, wehave a whole library of TPU optimized models.
So you will find them onthe TensorFlow/tpu GitHub repository.
And there is everythingin the image– in the vision space, inthe machine translation, and language, and NLP space,in speech recognition.
Even you can playwith GaN models.
The one that we aredemoing on stage, remember we are training themodel right now, is RetinaNet.
So this one is anobject detection model.
And I like this model,so let me say a few words about how this works.
In object detection, you putan image, and what you get is not just the label– this is a dog, this is a panda– but you actually get boxesaround where those objects are.
In object detectionmodels, you have two kinds.
There are one shotdetectors that are usually fast butkind of inaccurate, and then two-stagedetectors that are much more accurate but much slower.
And I like RetinaNetbecause they actually found a trick to makethis both the fastest and the most accuratemodel that you can find in object detection today.
And it's a very simple trick.
I'm not going to explainall the math behind it, but basically in thesedetection models, you start withcandidate detections.
And then you prune them tofind only the detections– the boxes that haveactual objects in them.
And the thing is that allthose blue boxes that you see, there is nothing in them.
So even during training,they will very easily be classified as nothingto see, move along boxes, with a fairly small error.
But you've gotloads of them, which means that when you compute theloss of this model, in the loss you have a huge sumof very small errors.
And that huge sum of verysmall errors might in the end be very big and overwhelmthe useful signal.
So the two-stagedetectors resolve that by being much more carefulabout those candidate boxes.
In one-stagedetectors, you start with a host of candidate boxes.
And the trick theyfound in RetinaNet is a little mathematicaltrick on the loss to make sure that thecontribution of all those easy boxes stays small.
The upshot, it's bothfast and accurate.
So let me go back here.
I actually want to say a wordabout now what I did, exactly, when I launched this demo.
I guess most of you are familiarwith the Google Cloud Platform.
So here I am opening theGoogle Cloud Platform console.
And in the GoogleCloud Platform, I have a tool calledAI platform, which, for those who know it, has hada facility for running training jobs and for deployingmodels behind the REST API for serving.
But there is a newfunctionality called Notebooks.
In AI platform, you can todayprovision ready all-installed notebook for working in– yeah, so let meswitch to this one– for working eitherin TensorFlow, in PyTorch, with GPUs.
It's literally aone click operation.
NEW INSTANCE, I wanta TensorFlow instance with Jupyter notebookinstalled, and what you get is here an instance that isrunning but with the link to open Jupyter.
For example, this one–and it will open Jupyter, but it's already open.
So it's asking me to selectsomething else, but it's here.
And here, you canactually work normally in your Jupyter environmentwith a powerful accelerator.
You might have noticedthat I don't have a TPU option, actuallynot here, but here, for adding an accelerator.
But here I am usingJupyter notebook instances that are powered by aTPU v3 128-core pod.
How did I do it? It's actually possibleon the command line.
I give you thecommand line here.
There is nothing fancy about it.
There is one gcloudcompute command line to start to the instance and asecond gcloud compute command line to start the TPU.
You provision a TPU just asyou would a virtual machine in Google's cloud.
So this is what I've done.
And that is what isrunning right now.
So let's see if what we are.
Here it's still running.
As you see enqueuenext 100 batches.
And it's training.
We are step 4,000out of 6,000 roughly.
So we'll check back on thisdemo at the end of the session.
This demo, when I was doingit, to run it on stage, I've been able also to run acomparison between how fast our TPU v3s versus v2s.
In theory, v3s are roughlytwice as powerful as v2s, but that only works ifyou feed them enough work to make use ofall the hardware.
So here on RetinaNet,you can train on images of various sizes.
Of course, if you trainon smaller images, 256 pixel images, itwill be much faster, in terms of images per second.
And I've tried both– TPU v2s and v3s.
You see with small images,you get a little bump in performance from TPU v3s,but nowhere near double.
But as you get to biggerand bigger images, you are feeding thehardware with more work.
And on 640 pixel images, thespeed up you get from TPU v3 is getting close to thetheoretical x2 factor.
So for this reason, I amrunning this demo here at the 512 pixel imagesize on a TPU v3 pod.
I'm talking about pods.
But what are thesepods, exactly? To show you moreabout TPU pods, I would like to givethe lectern to Kaz.
Thank you Kaz.
KAZ SATO: Thank you, Martin.
[APPLAUSE] So in my part, I directlyintroduce Cloud TPU pods.
What are pods? It's a large clusterof Cloud TPUs.
The version two pod is nowavailable as public beta, which provides 11.
6 petaflops,with 512 TPU cores.
The next generationversion three pod is also public betanow, which achieves over 100 petaflopswith 2,048 TPU cores So those performance numbersare as high as the greatest supercomputers.
So Cloud TPU podsare AI supercomputer that Google havebuilt from scratch.
But some of youmight think, what's the difference between a bunchof TPU instances and a Cloud TPU pod? The difference isthe interconnect.
Google has developed ultrahigh-speed interconnect hardware derived from asupercomputer technology, for connecting thousands ofTPUs with very short latency.
What does it do for you? As you can see on theanimation, every time you update a single parameteron a single TPU, that will be synchronizedwith all the other thousands of TPUs, in an instant,by the hardware.
So in short, TensorFlowusers can use the whole pod as a single giantmachine with thousands of TPU cores inside it.
It's as easy as usinga single computer.
And you may wonder, becauseit's an AI supercomputer, you may also takesuper high cost.
But it does not.
You can get startedwith using TPU pods with 32 cores at $24 per hour,without any initial cost.
So you don't have topay millions of dollars to build your ownsupercomputer from scratch.
You can just rent it for acouple of hours from the cloud.
Version three pod also canbe provisioned with 32 cores.
That costs only $32 per hour.
For larger sizes, you canask our service contact for the pricing.
What is the cost benefitof a TPU pods over GPUs? Here's a comparison result.
With a full version two pod, with 512 TPU cores, you cantrain the same ResNet-50 models at 27 times fasterspeed at 38% lower cost.
This shows the clearadvantage of the TPU pods to a typicalGPU-based solutions.
And there are other benefitsyou could get from the TPU pods.
Let's take a lookat eBay's case.
eBay has over 1 billionproduct listings.
And to make it easier tosearch specific products from 1 billion products, theybuilt a new visual search feature.
And to train the models, theyhave used 55 million images.
So it's a really largescale training for them.
And they have usedCloud TPU pods, and eBay was able to get a 100times faster training time, compared withexisting GPU service.
And they will also geta 10% accuracy boost.
Why is that? TPU itself is not designedto increase the accuracy that much.
But because if you can'tincrease the training speed 10 times or 100 times,that means the data scientists or researchers canhave 10 times 100 times more iterations for thetrials, such as trying out a different combinationof the hyperparameters or differentpreprocessings and so on.
So that ended up at least 10%accuracy boost in eBay's case.
Let's see what kindof TensorFlow code you would write to get thosebenefits from TPU pods.
And before taking alook at the actual code, I try to look back.
What are the effortsrequired, in the past, to implement the largescale distributed training? Using many GPUs or TPUsfor a single machine running training, that isso-called distributed training.
And there are two ways.
One is data parallel andanother is model parallel.
Let's talk about thedata parallel first.
With data parallel, as youcan see on the diagram, you have to split the trainingdata into the multiple GPU or TPU nodes.
And also you have to share thesame parameter set, the model.
And to do that, you have to setup a cluster of GPUs or TPUs by yourself.
And also you have to setup a parameter server that shares all the updatesof our parameters among all the GPU or TPUs.
So it's a complex setup.
And also in manycases, you have to– there's going to besynchronization overhead.
So if you have hundredsor as thousands of the TPUs or GPUsin a single cluster, that's going to be ahuge overhead for that.
And that limits the scalability.
But with TPU pods, thehardware takes care of it.
Your high-speedinterconnect synchronizes all of the parameterupdates in a single TPU with the other thousandsof TPUs in an instant, with very short latency.
So there's no need to setup the parameter server, or there's no need to setup the large cluster of GPUs by yourself.
And also you can getalmost linear scalability to add more on the moreTPU cores in your training.
And Martin will show youthe actual scalability result later.
And as I mentionedearlier, TensorFlow users can use the whole TPU podsas a single giant computer and with thousands ofTPU cores inside it.
So it's as easy asusing a single computer.
For example, if you have Kerascode running on a single TPU, it also runs on a 2,000 TPUcores without any changes.
This is exactly the samecode Martin showed earlier.
So under the hood, all thecomplexity for the data parallel training, suchas splitting the training data into the multipleTPUs, or the sharing the same parameters,those are all taken care of by theTPU pods' interconnect, and XLA compilers, andthe new TPUStrategy API in the TensorFlow 1.
The one thing you may wantto change is the batch size.
As Martin mentioned, a TPUcore has a matrix processor that has 128 by 128matrix multipliers.
So usually, you willget the best performance by setting in the batch sizeto 128 times the number of TPU cores.
So if you have 10 TPU cores,that's going to be 1,280.
The benefit of TPU pods isnot only the training times.
It also enables thetraining of giant modules by using gear theMesh TensorFlow.
Data parallel has been a popularway of distributed training, but there's one downside.
It cannot train a big model.
Because all departments areshared with all the GPUs or TPUs, you cannot bring abig model that doesn't fit into the memory of asingle GPU or a TPU.
So there's another way ofdistributed training called a model parallel.
With model parallel, youcan split the giant model into the multiple GPUsor TPUs so that you can train much larger models.
But that has notbeen a popular way.
Why? Because it's muchharder to implement.
As you can see onthe diagrams, you have to implement allthe communications between the fractionof the models.
It's like stitchingbetween the models.
And again, you have to setup the complex cluster, and in many cases,the communication between the models.
Because if you havehundreds of thousands of CPU or GPU or TPUcores, then that's going to be a hugeoverhead for that.
So those are the reasonswhy model parallel has not been so popular.
To solve those problems,TensorFlow team has developed a new librarycalled Mesh TensorFlow.
It's a new way ofdistributed training, with the multiplecomputing nodes, such as TPU pods, or multipleGPUs, or multiple CPUs.
TensorFlow providesan abstraction layer that sees those computing nodesas a logical n-dimensional mesh.
Mesh TensorFlow is nowavailable as open source code on the TensorFlowGitHub repository.
To see how itworks with imaging, you could have asimple neural network like this for recognizingthe MNIST model.
This network has thebatch size as 512, and data dimension as784, and one hidden layer with 100 nodes, andoutput as 10 classes.
And if you want to trainthat network with the model parallel, you canjust specify, I want to split the parametersinto four TPUs to the Mesh TensorFlow, and that's it.
You don't have tothink about how to implement the communicationbetween the split model and how to worry about thecommunication overhead.
What kind of a codeyou would write? Here is the code touse the model parallel.
At first, you have todefine the dimensions of both data and the model.
In this code, you are definingthe batch dimension as 512, and the data hasa 784 dimensions, and hidden layer has 100nodes, and the 10 classes.
And then you defineyour own network by using Mesh TensorFlow APIs,such as two sets of weights and one hidden layers, andone logits and loss function, by using those dimensions.
Finally, you define how manyTPU or GPUs have in the mesh, and what is the layoutrule you want to use.
In this code example, itis defining a hidden layer dimensions for splitting themodel parameters into the four TPUs.
And that's it.
So that the Mesh TensorFlowcan take a look at this code and automatically split themodel parameters into the four TPUs.
And it shares the same trainingdata with all the TPUs.
You can also combine bothdata and the model parallel.
For example, you can definethe 2D mesh like this.
And you use the rows of themesh for the data parallel and use the column of themesh or the model parallel, so that you can get thebenefits from both of them.
And again, it's easy todefine with Mesh TensorFlow.
You can just specifybatch dimension for the rows and hidden layerdimensions for the columns.
This is an example whereyou are using the Mesh TensorFlow for traininga transformer model.
Transformer model is a verypopular language model, and I don't go deeper intothe transformer model.
But as you can see,it's so easy to map each layer of atransformer model to the layer loadof Mesh TensorFlow so that you can efficiently mapthe large data and large model into the hundreds ofthousands of TPU cores by using Mesh TensorFlow.
So what's the benefit? By using the Mesh TensorFlowrunning with to TPU pods, the Google AI team was ableto train the language module and translation model withthe billion word scale.
And they were able to achievethe state-of-the-art scores, as you can see on those numbers.
So for those use cases, thelarger the model, the better accuracy you get.
The model parallel with TPUpods give the big advantage on achieving thosestate-of-the-art scores.
Let's take a look at another usecase of the large scale model parallel I just call BigGAN.
And I don't go deeper into whatis GAN or how the GAN works.
But here's the basic idea.
You have the twodefined networks.
One is calleddiscriminator D and another is called generator G.
Andyou define a loss function so that the D to be trained torecognize whether an image is a fake image or real image.
And at the same time, thegenerator will be trained to generate a realisticimage so that a D cannot find it's a fake.
It's like a minimax game youare playing with those two networks.
And eventually, youwill have a generic G that can generate aphoto-realistic fake images, artificial images.
Let's take a lookat the demo video.
So this is not big spoiler.
I have already loadedthe bigger models that is trained on the TPU pod.
And as you cansee, these are all the artificial synthesizedimage at high quality.
You can alsospecify the category of the generatedimages, such as ostrich, so that you can generatethe ostrich images.
These are all synthesizedartificial images.
None of them are real.
And because BigGAN can havethe so-called latent space that has the seeds togenerate those images, you can interpolatebetween two seeds.
In this example,it is interpolating between goldenretriever and Lhasa.
And you can try out adifferent combination of the interpolation, such aswest highland white terrier and golden retriever.
Again, those areall fake images.
So this bigger model was trainedwith the TPU version three pod with 512 cores.
And that took 24hours to 48 hours.
Why BigGAN takes so manyTPU cores and so long time? The reasons are the modelsize and the batch size.
The quality of a GAN model,measured by GAN model, are measured by theinception score, or IS score.
That represents howmuch an inception model thinks those images are real.
And that also represents thevariety of generated images.
The BigGAN papersays that you get better IS scorewhen you are having more parameters inthe model and when you are using the largerbatch size for the training.
So that means the largerscale model parallel on the hundreds of TPU coresis crucial for BigGAN model to increase the qualityof those generated images.
So we have seen two use cases.
A BigGAN use case andlanguage model use cases.
And those are the firstapplications of the model parallel on TPU pods.
But they are only the starters.
So TPU pods are availableto everyone from now.
So we expect to seemore and more exciting use cases coming fromthe new TPU pods users and also from the applications.
So that's it for my part.
Back to Martin.
MARTIN GORNER: So now it'stime to check on our demo.
Did our model actually train? Checking here, yeah, it lookslike it has finished training.
A saved model has been saved.
So the only thingthat is to do is to verify if this model canactually predict something.
So on a second machine I willreload the exact same model.
I believe that's the one.
And let's go and reload it.
So I'll skip trainingthis time and just go here to inference and loading.
Whoops, sorry about that.
I just hope the demo godswill be with me today.
That's because I'm loadingthe wrong directory.
The demo gods arealmost with me.
It's this one where mymodel has been saved.
It wasn't the same.
Sorry about that.
No training, just inference.
And this time, it lookslike my model is loading.
And once it's loaded, Iwill see if it can actually detect animals inimages, and here we are.
So this leopard isactually a leopard.
This bird is a bird.
The lion is a lion.
This is a very tricky image.
So I'm showing you notcherry-picked images.
This is a model I have trainedon stage, here with you.
No model is perfect.
We will see baddetections, like this one.
But that's a tricky one.
It's not an actual lion.
The leopard is spot on.
The lion is spot on.
And see that the boxingactually works very well.
The leopard has been perfectlyidentified in the image.
So let's move to somethingmore challenging.
Even this inflatable artworklion has been identified, which is not always the case.
This is a complicated image– a flock of birds.
So you see it's notseeing all of them.
But all of them atleast are birds, which is a pretty good job.
The leopard is fine.
Oh, and this is themost complex we have.
There is a horse and cattle.
Well, we start seeing acouple of bad detections here.
Of course, thatcow is not a pig.
As I said, no model is perfect.
But here the tigeris the tiger, and we have our two cute pandas.
And those two cute pandasare actually quite difficult, because those are baby pandas.
And I don't believethat this model has had a lot of baby animalsin its 300,000 images data set.
So I'm quite glad that itmanaged to find the two pandas.
So moving back, let mefinish by giving you a couple of feeds andspeeds on those models.
So here, this model hasa RetinaNet 50 backbone, plus all the detection layersthat produced the boxes.
And we have been training iton a TPU v3 pod with 128 cores.
It did finish in 20 minutes.
You don't have to justbelieve me for that.
Let me show you.
Here I had a timerread my script.
Yep, 19 minutes and 18 seconds.
So I'm not cheating.
This was live.
But I could also have runthis model on a smaller pod.
Actually, I triedon a TPU v2-32.
On this chart, you seethe speed on this axis and the time on this axis.
This is to show you thata TPU v2-32 is actually a very useful tool to have.
We've been talking abouthuge models up to now.
But it's debatable whetherthis is a huge model.
This definitely was ahuge model a year ago.
Today, with bettertools, I can train it in an hour on a fairlymodest TPU v2 32-core pod.
So even as an individualdata scientist, that is a very useful toolfor me to have handy when I need to do a round oftrainings on a model like this, because someone wants ananimal detection model.
And bringing the trainingdown to the one hour space, or 20 minutesspace, allows me to work a lot fasterand iterate a lot faster on the hyperparemeters, onthe fine tuning, and so on.
You see on a single TPUv3, it's the bottom line.
And if we were totrain this on a GPU– so remember our rule ofthumb from the beginning.
One TPU v2, roughly five GPUs.
Therefore 1 TPU v3,roughly 10 GPUs.
So the GPU line would beone tenth of the lowest line on this graph.
I didn't put it because itwould barely register there.
That shows you thechange of scale at which you can be trainingyour models using TPUs.
You might bewondering about this.
So as you scale, onething that might happen is that you have to adjustyour learning rate schedule.
So this is actually thelearning rate schedule I have used to train themodel on the 128 core TPU pod.
Just a couple of words,because it might not be the most usual learning rateschedule you have ever seen.
There is this ramp up.
So the second partis exponential decay.
That's fairly standard.
But the ramp up part,that is because we are starting fromResNet-50, initialized with pre-trained weights.
But we still leavethose weights trainable.
So we are trainingthe whole thing.
It's not transfer learning.
It's just fine tuning ofpre-trained ResNet-50.
And when you dothat, and you train very fast, using bigbatches, as we do here, the batch size hereis 64 times 128.
So it's a very big batch size.
You might actually breakthose pre-trained weights in ways that harmyour precision.
So that's why it's quiteusual to have a ramp up period to make sure that the network,in its initial training phases, when it doesn'tknow what it's doing, does not completelydestroy the information in the pre-trained weights.
So we did it.
We did train this modelhere on stage in 20 minutes.
And the demo worked, I'mreally glad about that.
So this is the end.
What we have seen isTPUs and TPU pods.
But mostly cost effective.
Very cost effectiveway and a good tool to have for any data scientist.
Also, and more specificallyfor very large models, but for what used to belarge models in the past and which are normal modelstoday, such as a ResNet-50 [INAUDIBLE].
It's a very useful tools.
And then Cloud TPU pods,where you can actually enable not only data,but model parallelism, using this new librarycalled Mesh TensorFlow.
A couple of links herewith more information if you would like to know more.
Yes, you can take a picture.
And if you havemore questions, we will be at the AIML pod, the red one, in front of one TPU rack.
So you can see thisone live and get a feel for what kindof computer it is.
And with that,thank you very much.
[APPLAUSE] [MUSIC PLAYING].