AI FTP and the cone of uncertainty

While I think a lot of the criticism of the new AI FTP is over the top, I also think TrainerRoad has brought some of this on themselves by doing a poor job of explaining the fundamental nature of prediction.

A reasonable analogy is hurricane forecasting. You are the hurricane. Your AI FTP is the position of the storm. Your upcoming workouts are the terrain the hurricane will pass over. The other things you do and your environment are the atmosphere and surrounding weather.

The further into the future one is predicting, the less certain that prediction is. Thus, providing a point estimate of a prediction can be grossly misleading. A cone of uncertainty would help to indicate how much/little confidence there is (How to Understand Hurricane Forecasts and the Cone of Uncertainty | Scientific American):

Right now, Trainer Road is just giving us the point at the end of the central “most likely” prediction path. Of course that prediction is often wrong. That’s how prediction works, even when it is working. You, me, all of us, are complex dynamical systems.

What would be extremely helpful would be for TrainerRoad to provide some measure of uncertainty around the AI FTP predictions. Even if it isn’t clear how to calculate that directly, surely they have enough data to provide an estimate based on past performance of the algorithm. For example, when AI FTP predictions are made 28 days out, 50% of actual AI FTP values at the end of the 28 days are within x Watts (or x%) of the original prediction.

26 Likes

Good analogy. In my mind I tend to imagine a statement kinda like this attached to the FTP prediction:

“At the end of this training block, TR AI estimates that your FTP will be X. This prediction assumes that you complete each of the scheduled workouts as prescribed. Skipping workouts, pausing during the workout or reducing the intensity below 100% will cause your predicted FTP to trend lower. Adding additional workouts to your schedule, increasing intensity above 100% for all or part of a workout, swapping out prescribed workouts for harder/longer variants or extending workouts may increase or decrease your predicted FTP depending on your ability to recover adequately. If you feel like you can do more work and still adequately recover, use the Check Volume feature to adjust the workload your plan is giving you.”

10 Likes

That’s a great paragraph right there.

2 Likes

This is an excellent analogy. Also to @tjwanek ‘s comments - they need to be a lot more specific about what each variable they do actually impacts and the tolerance. How bad is a 10 minute break on a 3 hour workout, and how did they get to that conclusion, etc.

4 Likes

The problem with even attempting to predict ftp is not everyone responds to the same training the same way and even the same athlete responds differently to training throughout the season or across multiple years.

You can see this play out in any study that reports on an individual basis or if you have actual experience coaching multiple athletes for a few seasons.

I am not remotely surprised that people are having very different experiences with the accuracy of the prediction.

It would help already if people finally get around to the idea of the prediction being a possibility, not a fact. Right now this still seems to elude a significant portion of the user base.

Personally I’d not want to have this paragraph, but only because of the UI annoyance, not because of it being badly written. It just feels like yet another of those “drink may be hot” labels on a freshly brewed coffee.

8 Likes

That’s a well written statement - but what it still misses is that even if everything goes exactly to plan and you complete everything TR gives you without making changes - it’s still only a prediction and it can still change.

The OP nailed it :clap:

3 Likes

Some great points.

Nate has said they would like to run Monte Carlo simulations in the future, which would sample a number of outcomes, produce a range of results and provide a level or certainty across them. Much like the OPs image showing the likely storm paths.

Currently though we get just one prediction path, or at least one that the ML model is reasonably certain of. Comparing to the image above, TR is showing one of the possible lines the storm can take, not many of them.

Don’t forget those kinds of storm predictions can be run in one batch, using large pools of computers, and with a potential risk to life - pockets to fund the simulations are deep.

I’m fairly sure TR can spin up as many machines as their cloud provider allows, but there are many different simulations from many different athletes. Cost and performance is a higher priority problem to solve here.

I’m sure we’ll get a similar view of AIFTP prediction in the future, once they’re able to do it quickly and efficiently.

2 Likes

You make a good point, however if it was only the above that made the prediction trend lower then that’s understandable. What I’m struggling with is when it trends lower even when the workouts are nailed (as far as I can see). I really want to know what’s the delta between how it thought I would do the workout and what I actually did.

2 Likes

“they need to be a lot more specific about what each variable they do actually impacts and the tolerance. How bad is a 10 minute break on a 3 hour workout, and how did they get to that conclusion, etc.”

I think that would be helpful for us. However, that may also be part of the “secret sauce” that they feel provides a business advantage over competing training approaches/software so I can see where they might be reluctant to provide exact detail. Some guideline would be super helpful, though. Something like “pauses have a minor effect up to 30 seconds. After that the effect become more and more pronounced” or “skipping a workout has a major effect, whereas turning it down 10% has a moderate effect and turning it down 3-5% has a minor effect”.

2 Likes

Another issue is that this question (“How much does x impact the prediction?”) implicitly assumes that the effects are additive. I have no idea what their model is, but I would doubt this to be the case. In other words, pausing once may have little effect on the prediction, and turning a workout down once may have little effect, but doing both of those things may have an effect greater than the sum of the individual effects. I would expect the model to include interactions, potentially negating the usefulness of the “How much does x impact the prediction?” question.

2 Likes

I agree that the computational issues may practically limit their ability to run lots of simulations for each individual. That’s why I think some postdictive analysis would be useful. For example:

  1. Sample 1,000 riders for whom you have a 1-month ahead AI FTP prediction and an actual AI FTP number at the end of the month. Show us a histogram of the differences between those two numbers as a percent of FTP.
  2. Do the same thing, but only include riders who completed all of the prescribed workouts and no additional ones.

Those two histograms would say a lot about the performance of the one-month out predictions.

3 Likes

Perhaps rephrase some of your criticism in a constructive and less self-assured way. Personally, I think you’ve missed a key issue:

Whilst we’re on the subject of much large issues that TrainerRoad, I’d suggest everyone familiarised themselves with non-deterministic systems flooding the internet and corporate technology. Ask yourselves how you are coping with their existence, and inability to know what, how or why they are doing the things they do.

Training AI is a drop in the ocean of ineffability we are currently all swimming in :slightly_smiling_face:

@martenk :slight_smile: :

4 Likes

While the logic here makes sense, I think it fails to consider that TrainerRoad only cares about selecting the right next workout. The AI FTP is number that is irrelevant to TrainerRoad and we seem to be fixing on that, rather than whether or not TR is consistently giving us good workouts based on our fitness level. The workout itself probably accounts for the outer edges of the cone of uncertainty as you can probably complete it, within reason if your AI FTP is lower and you could possibly fail it if your AI FTP is higher, that’s the point of the % predictions of how you will feel at the end. I say ignore the FTP. It’s meaningless.

3 Likes

Playing devils advocate here, but how do I know that the workouts are correct based on my fitness level? If I never do a test my fitness could be slowly dropping off. What is a measure of a good workout?

The cone of uncertainty would be a lot more applicable here if TR didn’t create an error/bias by assuming a survey response.

The model’s applies all of its general and athlete specific knowledge to determine the next workout. Then, TR pollutes that by assuming the athlete will select a particular response. This is an unnecessary secondary processing of the data that bifurcates the result and introduces error. The model’s can do just fine (actually better!) without inventing a response.

Remember that if you have 10 intense workouts in a block that each have a 60% chance of hard and a 40% chance of very hard, the simulation will assume that you say “hard” EVERY TIME, when statistically you would say “very hard” four of those times. In this case, you would see a significantly over optimistic prediction of FTP.

The AI workout selection is excellent, but I will take the prediction with a grain of salt until they remove the assumed survey response from the simulation.

4 Likes

I think if we, as athletes, are expected to trust their guidance based on their formulas, they owe it to disclose how it is being done so we can draw our own conclusions or at least have good faith in the methodology. I appreciate that some of it may be trade secret but guidelines would be better than the post workout declaration we get now.

1 Like

The number that TR is showing you should be calculated by figuring out the predicted FTP for every workout rating combination.

E.g. if there are two workouts - easy example for illustration - to get to predicted FTP, and for both the system thinks there is a 60% chance of rating them moderate and 40% chance of rating them hard, then the Predicted FTP should equal the sum of the following:

  • 36% * Predicted FTP if you answer them both moderate - 60% chance of moderate for the 1st * 60% chance of moderate for the 2nd
  • 24% * Predicted FTP if you answer the first moderate and the second one hard - 60% chance of moderate for the 1st * 40% chance of hard for the 2nd
  • 24% * Predicted FTP if you answer the first as hard and the second as moderate - 40% chance of hard for the 1st and 60% chance of moderate for the 2nd
  • 16% * Predicted FTP if you answer both as hard - 40% chance of hard for the 1st * 40% chance of hard for the 2nd

The above calculation is actually the Expected FTP prediction. If the initial FTP prediction is predicated on you answering both as “moderate”, the system is showing a value that it thinks will only happen 36% of the time.

Now do this same math for 12 (3 workouts a week for 4 weeks) to 20 (5 workouts a week for 4 weeks) to 20 workouts, and the expected FTP Prediction won’t really ever happen in real life

3 Likes

Yes, incorporating the entire projected response distribution would be better than what they are doing now, but the model doesn’t need any response to determine subsequent workouts.

Predicted survey response is useful for us to see when analyzing a future ride, but for the simulation, it’s just a different way to process the same data it used to generate the next power graph (suggested workout). Unfortunately, it happens to be a way that introduces error by tilting the FTP trend up or down by ignoring all of the less common responses.

Notice that the model could also invent a hr graph for future workouts based on all of the same data it used to select the workout, but that would be silly, right?

If we want real insight as to how a future workout impacts fitness, it should look at nothing more than the work done, as if the user had forgotten to complete the survey. Remember, it still has all of the users past workout responses to use in addition to the new power graph. It can select a great subsequent workout with that data alone. The model is smart… if they would just let it do its thing and stop overly guiding it.

Well, there are a lot of threads and posts where users are saying that the post-workout response impacts the Predicted FTP. If this is true, then TR needs to calculate all of the possibilities, sum them up like in my example, and show you what the system predicts the most likely FTP gain (lose) will be.

Assuming every workout is executed perfectly and the user responds optimally isn’t a realistic outcome for people. So why show a number that will never (almost) be realized?