Psychometric analysis of the Ramp test

So I have been struggling with ramp tests myself. I have read most of what I could find on TR forum and elsewhere, and it seems there is a lot going on that may distort the result of the ramp test.

I am an academic researcher, and I am also somewhat familiar with sports psychology (I am not an expert, but I have worked with people who are and read some papers on it). Mind you, I have not done a literature review of “FTP testing”, so perhaps there is a lot of research on this already. My insights mostly come from what I come across on this forum. So please enlighten me if I am missing out on crucial information.

From a psychometric perspective, it seems that every FTP-test - whether it is the Ramp test or the 20min FTP-test - is a performance test with the aim of finding out the true FTP score. From theory, it follows that every test has a true score and error score component, thus:

FTP score = True FTP score + Error

Now, we are interesting in this True FTP score, but we have to deal with error. Clasically, error is argued to result from “varying test conditions”. In psychology, this would be how much you studied, between-person variation such as intelligence and other ability. It also comprises test anxiety.

Within a cycling context, this may at least entail: fuelling (carbs, proteins), hydration (fluids, minerals, salt, etc.), sleep, illness, mental fatigue, mental stress, physical phatigue, etc.

Another thing I come across is that some people state they are not really a “VO2-type of rider”, meaning that using VO2Max as a reference for FTP does not work for them and, that for example, those riders benefit more from a traditional 30min or 20min FTP-test. This seems to suggest that, at least the ramp test, suffers from criterium contamination, lowering its criterium validity. It not only measures people’s FTP, it also simultaneously measures VO2-proneness, or people’s genetic/inherent VO2-ability.

There is also the motivational component, entailing test anxiety for instance.

There are a LOT of factors influencing the error component for an FTP-test, perhaps especially the Ramp-test (I have read more problems for people with the Ramp-test than with the 20/30-min FTP test).

From a statistical point of view, this also means that there are a lot of other factors contributing to the proportion explained variance besides FTP.

What I would like to accomplish, is the following:

  1. Make an inventory of all these factors that can distort FTP-tests, particularly the Ramp test
  2. See if I am able to quantify these distorting factors.
  3. See how large the contribution of these distorting factors actually is.
  4. See whether it is possible to correct for them in a Ramp test or whether previous training data - self-report or physiological data - can predict FTP.
  5. See how these distortion factors influence FTP over time using longitudinal modeling.

Any ideas / suggestions?

iLLucionist, this is an interesting and ambitious research project. But please do remember that there are a lot of people who have thought about and experimented with this kind of project, so if you are going to do something like this, then you need to commit to it. This is a long term piece of research.

The second observation that I would make is the concept, FTP, needs analysis itself. The first question I would ask is: what is the concept for which FTP is the appropriate measure?

Good luck with that!


Thanks. Yeah, every research project is ambitious. I am not necessarily interested in an academic publication, I just want to “find out how it works”. But commitment for an academic publication is typically 5-7 years, taking into account high quality design, data, writing, and multiple revisions rounds across multiple top-tier journals.

I wholeheartedly agree. The concept of FTP itself is also much-debated. There’s also a couple of GCN videos about this. For example, I am all about endurance riding, which would more be about lower-intensity (neuro)muscular endurance “keeping going”. Some say you can just take a % of your FTP and determine how much watts you should be able to output over 4-5…6…8 hours. Others say FTP does not reflect your enduranc ability / fitness. Who knows?

Only anecdotal offerings here, but when I started back training this year, after 4 months off, my FTP was surprisingly high. I did a cursory search & investigation into the components of FTP, to see what was supporting my own results. My findings were a bit vague but I ended assuming my FTP was composed mostly of muscular strength rather than cardio fitness.

As well, FTP can be derived from, I’m guessing, the top (VO2max) or the bottom (endurance) or a mix. I’ve had the same FTP score yet based on very different training.

Trying to pin down FTP might drive you mad.

Yes, exactly, That’s why I want to undertake this endeavor. It IS driving me mad. Particularly ramp tests and FTP-values. Outside, I can ride SO HARD. I have a 4i left-side power meter (I know my left-right balance is essentially the same, I had a bike fit last year), and a tacx 2 neo inside. Yesterday, I rode about 200 watts for 10 mins, and then around 300 watts for 5 mins in head wind. Heart rate touch 160-165. My FTP of my last ramp was 212. But when I did a ramp yesterday, I bailed early and it said my ftp would be 206.

Currently, I don’t know if I am limited by VO2. Would be weird, VO2 intervals feel the EASIEST. The hardst for me is threshold. Or mental anxiety. Because as soon as I see 260 watts or 175 HR, I am like “yup this is probably my max”. Which is around 18:30 mins into ramp-test. Drives me crazy.

EDIT: for clarification, yes I first did a ramp test and then went outside to compare my crappy ramp test to how riding at higher watts feels.

I’m actually going to take issue with the very basis of your line of inquiry. I’m not sure we’re actually interested in the “absolute” FTP physiologic value that the ramp test estimates. The purpose of the ramp test cannot and will never be to come close to an accurate representation of this biological number. It’s a training tool, a number upon which you can pin your training sessions. While I completely agree with your assessment of the sources of error, I’m not sure they’re “distortions” in the way you’re mentioning, because literally any strenuous physiologic test will have anxiety, fueling, etc. as possible sources of variance that aren’t an absolute change in the physiologic value of the FTP.

The ramp test probably does depend on more than just “maximum aerobic power,” it’s probably some combo of that and VO2 and lactate processing etc. etc. Regardless, it still generates a number that allows most people to complete workouts at the maximum level of difficulty without burning up. This number might be close to some actual physiologic value or not, but the fact of the matter is it generally seems to work. I think the nature of the questions you’re asking are so individual, that there is unlikely to be an underlying generalizable fact beneath them. Even if there were, the easiest way to address this value is to just tell everyone to troubleshoot it themselves the way they already are: do everything as close to the same every time as possible, and if you feel the number is off, do a workout you’re familiar with and see if the level of effort matches your expectations.

I’m not quite sure how a 20-30 minute traditional FTP test addresses any of the sources of error you bring up, and it also introduces a new, even more difficult to contend with source: you have to guess what you think your new FTP is at the beginning of the test and hold on for dear life. I’m not disputing any of the sources of error you bring up, just disputing how “interesting” these things are from a scientific standpoint in the context of the ramp test. It’s such a multi-variate, complex process, I’m not convinced of your ability to quantify or model these sources of error in any meaningful way that couldn’t be more effectively and easily dealt with by individuals experimenting with their own physiology and seeing what works.

Just my two cents, thanks for the interesting thoughts!


You might want to read the ‘Outside vs Inside FTP’ thread. :wink:

As FTP is a fuzzy concept comprised of fuzzy components*, I don’t envy you in your quest.

*(if one truly wanted to know their “FTP”, they could do a bunch of lab tests to ascertain actual physiological “zones”.)

1 Like

Exactly. I really think the confusion here is viewing the “FTP” as some sort of physiologic absolute when really it’s an average of a bunch of different things, some physiological, some psychological. There are “real” numbers like lactate threshold or VO2 max that can be measured in a lab, while FTP is just catch-all number that’s close enough for most people’s training purposes. Viewing the non-aerobic physiologic components and the psychological components as “distortions” rather than “variables I need to try to nail down as much as possible” is a fundamental misunderstanding of the nature of the ramp test.

Disclaimer: I’m a PhD Immunologist and have no particular expertise in exercise physiology outside of listening to Coach Chad.


The complication here is that the ramp test is a method to estimate a parameter which itelf is an estimation of a bunch of parameters. The main driver of errors in a ramp test, apart from the rider-state ones (motivation, freshness, fueling, etc), is the underlying assumption that FTP = 75% x best one-minute power. This is a little bit like Max HR = 220-age; perhaps better, perhaps worse, I don’t have the correlation data to say.


Checking in so i can track this. I’m a behavioral scientist who likewise dabbled in sports psychology. so i know some of the words in the opening post.

1 Like

I just do that ramp test thingy and peddle until I can’t anymore and then use that number that pops up.


Might be useful?

Have you done an experimental design to understand how many trials it would take to estimate the impact of all these input factors? And defined how to measure each of the inputs in a reasonably objective and consistent way? And how to recruit the test subjects?

Seems like a Herculean task to undertake…

In the construction of a measurement, there is a distinction between the construct, the operationalization, and its psychometric properties.

The point you are raising is about the definition of the construct: what IS FTP? It is also about the operationaliztion: what IS it measuring?

There are indeed roughly two common uses of FTP as I understand it:
(1) find a physiological value or indication of “strength” or “fitness” on the bike (e.g. “the theoretical maximum FTP you can sustain over the duration of an hour”)
(2) a number to quantify training progress, either increase or decrease

This is distinct from its psychometric properties, which entail reliability and validity. What you are suggesting - and what I tend to agree with - is that FTP will, in your words, “never be to come close to an accurate representation (…)”.

This is quite problematic still - regardless of the purpose for which FTP is used - because even for it to be used as a measure of progress, we need to ensure that FTP is an accurate (enough) reflection of training progress. If there is much error variance, it is not suitable to measure progress. From a statistical point of view, large standard deviations mean that the confidence intervals of two values overlap, meaning they will always be non-significant. Thus, let’s say the ramp test indicates that FTP has increased from 220 to 240. If error is too large, the difference of 20 is not significant, because the confidence intervals overlap. In other words, this increase is non-significant. Thus, statistically, you have NOT increased your FTP or progressed according to the statistical interpretation of this FTP-value even though the absolute difference is there, namely 20.

It is very simple to see that these factors that I refer to as “distortions” are, in fact, contributors to error variance. First, let me break it down a bit.

With regards to the ramp test, there are two distinct issues:
(1) there is a “true” FTP-score (a latent variable), that we want to estimate accurately enough for it to be useable as a measure of progress (let’s assume that is the intent within TR workouts).
(2) Is the ramp test the right way to measure FTP?

For example, if I were to measure intelligence by measuring the length of people - or their brain size, for that matter - intelligence still exists as a separate construct, but the measurement will never capture intelligence. Because we know that body length or brain size is very weakly correlated with intelligence. In other words, the question is wheter the ramp test is valid (enough) to measure FTP.

Back to distortions. These have to do with reliability: will I find the same measure again if I were to do the test again now or tomorrow. These “distortions” are relevant, because they can increase error variance, and thus give a quite misleading picture of FTP.

Let’s assume I slept poorly and my FTP after ramp test is 220.
Let’s assume I slept very well and my FTP after ramp test = 240.

(1) What is my true FTP? Is it closer to 220 or closer to 240?

On the assumption it would be possible to filter out this influence (control for it statistically), we would see that error variance for the variable “FTP” would be greatly reduced when we include “sleep quality” in a statistical model. This would indicate that, indeed, sleep quality is a variable that obscures the true FTP score. In other words, sleep quality reduces the RELIABILITY of ramp test in predicting FTP.

This exact same thing holds for mental anxiety. Now, people may differ in the extent to which mental anxiety impacts them PERSONALLY. This would be a trait variable, and a potential moderator in the stability of FTP values. One example may be neuroticism, trait anxiety, or self-efficacy/self-esteem.

Now, you are correct to assert that this is a potential source of variance that does not change DIFFERENCES in FTP IF this is a trait. It would simply mean that this error leads to systematic FTP inflation/deflation (similar to common method variance). This FTP-value can then still be used to track DIFFERENCES, even though it still distorts the true FTP score.

However, mental anxiety may also be due to a major life event (injury, race lost, race won). Then, it would only influence SOME FTP-measurements, and then it is a problem again.

I also agree that the 20-30 min FTP test is also problematic, but for different reasons. I am not sure what the ramp-test exactly depends on, but VO2 seems to be a large component for it. This means that it does not discriminate all people equally: people who are VO2-prone get more accurate results than people who are not VO2-prone. Potentially, VO2-proneness would be a moderator of this. On the assumption that you would take a sample of cyclists into the lab, measure their physiological VO, you would assume that VO correlates strongly with ramp test, and that the FTP of people with low VO is systematically lower than people with high VO, regardless of other indicators (not sure which ones that would be, haven’t looked into that).

What I am interested in, is how accurate and, therefore, how usable the ramp test really is to predict FTP. For that, I am particularly interested in HOW LARGE of an effect these common distortions have, and whether it would be possible to filter them out by correlating them with self-report measures such as sleep quality, or other objective variables, in order to improve ramp tests or FTP-values.

TR actually is a great platform for this: it already records a lot of data over time. If you were to add some self-report questions to that and potentially combine other data sources, you could devise an algorithm to improve the accuracy of ramp tests / FTP-values.

For instance:

  • use apply watch or myfitnesspal to judge food intake
  • use those sleep trackers to input sleep quality
  • ask people about sleep quality
  • ask people about mental anxiety. this can partly be gauged by heart rate after research: seeing how much physiological fatigue contributes and what is left in hr increase before starting ramp test. not sure if that would be accurate enough, there are many factors influencing hr


Finally, it is also a problem that FTP = 75% highest ramp (give or take). I have seen some other platforms and some blogs that do it a bit differently. I have no idea what the best take on this is.

At its simplest, some empirical studies could be conducted to combine a plethora of data sources, estimate error variance components and other contributors to ramp testing / ftp values (over time), and derive potential confidence intervals to correct this for every individual rider in TR, similar how you can offset your power meter outside. This, of course, requires a representatitve population, and taking into account moderators and differences amongst people (correct stratification in sampling). But it is possible, at least, theoretically.

No, I have not. I would first need to dive into the literature to see how this is typically done in research. But for me, it would probably best to team up with a sports psychologist, recruit riders from the right population, and test them systematically as a pre-test. Doing a cross-sectional study incorporating self-report data, biographical data, and physiological data.

Take a read, and do this before investing too much more effort into this. The sample sizes you need can get very big in a hurry.

Example: if you have 7 input factors you are looking to evaluate (you listed 7 example factors in your original post), and just 2 levels per factor (eg high/low, or good/bad), a full factorial design requires 2^7 = 128 trials. And that’s just on one person!

Now you can be thoughtful about what interactions between input factors that you care and don’t care about, and do a fractional factorial design. This can reduce the sample size considerably.

Because of the nature of what you are trying to measure - absolute FTP - I think the identity of the test subject will always be a necessary factor. And this adds another complication - let’s say you have a test subject do 8 ramp tests over 8 weeks as you test the impact of different factors. You now also need to add training load as a factor - as it’s plausible someone gets more or less fit over 8 weeks. And if training load is a factor, that’s not something that’s easy to test at different levels (eg high or low) as it takes a long time.

If you have the passion and energy to do a study like this, IMO there are more interesting things to figure out. Like why some people respond very well to training and others don’t.

Thanks for the input! I would definitely read myself into this first. I understand the complexities, also of data requirements.

With regards to the identity of the test subject, you are absolutely correct. Fortunately, there are longitudinal structural equation models that allow to split up variance into a variable component, stable component, and time component, and all of them can be predicted by other variables. This should provide a good method to discern the extnt to which these different factors impact these tests, such as training load. Still, this would also require a large sample size.

I agree that there are other - potentially more - interesting things to figure out. For now, I am more interested in the psychometric properties of the FTP-test, and see whether it needs to be and/or can be improved via ramp testing and additional data sources.

What you are suggesting may really be outside of the scope of my knowledge and skillset, unless it is psychological. But I suggest a large part of (lack of) training adaptation has to do with physiological variables, but I may be wrong.

Thanks for expanding your discussion. I agree with pretty much all of your points, but my major emphasis is that which things affect the ramp test for a given individual and the amplitude of this effect on the ramp test for that individual are likely to be incredibly unique to that individual’s genetics, life experience, mentation, etc. I’m not convinced you’re going to be able to ask people about sleep quality and see a reliable trend of “sleep quality X leads to error in FTP test Y.”

I think the most feasible thing to study is the VO2 proneness point, and see in general how much having a VO2 of 70 compares to people who have a VO2 of 40, but this will still be complicated by factors such as training history, competition history (e.g. competitive endurance athletes with high VO2s may also be psychologically “tougher” and have high ramp tests because of this as well), etc. So essentially you’d need to find completely untrained individuals that have genetically high or low VO2s and use this population to determine the effect of VO2 on ramp test results, maybe in terms of watt/kg FTP. I think you can see that even the “most feasible” of all your testing parameters would require an incredible number of participants to start addressing the question. And even then, how would know whether the FTP test was “more accurate” in higher VO2 riders? Would you make them do an hour time trial and see which individuals could hold their power? Very few people can actually maintain hour power for an hour untrained.

In the end, I don’t think that what you’re proposing is theoretically unsound, necessarily, just that the practicality of addressing any of these variables would be a huge mess, and I’m not sure how you would determine whether an FTP is “accurate” or “distorted.”

1 Like