Reliability Engineering, Scale, and TrainerRoad

@Nate_Pearson & team -

I don’t love doing forum callouts for this stuff, but this is starting to happen with enough frequency that it’s frustrating me as a customer:

Without knowing the specifics of how TR engineering teams are structured, I will assume there is a team who are either loosely or explicitly defined as SRE/WebOps/DevOps/Platform-Engineering (pick your buzzword), and otherwise known as the the group of people whose shared responsibility is to keep the site online and the data on time.

In my own work life, I lead engineering teams in this space, and have first-hand experience of how difficult, stressful, and non-trivial it can be to keep a platform online and highly available, especially a public-facing one… like TR, especially a growing one… like TR. I have limitless empathy and respect for the engineers who have to respond to alerts at 2am, and live-debug an incident on a critical production system. I have spent many hours on those calls, and probably shaved years off my life as a result. So I get it…

BUT… it is getting frustrating to encounter these platform issues with seemingly (albeit anecdotally) increasing frequency. TR markets itself heavily as a platform for time crunched athletes, which is definitely me - so when I come home weary from my own day in the trenches, cram some food, get my bottles ready, get changed, and finally get on the bike to train only to be presented with an authentication failure that makes the entire app fail, I have to ask some questions:

Why doesn’t the TR app fail-open? - Not being able to login or otherwise talk to the TR servers should have zero bearing on my ability to open the app and do a workout. The app should opportunistically try to phone home and sync, but if that isn’t possible for some reason - either local to me or with TR mothership - just fail-open and let me carry-on! Cache the workout library and other required assets locally, and force a token renewal at some acceptable interval (1-2 weeks)… if I am offline for 3 weeks, fine force me to login and and renew.

Why is there no status page and/or updates on any of the social channels? - Get something… anything… up at status.trainerroad.com. There are dozens of serviceable products in this space, Atlassian StatusPage, Incident.io, PagerDuty, Pulsetic, Uptime.com, etc… just pick one, and throw something up. Have it report basic availability status for however the app is broken down in a way that is useful to end-users (you can have a different dashboard for internal stuff, maybe you already do). I’m just making this up, but maybe AI FTP detection is an entirely separate service from basic workout syncing, imagine if instead of people asking why AI FTP detection is slow or not working, they can just check the status page! Amazing, right!? In addition to the automated availability reporting, engineering teams can post quick blurbs/updates about any incidents being investigated. Doing a scheduled deployment that might cause issues? No problem! Just throw it on the status page.

Coinsbase has a nice example of this: https://coinbase.statuspage.io/

Is there a SRE/DevOps/WebOps team at TR? - Getting back to my assumption at the top, I assume there is a team or group whose core responsibility is production, or a dog-fooding DevOps type model which also works well… But if I was wrong, I can’t help but think it’s high-time to remedy this, and treat production availability as a primary concern. Choose whatever operational model you like - it doesn’t matter, as long as at the end of the day someone (team) is responsible for it. Production systems can’t be an off-the-side-of-your-desk-not-my-day-job that some dev gets saddled with, but who can’t really focus on it. If something breaks in production, someone needs to be paged by an automated alerting system immediately, and engaged to drive a resolution.

Before the forum jumps down my throat and says TR is a small company and this feedback is unreasonable. What I’ve outlined above are absolute basics table-stakes stuff for running web platforms, small or large, and if anything the challenges I am highlighting are ones of scaling-up and success… which is great!

@Jonathan you ask on every podcast to get people to sign up for TR, and note that the increased subscription rates will help with feature development and overall value delivery for customers. As one customer (but I am sure I speak for a few) , one area I would like to see an increased focus on is operational discipline… at least give me the impression you’re monitoring things!

tl;dr - Get a bloody status page up asap. It would immediately negate this whiny post, and every whiny post like it asking “is the site up?”.

8 Likes

I’d also like to see this, rather than (as I am having right now trying to use the calendar) the spinning red circle for ages and then the “ouch we have taken a nasty spill” page…

3 Likes

Yes! Having had this happen before it’s a huge bummer and if it could fail open with the next workout autosaved that would be amazing.

I am finding the app a bit laggy on android, but I would say I only have the “something went wrong” when the wifi/ data is a bit flakey.

Hey @mikethewhite,

Thanks for the well-thought out message. I agree that we should communicate service outages better.

I want to specifically address the issue you ran into this morning, which is actually separate from the website outage we experienced this morning. That website outage is now resolved.

The issue you ran into is an authorization issue. Any time we discover a bug that is blocking athletes from doing a workout, it becomes our highest priority. This one has been an intermittent issue that athletes have gotten around by relaunching the app, so it hasn’t been reported at the sort of frequency to sufficiently ring the alarm bell. That said, we’re working on it and it is the focus of our principal engineer as we speak.

Separately, our DevOps team is looking into recent outages to see what has caused them in the first place, and also if there has been an increase. We’re also discussing options to help make it more clear when we have outages.

Again, I really appreciate you taking the time to write such a well-thought out message. It helps us improve :slight_smile:

10 Likes

I really appreciate the response! I’m sure you guys will handle it.

5 Likes