Further than that even, right? Most programs don't change that drastically year over year. Coach stays the same, roster evolves, lose some production, gain some production. Those are all fairly "modeling friendly" events.
Then there are the Deion's or the "wow maybe Geoff Collins really brought the whole thing down that much" rare events.
They're tougher to figure out how to quantify and they might not even make a big impact on your overall model quality evaluation. Are you testing more for "how accurate did this do on the largest number of teams" or "how accurate did this do on the most random one-off changes"? Or trying to weigh both somehow? From what I've seen, the majority of systems built out there right now are squarely in the first camp: predict well for the standard cases. So even if some of them do better on the wildcard cases, that doesn't mean that anyone necessarily can tell you which ones those are or has a good way of ranking for that.