Hmmm, the Info pages shows
Overall Score = 0.60 × RMSE(Goals) + 0.40 × F1(Stage)
but, since a higher RMSE is bad, this should maybe be scored something like
Overall Score = 0.40 × F1(Stage) - 0.60 × RMSE(Goals)
or wil the RMSE be 'normalised' as @Yehoshua do it
1.0 - (raw - lo) / (hi - lo)
Minor detail I guess ...
I think Zindi apply the normalization for every multi-metric competition according to this post: https://zindi.africa/learn/introducing-multi-metric-evaluation-or-one-metric-to-rule-them-all.
From the Zindi multi-metric explanation and my own analysis, RMSE is not used in raw form. It is normalized (bounded using a reference/starter baseline), so it becomes a comparable “higher is better” score. Then it is combined with F1 using a weighted average. So the formula applies to normalized RMSE, not raw RMSE subtraction.
So the formula becomes:
Overall = 0.60 × RMSE_normalized + 0.40 × F1
Thanks Shannon - yes that sounds a lot like @yehosua code, that snippet I posted above. It seems he uses the highest and lowest RMSE scores from the LB to bound it ... is that how you understand it also?
Yes, that’s my understanding as well. In that approach @yehosua 's code, the normalization bounds are derived from the current leaderboard distribution, typically using the minimum and maximum RMSE values observed. That effectively creates a dynamic min–max scaling of RMSE within the cohort. One limitation of this approach is that the normalization is cohort-dependent, since the bounds are derived from the current set of leaderboard submissions. This makes the scaled score unstable during the active competition, as updates to the leaderboard can shift the min–max range. It is also sensitive to outliers, where extreme submissions can distort the scaling range and compress differences between other models.
This differs from Zindi’s approach, which uses fixed reference bounds (e.g., baseline or starter solution), making it more stable and reproducible over time as it removes dependences on the current set of submissions as the scaling reference.
That's right -:)
https://github.com/yehoshua0/zindi-world-cup-2026-bot/blob/main/wc2026bot%2Fevaluation.py
PS: It's @yehoshua 😅
@Shannon_Sikadi, thank you for this point of view.
Do you think we can try to approximate and implement in our evaluation.py without needing the fix reference bounds so I can improve the leaderboard.
Oops my mistake, apologies @yehoshua
Thanks Shannon for the detailed explanation, wow, are you AI?
Hmmmmmm btw @yehoshua should it not perhaps be micro f1, not macro? Given the highly imbalanced labels and the single-column nature of the group stage in this comp, I'm thinking micro is the way to go here?
Okay, let me try it and see how it correlates ? I think we could get a more improved evaluation.py system when the first update post-closed submission is out. I can implement micro f1 but how to evaluate it.
Yeah, your tool presents well, but also you need some substance.
Right now you should only score on all teams whose campaings have ended. Those who are not going any further, and who are not playing any more games, because both their number of goals and final stage are now known. The others are still fluctuating, so you can not really include those in the (current) score.
I think (hope) this is how Zindi will update the leaderboard, probably at the end of the group stage, round of 32 etc. The group stage is still underway, but at least some of the groups are done and some of the results are fully known by now so you could use those if you want accuracy in your scores.
It is a really complex calculation, but at least now in the knock-out stages it is a lot easier.
To calculate micro is a real small change ... just change
into (you can do it much better, I'm just taking easiest edit of your existing here)