Primary competition visual

World Cup 2026 Goal Prediction Challenge

$1 000 USD
Reveal coming soon!
Prediction
Feature Engineering
775 joined
393 active
Starti
Jun 12, 26
Closei
Jun 19, 26
Reveali
Jul 19, 26
User avatar
skaak
Ferra Solutions
Scoring this ... how?
23 Jun 2026, 14:54 · 11

Hmmm, the Info pages shows

Overall Score = 0.60 × RMSE(Goals) + 0.40 × F1(Stage)

but, since a higher RMSE is bad, this should maybe be scored something like

Overall Score = 0.40 × F1(Stage) - 0.60 × RMSE(Goals)

or wil the RMSE be 'normalised' as @Yehoshua do it

1.0 - (raw - lo) / (hi - lo)

Minor detail I guess ...

Discussion 11 answers
User avatar
yehoshua
Heoshua

I think Zindi apply the normalization for every multi-metric competition according to this post: https://zindi.africa/learn/introducing-multi-metric-evaluation-or-one-metric-to-rule-them-all.

23 Jun 2026, 16:11
Upvotes 0
User avatar
Shannon_Sikadi
Deep Learning IndabaX Zimbabwe

From the Zindi multi-metric explanation and my own analysis, RMSE is not used in raw form. It is normalized (bounded using a reference/starter baseline), so it becomes a comparable “higher is better” score. Then it is combined with F1 using a weighted average. So the formula applies to normalized RMSE, not raw RMSE subtraction.

So the formula becomes:

Overall = 0.60 × RMSE_normalized + 0.40 × F1

24 Jun 2026, 06:15
Upvotes 1
User avatar
skaak
Ferra Solutions

Thanks Shannon - yes that sounds a lot like @yehosua code, that snippet I posted above. It seems he uses the highest and lowest RMSE scores from the LB to bound it ... is that how you understand it also?

User avatar
Shannon_Sikadi
Deep Learning IndabaX Zimbabwe

Yes, that’s my understanding as well. In that approach @yehosua 's code, the normalization bounds are derived from the current leaderboard distribution, typically using the minimum and maximum RMSE values observed. That effectively creates a dynamic min–max scaling of RMSE within the cohort. One limitation of this approach is that the normalization is cohort-dependent, since the bounds are derived from the current set of leaderboard submissions. This makes the scaled score unstable during the active competition, as updates to the leaderboard can shift the min–max range. It is also sensitive to outliers, where extreme submissions can distort the scaling range and compress differences between other models.

This differs from Zindi’s approach, which uses fixed reference bounds (e.g., baseline or starter solution), making it more stable and reproducible over time as it removes dependences on the current set of submissions as the scaling reference.

User avatar
yehoshua
Heoshua
User avatar
yehoshua
Heoshua

@Shannon_Sikadi, thank you for this point of view.

Do you think we can try to approximate and implement in our evaluation.py without needing the fix reference bounds so I can improve the leaderboard.

User avatar
skaak
Ferra Solutions

Oops my mistake, apologies @yehoshua

Thanks Shannon for the detailed explanation, wow, are you AI?

User avatar
skaak
Ferra Solutions

Hmmmmmm btw @yehoshua should it not perhaps be micro f1, not macro? Given the highly imbalanced labels and the single-column nature of the group stage in this comp, I'm thinking micro is the way to go here?

User avatar
yehoshua
Heoshua

Okay, let me try it and see how it correlates ? I think we could get a more improved evaluation.py system when the first update post-closed submission is out. I can implement micro f1 but how to evaluate it.

User avatar
skaak
Ferra Solutions

Yeah, your tool presents well, but also you need some substance.

Right now you should only score on all teams whose campaings have ended. Those who are not going any further, and who are not playing any more games, because both their number of goals and final stage are now known. The others are still fluctuating, so you can not really include those in the (current) score.

I think (hope) this is how Zindi will update the leaderboard, probably at the end of the group stage, round of 32 etc. The group stage is still underway, but at least some of the groups are done and some of the results are fully known by now so you could use those if you want accuracy in your scores.

It is a really complex calculation, but at least now in the knock-out stages it is a lot easier.

User avatar
skaak
Ferra Solutions

To calculate micro is a real small change ... just change

for stage in VALID_STAGES:
        tp = sum(1 for k in keys if pred[k] == stage and actual[k] == stage)
        fp = sum(1 for k in keys if pred[k] == stage and actual[k] != stage)
        fn = sum(1 for k in keys if pred[k] != stage and actual[k] == stage)
        denom = 2 * tp + fp + fn
        f1 = (2 * tp / denom) if denom else 0.0
        total += f1

into (you can do it much better, I'm just taking easiest edit of your existing here)

        tp = sum(1 for k in keys if pred[k] == actual[k])
        fp = sum(1 for k in keys if pred[k] != actual[k])
        denom = tp + fp
        f1 = tp / denom if denom else 0.0
        return f1