⚽ Must-Read: Scoring this ... how?

World Cup 2026 Goal Prediction Challenge

$1 000 USD

Reveal coming soon!

Skills you will learn

Prediction

Feature Engineering

775 joined

393 active

Info Data Chat Leaderboard

Start

Jun 12, 26

Jun 19, 26

Reveal

Jul 19, 26

skaak

Ferra Solutions

Scoring this ... how?

23 Jun 2026, 14:54 · 11

Hmmm, the Info pages shows

Overall Score = 0.60 × RMSE(Goals) + 0.40 × F1(Stage)

but, since a higher RMSE is bad, this should maybe be scored something like

Overall Score = 0.40 × F1(Stage) - 0.60 × RMSE(Goals)

or wil the RMSE be 'normalised' as @Yehoshua do it

1.0 - (raw - lo) / (hi - lo)

Minor detail I guess ...

Discussion 11 answers

yehoshua

Heoshua

I think Zindi apply the normalization for every multi-metric competition according to this post: https://zindi.africa/learn/introducing-multi-metric-evaluation-or-one-metric-to-rule-them-all.

23 Jun 2026, 16:11

Upvotes 0

Shannon_Sikadi

Deep Learning IndabaX Zimbabwe

From the Zindi multi-metric explanation and my own analysis, RMSE is not used in raw form. It is normalized (bounded using a reference/starter baseline), so it becomes a comparable “higher is better” score. Then it is combined with F1 using a weighted average. So the formula applies to normalized RMSE, not raw RMSE subtraction.

So the formula becomes:

Overall = 0.60 × RMSE_normalized + 0.40 × F1

24 Jun 2026, 06:15

Upvotes 1

skaak

Ferra Solutions

Thanks Shannon - yes that sounds a lot like @yehosua code, that snippet I posted above. It seems he uses the highest and lowest RMSE scores from the LB to bound it ... is that how you understand it also?

replied to Shannon_Sikadi24 Jun 2026, 07:12

Upvotes 0

Shannon_Sikadi

Deep Learning IndabaX Zimbabwe

Yes, that’s my understanding as well. In that approach @yehosua 's code, the normalization bounds are derived from the current leaderboard distribution, typically using the minimum and maximum RMSE values observed. That effectively creates a dynamic min–max scaling of RMSE within the cohort. One limitation of this approach is that the normalization is cohort-dependent, since the bounds are derived from the current set of leaderboard submissions. This makes the scaled score unstable during the active competition, as updates to the leaderboard can shift the min–max range. It is also sensitive to outliers, where extreme submissions can distort the scaling range and compress differences between other models.

This differs from Zindi’s approach, which uses fixed reference bounds (e.g., baseline or starter solution), making it more stable and reproducible over time as it removes dependences on the current set of submissions as the scaling reference.

replied to skaak24 Jun 2026, 07:49

Upvotes 0

yehoshua

Heoshua

That's right -:)

https://github.com/yehoshua0/zindi-world-cup-2026-bot/blob/main/wc2026bot%2Fevaluation.py

PS: It's @yehoshua 😅

replied to skaak24 Jun 2026, 08:11

Upvotes 1

yehoshua

Heoshua

@Shannon_Sikadi, thank you for this point of view.

Do you think we can try to approximate and implement in our evaluation.py without needing the fix reference bounds so I can improve the leaderboard.

replied to Shannon_Sikadi24 Jun 2026, 08:14

Upvotes 0

skaak

Ferra Solutions

Oops my mistake, apologies @yehoshua

Thanks Shannon for the detailed explanation, wow, are you AI?

replied to yehoshua24 Jun 2026, 09:59

Upvotes 2

skaak

Ferra Solutions

Hmmmmmm btw @yehoshua should it not perhaps be micro f1, not macro? Given the highly imbalanced labels and the single-column nature of the group stage in this comp, I'm thinking micro is the way to go here?

replied to yehoshua24 Jun 2026, 12:27

Upvotes 0

yehoshua

Heoshua

Okay, let me try it and see how it correlates ? I think we could get a more improved evaluation.py system when the first update post-closed submission is out. I can implement micro f1 but how to evaluate it.

replied to skaak24 Jun 2026, 20:56

Upvotes 0

skaak

Ferra Solutions

Yeah, your tool presents well, but also you need some substance.

Right now you should only score on all teams whose campaings have ended. Those who are not going any further, and who are not playing any more games, because both their number of goals and final stage are now known. The others are still fluctuating, so you can not really include those in the (current) score.

I think (hope) this is how Zindi will update the leaderboard, probably at the end of the group stage, round of 32 etc. The group stage is still underway, but at least some of the groups are done and some of the results are fully known by now so you could use those if you want accuracy in your scores.

It is a really complex calculation, but at least now in the knock-out stages it is a lot easier.

replied to yehoshua25 Jun 2026, 06:10

Upvotes 0

skaak

Ferra Solutions

To calculate micro is a real small change ... just change

for stage in VALID_STAGES:

        tp = sum(1 for k in keys if pred[k] == stage and actual[k] == stage)

        fp = sum(1 for k in keys if pred[k] == stage and actual[k] != stage)

        fn = sum(1 for k in keys if pred[k] != stage and actual[k] == stage)

        denom = 2 * tp + fp + fn

        f1 = (2 * tp / denom) if denom else 0.0

        total += f1

into (you can do it much better, I'm just taking easiest edit of your existing here)

        tp = sum(1 for k in keys if pred[k] == actual[k])

        fp = sum(1 for k in keys if pred[k] != actual[k])

        denom = tp + fp

        f1 = tp / denom if denom else 0.0

        return f1

replied to skaak25 Jun 2026, 06:30

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status