My own post-mortem from 2014 is that judging was tedious. Would love to get a review of the 2015 process and a bit more insight as it would probably help determine whether it was a viable alternative. Care to chime in?
It was still tedious. Breaking it down so that each judge handled less entries was a life-saver, but at the same time, the entries were on average a lot more complex, and a number of them took considerable time to play through.
Having clear examples of the expected quality for each judging dimension could help out - I felt like I spent a lot of time re-reviewing earlier games, after later games caused my scoring criteria to move (i.e. I'd score a game highly for graphics early on, then have to re-evaluate it down as later entries exceeded expectations).
I feel you, I vividly remember doing this myself in 2014.
The problem I see with setting a "bar" is that we just don't know whether it is a valid one. It could be that the bar is too demanding, and every entry would be in the 1-5 range, which wouldn't make much sense, or it could be that it is too low, having all games competing for that .5 point in a given field. The byproduct would be that, if, for example "sound" was correctly gauged, but not art, then audio would become a more valuable rating as it would be spaced out (3-7 on avg) whereas everyone would roughly get the same score (1-3 or 7-10) for visuals, largely limiting the validity of this process.
Breaking the games across all judges was something that was done out of necessity, but it also makes it much less "fair". I can remember going through something similar in high school where we had 2 math teachers, and the two of them had a drastic differences in their exams. Teacher A had a relatively easier exam, whereas Teacher B was more demanding. To the outsider, it felt like class B (associated with Teacher B) had on average poorer results, which would've indicated that they were not as good, but as it turned out, this was incorrect.
Measures were then taken to review scores, so that the average of both classes would be roughly the same, but this correction was done on a leap of faith. Ultimately, no one could determine whether one class was doing better than the other as far as learning was concerned. Our assessment of the situation back then was that Teacher B was more demanding in his class, and that as a result, it would've been likely that the average for that class should've been higher if they had been tested against the same exam, but there's just no way to prove that.
Long story short, if we have judges that don't overlap sufficiently, the ranks would be less about who did well than about what judge they ultimately got.
If my understanding is correct, last year's competition mitigated this by having judges that would overlap (3 judges would score each game, and these did not work as a cohort, but were rather randomly assigned games to test)?
Assuming this to be the case, it would imply we would need a certain amount of qualified judges to go through these entries and insure proper distribution. And this number should be able to scale based off how many entries would end up being delivered.