Three Day GTG - The Results - Update Added in Comments
FlyingAl
Posts: 4,087 ✭✭✭✭✭
After a long wait, I finally had time to analyze the results.
Many of you are familiar with the three day GTG concept - I posted a coin, gave three days for forum members to guess, and then posted the answer. I used coins that had CAC approval and I personally agreed with the grade.
It came down to this - the median grade for each coin was taken, and then the median subtracted from the correct grade. This error was then averaged across the 15 coins - and the forum was 0.875 grades off on average.
How good is that?
Compared to NGC's 2024 in person grading contest, it's decent. The 159 in person graders averaged 0.72 grades off across 12 non-details coins.
PCGS's 52 in person graders at the OKC show averaged 0.52 grades off on 18 coins.
As we can see, the forum did well. Very well in fact. The forum was about as accurate as the NGC panel of 159 graders, despite having far less people guessing. They did not quite rise to the level of the PCGS's group, which happened to do very well, and that data may be an outlier in and of itself.
The ultimate question - can you grade from images? I argue the data shows that you can be almost as accurate as an in person grader if you have quality images and the experience/knowledge of image grading. However, you can draw your own conclusions.
A few things here:
1) I used medians because they are more robust (less inclined to vary based on outliers). NGC's data in particular had significant numbers of outliers, which would make them the worst graders by a wide margin if averages alone were used. The graders on the forum were by the most accurate, but did not happen to be the most precise. Unfortunately, it was precision that counted.
2) This is not a perfect experiment. I would ideally have every grader grade every coin in person and then by images, without being able to recognize the same coin. This unfortunately is just not possible, so this is what I have to work with.
3) The coins being used are not the same. There ultimately were some coins that were harder to grade than others. That will skew the data, and unfortunately I don't have control over the in person data.
Comments
Cool analysis.
Mr_Spud
Nice job....That was quite interesting !
You should feed it into AI and have it crunch some more numbers.
Proud follower of Christ!
AI has a tendency to make things up.
Fake news lol
The grading contest participants and the participants on this forum are likely a completely different subset of people. The coins chosen are also different - for example, the NGC contest includes world coins, medals, and counterfeits. This isn’t really conclusive information as a result. Additionally, few if any of the participants in either experiment are professional graders, so there is a general skill cap that renders the data somewhat meaningless.
Gobrecht's Engraved Mature Head Large Cent Model
https://www.instagram.com/rexrarities/?hl=en
See my point #2. Counterfeit data was also removed for the NGC and PCGS experiment. I do agree the differing of coins would yield significantly different results.
I would say that the range of pro graders in any one of these experiments is fairly similar (maybe 5-10% in each one). I don't think that played a big factor here, but I would like to see how it would play out if we ran this experiment with a grading team.
One of the reasons i quit participating was because in truth i was just giving off the cuff responses. when i learned the idea was to try and generate some kind of meaningful data i had to make a choice. i could put in the kind of analysis i do when making purchases of my own, or I could bow out. GTG exercises have always been just for the fun of it excercises for and i decided to leave it that way. I do hope the orginal poster and other participatants gained what they wanted. James
If you ran it with a grading team, it would be a very simple test: does the grader’s grade change when grading the coin from a photo vs after seeing in hand? The changes might not be extreme, but I can assure you that the answer will be “yes” enough that it matters.
I also don’t think this is useful data without the inclusion of counterfeits and details grades. Assigning numerical grading is the easiest part of grading. If you’re buying a raw coin from photos, there is no situation in which you would not consider the possibility of a details grades or a counterfeit, because that would significantly affect the value. Getting the numerical grade correct from a photo when there are only numerical grades possible is far less difficult than a real-world situation, and is not conclusive about how well a coin can be “graded” from a photo, because it’s only being partially graded.
Gobrecht's Engraved Mature Head Large Cent Model
https://www.instagram.com/rexrarities/?hl=en
I think that, more and more over time that photogenic properties will become as important, if not more important, than how the coins look in hand. More and more, as photography becomes easier for the masses, enlarged pictures of coins will be what people show off to other collectors on social media and forums. I’m not talking about trick photography to hide flaws, also not talking about deliberately highlighting minor hairlines and problems, but just like eye appeal if a coin photographs well it will influence grading and pricing more and more.
I know for me, I used to be able to grade coins in hand very good with no need for magnification because I was extremely near sighted. The eye doctor told me I had the equivalent of 5x lens right about 6 inches from my eye with no glasses on. I could also cherry pick varieties real easy. But after cataract operations I’m now far sighted and have to use magnification to grade coins in hand and it’s not the same because I used to see the details of the whole coin with my naked eye, but with magnification I can only see parts of the coin good enough and have to move the magnifying lens around. But with a well lit photograph magnified on my iPad I can see the whole coin like I used to and can see the overall eye appeal as well as the details/flaws like I used to with just my eye. So photographs are easier for me. I’ve also done so much photography and editing that my mind can interpret things about photos and I can usually tell what the coin will look like even when the photos are sub-par. I can usually tell when photographers do tricks to hide stuff too, on a good day when the floaters aren’t acting up 🌞
I’m also thinking that as more young people enter into the hobby, people growing up who teethed on their iPhones and were on social media since kindergarten, that showing off photos of eye appealing coins will be the only thing they know. The more “likes” they get on their coin photos, the more the reward centers of the brain are stimulated, the more value the coins will have.
So that’s my take on the whole subject.
Mr_Spud
I agree. Unfortunately, I have yet to image large amounts of counterfeits, which makes that testing impossible for the time being.
I need to keep tabs on it to get feel if I am too tough or too loose.
Or, as another possibility, whether you're both too tough and too loose.
Mark Feld* of Heritage Auctions*Unless otherwise noted, my posts here represent my personal opinions.
I forgot one more statistic, which was really dumb of me because it's rather important.
Standard deviation between grades assigned by the graders were:
Images: 2.751
PCGS contest: 4.523
NGC contest: 6.977
Standard deviation is how much each guess will variate from the average of the data, on average. Essentially, it is a measure of the precision of a group of data. In this data, the values are skewed upwards by the numbers of circulated coins (i.e, a group with more circulated coins will show a higher standard deviation). The image graders graded the most circulated coins.
Adjusted standard deviations for non-circulated coins produce the following:
Images: 1.127
PCGS in person: 3.783
NGC in person: 5.996
Essentially, the image graders are 1.57 times less accurate than the PCGS graders, but are 3.356 times more precise. What this means is that the image graders are more consistently arriving at a consensus grade (they agree with each other), whereas the in person graders happen to arrive at a grade via a large variation (they don't agree with each other, but happen to be "right" at the end).
If we scaled down this experiment by choosing five random graders from each dataset, the image graders would likely be the "best" team. This is because reducing team size removes the advantage the in-person graders have of being able to arrive at the correct grade via variation.