PSA Cert Number Search

Recently a card I am interested in was graded and certified a 10 by PSA. I checked the PSA pop report for the 2024 Topps Black and White Rookie Resolution set and the card is listed as having been graded. I would like to pull the card up but I don't have the cert number. Is there a search method available to determine the PSA Cert number without having seen the card as graded?
0
Comments
You'd think someone would embark on a major scrape project by programming to input every cert# in PSA lookup and created a corresponding searchable database.
That's my thought as well.
What is the card number?
It’s tricky to scrape the data. While the ToS states you cannot, it has been ruled by the 9th circuit that you can, provided you don’t degrade services. But you cannot promote or provide access to the data. So no one could do it to share. PSA should provide an API but I suspect their system architecture needs massive update. It’s an enormous amount of data. 45TB just for text and almost 240TB of image data.
Im curious, where did you get those TB totals?
It's the singer not the song - Peter Townshend (1972)
Huh, how can it be TB's for the text (html) data? If we just capped it at certs 100,000,000 and down and said each html file was 1mb, that would only be 1TB. But I'm getting a file size of 150-161kb per saved html file d/l, and the html is the only thing the "scrapper" would pull.
There isn't an html file for each cert.
I mean the HTML output from each query. Or page source...
I don't know what you're using to pull the data. Perhaps curl or wget or maybe just using an interface in your browser (which I suspect, because it's easy and the other stuff is hard). I don't know what page or pages you're considering. I don't know if you've thought about indexing the data, or otherwise making it "searchable" and I don't know if you have the expertise to know how to make that efficient for a set of almost a billion objects.
When I said it was 45TB for only the text data, I meant the text as stored, and some elements of the text indexed for search; cert, year, brand, sport, card number, player, variety, card grade, auto grade (note: also necessary to store these as strings because they have changed over the years and you get them as strings obviously from the scrape), qualifiers, primary signer.
That is the base data. If you want to have auction results and/or pop count you have to pull more HTML and parse. So I'm telling you only what the size on disk is for the data, and this doesn't even include anything from the past 10 months because I have not been running my spiders for a while - I actually started running out of disk space capturing the secure scan images and shut it down while I spec out a new disk array.
I don't actually know how large all of the HTML data is, but I did capture the raw HTML for a while when I was developing this software so that I could go back and re-parse data if I learned there was a variation - and there are a bunch. The data I have stored for these is stored in ASCII Hexadecimal, but for only the base page it is 894kB for a sample I just pulled out, so divide that in half and that's the binary size. However, how data sits on disk is a little different and that depends on what data is stored where. The relational vs. non-relational aspect of disk usage. Store your data in a relational database - think Rows & Columns and then you have to size your columns to max-object-size. Store your data in a non-relational database - think Files on disk and then you have to deal with block sizes adding overhead on each file. Want to make that searchable... now you multiply that pain with many more files.
I don't run this on your Mom's Dell laptop that she uses for Instacart either. This takes significant compute resources.
Regardless I don't think I have captured any of that set graded as they were likely released after I stopped collecting certs. I searched for the terms black, white, rookie, resolution and didn't find any from this 2024 set.
I didn't do this to monetize the data in any way. I did it for a simple, perhaps silly, reason. I built an app that monitored the cert ids and when a new one became available, it would display it. So many cards are graded that even when I left them up there for only 5 seconds at a time (initially I had them for 60s) I was falling well-behind grading pace. So I had to add filters like "autograph" or "vintage" or "not pokemon" so that I could stay somewhat up to date. Why doesn't PSA have this...
Anyways - Sorry OP. I can't find the cards, but maybe someday. In the interim, look at the PSA set registry and the PSA APR and maybe them will have sold. Then you can get the cert ID and then I would manually scan forward and backward to see if that same submission included others.
Just put https://www.psacard.com/cert/91597648 in your browser and then increment or decrement the cert ID. Add "/psa" to the end to avoid cert IDs which are also in the DNA-only side.
This particular one doesn't appear to yield anything useful. As you find cert IDs for the series on eBay or elsewhere search them also. You might find it that way. Good luck - sorry I couldn't help.
All I did was search a cert and then save the resulting page as html. I obviously didn't bother to read the code, but would figure some script would look for the category text string or something for the cert#, Sport, Set, Year, etc. Remember that all image tags and multimedia links would be ignored, as this is just a text scrapping. As for the form and data size once all the scrapped data is in hand, that is way over my head, but didn't think a simple SQL db or whatever the std is now (some sort of glorified excel spreadsheet) would blow up in size for less than 10 cells per cert.
But what do I know, I'm not a software engineer, but I did stay at a Holiday Inn Express last night.....
The HTML is pretty easy to parse. The size isn’t really a major issue though as to why someone hasn’t done it don’t kind of sucks to have gone so far down the rabbit hole on a moot subject.
The main item is legality. You cannot reproduce the data.
Second it’s the difficulty. Headless eCMAScript-capable http/html clients are tough. I know a handful of people who can successfully bot cloudflare reliably. I understand that it seems like it should be easy because you just GET the URI and parse the data. But that’s the issue. Try to automate that. It’s tougher than you might think. I recently saw a project to defeat cloudflare that was controlling a real browser with a program. Basically sending the mouse-clicks. That’s how the know-nothings are getting by.
Then it’s the size of the data.
The smallest you could make the data would be files. But that’s tough. You would use some bigtable solution for that. I just used a relational database because I could get faster access using a big-data indexed layout strategy. This isn’t a relational database. This is a relational warehouse. You don’t just throw 450 million rows in a table - I first partitioned by series 0-9. You have to profile the indexed paths and then tune your database. You end up with quite a bit of data replication to allow for much faster access. You don’t want to wait 5 minutes to query half a billion rows to find every %Mantle%. Oh and you can’t just store “Mickey Mantle” because you need to have a case-insensitive option to search. And you can’t just tolower() it each time because that’s more cycles each row it has to check. Because what good is all the data if you can’t use it.
What if you have never agreed to the terms of service?
It’s established that you agree by using the service.
Genuine thanks for the detailed answer to my 1 sentence question
It's the singer not the song - Peter Townshend (1972)
What is established by what? You can use the pop report without even logging in. You don't have to read anything or sign anything or click a box or anything like that. If anything is established by using services like that companies like Google would be out of business.
The way I would look at it, from a legal perspective, and this has been tested, is that simply using a website implies acceptance of terms and conditions (terms of use). I believe the only requirement is that it's clearly stated on the website. Having a "Terms and Conditions" that states what you're accepting by using the site meets the criteria.
I don't agree or disagree with them because I don't technically care. I don't care because the Courts have also established that violating Terms and Conditions must be served with notice that you're violating Terms and Conditions before they can take action on anything who is allegedly violating said Terms and Conditions.
Now you know. You might disagree but there's a reality that exists outside of our personal wants and desires.
https://www.psacard.com/termsandconditions
I haven't read it very carefully recently. The trick to understanding the law is developing the ability to suspend disbelief.
The news seems to me to be saying it's a gray area. All of these AI companies are training their AI models by scraping databases just like this. The news is saying it's a big problem for smaller companies who are being bypassed by sites like Google who are scraping databases left and right and representing the data as their own in search results.
How you approach it is up to you. Search and generative AI is perhaps a different topic.
I would just recommend that anyone considering scraping data should review the terms and conditions. If those terms and conditions prohibit use of the data in a certain way that they intend to disregard they should consider the consequences and their validity for themselves. I am only sharing my opinion and what I understand to be the current legal opinion on that particular topic. There's always a chance I'm wrong about something so check my work!
Regarding the nexus of search & generative AI... I think many find it a useful - sometimes amusing - tool. I don't see Google representing the data as their own. They advertise it as "information from multiple sources" which they present as an overview. You know - like what we used to do when we were researching something before AI... Check multiple sites/ information sources and synthesize information right? I think we all know a few people who could be upgraded by generative AI with a search engine at it's disposal. I don't see this as bad and the web of usefulness that search engines like Google provide are invaluable to us all. I don't really know what the news is saying about how that's hurting smaller companies. I imagine it could in some ways, but that's not a new thing.
Google is a bit tricky to deal with. They have a monopoly. I would have said I use DuckDuckGo always a year ago but I really do appreciate the AI overview in Google's AI. You get all the links broken down into relevant section. Learning was never so easy. I had encyclopedias. My kids have Robot!