Add to bibliography update a check of every URL if broken vs. accessible.#337
Add to bibliography update a check of every URL if broken vs. accessible.#337MattHeffron wants to merge 2 commits into
Conversation
|
Sounds like the next step is going to be to separate the return values and based on that decide if it's a paywall or a broken link. We'll also need to be careful that we handle 429s correctly and back off on the number of requests we're making if needed. Might want to add a flag to allow skipping the check - I imagine it takes a bit. |
…ning why they're getting errors from curl, but (most) not when entered in a browser.
|
Here's the bibSplit.err and the .hdr files generated by the bibSplit.pl I just committed. With the mostly 403 errors, I had hoped that adding the |
|
I'm coming to the conclusion that URL testing is doomed to fail, or needs significant revisions to be valuable. I've been playing around with curl and comparing it to directly accessing the website. I was using this site as a test case: It works if I access it via a browser. It gets to the final website via two redirects:
Testing the starting URL with The problem from examining the returned headers is that curl fails a challenge from Cloudflare -- basically they are trying to prevent bots from accessing their site. And, curl falls into the bot category. I asked Claude to interpret the resulting headers and its assessment was: Given anything I did with It also suggested we could adjust our handling of 403s based on the server, if |
Initial attempt. Too many "false positive" indicating broken when not.
I tried using LWP perl library. It was faster than calling "curl", but that gave even more false positives.
It may be having trouble with redirects.