I recently wrote a Python bot to tweet about bioRxiv preprints as soon as they reach the top 10% of attention for bioRxiv on altmetric. To make sure I made some right choices there I want to analyse the altmetric scores and their history of all bioRxiv preprints.
As far as I know, the bioRxiv database doesn’t have a convenient API, but I could get a list of all DOIs from Jordan ‘OmnesRes’ Anaya, who is the author and maintainer of PrePubMed, a searchable database of preprints from various servers. The script I used to query altmetric is added at the bottom of this post. I executed the code as:
python altmetric-query.py biorxiv-dois.txt
Since the file with the DOIs was not hardcoded in the script I could also do quick tests with bash “process substitution” to query just a few DOIs and see if the code is working:
python altmetric-query.py <(head -n 5 biorxiv-dois.txt)
An alternative would be to have the script read from stdin:
head -n 5 biorxiv-dois.txt | python altmetric-query.py
The code starts with writing out the header for the extracted metrics, and then I queried the Altmeric database in a loop. I added plenty of time between the calls to the API (10 seconds) to make sure I didn’t hammer their server. As such this script took rather long to run but as it was not urgent this was not a real problem.
Annoyingly, the returned dictionary did not always contain the same keys in the “cited_by” categories. To get an idea what the most important fields in this category are I had a look at the keys returned for the ExAC preprint. I guess I will have the most important channels then. When a field was not returned by the Altmetric API I assumed this to be 0.
I added a bit of error handling, but it turned out that this was not necessary, but it’s always better to be safe than sorry.
The analysis of the data is for another post!