Gigabase or gigabyte

2018-02-06T00:13:57+01:00

This is awesome, thanks for sharing!

Back in June, I wanted to create something similar for sociology-related research. I only got as far as prototyping it (collecting and tweeting the data manually), the process for which I describe here: http://stacykonkiel.org/sociologybot/

I may fork your code (when I finally get around to coding the darn thing) as your approach for finding articles to recommend potentially makes more sense than mine (esp. if one were to use SocArxiv as a single point for sourcing papers, rather than all of the top 10 sociology journals).

Anyway–great work! Thanks for making your code open source!

LikeLike

Reply

2018-02-06T09:37:08+01:00

I’m glad you find it useful. Please let me know if you need help adapting the code for your application.

LikeLike

Reply

2018-05-04T08:28:55+02:00

This looks great. Have you considered adding other preprint servers? You can get in touch if you need info about accessing data from preprints.org.

LikeLike

Reply

2018-05-04T09:01:22+02:00

Hi Martyn,

Thank you for reaching out. Did I understand correctly that preprints.org is a separate/non-overlapping server/repository for preprints, without restriction of research area? If so, I’d be happy to set up a twitter bot for these too, in case an RSS feed is available and the preprints are included in Altmetric.

Cheers,
Wouter

LikeLike

Reply

2018-05-04T14:54:53+02:00

Hi Wouter,

Yes, we have a similar setup to bioRxiv and cover all disciplines. There is an RSS feed at https://www.preprints.org/rss and our preprints are on altmetric (see e.g. https://www.altmetric.com/details/35365527). Thanks.

LikeLike

2018-05-04T15:50:30+02:00

Okay, should be quite easy to clone the current bot and create a new one for preprints.org
I’ll probably do it on Monday.

LikeLike

	# Frequently check RSS, save results to database
	# Part of PromisingPreprint
	# wdecoster

	import feedparser
	from os import path
	import logging


	def checkRSS(dois_seen, dbf):
	'''
	Check the RSS feed of biorxiv and for each paper, save doi, link and title
	Except if the doi was already seen, then don't duplicate
	'''
	feed = feedparser.parse("http://connect.biorxiv.org/biorxiv_xml.php?subject=all")
	if 'bozo_exception' in feed.keys():
	logging.error("Failed to reach the feed")
	else:
	for pub in feed["items"]:
	if not pub['dc_identifier'] in dois_seen:
	save2db(pub['dc_identifier'], pub["link"], pub["title"], pub['updated'], dbf)


	def save2db(doi, link, title, date, dbf):
	'''
	Save article metrics to the database
	Add status 'notTweeted', since this publication is new
	'''
	with open(dbf, 'a') as db:
	db.write("{}\t{}\t{}\t{}\t{}\n".format(doi, link, title, date, 'notTweeted'))


	def readdb(dbf):
	'''
	Return all DOIs in the database file
	'''
	if path.isfile(dbf):
	with open(dbf, 'r') as db:
	return [i.split('\t')[0] for i in db]
	else:
	logging.warning("Unexpected: Database not found")
	return []


	def main():
	try:
	db = "/home/pi/projects/PromisingPreprint/preprintdatabase.txt"
	logging.basicConfig(
	format='%(asctime)s %(message)s',
	filename="/home/pi/projects/PromisingPreprint/getPreprintsAndSave.log",
	level=logging.INFO)
	logging.info('Started.')
	dois_seen = readdb(db)
	checkRSS(dois_seen, db)
	logging.info('Finished.\n')
	except Exception as e:
	logging.error(e, exc_info=True)
	raise


	if __name__ == '__main__':
	main()

	## Periodically query altmetric all entries in database younger than 1 week
	## As soon as threshold (better than top 90% score journal) is passed, tweet about article
	# Part of PromisingPreprint
	# wdecoster

	from altmetric import Altmetric
	from os import path
	from time import sleep
	from datetime import datetime
	from secrets import *
	import tweepy
	import logging
	import argparse


	def queryAltmetric(doi):
	'''
	Check the altmetric journal percentile score of the publication
	'''
	a = Altmetric()
	sleep(2)
	try:
	resp = a.doi(doi)
	if resp:
	return resp["context"]['journal']['pct'] # Percentage attention for this journal
	else:
	return 0
	except AltmetricHTTPException as e:
	if e.status_code == 403:
	logging.error("You aren't authorized for this call: {}".format(doi))
	elif e.status_code == 420:
	logging.error("You are being rate limited, currently {}".format(doi))
	sleep(60)
	elif e.status_code == 502:
	logging.error("The API version you are using is currently down for maintenance.")
	elif e.status_code == 404:
	logging.error("Invalid API function")
	print(e.msg)


	def tweet(message, api, dry):
	'''
	Tweet the message to the api, except if dry = True, then just print
	'''
	if args.dry:
	print(message)
	else:
	api.update_status(message)
	sleep(2)


	def cleandb(currentlist, alreadyTweeted, dbf):
	'''
	Using the stored list of entries check if the articles aren't older than 1 week
	Save only those younger than 1 week to same file (overwrite)
	Also remove those which were already tweeted
	'''
	currentTime = datetime.now()
	with open(dbf, 'w') as db_updated:
	for doi, link, title, date, status in currentlist:
	if (currentTime – datetime.strptime(date.strip(), "%Y-%m-%d")).days <= 7:
	if doi in alreadyTweeted:
	db_updated.write("{}\t{}\t{}\t{}\t{}\n".format(doi, link, title, date, "tweeted"))
	else:
	db_updated.write("{}\t{}\t{}\t{}\t{}\n".format(doi, link, title, date, "SeenNotTweeted"))


	def readdb(dbf):
	'''
	Get all saved metrics from the database saved to file
	'''
	if path.isfile(dbf):
	with open(dbf, 'r') as db:
	return [i.strip().split('\t') for i in db]
	else:
	return []


	def getArgs():
	parser = argparse.ArgumentParser(description="Checking altmetric score of preprints in database and tweet if above cutoff.")
	parser.add_argument("-d", "–dry", help="Print instead of tweeting", action="store_true")
	return parser.parse_args()


	def setupTweeting():
	'''
	Setup the tweeting api using the keys and secrets imported from secrets.py
	'''
	auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
	auth.set_access_token(access_token, access_secret)
	return tweepy.API(auth)


	def main():
	try:
	args = getArgs()
	db="/home/pi/projects/PromisingPreprint/preprintdatabase.txt"
	logging.basicConfig(
	format='%(asctime)s %(message)s',
	filename="/home/pi/projects/PromisingPreprint/checkScoreAndTweet.log",
	level=logging.INFO)
	logging.info('Started.')
	api = setupTweeting()
	currentlist = readdb(db)
	tweeted = []
	for doi, link, title, date, status in currentlist:
	if status == 'tweeted':
	tweeted.append(doi)
	continue
	pct = queryAltmetric(doi)
	assert 0 <= pct <= 100
	if pct >= 90:
	tweet("{}\n{}".format(link, title), api, args.dry)
	tweeted.append(doi)
	cleandb(currentlist, tweeted, db)
	logging.info('Finished.\n')
	except Exception as e:
	logging.error(e, exc_info=True)
	raise


	if __name__ == '__main__':
	main()

Gigabase or gigabyte

Exploring bioinformatics

A Twitter bot to find the most interesting bioRxiv preprints

6 thoughts on “A Twitter bot to find the most interesting bioRxiv preprints”

Leave a comment Cancel reply

Share this:

Related

Share this:

6 thoughts on “A Twitter bot to find the most interesting bioRxiv preprints”

Leave a comment Cancel reply