Retrieving records using Europe PMC API
Understandably, going through a lot of publications is a necessary evil part of every researcher’s life. I’ve been wondering how to go about filtering through search results and get as much information from papers. Fortunately, databases like the Europe PMC have made available APIs to facilitate access to their records, including annotations of literature (e.g. the organisms and chemical structures mentioned in the paper, as well as links to datasets).
I only have a vague idea how to retrieve information through APIs so this was a good exercise. The first order of business is go through the documentation.
Checking the instructions for GET /search
, the URL to construct the query from is:
1
https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=parameters
For example, searching for the string “sloth genomics” and setting format into JSON (everything else defaults), the request URL is:
1
https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=sloth%20genomics&resultType=lite&cursorMark=*&pageSize=25&format=json
If you access the link in a browser (I’m using Firefox version 67.0), it can render the JSON file in a way that makes it easy to scrutinize by eye.
This results indicate that there are 131 records matching the query term (as of June 2, 2019). Further, all sorts of information about the records are made available such as the id of the article, which repository it was from (e.g. MED means the article is available in PubMED and if there is a PMCid means that the full-text is available online), as well as details about the authors, title, journal it was publised, year, etc.
A simple way to go about this in Python is to use the request
library, coupled with json
and pprint
(pretty print). The idea is to create a message containing the address (request URL) and the details about the information you’re interested in (parameter values), use the request
library to make and send the request (HTTP request) and get the response to and from the Europe PMC database, and then use json
and pprint
to render it nicely in the terminal.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import json
from pprint import pprint
URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"
data = {"query": "sloth genomics",
"resultType" : "lite",
"synonymn": "",
"cursorMark": "*",
"pageSize": "25",
"sort": "",
"format": "json",
"callback": "",
"email": "",
}
response = requests.get(URL, params=data)
results = json.loads(response.text[1:-1])
pprint(results)
Similarly, I can construct a script to retrieve interesting annotations within an article using the Annotations API.
The base URL is:
1
https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds
So to retrieve the annotations about organisms for the PubMED article with id 31031962 will go something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
import json
from pprint import pprint
URL = "https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?"
data = {
"articleIds" : "MED:31031962",
"type" : "Organism",
"section" : "",
"provider" : "",
"format" : "",
}
response = requests.get(URL, params=data)
results = json.loads(response.text[1:-1])
pprint(results)
The entire request URL is below and could be similarly accessed using the browser.
1
https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=MED%3A31031962&type=Organisms&format=JSON
This results in information about what organism-terms are in the article (e.g. “mammals”, “Homo sapiens”) and which part of the article they were mentioned. It might be possible, for example, to use these data to prioritize papers according to how they could be more valuable to your research.
While visually, the results may look a lot, I can now further process (parse) the variable results
(using the json
library) to print only the values of the tags I was interested in. Further extensions to this script could include:
- error-handling (e.g. connection timeouts)
- capturing the date of the query
- formatting the results into a table (e.g. tsv or csv)
- printing the results to a file
- processing annotation results for multiple IDs
- retrieving inputs for parameter values from the commandline or a file (e.g. YAML or JSON)
- further filtering of the results
- integrating two or more of the functionalities in order to have a flexible query space
Some sample scripts are in this repository.
My take away from this exercise is that it’s possible to mine the information from articles through APIs, and make your life easier with regards to retrieving as much information from published papers. :grin: