This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
Arxiv.org is a pre-print journal containing research articles. Researchers can upload their articles here before publishing them in a journal. Their website is not complex; therefore, web scraping Arxiv is possible using HTTP requests.
This article will show you how to scrape Arxiv articles using Python.
Data Scraped from Arxiv.org
The tutorial will scrape the articles’
- Topic
- Authors
- Abstract
- PDF link
You can get this information from their search results page.
The Environment
The tutorial shows web scraping Arxiv using Python. The code uses Python requests to send the HTTP requests to arxiv.org and BeautifulSoup to parse the results.
You can install both these libraries using pip.
pip install beautifulsoup4 requests
Web Scraping Arxiv: The Code
First, write the import statements. You need to import the external library Python requests, the module BeautifulSoup from bs4, and the module json from Python standard library.
The json module lets you write the scraped results into a JSON file.
import requests
from bs4 import BeautifulSoup
import json
After writing the import statements, you can use the Python requests to retrieve the HTML content. To do so, use the get() method to retrieve HTML data from the URL.
query = input("What do you want to search for on Arxiv.org ")
response = requests.get(f"https://arxiv.org/search/?query={query}&searchtype=all&abstracts=show&order=-announced_date_first&size=50")
The response will contain the HTML data. Pass that to the BeautifulSoup for parsing; it will create an object.
soup = BeautifulSoup(response.text)
Extract data from the created object.
You can use the find method to get a single element that BeautifulSoup finds first. Use the find_all method to find all the elements matching the criteria.
Find the li tags with the arxiv-result class. This tag element will contain the details of the article. Then, extract sub-elements or child elements using the same method.
articles = soup.find_all("li",attrs={"class":"arxiv-result"})
Loop through all the articles and extract the required information.
- Name from a p tag with the class title
rawTopic = article.find("p",attrs={"class":"title"}).text
- Abstract from a span tag with the class abstract-full
rawFullAbstract = abstract.find("span",{"class":"abstract-full"}).text
- Authors using the class authors
authors = article.find("p",{"class":"authors"}).text.split()[1:]
The PDF link is inside an anchor tag, but the tag does not have a class. However, you can select the specific anchor tag using its relative location. The anchor tag is inside a span tag inside a p tag inside a div tag. Therefore, chain the tags to select the required anchor tag and extract the href attribute.
pdfURL = article.div.p.span.a['href']
Append the extracted information to an array at the end of every loop.
arxivArticle= {
"Topic": topic,
"Abstract": fullAbstract,
"PDF": pdfURL,
"Authors": authors
}
articleInfo.append(arxivArticle)
After the loop ends, save this array to a JSON file.
with open("arxiv.json","w") as jsonFile:
json.dump(articleInfo,jsonFile,indent=4)
Here is the complete code for web scraping Arxiv articles.
import requests
from bs4 import BeautifulSoup
import json
query = input("What do you want to search for on Arxiv.org ")
response = requests.get(f"https://arxiv.org/search/?query={query}&searchtype=all&abstracts=show&order=-announced_date_first&size=50")
soup = BeautifulSoup(response.text)
articles = soup.find_all("li",attrs={"class":"arxiv-result"})
articleInfo = []
for article in articles:
rawTopic = article.find("p",attrs={"class":"title"}).text
topic = rawTopic.replace("\n","").strip()
abstract = article.find("p",attrs={"class":"abstract"})
rawFullAbstract = abstract.find("span",{"class":"abstract-full"}).text
fullAbstract = rawFullAbstract.replace("\n","").strip()
pdfURL = article.div.p.span.a['href'],
authors = article.find("p",{"class":"authors"}).text.split()[1:]
arxivArticle= {
"Topic": topic,
"Abstract": fullAbstract,
"PDF": pdfURL,
"Authors": authors
}
articleInfo.append(arxivArticle)
with open("arxiv.json","w") as jsonFile:
json.dump(articleInfo,jsonFile,indent=4)
The results of scraping arxiv.org will be something like this.
[
{
"Topic": "A Novel Method for Drawing a Circle Tangent to Three Circles Lying on a Plane by Straightedge,Compass,and Inversion Circles",
"Abstract": "In this paper, we present a novel method to draw a circle tangent to three given circles lying on a plane. Using the analytic geometry and inversion (reflection) theorems, the center and radius of the inversion circle are obtained. Inside any one of the three given circles, a circle of the similar radius and concentric with its own corresponding original circle is drawn.The tangent circle to these three similar circles is obtained. Then the inverted circles of the three similar circles and the tangent circle regarding an obtainable point and a computable power of inversion (reflection) constant are obtained. These circles (three inverted circles and an inverted tangent circle)will be tangent together.Just,we obtain another reflection point and power of inversion so that those three reflected circles (inversions of three similar circles) can be reflections of three original circles, respectively. In such a case,the reflected circle tangent to three reflected circles regarding same new inversion system will be tangent to the three original ones. This circle is our desirable circle. A drawing algorithm is also given for drawing desirable circle by straightedge and compass. A survey of conformal mapping theory and inversion in higher dimensions is also accomplished. Although, Laguerre transformation might be used for solution of this problem, but we do not make use of this method. Our novelty is just for drawing a circle tangent to three given circles applying a tangent circle to three identical circles concentric with three given ones and then inverting them as original ones by compass and straightedge not any thing else.",
"PDF": "http://arxiv.org/pdf/1906.00068v1",
"Authors": [
"Ahmad Sabihi"
]
},
{
"Topic": "Generalization of Apollonius Circle",
"Abstract": "Apollonius of Perga, showed that for two given points $A,B$ in the Euclidean plane and a positive real number $k\\neq 1$, geometric locus of the points $X$ that satisfies the equation $|XA|=k|XB|$ is a circle. This circle is called Apollonius circle. In this paper we generalize the definition of the Apollonius circle for two given circles $\\Gamma_1,\\Gamma_2$ and we show that geometric locus of the points $X$ with the ratio of the power with respect to the circles $\\Gamma_1,\\Gamma_2$ is constant, is also a circle. Using this we generalize the definition of Apollonius Circle, and generalize some results about Apollonius Circle.",
"PDF": "http://arxiv.org/pdf/2105.03673v1",
"Authors": [
"Ömer Avcı",
"Ömer Talip Akalın",
"Faruk Avcı",
"Halil Salih Orhan"
]
},
{
"Topic": "When Euler (circle) meets Poncelet (Porism)",
"Abstract": "We describe all triangles that shares the same circumcircle and Euler circle. Although this two circles do not form a poristic pair of circles, we find a poristic circle \"in-between\" that enable to solve this problem using Poncelet porism.",
"PDF": "http://arxiv.org/pdf/2011.01988v1",
"Authors": [
"Liliana Gabriela Gheorghe"
]
},
{
"Topic": "The Six Circles Theorem revisited",
"Abstract": "The Six Circles Theorem of C. Evelyn, G. Money-Coutts, and J. Tyrrell concerns chains of circles inscribed into a triangle: the first circle is inscribed in the first angle, the second circle is inscribed in the second angle and tangent to the first circle, the third circle is inscribed in the third angle and tangent to the second circle, and so on, cyclically. The theorem asserts that if all the circles touch the sides of the triangle, and not their extensions, then the chain is 6-periodic. We show that, in general, the chain is eventually 6-periodic but may have an arbitrarily long pre-period.",
"PDF": "http://arxiv.org/pdf/1312.5260v2",
"Authors": [
"Dennis Ivanov",
"Serge Tabachnikov"
]
},
{
"Topic": "Testing analyticity on circles",
"Abstract": "Consider a continuous one parameter family of circles in complex plane that contains two circles lying in the exterior of one another. Under mild assumptions on the family, we prove that if a continuous function on the union of the above circles extends holomorphically into each circle, then the function is holomorphic in the interior of the union of the circles.",
"PDF": "http://arxiv.org/pdf/math/0502139v1",
"Authors": [
"A. Tumanov"
]
},
{
"Topic": "Balanced Circle Packings for Planar Graphs",
"Abstract": "We study balanced circle packings and circle-contact representations for planar graphs, where the ratio of the largest circle's diameter to the smallest circle's diameter is polynomial in the number of circles. We provide a number of positive and negative results for the existence of such balanced configurations.",
"PDF": "http://arxiv.org/pdf/1408.4902v1",
"Authors": [
"Md. Jawaherul Alam",
"David Eppstein",
"Michael T. Goodrich",
"Stephen G. Kobourov",
"Sergey Pupyrev"
]
},
{
"Topic": "Spinors and Descartes' Theorem",
"Abstract": "Descartes' circle theorem relates the curvatures of four mutually externally tangent circles, three \"petal\" circles around the exterior of a central circle, forming a \"$3$-flower\" configuration. We generalise this theorem to the case of an \"$n$-flower\", consisting of $n$ tangent circles around the exterior of a central circle, and give an explicit equation satisfied by their curvatures. The proof uses a spinorial description of horospheres in hyperbolic geometry.",
"PDF": "http://arxiv.org/pdf/2310.11701v1",
"Authors": [
"Daniel V. Mathews",
"Orion Zymaris"
]
},
{
"Topic": "Generalized problem of Apollonius",
"Abstract": "The aim of this paper is to generalize Apollonius' problem. The problem is to construct a circle that is tangent to three given circles in a plane. We find the maximum possible number of solution circles in the case of more than the three given circles. We show that if all the given circles are not tangent at the same point, then there exist at most six solutions in the case of the four given generalized circles and there exist at most four solutions in the case of the five given generalized circles. We also describe all quadruples of generalized circles with exactly six solutions.",
"PDF": "http://arxiv.org/pdf/1611.03090v2",
"Authors": [
"Egor Morozov"
]
},
{
"Topic": "Fibonacci numbers and Ford circles",
"Abstract": "An amusing connection between Ford circles, Fibonacci numbers, and golden ratio is shown. Namely, certain tangency points of Ford circles are concyclic and involve Fibonacci numbers. They form four circles that cut the x-axis at points related to the golden ratio.",
"PDF": "http://arxiv.org/pdf/2003.00852v1",
"Authors": [
"Jerzy Kocik"
]
},
{
"Topic": "Properties of Ajima Circles",
"Abstract": "We study properties of certain circles associated with a triangle. Each circle is inside the triangle, tangent to two sides of the triangle, and externally tangent to the arc of a circle erected internally on the third side.",
"PDF": "http://arxiv.org/pdf/2310.12896v1",
"Authors": [
"Stanley Rabinowitz",
"Ercole Suppa"
]
}
]
Extracting Using APIs
Arxiv.org also provides APIs for extracting data. The process is similar to scraping their search results page, but the response will contain structured XML data.
Details of each article will be inside an entry tag; each detail will be inside the corresponding tag.
<entry> <id>http://arxiv.org/abs/1409.5175v1</id> <updated>2014-09-18T02:08:39Z</updated> <published>2014-09-18T02:08:39Z</published> <title>Colorful Associahedra and Cyclohedra</title> <summary> Every n-edge colored n-regular graph G naturally gives rise to a simple abstract n-polytope, the colorful polytope of G, whose 1-skeleton is isomorphic to G. The paper describes colorful polytope versions of the associahedron and cyclohedron. Like their classical counterparts, the colorful associahedron and cyclohedron encode triangulations and flips, but now with the added feature that the diagonals of the triangulations are colored and adjacency of triangulations requires color preserving flips. The colorful associahedron and cyclohedron are derived as colorful polytopes from the edge colored graph whose vertices represent these triangulations and whose colors on edges represent the colors of flipped diagonals. </summary> <author> <name>Gabriela Araujo-Pardo</name> </author> <author> <name>Isabel Hubard</name> </author> <author> <name>Deborah Oliveros</name> </author> <author> <name>Egon Schulte</name> </author> <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">21 pp, to appear in Journal Combinatorial Theory A</arxiv:comment> <link href="http://arxiv.org/abs/1409.5175v1" rel="alternate" type="text/html"/> <link href="http://arxiv.org/pdf/1409.5175v1" rel="related" title="pdf" type="application/pdf"/> <arxiv:primary_category scheme="http://arxiv.org/schemas/atom" term="math.CO" xmlns:arxiv="http://arxiv.org/schemas/atom"/> <category scheme="http://arxiv.org/schemas/atom" term="math.CO"/> <category scheme="http://arxiv.org/schemas/atom" term="math.MG"/> </entry>
Because the data is XML, use features=” XML” as an argument when passing the response text to the BeautifulSoup constructor.
xmlSoup = BeautifulSoup(apiResponse.text,features="xml")
Then, as before, you can find the tags.
Locate the entry tags to find all the tags containing articles. Iterate through the tags and extract each data point.
apiArticles = xmlSoup.find_all("entry")
articlesFromAPI = []
for article in apiArticles:
title = article.find("title").text
summary = article.find("summary").text
pdfLink = article.find("link",attrs={"title":"pdf"})['href']
authors = article.find_all("name")
authorList = []
for author in authors:
authorList.append(author.text)
Finally, append the extracted data to an array and save the array to a JSON file.
articlesFromAPI.append(
{
"Topic":" ".join(title.split()),
"Abstract":" ".join(summary.split()),
"PDF":pdfLink,
"Authors":authorList
}
)
with open("arxivFromAPI.json","w") as jsonFile:
json.dump(articlesFromAPI,jsonFile,indent=4,ensure_ascii=False)
Using APIs vs. Web Scraping
You can also use their API to extract article details from arxiv.org. The API gives you an XML response. The XML will contain article details in a structured format.
For example, each article will be inside an entry tag. Inside the entry tag, the child elements hold the article details. For example, the article’s title will be inside the title tag, and the summary will be inside the summary tag.
In contrast, the article details will be inside HTML tags when you try web scraping.
<li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="https://arxiv.org/abs/2405.10914">arXiv:2405.10914</a> <span> [<a href="https://arxiv.org/pdf/2405.10914">pdf</a>, <a href="https://arxiv.org/format/2405.10914">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="High Energy Physics - Theory">hep-th</span> </div> </div> <p class="title is-5 mathjax"> Global Symmetry and Integral Constraint on Superconformal Lines in Four Dimensions </p> <p class="authors"> <span class="has-text-black-bis has-text-weight-semibold">Authors:</span> <a href="/search/?searchtype=author&query=Dempsey%2C+R">Ross Dempsey</a>, <a href="/search/?searchtype=author&query=Offertaler%2C+B">Bendeguz Offertaler</a>, <a href="/search/?searchtype=author&query=Pufu%2C+S+S">Silviu S. Pufu</a>, <a href="/search/?searchtype=author&query=Wang%2C+Y">Yifan Wang</a> </p> <p class="abstract mathjax"> <span class="search-hit">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.10914v1-abstract-short"> …<span class="search-hit mathjax">field</span> <span class="search-hit mathjax">theories</span>. At large distances, such impurities are described by half-BPS superconformal line defects. By working in the $\text{AdS}_2\times \text{S}^2$… <a class="is-size-7" onclick="document.getElementById('2405.10914v1-abstract-full').style.display = 'inline'; document.getElementById('2405.10914v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.10914v1-abstract-full"> We study properties of point-like impurities preserving flavor symmetry and supersymmetry in four-dimensional ${\cal N} = 2$ <span class="search-hit mathjax">field</span> <span class="search-hit mathjax">theories</span>. At large distances, such impurities are described by half-BPS superconformal line defects. By working in the $\text{AdS}_2\times \text{S}^2$ <span class="search-hit mathjax">conformal</span> frame, we develop a novel and simpler way of deriving the superconformal Ward identities relating the various two-point functions of flavor current multiplet operators in the presence of the defect. We use these relations to simplify a certain integrated two-point function of flavor current multiplet operators that, in Lagrangian <span class="search-hit mathjax">theories</span>, can be computed using supersymmetric localization. The simplification gives an integral constraint on the two-point function of the flavor current multiplet superconformal primary with trivial integration measure in the $\text{AdS}_2\times \text{S}^2$ <span class="search-hit mathjax">conformal</span> frame. We provide several consistency checks on our Ward identities. <a class="is-size-7" onclick="document.getElementById('2405.10914v1-abstract-full').style.display = 'none'; document.getElementById('2405.10914v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">41 pages + Appendices</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Report number:</span> PUPT-2652 </p> </li>
Therefore, you can see that it is easier to extract data from an API response.
However, you can only extract the details available in the API response; whereas, you can get all the information available on their website using web scraping.
Code Limitations
The code can extract data from arxiv.org using the search results URL or their API. Either way, you may have to alter the code in the future, as arxiv.org can change the website’s structure or the API endpoint.
Changing the website structure may require you to analyze their search results page again to find the tags and attributes of the data you want to scrape.
If they change the endpoint, you must use a new one, which also requires altering the code.
Wrapping Up
You can extract data from arxive.org using Python requests. You don’t have to use headers, and arxive.org also provides APIs.
Either web scraping Arxiv.org or using their APIs can provide the required information. However, web scraping can also provide any information available on the website. Meanwhile, APIs can only get you the information they intend to deliver.
Though this code can extract data from arxiv.org, you may need to update it whenever arxive.org changes its website structure or the API URL. Moreover, the code only gets the name, the abstract, the authors, and the pdf link. If you need more information, you must change the code accordingly.
Or, you can use ScrapeHero Services. ScrapeHero is an enterprise-grade web scraping service provider. Let us know what and how much data you need, and we will build a high-quality Arxiv.org scraper for you.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data