Texto e código retirado
deste site.
I’m trying to “scrape” images from Imgur using BeautifulSoup and requests but I’m only getting the first page of results. Why?
The code uses requests
to fetch the HTML and BeautifulSoup
with html5lib
to parse it.
If you don’t have these install already you can use install them using pip
e.g.
pip install beautifulsoup4 requests html5lib --user
- You use it, taking the blue pill—the article ends.
- You take the red pill—you stay in Wonderland, and I show you how deep a
JSON
response goes.
Remember: all I’m offering is the truth. Nothing more.
Usually the first step to take is to debug the page in question inside your browser using the “Developer Tools” (although currently it seems to called “Web Developer” in Firefox
) mainly the Inspector
tab and the Network
tab.
So normally when you scroll down to the end of an Imgur results page it will automatically load the next page of images.
I’m using Firefox
as my browser (although Chrome
also has “Developer Tools”) with Javascript disabled (using the NoScript
extension) so when I scroll to the end of the page I see a “loading”icon and the next page is never loaded.
This means that the “load next page” functionality is implemented using Javascript. Neither requests
nor the urllib
modules can execute Javascript meaning we will only get the first page of results when using them to fetch the HTML.
One option is to debug the HTTP requests being made by the Javascript and try to replicate them with requests
.
To debug HTTP requests we can view the Network
tab.
You may have heard of the Inspector
tab which you can access by right-clicking on a page and selecting Inspect Element
.
Do note that the Inspector
tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.
With the Inspector
tab open we can simply select the Network
tab.
After opening the Network
tab I have selected the XHR
filter to show only those type of requests. This stands for XMLHttpRequest
which is what the requests made by Javascript are classed as.
As we know the page is using Javascript to fetch the data we know it will be an XHR
request so viewing only those requests will simplify things.
I then enable Javascript (as I have it disabled) and reload the page. The Network
tab will only show requests that have been made after it was opened.
All that changes in the URL is the page number i.e. 1 -> 2
and as 1
gives us the second page of results we can assume that 0
will gives us the first i.e.
This suggests we could loop through a range()
of page numbers to build the needed URLs and then parse out the image links from the HTML.
The images on the results page however are just “preview” images and not the actual content we’re looking for. Also, it’s not just “images” we’re dealing with as some are “videos”, some are “albums” of “images”.
So let’s scroll back up and click on one of the images to open the individual image page. Before clicking we will hit click the “basket” icon to the left of Inspector
to clear out the Network
tab to make it easier to focus on the new requests being made.
We can see a similar request is made to the previous XHR
requests except hit?scrolled
is replaced with hit.json
and we get back a JSON
response.
If you click on the Response
tab over on the right-hand side panel you can view details about the response.
Another useful feature is that right-click on a request and Copy as cURL
which will give you the full curl
command that can be used to replicate the request which can be very useful for testing things out from the command-line.
We can add -o filename
to the end of the curl
command to store the output into filename
instead of just printing the result if needed.
Using the exact command is not always necessary but at the very least you will want to set the User-Agent
header which is what we’ll do here.
We’re also going to pipe the result through jq
which will pretty-print the JSON
and finally using shell redirection to write the result to file.
$ curl -A Mozilla/5.0 'http://imgur.com/r/funny/new/page/0/hit.json' | jq . > imgur.json
You could of course use
requests
along with
json.dump()
to do it directly from Python and there’s also
httpie.
You may have also noticed the Copy Response
entry in the right-click menu which as the name suggests copies the response to your clipboard.
The JSON
response received has a structure like
{
"data": [
{
"id": 4735008625,
"hash": "pNuFUQQ",
"author": "SlimJones123",
"account_id": null,
"account_url": null,
"title": "Abra Kadraba Alakaslam",
"score": 18352,
"size": 18087775,
"views": "18551",
"is_album": false,
"album_cover": null,
"album_cover_width": 0,
"album_cover_height": 0,
"mimetype": "image/gif",
"ext": ".gif",
"width": 720,
"height": 720,
"animated": true,
"looping": true,
"reddit": "/r/funny/comments/6a02ma/abra_kadraba_alakaslam/",
"subreddit": "funny",
"description": "",
"create_datetime": "2017-05-05 14:42:36",
"bandwidth": "312.50 GB",
"timestamp": "2017-05-08 18:50:03",
"section": "funny",
"nsfw": false,
"prefer_video": true,
"video_source": "https://www.instagram.com/p/BTW71oaA8DL/",
"video_host": null,
"num_images": 1,
data
has 100
entries meaning using hit.json
gives us 100
results per page and each entry contains lots of information, hash
, author
, views
, etc.
$ jq '.data | length' imgur.json
100
We could also use Python’s json.load()
to get that information.
>>> import json
>>>
>>> with open('imgur.json') as f:
... j = json.load(f)
...
>>> len(j['data'])
100
… or just use requests
to make the request as mentioned.
>>> import requests
>>>
>>> url = 'http://imgur.com/r/funny/new/page/0/hit.json'
>>> r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
>>>
>>> len(r.json()['data'])
100
requests
has a
.json() method on response objects which saves us having to call
json.loads()
manually. Also note that HTTP header names are case-insensitive although feel free to use
User-Agent
if you prefer.
If you’re not familiar with JSON
it’s just a “file format” that is used for passing data around. It looks similar to a Python structure and in this case it’s valid Python too.
>>> j = {
... "data": [
... {
... "id": 4735008625,
... "hash": "pNuFUQQ"
... }
... ]
... }
>>> type(j)
<type 'dict'>
So we have a dict
with the key data
whose value is a list
.
>>> type(j['data'])
<type 'list'>
And this is exactly what we get when from from the call to .json()
or one of the json.load
methods as they turn JSON
data into an equivalent Python structure.
So as mentioned above we’re not just dealing with single static images. If you “hover” over the images in the list of results it shows you the image “type” and the amount of views.
There appear to be 3
types
We can see this information inside the JSON
response
"hash" : "pNuFUQQ",
"is_album": false,
"animated": true
So if is_album
is false
and animated
is false
it’s a regular single image.
If is_album
is false
and animated
is true
like in this example it’s a single “video”.
"hash" : "pNuFUQQ",
"mimetype": "image/gif",
"ext" : "gif"
It does say it’s a gif
but Imgur also serves mp4
versions of the gif
files.
As we have the hash
and the ext
to build the URL we can simply replace .gif
with .mp4
$ curl -I http://i.imgur.com/pNuFUQQ.gif
HTTP/1.1 200 OK
Last-Modified: Fri, 05 May 2017 14:42:40 GMT
ETag: "b5467c538aa52d7ea066a7a7bfa6d574"
Content-Type: image/gif
Fastly-Debug-Digest: 3a99bef1064a006c1e74d60486dfbae1791a3220c40c0fe005a08f3b801784bc
cache-control: public, max-age=31536000
Content-Length: 18087775
[...]
If we change extension to mp4
$ curl -I http://i.imgur.com/pNuFUQQ.mp4
HTTP/1.1 200 OK
Last-Modified: Fri, 05 May 2017 14:42:38 GMT
ETag: "56a70bd6a18458c2a95c33934df5a692"
Content-Type: video/mp4
Fastly-Debug-Digest: 95f433cb7680d8ce3e9289d820027d124861898b9723658660d494906c2eb2e7
cache-control: public, max-age=31536000
Content-Length: 903448
[...]
Note the substantial difference in the file size 903448
vs. 18087775
bytes. We will just use the mp4
extension when we encounter a gif
.
So that is image
and animated
types taken care of what about album
?
If we look at the Network
tab as we click on the Load more images
button we can see the XHR
request being made.
We’ll use the curl | jq
combo from earlier to get a pretty-printed version of the JSON
response saved to disk.
$ curl -A Mozilla/5.0 'http://imgur.com/ajaxalbums/getimages/rtYyh/hit.json' | jq . > imgur-gallery.json
httpie mentioned earlier is a
“curl-like” tool written in Python that gives you a single
http
command which you could use instead.
$ http --pretty format http://imgur.com/... User-Agent:Mozilla/5.0 > imgur-gallery.json
Either way the JSON
response has a structure like
{
"data": {
"count": 26,
"images": [
{
"hash": "5MXcg9i",
"title": "",
"description": null,
"width": 701,
"height": 943,
"size": 6597,
"ext": ".png",
"animated": false,
"prefer_video": false,
"looping": false,
"datetime": "2017-05-03 06:00:48"
},
{
"hash": "otDOcL6",
"title": "",
So we have the count
of images in the album and the list of images with each hash
and ext
meaning we can build their URL e.g. http://i.imgur.com/hash.ext
Do recall that we got 100
results from our page/N/hit.json
which would suggest there is a limit of 100
results per call. If an album has multiple Load more images
buttons there is probably a similar “page” type syntax for the getimages/hash/hit.json
call however without debugging such a page we cannot be sure how it works.
So now that we know what requests are being made we can attempt to replicate them with Python.
We will first show the output.
$ python imgur.py | tee imgur.log
http://i.imgur.com/rNM1Z0V.jpg
http://i.imgur.com/SOnDLwR.mp4
http://i.imgur.com/n3PXWRe.mp4
http://i.imgur.com/9i1aYWN.jpg
http://i.imgur.com/JaeMo6M.mp4
http://i.imgur.com/sH4DqXA.jpg
http://i.imgur.com/uw9ppv4.png
http://i.imgur.com/xF0mdmc.mp4
http://i.imgur.com/pNuFUQQ.mp4
http://i.imgur.com/k4j6ZNZ.jpg
http://i.imgur.com/Ac5p6fB.jpg
http://i.imgur.com/2wdzJkM.jpg
http://i.imgur.com/HP3nCsV.jpg
http://i.imgur.com/90p9w6i.jpg
http://i.imgur.com/3V9qV6w.jpg
http://i.imgur.com/ZxfEVSF.jpg
http://i.imgur.com/CH0FWLj.jpg
http://i.imgur.com/vUfHVz2.jpg
http://i.imgur.com/1Ici9Si.jpg
http://i.imgur.com/ZSyEt89.jpg
[...]
$ wc -l imgur.log
152 imgur.log
So from the 100
entries we ended up with 152
“images” due to some of them being albums containing multiple “images”.
Here is the code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests
imgur = 'http://i.imgur.com/{}{}'
page_api = 'http://imgur.com/r/funny/new/page/{}/hit.json'
album_api = 'http://imgur.com/ajaxalbums/getimages/{}/hit.json'
with requests.session() as s:
s.headers['user-agent'] = 'Mozilla/5.0'
url = page_api.format(0)
j = s.get(url).json()
for entry in j['data']:
if entry['ext'] == '.gif':
entry['ext'] = '.mp4'
if entry['is_album']:
url = album_api.format(entry['hash'])
j = s.get(url).json()
for image in j['data']['images']:
if image['ext'] == '.gif':
image['ext'] = '.mp4'
url = imgur.format(image['hash'], image['ext'])
print(url)
else:
url = imgur.format(entry['hash'], entry['ext'])
print(url)
So firstly you’ll notice there is no BeautifulSoup
.
This is because we’re kind of making “API calls” directly and getting the data in JSON
meaning there is no HTML
involved at all.
The
{}
in our
imgur
variable is a
placeholder for use with
str.format().
Each {}
gets replaced with the corresponding argument passed to the format()
call e.g.
>>> imgur = 'http://i.imgur.com/{}{}'
>>> imgur.format('hash', '.ext')
'http://i.imgur.com/hash.ext'
Obviously in this example we could also use imgur + hash + ext
as we just need to append to the end but perhaps a better example of format()
is when you need variable data “inside” a string.
>>> page_api = 'http://imgur.com/r/funny/new/page/{}/hit.json'
>>> page_api.format(0)
'http://imgur.com/r/funny/new/page/0/hit.json'
When making multiple requests with
requests
you’ll usually want to use a
session object to maintain
“state” and keep track of cookies.
You’ll also pretty much always want to set the User-Agent
header which we set here to Mozilla/5.0
as the default User-Agent
tends to be blocked.
So we get()
the first page of results from page/0/hit.json
however you could use a loop to process multiple pages.
for n in range(3):
url = page_api.format(n)
...
We then loop through each entry changing any .gif
extension to .mp4
but you can of course skip this part if you wish.
If the entry['is_album']
we then generate the URL to the ajaxalbums/getimages
“API call”and process the resulting ['data']['images']
list.
There is some slight code duplication here which suggests we could create a function but we’ll leave that for now.
The final goal is probably to save the images to disk which you could implement
using requestsinstead of just printing the URLs.
Comentários
Postar um comentário