How to Solve Simple Captchas using Python Tesseract

Share:

how-to-solve-captchas-using-python-tesseract

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. As the acronym suggests, it is a test used to determine whether the user is human or not. A typical captcha consists of a distorted test, which a computer program cannot interpret but a human can (hopefully) still read. This tutorial will show you how to bypass simple captchas and how to solve captcha code using an OCR in Python.

What is an OCR?

Optical Character Recognition, or OCR, is the recognition of printed or written characters by a computer. It enables you to convert different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera into editable and searchable data.

Popular open-source OCR tools are Tesseract, GOCR, and Ocrad. We will use Tesseract for this tutorial.

Tesseract

Tesseract is an open source OCR engine for various operating systems. It’s considered one of the most accurate OCR engines currently available, with the precision depending on the clearness of the image. Google has sponsored its development since 2006.

PyTesseract

Python-Tesseract is a Python wrapper that helps you use Tesseract-OCR engine to convert images to the accepted format from Python. It can read all image types – png, jpeg, gif, tiff, bmp, etc.

Using Tesseract to bypass Captchas

Tesseract is designed to read regular printed text. If we want to use Tesseract effectively, we will need to modify the captcha images to remove the background noise, isolate the text, and then pass it over to Tesseract to recognize the captcha.

Below are the package requirements for this tutorial in python.

  • Python 3.0 ( https://www.python.org/downloads/ )
  • PIP to install the following packages in Python ( https://pip.pypa.io/en/stable/installing/)
  • Tesseract-OCR, installation instructions for Tesseract are available at (https://github.com/tesseract-ocr/tesseract/wiki)
  • PyTesseract,  requires Python Imaging Library(PIL) and python version 2.5 or later. To install it use pip install pytesseract
  • Python Imaging Library (PIL), for adding image processing capabilities to your Python interpreter and to support library formats. Install it using pip install pil
  • ImageMagic Tools for processing and resampling image. Find the install instructions here https://www.imagemagick.org/script/download.php

To know more about how to install PyTesseract with Tesseract, read here.

For this tutorial, we will show you how to solve captchas. If the captchas we are trying to interpret are not difficult or messy we can make use of PyTesseract to bypass the captcha.

As an example, we will use the following captcha image

bypass-captcha

After resampling, the image will look like this:

how-to-bypass-captcha

Code

The script below can recognize the captcha and read the captcha image.

https://gist.github.com/scrapehero/b85a280dc0d993f665c91e0332cf618f

If the embed above does not work you can download the script from this link.

The script is named captcha_resolver.py. To run this script in command prompt or terminal you must type in the script name followed by the name of the captcha image as shown below.

python captcha_resolver.py cap.jpg

This will give the output as

Resolving Captcha
Resampling the Image
('Extracted Text', 'Viearer')

This code can resolve on how to solve captchas with sufficient clarity like the one we have just shown above. Let us know in the comments how this script worked for you or if you have a better solution.

If you need professional help with scraping complex websites, contact us by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
Scrape Yelp Reviews

Need to Scrape Yelp Reviews? Check Out This Tutorial

Learn how you can scrape Yelp reviews using Selenium.
ScrapeHero Logo

Can we help you get some data?