source: http://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/
tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip:
|
1
|
pip install tabula-py |
If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here).
|
1
2
3
4
5
|
import tabulatables = tabula.read_pdf(file, pages = "all", multiple_tables = True) |
The result stored into tables is a list of data frames which correspond to all the tables found in the PDF file. To search for all the tables in a file you have to specify the parameters page = “all” and multiple_tables = True.
You can also use tabula-py to convert a PDF file directly into a CSV. The first line below will find the first table in the PDF and output it to a CSV. If we add the parameter all = True, we can write all of the PDF’s tables to the CSV.
|
1
2
3
4
5
|
# output just the first table in the PDF to a CSVtabula.convert_into(file, "iris_first_table.csv")# output all the tables in the PDF to a CSVtabula.convert_into(file, "iris_all.csv", all = True) |
tabula-py can also scrape all of the PDFs in a directory in just one line of code, and drop the tables from each into CSV files.
|
1
|
tabula.convert_into_by_batch("/path/to/files", output_format = "csv", pages = "all") |
We can perform the same operation, except drop the files out to JSON instead, like below.
|
1
|
tabula.convert_into_by_batch("/path/to/files", output_format = "json", pages = "all") |
Hi I am Lazycoder, this is where I typically dump all my thoughts, code and solved problems that I come across in my travels across the internet.