Converting PDF to CSV using tabula-py

June 12, 2020 - Reading time: 9 minutes

source: http://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

tabula-py

tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip:

pip install tabula-py

If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here).

import tabula
 
file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf"
 
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)

The result stored into tables is a list of data frames which correspond to all the tables found in the PDF file. To search for all the tables in a file you have to specify the parameters page = “all” and multiple_tables = True.

You can also use tabula-py to convert a PDF file directly into a CSV. The first line below will find the first table in the PDF and output it to a CSV. If we add the parameter all = True, we can write all of the PDF’s tables to the CSV.

# output just the first table in the PDF to a CSV
tabula.convert_into(file, "iris_first_table.csv")
 
# output all the tables in the PDF to a CSV
tabula.convert_into(file, "iris_all.csv", all = True)

tabula-py can also scrape all of the PDFs in a directory in just one line of code, and drop the tables from each into CSV files.

tabula.convert_into_by_batch("/path/to/files", output_format = "csv", pages = "all")

We can perform the same operation, except drop the files out to JSON instead, like below.

1	`tabula.convert_into_by_batch("/path/to/files", output_format` `=` `"json", pages` `=` `"all")`

LazyCoderOZ

I am a Linux guy, been around for 20+ years using Linux as my daily driver.
This is my blog on my discoveries and notes so I don't forget how I have done things :)

Converting PDF to CSV using tabula-py

June 12, 2020 - Reading time: 9 minutes

tabula-py

LazyCoderOZ

Categories

Static Pages

Tags