Python Program to read a book (docx Word document) & store it in a DataFrame in Python.

Problem Statement:

Following code will read a book from the system in a document form and store it in a dataframe in Python.

Solution:

  • Step 1: Convert a pdf book into .docx format.
  • Step 2: Import necessary libraries in the code.
  • Step 3: Initialize address(path to the file to be read) and dataframe.
  • Step 4: Create a function that will take address(path to the file to be read) as an input, and store it in a dataframe.
  • Step 5: Call the function by passing the address as the parameter.
  • Step 6: End.

How Does it Work ?

The “docx” package of python allows to read and access the docx documents. To install this package you need to run following on your command prompt:

pip install docx

Then we import this package in our code to access the docx document. Using this package, you can open the document and read all the paragraphs of the word document.

Program/Code To Read Paragraphs in Word Docx:

from docx import Document

address='H:/Work/Practice/OOW/1st_text/Heidi_w.docx' # path to the file in your system
text_chunks = [] # create an empty dataframe
def doc_to_df(address): # define a function
document = Document(address) # open the document
for paragraph in document.paragraphs: # for loop to read each paragraph and append it to the dataframe
text_chunks.append(paragraph.text)

doc_to_df(address) # call function

As everything in Word document is represented by paragraphs. Then using a for loop, all the paragraphs is read and appended in a dataframe.

Leave a Reply