Python Program to read a book (docx Word document) & store it in a DataFrame in Python.
Problem Statement:
Following code will read a book from the system in a document form and store it in a dataframe in Python.
Solution:
- Step 1: Convert a pdf book into .docx format.
- Step 2: Import necessary libraries in the code.
- Step 3: Initialize address(path to the file to be read) and dataframe.
- Step 4: Create a function that will take address(path to the file to be read) as an input, and store it in a dataframe.
- Step 5: Call the function by passing the address as the parameter.
- Step 6: End.
How Does it Work ?
The “docx” package of python allows to read and access the docx documents. To install this package you need to run following on your command prompt:
pip install docx
Then we import this package in our code to access the docx document. Using this package, you can open the document and read all the paragraphs of the word document.
Program/Code To Read Paragraphs in Word Docx:
from docx import Document address='H:/Work/Practice/OOW/1st_text/Heidi_w.docx' # path to the file in your system text_chunks = [] # create an empty dataframe def doc_to_df(address): # define a function document = Document(address) # open the document for paragraph in document.paragraphs: # for loop to read each paragraph and append it to the dataframe text_chunks.append(paragraph.text) doc_to_df(address) # call function
As everything in Word document is represented by paragraphs. Then using a for loop, all the paragraphs is read and appended in a dataframe.