Convert PDF to TEXT

October 6, 2016

674 views
205 downloads


Report Abuse
Azure ML experiment to convert PDF to text using python script.
**Use case**: I needed to extract text from pdf in order to do some text analytics on the extracted text and I needed to do it within Azure ML. > **Note:** User does not need to download pdfminer on their machine. When you open the experiment in Studio (by clicking on Open in Studio button at the right), all the required modules will be available in your workspace via the ""Both_CustomPythonTool_Azure_sdk" dataset. **Methodology**: To this end I wanted to use [pdfminer](https://pypi.python.org/pypi/pdfminer/), a tool for extracting information from PDF documents. It is written entirely in python. I wanted to use this tool within Azure ML using Azure ML's built-in capability to run Python script as I wanted to leverage the native text analytics module in Azure ML. However the modules required for this tool are not installed on Azure ML. ![image](https://raw.githubusercontent.com/shaheeng/ShaheenGauher/master/SharedPublic/TextAnalytics_AzureML2.png) Fig 1. Text Analytics modules in Azure ML There is a great blog post [here](https://blogs.technet.microsoft.com/machinelearning/2016/02/25/running-compiled-code-on-azure-ml-in-r-and-python/) which shows how to source external dependencies, such as Python modules, which are not already installed on Azure ML and require code compilation. Please find below a brief description of the process followed. The user can skip all these steps however and simply get all the modules required to run the script in their workspace by clicking on "Open in Studio" button at the right. As a first step, I installed pdfminer on my machine. After installation, the location of each package can be found using “import site; site.getsitepackages()” in Python (.libPaths() command in R). On my machine it was at C:\Anaconda\Lib\site-packages\ where I found the following. ![image](https://raw.githubusercontent.com/shaheeng/ShaheenGauher/master/SharedPublic/pdfminerdep.PNG ) Fig 2. Dependencies for pdfminer I collected the above into a folder and named it "Both_CustomPythonTool_Azure_sdk". In this example I also needed the [azure storage sdk](https://github.com/Azure/azure-storage-python) for python in order to be able to read and write to the blob. So additionally I also added the contents of azure sdk into the folder before zipping it up. I uploaded the zipped file ("Both_CustomPythonTool_Azure_sdk.zip") to Azure ML as a dataset. In the experiment it was connected to the third input port of the execute python script as shown below. ![image](https://raw.githubusercontent.com/shaheeng/ShaheenGauher/master/SharedPublic/expscreenshot_pdf2text.PNG ) Fig 3. Sreenshot of experiment in Azure ML If a zip file is connected to the third input port, it is unzipped under ".\Script Bundle". This directory needs to be added to sys.path. Within my execute python script, I added the following lines path = os.path.dirname("./Script Bundle/") + "/Both_CustomPythonTool_Azure_sdk/" sys.path.append(path) Now all the modules needed to execute my function are available. Within the execute python script you will find the function pdf2text which accepts a pdf file and returns extracted text in a text file. ![image](https://raw.githubusercontent.com/shaheeng/ShaheenGauher/master/SharedPublic/inputpdf2textb.png) Fig 4. Accepted Input The experiment accepts as input the location of the pdf file. The user has the option to store the pdf files in an azure storage and provide the credentials for the storage account (container name, storage account name and storage access key). Alternatively, the location of the pdf can also be specified via a url. The output from the experiment is a dataframe with one column containing the extracted text. Contributed by a Microsoft Employee.