Wednesday, April 23, 2014

PowerShell example to convert text on PDF to MP3 (OCR)

My wife mentioned that she wanted a device that scans books/documents in and converts it to an mp3 file.  She is on the road a fair bit and would like to read (have read to her) work material while she is driving.
When she stated that the device was $3000 I was floored, I didn't think it would be all that hard to throw something together that would accomplish this task.  I threw this example together in PowerShell but will create a more complete and accurate version created in C#.
The steps below describes the complete process from taking a PDF to MP3, if you only want to pull the text from a jpg run step 1 & 6, if you only want to convert a wav file to MP3 do steps 3 and 8, for PDF to Jpg conversion step 2 and 4 are needed.


Installation Scripts
The first 3 scripts download and install assemblies/programs we need to convert your pdf(s) to an MP3 audio file (All downloads use the 64 bit versions, if your system is not 64 bit you will have to download the additional assemblies/programs manually).  The scripts create a 'programming' folder and install these programs to that directory.  The scripts also installs 7-zip if it is not already installed in order to unpackage these tools.  Each script will contain a commented section at the end which will remove the installed package when run.  No commented section removes 7-zip, this will need to be done manually via add remove programs.
Step 1. Install TessNet2 to convert image to text.
Step 2. Install GhostScript to convert PDF to JPG.
Step 3. Install ffmpeg to convert Wav file to MP3.


The rest of the scripts(functions) need to loaded/run whenever PowerShell is reopened, they will remain in memory for the existence of the PowerShell session.
Step 4. PowerShell function utilizes GhostScript to convert PDF to an image.
Step 5. PowerShell function alters an image, in this example I use it to pull out a specific part of an image.
Step 6. PowerShell functions that will utilize TessNet2 to pull the text from the image. (OCR)
Step 7. PowerShell function that converts Text to a Wav file.
Step 8. Powershell function that converts the Wav file to an MP3.

After running each of the above scripts you can give the bellow example a try.  The example script grabs a scanned pdf from a book in the users Documents library (Scan0002.pdf) and converts the text on the page to an mp3 file which is saved in the users Music library.

 #Scan0002.pdf is a sample document that can be downloaded from here http://drive.google.com/file/d/0B0ZMfV3A7dAhbTZPOXFENkluUGs/edit?usp=sharing
 #To convert your own pdf to mp3 specify the location of the pdf below.  
 $original = Get-ChildItem "$home\Documents\Scan0002.pdf"
 #convert-PdfToImage converts pdf to jpg, the new jpg file name keeps the base name of the file and adds jpg as the extension.
 #Tweaking with the resolution results in better results depending on you scanners quality.
 $jpg = convert-PdfToImage -PDFLocation $original -Resolution 300  

 $clean = "$home\Documents\scan0002Clean.jpg"  
 #Use the Update-ImageColor function to create an image that only contains the page itself. 
 #With the scanner bed noise removed it removes a lot of noise and improves the accuracy. 
 
 #The parameters for Update-ImageColor are hard coded for the example page from this book, pages from a different book may need different parameters to set it to actual page size.   
 #I think it will be fun to create a powershell script to automatically recognize the page size and pull it automatically so no parameters are needed, I'll post it when I create it.
 $void = Update-ImageColor -CurrentFileName $jpg[$jpg.Count-1].FullName -NewFileName $clean -WidthStartPosition 125 -Width 1610 -height 2400 -heighttartPosition 160  
 #Get each word, and all the words properties(location, accuracy confidence, size, etc.) from the jpg image.
 $PageWords = Get-TextFromImage $clean  
 [string]$Sentences = ""  
 #concatonate the words into one long string.
 $PageWords | %{$Sentences += $_.text + " "} 
 #Words that end with '-' are words that continue on the next line, so remove '- ' to concatenate the word.
 $Sentences = $Sentences.Replace("- ","")  
 #Convert $Sentences to a wav file.  We are using Zira's voice to use a different voice just select the voice 
 #that displays when you type the -Voice parameter. 
 Convert-TextToWavFile -Text $Sentences -OutputLocation "$home\Music\BookReading.wav" -Voice Zira  
 #Convert the wav file to an mp3 file and delete the wave file.
 Convert-Audio -path "$home\Music" -Source "BookReading.wav" -DeleteOriginal $true  


Note: if you can scan the image in as a jpg instead of a pdf you do not have to use the convert-PdfToImage function which will speed up the process and reduce any image 'noise' that may be incurred with the pdf to jpg coversion.  I noticed a slight improvement in quality when skipping the pdf to jpg conversion.
If you want you can try the jpg file of the same page and skip the call to convert-PdfToImage to see the difference.  The page is located here. https://drive.google.com/file/d/0B0ZMfV3A7dAhWlZWdWQ0M2VNaGM/edit?usp=sharing

Food For Thought
If you are scanning many documents to convert to an MP3 and you want to automate automate the above process you could set a SystemFileWatcher over the scan directory and set it up so that if new documents stopped being scanned for 5 minutes it would move them all to a "Converted" directory and kick off converting all the documents/images to 1 MP3 File.  To see an example of a powershell script that uses SystemFileWatcher you could view this PowerShell example script. http://programmingthisandthat.blogspot.ca/2014/04/create-delegates-that-will-be-used-when.html

No comments:

Post a Comment