Helper library to stringify binary document formats
  Description 
This extension serves as a helper to stringify documents in a way that they can be indexed
by full-text search engines such as 
Foswiki:Extensions/KinoSearchContrib or 
Foswiki:Extensions/SolrPlugin.
It supports all major office document formats, i.e. MS-Office and OpenDocument formats.
StringifierContrib is organized in plugins to serialization a document format by delegating it to according backends.
For some formats there are alternative backends to chose from. For example a DOC file can be serialized 
by any of 
abiword, 
antiword, 
catdoc, 
soffice or 
wvWare. Use the one that serves best your needs and
is available on your platform. For instance 
soffice is a very good choice to serve as a document converter.
However using it is rather performance demanding. The more simpler ones suffice most of the time but may
have an inferior quality of text being extracted.
  Backends for Word Documents 
To index Word Documents (
.doc) you will need to install one of the following:
 
-  antiword
-  abiword
-  catdoc
-  soffice
-  wvWare
  Backend for PDF 
To index 
.pdf files you need to install 
poppler-utils.
  Backend for PPT 
To index 
.ppt files you may select one of the following:
  
  Backend for HTML 
There are a couple of options to index 
.html files:
    * 
w3m
    * 
lynx
    * 
links
    * 
html2text
Some of the other backends generate html temporarily which is then converted
to plain text using the html stringifier.
  Backends for DOCX, PPTX, XLSX 
To index these file types, you will need to install the following tools:
  
Then set the command path to these tools in 
configure.
  Backend for OpenDocument and Staroffice documents 
To index these file types you need to install 
odt2txt.
  Backed fo E-books 
To index common ebook file formats you need to install 
calibre
  Installing the Contrib 
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab → "Install, Update or Remove extensions" Tab.  Click the "Search for Extensions" button.  
Enter part of the extension name or description and press search.   Select the desired extension(s) and click install. If an extension is already installed, it will 
not show up in the
search results.
You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)
cd /path/to/foswiki
perl tools/extension_installer <NameOfExtension> install
If you have any problems, or if the extension isn't available in 
configure, then you can still install manually from the command-line. See 
https://foswiki.org/Support/ManuallyInstallingExtensions for more help.
  Configuration 
There are a number of settings that need to be set in 
configure before you can use the Contrib.
  Test of the Installation 
 
-  Test if the installation was successful: 
-  Check that antiword,abiwordorwvHtmlis in place: Typeantiword,abiwordorwvHtmlon the prompt and check that the command exists.
-  Check that pdftotextis in place: Typepdftotexton the prompt and check that the command exists.
-  Check that ppthtmlis in place: Typeppthtmlon the prompt and check that the command exists.
-  stringifysome files (see below)
 
  Test of Stringification with stringify 
Some users report problems with the stringification: The stringifier scipts
fails, takes too long on attachments. Some times this may result from
installation errors, especially of the installation of the backends for the
stringification.
stringify give you the opportunity to test the stringification in advance.
Usage: 
stringify file_name
In the result you see, which stringifier is used and the result of the
stringification.
Example:
stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc
Simple example  
Keyword: dummy  
Umlaute: Grober, Uberschall, Anderung
  Further Development 
In this extension, a plug-in mechanism is implemented, so that additional
stringifiers can be added without changing the existing code. All stringifier
plugins are stored in the directory 
lib/Foswiki/Contrib/Stringifier/Plugins. 
You can add new stringifier plugins by just adding new files here. The minimum
things to be implemented are:
 
-  The plugin must inherit from Foswiki::Contrib::StringififierContrib::Base
-  The plugin must register itself by __PACKAGE__→register_handler($application, $file_extension);
-  The plugin must implement the method $text = stringForFile ($filename)
All the stringifiers have unit tests associated with them, and we would
encourage you to provide unit tests for any you wish to contribute. See
Foswiki:Development/UnitTests for more information on unit testing.
  Dependencies 
| Name | Version | Description | 
|---|
| abiword | >0 | One of antiword, abiword, soffice or wvWare is required for .docfiles | 
| antiword | >0 | One of antiword, abiword, soffice or wvWare is required for .docfiles | 
| Archive::Zip | >=0 | Required | 
| calibre | >3 | Required for indexing ebooks such as .mobiand.epubfiles | 
| catdoc | >0 | Optional | 
| Encode | >0 | Required | 
| Error | >0 | Required | 
| File::Which | >0 | Required | 
| html2text | >0 | One of the tools for indexing html files | 
| lynx | >0 | One of the tools for indexing html files | 
| w3m | >0 | One of the tools for indexing html files | 
| HTML::Formater | >0 | One of the tools for indexing html files | 
| Module::Pluggable | >0 | Required | 
| odt2txt | >0 | Required for indexing OpenDocument and StarOffice documents | 
| pdftotext | >0 | Required for indexing .pdf. Part of poppler-utils | 
| ppthtml | >0 | Required | 
| soffice | >0 | One of antiword, abiword, soffice or wvWare is required for .docand.docx0files | 
| Spreadsheet::ParseExcel | >0 | Required for .xlsfiles | 
| Spreadsheet::XLSX | >0 | One of Spreadsheet::ParseXLSX or xlsx2csv is required for .xlsxfiles | 
| tesseract | >0 | OCR for tiff files,Optional | 
| wvWare | >0 | One of antiword, abiword, soffice or wvWare is required for .docfiles | 
| xlsx2csv | >0 | One of Spreadsheet::ParseXLSX or xlsx2csv is required for .xlsxfiles | 
| XML::Twig | >=3 | REquired for indexing PPTX files | 
  Change History 
	
		
			| 12 Jul 2024: | (8.00) added a perl based html-to-text helper in case of the other helpers not being available | 
		
			| 17 Jan 2024: | (7.00) reworked configuration settings | 
		
			| 21 Oct 2019: | (6.00) performance fixes; new api canStringify(); added support for common ebook file formats | 
		
			| 16 Aug 2018: | (5.20) register more mime types to the text stringifier | 
		
			| 09 Jan 2018: | (5.10) added support for tiff documents using tesseract | 
		
			| 18 Sep 2017: | (5.00) make html-to-text converter pluggable | 
		
			| 31 Jan 2017: | (4.40) improved XLSX stringifier | 
		
			| 23 Jan 2017: | (4.30) added stringifier to index XLS using soffice | 
		
			| 18 Oct 2015: | (4.20) removed dependency on File::MMagic; now using extension-based mime detection | 
		
			| 01 Oct 2015: | (4.10) don't default to pass-through for non-supported document types; fixed unit tests | 
		
			| 29 Sep 2015: | (4.00) added unicode support with Foswiki > 2.0 | 
		
			| 22 Jul 2015: | (3.10) added support for stringification of ppt using catdoc as ppthtml isn't available on some distros | 
		
			| 29 Aug 2014: | (3.00) added support for stringification using open/libreoffice | 
		
			| 07 May 2012: | (2.20) added configuration parameter to specify the encoding of the output of each external helper in use | 
		
			| 17 Oct 2011: | (2.10) using wvText instead of wvHtml now; encoding stringified files to the site's charset now; fixed unit tests to use utf8 exclusively | 
		
			| 05 Sep 2011: | (2.00) added OpenDocument serializer; removed dependency left-over on Text::Iconv; added dependency on odt2txt; fixed defaults for wv serializer | 
		
			| 01 Dec 2010: | (1.20) moved core from StringifierContrib to Stringifier not to disturb configure | 
		
			| 12 Nov 2010: | (1.14) Foswiki:Main.PadraigLennon  - Foswikitask:Item9311 | 
		
			| 23 Oct 2010: | (1.12) made system fault-tolerant in case of missing dependencies for a given file type; doc cleanup -- Foswiki:Main.WillNorris | 
		
			| 12 Feb 2010: | robust parsing of password protected XLS files | 
		
			| 02 Oct 2009: | extracted from Foswiki:Extensions/KinoSearchContrib (MD) |