docsearch: ZIP Files
For zip files this little script can be used. The command line tools for the conversion need to be added for each document type. The known document types get extracted to a temp folder where they are converted to txt and joined to one big text file which can be indexed.
Currently only conversion tools are supported which have the following style: <cmd> <inputfile> <outputfile>
- zip2txt.sh
#!/bin/bash # This is a converter script to convert the content from a zip file to a single txt file. # All files which extensions are defined in this script get unzipped, converted to text and joined to one single output file # usage: zip2txt.sh <inputfile> <outputfile> #adapt this: #Folder where the zip file is unpacked WARNING: DO NOT USE THIS FOLDER FOR ANYTHING ELSE -> all files in there will be converted! TMPFOLDER="/tmp/zipconverter" #File which is used as a temporary storage #DO NOT PLACE THE TMPFILE INSIDE/BELOW THE TMPFOLDER IF YOU DON'T EXACTLY KNOW WHAT YOUR ARE DOING TMPFILE="/tmp/zipconverstion.txt" #commands needed for this script UNZIP_CMD="/usr/bin/unzip" FIND_CMD="/usr/bin/find" #extent the extention and command array for your personal needs #note: the first parameter of the cmd must be the input, the second is the output filename. e.g. /opt/office2txt.sh <inputfile> <outputfile> FILEEXT[0]="doc"; CMD[0]="/opt/office2txt.sh" FILEEXT[1]="pdf"; CMD[1]="/usr/bin/pdftotext" #IO definitions zipfile=$1 outputfile=$2 #generate filter string from FILEEXT filter="" for ext in "${FILEEXT[@]}" do filter="$filter *.$ext" done #Unzip only content into TMPFOLDER with known extensions, ignoring case sensitivity of filter "-C", # The "-P \n" is needed to tell unzip that we do not have a valid password so it does not ask on stdin # if a file is encrypted $UNZIP_CMD -o -qq -C -P \n $1$filter -d $TMPFOLDER #put all filenames into an array which are inside the TMPFOLDER. #Whitespaces in filenames are handled correctly (from http://mywiki.wooledge.org/BashFAQ/020) unset filenames i while IFS= read -r -d '' file; do filenames[i++]=$file # echo "File: ${filenames[i-1]}" done < <($FIND_CMD $TMPFOLDER -type f -print0) #switch off case sensitivity shopt -s nocasematch #convert each file to txt according the command set in CMD for file in "${filenames[@]}" do echo "Working on file: $file" #get fileextention input_filename_w_ext=$(basename "$file") input_extension=${input_filename_w_ext##*.} #search extension in FILEEXT array (case insensitive) # get length of an array tLen=${#FILEEXT[@]} extfount=0 for (( i=0; i<${tLen}; i++ )); do if [[ ${FILEEXT[$i]} = $input_extension ]] then rm -f $TMPFILE #make sure it is empty #execute conversion cmd echo ${CMD[$i]} "$file" "$TMPFILE" ${CMD[$i]} "$file" "$TMPFILE" #append $TMPFILE to output file $outputfile cat $TMPFILE >> $outputfile break fi done done #switch on case sensitivity shopt -u nocasematch #remove all stuff in the temp folder and the temp file rm -rf $TMPFOLDER/* rm -f $TMPFILE
WARNING: Because this script joins all content found in the zip file to one huge text file, the indexing process (PHP) will need a lot of memory! You better dump the output of this conversion script to a logfile and check it on a regular basis or for errors! To increase the memory have a look at the tips at the top of the page.
I had to set the memory limit of PHP to 250 MB because the generated txt file from this script was 8.8 MByte in size. This can happen very easy if a zip file contains a lot of PDF documents!
