Changelog History
Version .41e
(By windymilla):
- Remove FTP tab. Neither DP nor DPC support the use of the FTP tab.
- Make fcannos.bin platform independent (which was the only portion of guiprep that was not platform independent).
- Tidy declaration of Storable in fwordgen.
- Move scrollbar to right-hand side of Select Option tab.
- Remove pause from run_guiprep.bat.
- Filter form feeds from files. (Tesseract support.)
- Make "Convert Windows 1252 codepage glyphs 80-9F" default to off. (UTF-8.)
- Make removal of headers and footers utf safe.
- Adjust various messages to be more accurate.
- Minor bug fixes.
(By mannyack):
- Rewrite of the user guide and other documentation.
Version .41d (396k)
(By windymilla):
- Tidy up/mark dubious spaced curly quotes
- Fix spaced close single curly quotes (not mark as unknown)
- leave unchecked if book has apostrophes at start of word, e.g. 'orrible
Version .41c
(By windymilla):
- Treat a file containing just a BOM as empty so [Blank page] will be added
- Fix '.' not in @INC error reported by later Perl releases
Version .41b
(By wfarrell):
- Minor reformatting of guiprep source for readability
- Compatibility fix for later Perl releases (remove several uses of "defined" function)
Version .41 through .41a
(By grythumn): fix for directory lock problems when renaming
(By Dave Morgan): fix for images/directories
(By Malcolm Farmer):
- minor Typo fixes
- testart, ivstart & startpngcrush calls made Linux compatible
- Don't dehyphenate numbers (makes indexes work better)
- Page footers removal subroutine added
- merged in these options from rfrank's cpprep:
- Remove HTML markup (bold, italic, small caps)
- Remove space before 'll
- Remove space from I 'm,
- Remove space from (s)he 's
- Remove space from we 've
- Remove space from we 'll
- Remove space before n't
- Remove space from I 'll
- Remove space from I 've
- Remove space from I 's
- Convert '11 -> 'll
- some of the "Spaceyquotes" regexps
- mark possible missing spaces between words/sentences
- remove footers in batch mode.
- mark blank pages after header/footer removal
(By lvl): don't convert solitary l to I followed by ' and text (corrects behavour for French)
Versions up to .40 were by Stephen Schultz:
Version .40 (643k)
Argh. When I added the option to extract the small caps markup from the RTF files, I broke the handler for small caps if you WEREN'T extracting the markup. Fixed now.
Modified how the Precessing functions displayed progress. They used to just print a dot to the screen for each page (file) that was completed. That worked fine as long as there weren't any problems. If the WAS a problem, it was extremely tedious to try to count the dots to figure out which file was causing it. Changed it to print an incremented counter mod 10. It will print the digits from 123456789012345... and so on. That should make it much easier to figure out which file causes a problem when one occurs.
Fixed an obscure problem with code page handling during RTF extraction. Set it to have a reasonable default if it couldn't determine the codepage.
Tightened up a bunch of code in the font table and codepage handling code. Made it much more memory efficient (and probably faster, though negligibly so.)
Version .39 (643k)
Added option to extract small caps markup from the rtf during the
extraction routine. Markup will be added as <sc> .. </sc>
around the text that is marked as small caps in the RTF file. It
doesn't do too bad, but there are problems trying to convert RTF markup
(which is strictly presentational) into semantic sensitive markup.
Added an entry box to the Process Text tab where you can specify what
number to start with when renaming the text and/or png files. By
default it is set to 1, but if you want to offset the pages by 127,
enter 127 in the box and the files will be renamed starting at
127. IF you want to force four digit numbers even for texts that
nominally would only need three (say an early volume of a multi-volume
work,) left pad the start number out to 4 places with zeros, e.g. 0001.
Sorry, no negative numbers, no skipping numbers in the sequence after
the start offset. If you don't like the offset you have, change it and
rename again, filename collisions will be automatically avoided.
Modified file renaming routine to be able to deal with offset start
points. Rewrote it to be more robust about avoiding filename
collisions. As a side effect, I sped it up about two to three times as
fast as it used to be.
Modified Search tab to be able to deal with file names that don't
correspond to their index.
Twiddled with the layout of the options tab slightly. Mostly cosmetic
changes.
Got tired of the default palette and changed it. Shouldn't affect most
current users, only new users, and you can still change it to whatever
you prefer.
Version .38 (642k)
Added a whole bunch of tweaks suggested by lorax.
Tweaked "Remove garbage punctuation " regexes a bit. Broke apart the
"Strip from front" and "Strip from end" regexes into separate options.
Modified Header Removal functions to not display pages where the only
text is the "Blank Page" text string from the options page.
Fixed improper calling of nohyph.dict loading function. Sigh.
Included a basic English nohyph.dict courtesy of lorax.
Tweaked quote handling a bit to try to intelligently resolve quote
spacing a bit better.
Added function that will try to find and change the case of ALL CAPS
words at the start of a chapter. It isn't very aggressive to prevent
unwanted case changes, but it should help a little.
Fixed bug with Convert £ to "Pounds" option where it would
erroneously split numeric quantities at commas. E.G., £100,000
would become 100 Pounds ,000 rather than 100,000 Pounds. Note, this
option is little used and somewhat discouraged, but it is available.
Fiddled around with the "Move punctuation outside of markup" functions
to avoid a few undesirable side effects. Most obnoxious of which was ,
the <i</i>> problem.
Fixed a bug in the Extraction routine where if a page contained a
table, any text after the table would have its spaces changed to
non-breaking spaces. Normally this would be a non-issue since the
filter routine changes all non-breaking space back to regular spaces,
however, in rare instances they seemed to be slipping through.
Added an option to save two files during dehyphenization;
hyphens.txt and dehyphen.txt. The hyphens.txt will contain all of the
end-of-line hyphenated words that the script found during the
dehyphenate routine where the words remained hyphenated. The
dehyphen.txt will contain all of the words where a hyphen was removed.
The script has been capable of generating these files for some time as
a debugging aid, however it required editing the source to set a
debugging flag. Since the addition of the nohyph.dict dictionary file
though, these could be more useful to general users so I made the
generation optional in the program. The files will be placed in the
base directory of the project, (the directory that contains the textw,
textwo, text and pngs directories.) They will be overwritten each time
the dehyphenate routine is run.
Messed around with the layout of the options page a bit. The layout
manager I was using was very automatic, but I didn't like the staggered
columns of checkboxes.
Version .37 (638k)
Fixed problem where guiprep would occasionally lock up while running
Filter Files with the "Move punctuation outside of markup" selected.
Added an option for the "Remove garbage punctuation at ends of
line" to the
options page. Made filter regex much more aggressive.
Tweaked a few other filters a bit.
Version .36 (638k)
It's a veritable bug fest.
Fixed problem with semicolons being turned into question marks.
Stupidity errer :-(
Think I finally fixed the problem with disappering punctuation after
hyphenated words. (Actually lorax spotted the error.)
Fixed some other mistakes I made while trying to implement dehyphenate
code modifications submitted by lorax. The problems should not have
caused any errors in the processed texts, though they limited the
effectiveness of the dehyphenate routine a bit.
Added a new filter to the filter routine to try to clean up junk at the
end of lines. Often, OCR will erroneously put a bunch of junk
puntuation at the end of lines, (typically where the page runs off into
the gutter.) This will try to detect and clean up the worst of it.
Was not able to replicate problem with emdash being rendered as
â", so that hasn't been fixed yet if it is truly a problem.
Remembered to update version number this time.
Version .35 (638k)
Phooey. Yet more bugs. (Well, bug fixes, one would hope.)
Fixed bug where Filter function would lock up on certain files.
Root cause was a regex to move punctuation outside of markup that had
adverse reactions to characters outside of Latin-1.
Fixed a few warnings about printing wide (multi-byte UTF-8) characters.
Version .34 (637k)
A few tweaks and bug fixes.
Added option to use an external file of words that are not hyphenated.
If there is a file named nohyph.dict in the guiprep directory, it will
be loaded and used to help determin which words should be dehyphenated
during the dehyphenization routine. (Similar to Nicola's DPEU version.)
Fixed problem with the Convert to ISO-8859-1 routine that was causing
some bizarre u <-> y substitutions.
Revised dehyphen routine to be a little more agressive. Changed to
agressivly lower false negatives without significantly raising false
positives. Based on code sample by lorax.
Twiddled around with FTP routines a bit. Nothing substantial, most
visible change is the "activity indicator". Used to just append
vertical bars to the log, now just has a "spinning" line.
Version .33 (636k)
Updated program to deal with Unicode files gracefully. Now works
natively in UTF-8. File for the original DP site NEED to be in ISO
8859-1 (Latin-1). There is an extra button on the Process Text tab
"Convert to ISO 8859-1" PLEASE down convert files for the original DP
site. (At least until the UTF-8 mods get activated.) No such
restreictions for DPEU. UTF-8 files are PREFERRED at DPEU. Note the
Convert to ISO8859-1 function will do transliteration of any Greek it
finds. (It uses the guiguts beta code to denote accented characters.)
Other characters outside of Latin-1 will be converted to question marks
at this time. If I get some transliteration tables, I could make auto
transliteration for other character sets too. I don't really want to
spend lots of time on it though because hopefully, in the near future,
DP will convert to UTF-8. A very large Thank You to Nikola Smolenski, one of the lead
developers for the DPEU site who worked out the bulk of the UTF-8
character extraction code.
Fixed problem with pngcrush under Win2000 and WinXP. It was easy
enough, once I figured out what was causing the problem. The fix
consisted mostly of downloading a version of pngcrush that works
correctly under 32 bit Windows. Argh. Note: for Win 95, 98 and ME
users. The 32 bit version will not work crrectly under DOS. The old
version is still included as pngcrush16.exe. Rename pngcrush.exe to
pngcrush32.exe and pngcrush16.exe to pngcrush.exe. The 32 bit version
will not work correctly under DOS.
A few other small (and mostly invisible) tweaks.
Version .32 (550k)
Fixed bug where if an italicized word was at the start of a line after
a line that ended with a hyphen, the word would be removed during
dehyphenization.
Modified guiprep to fix markup that closes at the end of a line to not
leave the ending markup at the beginning of the next line.
Modified guiprep to use the spawn.pl spawning script for external
programs instead of runner.pl for the same reasons I changed it in
guiguts. More compact, and better Linux compatability.
Added check for common italicized scholarly abbreviations to move
markup outside of punctuation. (e.g., ibid., loc., cit., Ib., cf., op.,
et seq., viz., etc.)
Cut out 100k of extreaneous images from the manual.
Version .31 (659k)
Major update of the code to work with the Tk:804 series. Rewrote and
updated user interface to work with the new unicode aware Tk. The basic
operation is as near to identical to previous versions as I could make
it. It uses the same layout, though button and font sizes are subtly
different.
I have split apart the libraries from the executable version and am
including the windows exe along with the perl script. The executable
version uses the same prl03 perl runtime libraries as guiguts. If you
already have prl03 (prl03.zip)
for guiguts installed, there is no need to download it again.
Added unicode handling code to all of the functions. There was very
basic unicode handling in the extract routines before, but all it would
do was substitute question marks for any unicode character
outside the Latin-1 character space. Will now deal with unicode in all
routines. **NOTE** The PGDP site is still not able to work with multi
byte characters. If you have a unicode encoded text, you are better off
putting
it through DPEU.
Puttered around with FTP functions to try to get more accurate tracking
of transfer rates and estimated times.
Worked on making things that SHOULD be impossible to do, harder to do
accidentally. :-\
Lots of little tweaks and tuning that are not worth mentioning
individually but which added up to a substantial amount of time.
Played around with optionally marking up texts with questionable word
markup as determined by ABBYY during OCR but after messing with
it a bit, have serious reservations about it's usefulness, and have
removed it again.
Version .30 (590k)
Modified FTP reporting code, now
reports
on instantaneous and average speed of file transfers. Reports real
throughput after overhead. Selectable readout in Kilobytes per second
(KBps) or Kilobits per second (Kbps). Makes an estimate of seconds
remaining to transfer
the current file. Not going to be very accurate for small files.
Fixed problem where script would dump you in the wrong directory if
processing was interrupted during the scannos routine.
Made rename functions report file counts. Useful to check that you have
the same number of text and image files.
When building a batch for FTP upload, the build routine will now check
for and warn about zero byte files.
Changed Change Directory tab to use double click instead of single
click to navigate. (Made it the same as the navigate function in the
FTP window.)
When making a new directory on the FTP server, the script automatically
issues a CHMOD 0777 command to set the permissions on the new directory.
Version .29 (590k) Fixed
"Change initial X not followed by e to
N"
to also ignore X followed by hyphen.
Tweaked a few more thing on FTP tab. Added a "percentage done" on
upload or download to status box.
Found and fixed bug where search window would add a blank line to the
bottom of each file every time it was opened.
Ripped out the original two set dehyphenization function and wrote a
new one based on the single set dehyphenization function. Actually both
dehyphenization function use the same code to perform the
dehyphenization, they just use different dictionary building code. The
new two set function has all of the robustness and flexibility of the
single set, with as good accuracy (potentially even better, in fact)
than the original two set.
Found and fixed bug in dehyphenization where it was getting confused by
italic markup (and likely bold too, though I didn't confirm that.)
Rewrote large portions of the logging and error reporting code to be
much more compact and less error prone. Reduced script size by 10
percent in the process.
Added capability to use German style "=" instead of "-" as the hyphen
symbol for dehyphenization.
Removed some of the more problematic scannos from the scanno
dictionary. "cf" => "of", "au"=>"an" and "dont"=>"don't".
Did a fair amount of updating to the manual.
Version .28(601k) Fixed a few
spelling errors in the user interface.
Made "Change initial X not followed by e to N" option not change Roman
numerals. (Basically it will ignore an initial X followed by eEIVXDCML
or space.)
Made "rnp" to "mp" fix ignore turnpike as a special case.
Tinkered around with the dehyphenate routine to try to figure out what
could be causing the intermittent moving of whole lines instead of just
word halves. Was not really able to find a specific fix. Was not able
to make it fail on any of the texts I have. Still waiting on some
sample files that show the symptom from someone, so I can try to track
it down. Was not able to make it happen, even by downloading some
images from the FTP server that have text files exhibiting the symptom
and OCRing them myself. Oh well, if I can't duplicate it, I can't
rectify it. I made a few changes that may help, but, as it worked for
me both before and after the changes, it is difficult to tell whether
they will be of any use.
Puttered around with the FTP client a bit. Added a preferred "Home"
directory option as suggested by sjg1978. (Actually, adapted a working
patch he submitted) Will automatically switch to this directory on the
FTP server when you log on. Made the client a little more general
purpose. Now able to save and recall different host names. User names,
passwords and Home directories will be saved with the different host
names (if that option is selected.) Status box has been moved down to
just below the log window (to make room for the home directory box up
on the top row) Status box now gives a lot more useful information
during transfers. Actually keeps track of progress instead of just
saying uploading/downloading.
Added ability to customize superscript markup. It still defaults to
^{xx} but can be changed to whatever you want. It is not sanity
checked, so if you put markup like "<<<<KYpR%J>"
"$$$$+=*", it will cheerfully use it without a second glance.
Version .27 (612k) Added code to handle mouse wheel events in
WinXP (and apparently some installations of Win 2K, though it always
worked for me on my Win2K system).
Fixed problem where zip file name was being incorrectly added to the
FTP batch.
Removed limitation on uploading into root directory.
Changed order of operations for changing / to ,' and change '' to " to
catch some occurrences that were slipping through.
Modified "cb" fixing code to be a little less greedy. Will no longer
"fix" Macbeth to Macheth
Made "Convert solitary 1 to I" ignore a 1 followed by a full stop.
Added convert initial VV to W option.
Added convert initial !! to H option.
Added convert initial X not followed by e to N option.
Added convert ! in a word to l option.
Changed empty file handling code and average file size calculation to
be more efficient based on suggestions by Elronse.
(Thanks!)
Changed page switching code on search tab to automatically save the
page file if you have made edits.
Changed Search page text window to have some undo capability. WILL ONLY
UNDO CHANGES DONE TO A SINGLE PAGE. once you switch pages, the changes
are written and the undo buffer is cleared.
Debated quite a bit about how best to implement the spaced double
quotes repair option that papeters requested. Decided to make it
universal rather than hard coding it for double quotes. Added two more
"Alternate" replacement text fields with some more Replace and Replace
& Search buttons beside the corresponding field. Now you can have
up to three alternate replacement terms. The "Replace All" function
uses the first alternate. Tried to make the button layout easy and
quick to use with a mouse.
Changed the FTP tab password entry to be a little more secure. Will now
keep your 5 year old nephew from figuring it out. :roll:
Displays **** instead of the actual password.
Lots and lots of minor tune ups and enhancements to make it more user
friendly. Too many to list (or remember).
In Version .26 ( K) Added option to not extract sub/superscript
from RTF files.
Fixed fcanno (Olde Englifh) routine to skip words that have a
capitalized F at the beginning. For instance, Fire will not be changed
to *ire, since the capital F is unambiguous.
Back ported some of the external program calling routines I developed
for guiguts. Now all the external program calls will work in both
guiprep and winprep
Added "See Image" Button to search page. Allows you to easily compare
text and image for the project pages.
In version .25 (601 k) Added
function very similar to Jon Ingrams de-fcanno script he published in
the developers forum. Ported from python to perl and integrated into
the text processing page. Added a new button on text processing page
"Fix Olde Englifh". This will comb through the text and replace any
words spelled with long esses (f) with the modern English equivalent.
(They are not really misspelled. The long s really is an s, it is just very, very
close to looking like an f.) The script will preserve the case of the
original word when it replaces it.
I based the de-fcanno function off of my scannos function, but as
the fcannos dictionary was about 35 times the size of dictionary used
by the scannos function (and that
wasn't any speed demon,) running the fcannos function was nearly
grinding my computer to a halt. I couldn't leave it like that so I went
back and optimized both functions a bit and sped them up by close to 2
orders of magnitude. (found some really, really inefficient code in
there....) Anyway, they are both pretty spritely now. After some
experimentation, I decided not to use the Moby SINGLE.TXT word list to generate my
dictionary. It was TOO complete. There were way too many extremely
uncommon words that were getting pushed as replacements, generating way
too many false positives. After some hunting around I settled on
generating it from the 2of4brif.txt
word list from the 12dicts-4.0.zip
package available at Kevins's
Word List Page This was somewhat arbitrary, but it generated a much
more reasonably sized list, (23000 words instead of 132000) and seems
to generate a lot fewer false positives in practice. It is a heavily
slanted toward British spellings as well, which fits in rather well
with the period of most of the texts we are seeing. I've included the
dictionary generation script in the distribution if you want to try
others. It is named fwordgen.pl and requires perl to run. The name of
the word list is hard coded. If you want to try different ones, you'll
need to change the line -- open (WLIST, "<2of4brif.txt"); -- to have
the name of your file instead of 2of4brif.txt. That will generate
fcannos.bin, a serialized hash of words in the format needed by the
script.
If you are planning to run both the scannos fix up and the Olde Englifh
fixup routines, you should definitely run the scannos routine first. Do
not run the scannos routine after the Olde Englifh routine, it will
find lots of false positives
Fixed a few other minor user interface bugs.
In version .24 (383k) More user requests. Improved how script
deals with tabular data. Optionally insert bar "|" surrounding each
"cell" in a table and try to retain original table spacing as much as
possible. Added automated markup for super and sub script text. Right
now these are hard coded to be TEXish markup: caret-braces "^{X}" for
superscript and underscore-braces "_{X}" for subscript. These may be
made editable markup in a future version, similar to the bold and
italics markup so different projects can use different styles.
Found and fixed bug with underscore handling in the filter
routine that made it impossible to use an underscore for italics markup
(the nominal Gutenberg standard).
Added new filter options "Convert double commas to a double
quote", "Remove space after doublequote if it is the first character on
a line" and "Remove space before doublequote if it is the last
character on a line". (Thanks for the suggestions, Curtis.)
In version .23 (376k) Sigh... fixed bug on search page
where an edited page wouldn't save unless you were in the midst of a
search.
Poked around in the source of gutcheck and stole a few more checks for
unlikely letter combinations - added to options page. (Thanks Jim!)
Fixed last thing keeping script from running under Linux, thanks to
jneves for bug reports and feedback Still not 100% functionality,
external programs (text editor, image viewer, pngcrush) still are not
functioning, but that's fairly minor. All of the internal routines
should work now. There is essentially a built in text editor on the
search page anyway, and you can run pngcrush as a separate program if
desired.
In version .22 (374k) Added some more functionality to search
tab. Now allows you to cycle through the text files or jump to a
particular file with out actually doing a search. Changed logic to
automatically load the first file from the text directory when search
tab is activated. Now caching the list of filenames between calls to
the different search functions to generally speed up operation,
especially for large numbers of files. Altered changed file save
semantics slightly to better fit with the new functionality.
Added Zip function to batch upload in FTP client in anticipation of the
option being available soon on the site. Automatically adds all the
files in the upload batch to a zip file named the same as your working
directory. Should make uploads a little faster since it is not
constantly have to negotiate transfers with the FTP server for each
file. Added option to build zip file during batch mode. Paves the way
to make the FTP upload batchable along with the pre-processing.
Moved both new batch options to options page where they should have
been originally.
Changed a few more things which were blocking Linux compatibility.
Trapped error which would sometimes result in the saved settings file
being corrupted and losing your personalized settings.
Trapped bizarre behavior if italics or bold markup is extracted with a
blank markup string.
Updated Manual.
In version .21 (350k) Added a bunch of user requested
items.
Tuned a few few things in the newer dehyphenization routine. Deals
better with spaced hyphens at end of line now.
You can now choose the directory name where your png files are stored.
It is no longer hard coded to be "pngs". Change it on the Program Prefs
tab.
Header Removal is now selectably automated for batch processing. It
will automatically remove the top line from every text file. THIS MAY
POSSIBLY REMOVE LINES THAT SHOULDN'T BE REMOVED. USE WITH CARE. It is
highly recommended that header removal be done in interactive mode if
feasible.
The header removal function has been made a little smarter. It will no
longer remove lines that contain the zero byte file text marker -
[Blank page], by default.
If header removal is run in batch mode, it will automatically run the
Fix Zero Byte Files routine after
it finishes. In this case, it is not necessary to select it on the
Process Text tab since that will only make it run twice.
There is a new tab with basic search & replace functions that you
can run against the text files. Will automatically search through all
of the text files. Useful for project specific spell checks that you'd
like to run. Select Case Insensitive search or Whole Word search or
combinations thereof to further narrow down the search target.
Disabled the "standard project directory name" check in the "make
remote directory" function of the FTP client. Has become moot with
recent changes to the site code.
Fixed a few inconsistencies in the FTP download logic.
Combed through code trying to reduce Linux incompatibilities. As far as
I can tell without actually trying to run it, there are only three
places where the code is Linux incompatible: the three external program
hook subroutines - testart(), ivstart() & pngcrushstart() [text
editor start, image viewer start and pngcrush start] Need to get access
to a Linux system to get them working. There may be others, but they
are the ones I know about.
Went through most of program , cleaned up code, improved commenting and
indenting. Generally tried to make program more maintainable. Updated
manual.
In version .20 (353k) Major update. Added new dehyphenate
routine. The original dehyphenate routine is still there and is far
more comprehensive than the new one, but the new one has a huge
advantage in that it only needs one set of text files and is not
dependent on Abbyy FineReaders' dehyphenization feature. The new
routine builds a dictionary of all of the words in the text files that do not have a hyphen in them, then
uses that dictionary to decide whether to remove the hyphen from a
split word or not. It will rejoin hyphenated words whether it removes
the hyphen or not. It will make a few educated guesses when it sees
some very common prefixes or suffixes. The new routine looks for a set
of text or RTF files in a "textw" directory. If there is also a "textwo" directory, the
script will automatically use the original dehyphenate routine. Changed
original dehyphenate routine to automatically fall back to the breaking
text if a threshold of synchronization errors was reached (currently 3)
in any one file.
Added much better reporting of what is going on during filtering of
"improbable letter combinations" and scanno replacement. Changed order
that routines run in to make reporting more useful. (Moved rename text
files to before any of the routines that do progress reporting so I
could include a file name.) Changed button order to match. Added a
button and logic to save a copy of the processing log to a file from
the process text tab. Added buttons and logic to the process text tab
to save and revert to backups of the text files.
Moved conversion of Windows codepage 1252 glyphs 80-9F (decimal
128-159) from the extract routine to the filter routine where it really
belonged. Added option for it on Select Options tab.
Made Remove Headers routine more tolerant of filenames with spaces in
them.
When downloading a directory in the FTP client, it will now
automatically make a directory in the selected local directory with the
same name as the selected remote directory and download the files into that directory.
Added a file name filter to the FTP directory download dialog box.
Default (blank) is 'download all files in directory'. If you want to
download only the text files in a directory, put .txt in the filter box. For all of
the PNG files put .png , etc.
You can build more complex pattern matching filters too, if you like.
It uses perl regular expressions to evaluate the pattern, so don't use DOS wildcard expressions
(*.*, *.txt, etc). Added some more word pairs to the scannos list.
In version .19: (354k)
Fixed up a bunch of minor non-fatal errors (warnings). Changed default
watchdog timer to allow longer subroutines to run without raising a
fatal timeout exception. Was giving problems with some users.(Well, one
specific user, but I'm sure it would crop up again sooner or later.)
Made a few of the routines a little more robust/error resistant. The
dehyphenate routine now marks the word in question with "**" when it
gets a synchronization error. Added a few more word pairs to the common
scannos list. Removed the check for double backslashes, no longer
necessary after site update.
In version .18: (357k)
Fixed pngcrush feedback mechanism to work consistently across windows
platforms. Changed it to work predictably no matter what your pngcrush
option settings. Added capability to edit pngcrush command line options
to the Program Prefs tab and changed default pngcrush settings to
something a little more generic.
Tweaked a few of the markup filters to catch boundary conditions
better. Fixed FTP client to understand directory names with spaces in
them. Changed FTP directory download dialog box to custom built one, a
little easier to work with, I think. Added directory download list
display. Change default FTP host to pgdp01.archive.org. Changed client
to allow editing host name. Tuned a bunch of the FTP functions to work
more intuitively. Just does the right thing. Double clicking on a
directory name on the remote server will change to that directory.
Double clicking on a file name will download that file. Double clicking
on a local file name will open a viewer for the file. Made all of the
FTP routines less fragile.
Wrote modified FTP::put and FTP::get routines that won't block the
calling Tk window to replace the ones in the standard FTP module which
blocks Tk very badly. Updates at least once for every 10KB of upload or
download. (You'll get a tick mark in the log box for every 10K of data
transferred).
Changed how external programs are invoked on the header removal page to
be more consistent with other pages.
Fixed missing last drive problem under NT / 2K.
Changed some code in the script which caused problems under WinXP and
perl 5.6.
Lots of code cleanup, added and formatted comments, remove some unused
routines, made indenting style more uniform. Updated manual.
In version .17: (377k) Better
resynchronization after error during Dehyphenization and better
trapping of errors. Finally dehyphenization is as stable as I would
like. In the worst case, it will use the text with line breaks as its
fall back if there are too many errors. Provides more information on
exactly what problem is on Dehyphenization error condition. More
efficient markup pattern matching in Filtering routine. Combined about
14 pattern matching searches down to 4. Reworked Pngcrush calling
routine to be compatible with NT based Windows platforms. Provide more
feedback during the pngcrush routine. Improved the FTP client
drastically. Added buttons for Change directory, Download, Rename and
Delete as alternatives to the arcane mouse button - key press
combinations. Added Rename function. Works with both files and
directories. Improved Download function to allow automatic batch
downloading of all the files in a directory. Disabled floppy drive
search on startup. Get rid of annoying "No Disk" acknowledge in XP. Not
really realistic that a project would be on a floppy anyway. Fixed
problem with small caps text not being upper cased on some occasions.
Updated Manual. Added history section. Miscellaneous bug fixes.
In version 16: (374k)
Reworked Process Text tab layout. Combined Process Batch and Do All
Selected button into one Start Processing button. Just does the right
thing depending on mode. Added routine to run pngcrush on your png
image files. Pngcrush is a png size optimizer. Most image generating
programs are not particularly efficient about making the smallest
possible lossless png file. Since the images are uploaded and
downloaded 4 - 6 times during a project, it makes sense to make it as
efficient as possible. Added pop up help buttons on most pages. Added
download and remote delete functionality to FTP client. Updated Manual.
Miscellaneous bug fixes
In version 15: (319k) Added
basic FTP client to help automatically upload preprocessed projects to
site. Added hook to link in external Image viewer. Added routine to
automatically rename png files in pngs directory under project. Changed
help box to a button activated pop up window on Change Directory page
to make more room for directory and batch listing boxes. Started
putting version number in program title bar to make it easier to track.
Updated Manual. Miscellaneous bug fixes
In version 14: (202k) Improved
the hooks for the external programs to run them non blocking. (Able to
run more than one at once without locking up guiprep) No longer any
reasonable expectation of Linux compatibility. Added some more
filtering options. Fixed some race conditions.Script now remembers the
window size and location from session to session. Added much better
reporting on processing progress. Renamed guiprepe to winprep. Updated
Manual. Other miscellaneous bug fixes.
In version 13: (202k) Added
hook to link in external text editor so you can view files easily
during Header Removal. Added more filtering options. Improved batch
processing . Added Program Preferences tab to allow you to choose some
settings that don't directly affect the text processing. Script will
remember preference settings. Script now remembers the last directory
you were working in and reopens to there. Modified Interrupt Processing
to interrupt whether in batch OR interactive mode. Script will
interrupt processing if you switch away from the processing window.
Reworked layout to be usable down to VGA resolution. Debut of guiprepe,
(guiprep executable) a compiled windows version of guiprep. Updated
Manual. Miscellaneous bug fixes.
In version 12: (194k) Jon
Ingram edition. Now does batching. Queue up several projects in a batch
and run processing on them sequentially. Updated Manual.
In version 11: (193k)Added
Check For Common Scannos routine & list. Check for 3400 or so
common scannos. Added lots of new filtering options for improbable
letter combinations and others. Made Text Processing routines batchable
with check boxes to select which one to do.Updated Manual. Lots of bug
fixes.
In version 10: (123k) First gui
version. Made a gui interface to the prep.pl script to allow runtime
option selection without huge command line lists. Renamed to guiprep.pl
to reflect interface change. Linked hrtk.pl header removal tool into
the script as a separate tab. Updated Manual. Created lots and lots of
bugs
In version 9: (0k)There was no
version nine.
In version 8: (94k) Last
command line version of prep.pl. Added basic header removal command
line scripts and gui tool that implements them (hrtk.pl).