whitakers-words/HOWTO.txt

--  DOCUMENT IN DEVELOPMENT  --

PROCESSES TO
    DO INFLECTIONS
    PREPARE DICTIONARY ADDITIONS
    UPGRADE LATIN DICTLINE
    CHECK LATIN DICTLINE
    MAINTAIN LATIN DICTLINE
    CHECK DICTLINE FOR ENGLISH SPELLING
GENERATE WORDS SYSTEM
        PREPARE LATIN DICTIONARY PHASE
        PREPARE ENGLISH DICTIONARY PHASE

OTHER FORMS OF DICTIONARY
    DICTPAGE
        Like a paper dictionary
    LISTALL
        All words that DICTLINE and INFLECTS can generate
          For spellcheckers
          Will not catch ADDONS and TRICKS words

TOOLS

CHECK.ADB
DUPS.ADB

DICTORD.ADB
FIXORD.ADB
LINEDICT.ADB
LISTORD.ADB

DICTPAGE.ADB

DICTFLAG.ADB

INVERT.ADB
INVSTEMS.ADB

ONERS.ADB

CCC.ADB

SLASH.ADB
PATCH.ADB

SORTER.ADB

-------------------  DO INFLECTIONS  ----------------------

INFLECTS.LAT contains the inflections in human-readable form
with comments, and in  useful order.
This is the input for MAKEINFL, which produces INFLECTS.SEC.


(LINE_INF uses INFLECTS.LAT input to produce INFLECTS.LIN,
clean and ordered, but still readable.

Run

        LINE_INF


which produces
    INFLECTS.LIN
and INFLECTS.SEC)


----------------------------------------------------------
------------PREPARE  DICTIONARY  ADDITIONS----------------
----------------------------------------------------------

This process is to prepare a submission of new dictionary entries
for inclusion in DICTLINE.  The normal starting point is a text file
in DICTLINE (LIN) form, the full entry on one line, spaced appropriately.


The other likely form is an edit file (ED) in which the entry is broken
into three lines

STEMS
PART and TRAN
MEAN

For this form, spacing is not important, as long as there are spaces
seperating individual elements.

This is transformed into LIN form by the program LINEDICT
LINEDICT.IN (ED form) -> LINEDICT.OUT (LIN form)


The inverse of this, LIN to ED, is useful to produce a more easily
editable file (3 lines per entry so it is all on one screen)
LISTDICT.IN (LIN DICTLINE form) -> LISTDICT.OUT (ED form)

Having a LIN form, one can create a DICTLINE.SPE and do checking on that.

Besides running CHECK to validate syntax, one can run DICTORD and create
a file in which leading words are in dictionary entry form.  One can then
run this against the existing WORDS and DICTLINE to check for overlap.

DICTORD makes # file in long format
DICTORD.IN -> DICTORD.OUT
Takes DICTLINE form, puts # and dictionary form at begining.

This file can be sorted to produce word order of paper dictionary.

SORTER on (1 300) (with or without U for I/J U/V conversion)

One can then run WORDS against this file using DEV (!) parameters
DO_ONLY_INITIAL_WORD and FOR_WORD_LIST_CHECK,
and (#) parameters
HAVE_OUTPUT_FILE, WRITE_OUTPUT_TO_FILE, WRITE_UNKNOWNS_TO_FILE
The output provides for a check whether the new submissions
are duplucated in the existing dictionary, and even if the forms are
are the meanings the same.

After editorial review in light of the WORDS run, the new submission
is ready for inclusion by the usual process with CHECK and SPELLCHECK.


----------------------------------------------------------
----------------UPGRADE  DICTIONARY ----------------------
----------------------------------------------------------

This is a variation of the additions process.

This process is to prepare a section of DICTLINE for upgrade.
A section (aboout 100 entries) is extracted and ordered alphabetically
It is then put in a form for convenient editing and compared to
the OLD and L+S.  Entries are checked and additions are made.
The edit form is returned to DICTLINE form and inserted in
place of the extracted section.

Much the same process is involved in preparing an independent submission
of new entries.


DICTORD makes # file in long format
DICTORD.IN -> DICTORD.OUT
Takes DICTLINE form, puts # and dictionary form at begining,
a file that can be sorted to produce word order of paper dictionary

SORTER on (1 300)

LISTORD    Takes # (DICTORD) long format to ED file
(3 lines per entry so it is all on one screen)
LISTORD.IN -> LISTORD.OUT

Edit


FIXORD produces clean ED file

LINEDICT makes long format (LINE_DIC/IN/OUT)

----------------------------------------------------------
-------ADDING A BLOCK OF NEW ENTRIES TO DICTIONARY -------
----------------------------------------------------------

This may be in association with the upgrade process or from
a block of new entries submitted by a developer or user.

The format may be strange.  It is usually easiest to reduce/edit
it down ro the 3 line ED form, because that has no column restrictions.

From there one does the usual, making LINEICT format and preparing the addition.

One quirk is that there may be entries duplicate of the current DICTLINE.
This is so even if the supplier was working from and checking his current DICTLINE,
because there may have been later additions to the master.

While DUPS will catch these, that is a big effort for a full DICTLINE.
One would rather check just the new input.

Take the input and DICTORD.  This gives a format with the dictionary entry
word first.  Run the current WORDS aginst that with NO FIXES/TRICKS and
FIRST_WORD and FOR_WORDLIST parameters.  And not UNKNOWN in the output
should be examined.

Then run CHECK and spellcheck the English.


----------------------------------------------------------
------------PREPARE  DICTIONARY (DICTLINE) WITH ADDITIONS-----------
----------------------------------------------------------
Save present copies of DICTLINE.GEN, DICTLINE.SPE, DICT.LOC,
and whateverelse, in case you foul up and have to redo.

Add DICT.LOC to DICTLINE.GEN

        Copy DICT.LOC   LINEDICT.IN
        Run LINEDICT

        Copy LINEDICT.OUT+DICTLINE.GEN   DICTLINE.NEW

Or if there is a SPE that you want to integrate

         COPY DICTLINE.GEN+DICTLINE.SPE  DICTLINE.NEW

Or any other and combiination.


Sort DICTLINE.NEW in the normal fashion (to check for duplicates)

      SORTER
        DICTLINE.NEW   --  Or whatever you call it
            1 75         --  STEMS
           77 24   P     --  PART
          111 80         --  MEAN  --  To order |'s
          101 10         --  TRAN
         DICTLINE.SOR    --  Where to put result

Check the sort for oddities and any blank lines.
(Look for long/run-on lines.)

Then run CHECK and examine CHECK.OUT

Run

        CHECK

to produce
   CHECK.OUT

Examine CHECK.OUT and make any corrections required
(The easiest way is to edit CHECK.IN and rerun as necessary.
Then copy the final CHECK.IN to DICTLINE.)
Errors are cites by line number in CHECK.IN.
Edit examining CHECK.OUT from the bottom, so that changes do not
affect the numbering of the rest of CHECK.IN
CHECK is very fussy.  The hits are primarily warnings to look for
the possiblity of error.  Most will not be wrong.  In fact, over
one percent of correct lines will trigger some warning, more false
positives than real errors.
This make a full run and edit of DICTLINE a considerable burden.


Sort the fixed CHECK.IN again if there have been any changes in order.

Check for duplicates in columns 1..100
(DUPS checks for '|' in column 111 so that it does not give
hits on lines known to be continuations, provided the sort is in order.)

   COPY CHECK.IN DUPS.IN
   Run DUPS
          1 100

Examine DUPS.OUT and fix DUPS.IN (again from the bottom).
Resort if necessary.

Copy the final product to DICTLINE.GEN

This only checks DICTLINE for syntax,

----------------------------------------------------------
----------CHECK DICTLINE FOR ENGLISH SPELLING-------------
----------------------------------------------------------
To check DICTLINE further, one can check the spelling of MEAN.

The fixed format of DICTLINE facilitates this process.
Just running DICTLINE through a spellchecker is impossible,
since all lines contain Latin stems, which will fail not only
an English spellchecker, but a Latin spellchecker as well
(since they are just stems, not proper words).

The process is to extract the MEAN portion, spellcheck this,
and reassemble, making sure to preserve the exact line order.
I use two personal tools, SLASH and PATCH.

Run SLASH on DICTLINE
SLASH takes a file and cuts it into two, lines or columns.
In this case we want to separate the first 110 columns from the rest.

   SLASH
      c          --  Rows or columns
      110        --  How many in first
      LEFT.      --  Name of left file
      RIGHT.     --  Name of right file
                 --  Or whatever you want to call them

Save LEFT for later and work on RIGHT, which is only MEANs.

There is one additional complication.
Some MEANs have a translation example element [... => ...]
This will contain some Latin (the left half) as well as English.

The rest I do with editors, but I suppose I should make tools.

Introduce 80 blanks in front of any [
SLASH out the first 80 columns, giving the MEAN omitting the []
Spellcheck that
In the [] file, left justify and add 80 blanks before the =
SLASH out the first 80 columns and spellcheck
Reassemble the three parts of MEAN
Eliminate blanks, leaving a simple MEAN/RIGHT.
PATCH LEFT. and RIGHT together to give DICTLINE.


___________________________________________

 To Prepare English Dictionary
__________________________________________

The first part of the following procedure is only for those
starting from scratch.  If porting with a full package,
EWDSLIST.GEN will be provided and you can skip down.

---------------------------------------------------------

Preparing the dictionary for the English mode also
involves checks on the syntax of MEAN.

Run MAKEEWDS against DICTLINE.GEN
(There may be some errors cited.  Correct as appropriate.)

This extracts the English words from DICTLINE MEAN (G or S)
Makes EWDSLIST.GEN (or .SPE)

Make sure that if running from DICTLINE.GEN that the extra ESSE line
is added.  If we start from DICTFILE.GEN, it is already in.

 type EWDS_RECORD is
        record
          W    : EWORD;                       1
          AUX  : AUXWORD;                    40
          N    : INTEGER;                    50
          POFS : PART_OF_SPEECH_TYPE := X;   62
        end record;

Ah                                                         1 INTERJ
Aulus                                                      2 N
Roman                                                      2 N
praenomen                                                  2 N
abbreviated                                                2 N


__________________________________________________


Sort EWDSLIST making a revised version (same name)

1    24   A
1    24   C
51    6   R
75    2   N  D


(Run ONERS on ONERS.IN if you want to see FREQ)
(Sort ONERS.OUT  1 11 D; 13 99)

_____________________________________________________

If you are supplied with EWDSLIST.GEN as part of a port package,
the above process is not done.

_____________________________________________________


Run MAKE_EWDSFILE against EWDSLIST.GEN
(This also removes some duplicates, entries in which the
key word appears more than once.)

producing EWDSFILE.GEN

(At present these will act to produce a EWDSFILE.SPE, but
WORDS is not yet setup to use that - only English on GEN for now.)

----------------------------------------------------------
------------PREPARE  WORDS SYSTEM-------------------------
----------------------------------------------------------

If using GNAT, otherwise compile with your favorite compiler

gnatmake -O3 words
gnatmake -O3 makedict
gnatmake -O3 makestem
gnatmake -O3 makeewds
gnatmake -O3 makeefil
gnatmake -O3 makeinfl


This produces executables (.EXE files) for
WORDS
MAKEDICT
MAKESTEM
MAKEEWDS
MAKEEFIL
MAKEINFL

(You may also need my SORTER to prepare the data if you are modifing data.
gnatmake -O3 sorter)

(If you have modified DICTLINE, SORTER sort
            1 75         --  STEMS
           77 24   P     --  PART
          111 80         --  MEAN
          101 10         --  TRAN
Actually the order of DICTLINE is not important for the programs;
it is only a convenience for the human user.)


Run MAKEDICT against the DICTLINE.GEN  -  When it asks for dictionary, reply G for GENERAL
This produces DICTFILE.GEN
("against" means that the data file and the program are in the same folder/subdirectory.)

(This assumes that you are using the presorted STEMFILE.GEN
which comes with distribution and matches that DICTLINE.GEN.
Otherwise make and run WAKEDICT (Identical to MAKEDICT with
PORTING parameter set in source).  This produces DICTFILE.GEN
and a STEMLIST.GEN, which has to be sorter by SORTER.
MAKE ABSOLUTELY SURE YOU ARE USING THE RIGHT MAKEDICT/WAKEDICT!

Invoke SORTER to sort the stems with I/J and U/V equivalence
and replace initial STEMLIST with the sorted one.

       SORTER
         STEMLIST.GEN    --  Input
           1    18   U
           20   24   P
           1    18   C
           1    56   A
           58    1   D
         STEMLIST.GEN    --  Output

The output file is also STEMLIST.GEN - Enter/CR for the name works.)
(All SORTER parameters are based on the layout of WORDS 1.97E.
Later versions may have further/expanded fields.)

Run MAKESTEM against STEMLIST.GEN (with dictionary "G") produces STEMFILE.GEN and INDXFILE.GEN

The same procedures can generate DICTFILE.SPE and STEMFILE.SPE (input S)
if there is a SPECIAL dictionary, DICTLINE.SPE


For the English part, if you use the presorted EWDSLIST.GEN
run MAKEEFIL aginst it.

(This assumes that you are using the presorted EWDSLIST.GEN
which comes with distribution and matches that DICTLINE.GEN.
Otherwise make and run MAKEEWDS against DICTLINE.GEN
This produces EWSDLIST.GEN which has to be sorted by SORTER.
Check the begining of EWDSLIST with an editor.
If there are any strange lines, remove them.
Invoke SORTER.  The input file is EWSDLIST.GEN.
The sort fields are

SORTER
    EWDSLIST.GEN
       1   24   A         --  Main word
       1   24   C         --  Main word for CAPS
      51    6   R         --  Part of Speech
      72    5   N    D    --  RANK
      58    1   D         --  FREQ
    EWSDLIST.GEN     --  Store

The output file is also EWDSLIST.GEN - Enter/CR for the name works.)
(For this distribution, there is no facility for English from a SPECIAL dictionary -
there is no D_K field yet)

Run MAKEEFIL against the sorted EWDSLIST.GEN producing EWDSFILE.GEN


Run MAKEINFL against INFLECTS.LAT producing INFLECTS.SEC

Along with ADDONS.LAT and UNIQUES.LAT,
this is the entire set of data for WORDS.

WORDS.EXE
INFLECTS.SEC
ADDONS.LAT
UNIQUES.LAT
DICTFILE.GEN
STEMFILE.GEN
INDXFILE.GEN
EWDSFILE.GEN
--  And whatever .SPE as appropriate


(If you go through the process and have a working WORDS but it
gives the wrong output, the most likely source of error is
a missing or improper sort.)


--------------------------------------------------------------
Viewing WORD.STA


A view to see what ADDONS and TRICKS were used


Sort WORD.STA on
1    12      --  The STAT name
55   25      --  STAT details
32   20      --  Word in question
16   10      --  Line number


------------------------------------------------------------------
------------------PREPARING DICTPAGE------------------------------
------------------------------------------------------------------

Preparing DICTPAGE, the listing as of a paper dictionary.

IMPORTANT NOTE

During the process, you may find it useful to edit some entries.  Feel free to do so.
But remember that you have to keep the separate files (.TXT) and reassemble at the end
into a new DICTLINE.


For a release, ideally DICTPAGE is done before the final DICTLINE,
because in the process there may be some editing of entries.
To first order, this is accomplished by running DICTPAGE
against DICTLINE, producing a listing of DICTLINE with each
entry preceeded by # and the DICTIONARY_FORM.
DICTPAGE is a simple modification of DICTORD to produce a
more readable output.

Some polishing of this process gives a better product.
Extracting a few groups of entries for special handling
will simplify the process.


1) Use the regular DICTLINE sort.
Those entries with first stem zzz may give an output
which sorts to #-.  But it is likely the second term which
you want to represent this entry.  For this and other reasons
these entries will require some hand editing, so extract them
from their place at the end of the regular DICTLINE, run DICTPAGE
on them, sort output on full line, and process seperately.
(About 30 entries, but half handled completely by DICTPAGE)
It is likely that this set has not changed much since the last run,
so check to see if you have to do it over.

2)Sort remaining DICTLINE on (77, 8), (110, 80), (1, 75).  Extract ADJ 2 X.
Many Greek adjectives are handled in DICTLINE in two or three parts
(ADJ 2, X by gender.  The full declension is the
sum of these partials.  (The Greek adjective form 3 6 is handled in the
regular process and does not have to be extracted.) Extract these ADJ declensions
from a sort of DICTLINE by PART.  Sort this output on stem and meaning to group
the constituent parts, run DICTPAGE and polish by hand edit to make
a single paper entry from the parts.  (About 150 entries, half that
after editing, not too hard, but a program could do the modification.)
It is very likely that this has not changed.

3)The qu-/aliqu- PRONOUN/PACKON (PRON/PACK 1) are yet more complicated
than the Greek adjectives, and are handled in the same manner.
Extract them, sort on meaning, DICTPAGE, and polish output by hand.
Also PRON 5 (only 8 of these).  Both of these are sufficiently
unchanging that one could archive the final edit and reuse on a later run.

4)The rest are automatically done by DICTPAGE.

5)UNIQUES are a special case, handled by UNIQPAGE.  This processes UNIQUES.LAT
(as UNIQPAGE.IN) into a raw form compatible with the regular PAGE material
(UNIQPAGE.OUT which is copied into UNIQPAGE.pg), added to, and sorted with.


The various phases are assembled into a whole and sorted on the lead,
producing DICTPAGE.RAW

DICTPAGE.RAW is ZIPped to provide a source for others to process for their purposes.

DICTPAGE.RAW is processes herein by PAGE2HTM to give (withthe addition of PREAMBLE.txt
and an end BODY) to give the presentation form DICTPAGE.HTM


The process:

First do a SORT of DICTLINE on STEM to find zzz stems

      SORTER
        DICTLINE.GEN   --  Or whatever
           1 75         --  STEMS
          77 24   P     --  PART
         111 80         --  MEAN  --  To order |'s
        DICTLINE.TXT    --  Where to put result

Extract the zzz stems from the end of the file into ZZZ.TXT leaving DICTLINE.NOZ

Sort these

     SORTER
        ZZZ.TXT
           77 24   P     --  PART
            1 75         --  STEMS
          111 80         --  MEAN  --  To order |'s
          101 10         --  TRAN
        ZZZ.TXT             --  Where to put result

Extract the PRON 5 to a PRON5.TXT  --  More to come


Now sort the rest

      SORTER
        DICTLINE.NOZ
           77 24   P     --  PART
            1 75         --  STEMS
          111 80         --  MEAN  --  To order |'s
          101 10         --  TRAN
        DICTLINE.NOZ    --  Where to put result


Now extract from DICTLINE.NOZ the remaining PRON 5, the Greek adjectives,
and the qui/alqui PRON/PACK 1, giving

ZZZ.TXT
GKADJ.TXT
PRON1.TXT
PRON5.TXT

After those are removed, the remaining is REST.TXT.


Run DICTPAGE on each of these 5 files
(Copy them to DICTPAGE.IN, run DICTPAGE, copy DICTPAGE.OUT to the appropriate file .PG)


----------------ZZZ

Process the remaining (less PRON 5) ZZZ.TXT with DICTPAGE
(Copy ZZZ.TXT to DICTPAGE.IN, run DICTPAGE, copy DICTPAGE.OUT to ZZZ.PG)
Most of them will be handled.  Hand edit the rest.

Some should be expanded (archaic forms in one stem need to be filled out).
Some should be modified (e.g., the plurals).
Some should be trimmed (adjectives with no positive).
There are some kludges (artificial entries which generate irregular forms)
here.  Some may just be excluded from the .PG .

----------------GKADJ

Sort GKADJ to get the various parts together for a multiple entry


      SORTER
        GKDAJ.TXT
            1 75         --  STEMS
          111 80         --  MEAN  --  To order |'s
          101 10         --  TRAN
           77 24   P     --  PART
        GKADJ.TXT            --  Where to put result

Run DICTPAGE and edit.  This edit is straightforward but tedious.
I should prepare a procedure to do this automatically, but have not yet.
It is likely that there are few or no changes
from the previous run and those results can be used/modified.


The product is GKADJ.PG

----------------PRON1

This must be hand edited.  However it may not change much between versions.

----------------PRON5

Very small.

----------------UNIQUES

UNIQUES are treated by UNIQPAGE.EXE, giving UNIQPAGE.PG

----------------

----------------

The resulting files (with extensions appropriate to the phase of the operation,
ending in .PG) are

GKADJ
PRON1
PRON5
REST
UNIQPAGE
ZZZ

----------------FINISH

Assemble the 6 .PG files to DICTPAGE.PG and sort to produce DICTPAGE.RAW


  SORTER
        DICTPAGE.PG
           1 300  C      --  Everything
           1 300  A      --  For Caps
        DICTPAGE.RAW    --  Where to put result


Then process with PAGE2HTM ans add PREAMBLE.TXT at begining and end BODY at end
to get DICTPAGE.HTM

---------------------------------------------------------------------


------------------------------------------------------------------
----------------------THE SHORT FORM------------------------------
------------------------------------------------------------------

------  SORT DICTLINE

      SORTER
        DICTLINE.GEN
            1 75         --  STEMS
           77 24   P     --  PART
          111 80         --  MEAN  --  To order |'s
          101 10         --  TRAN
         DICTLINE.GEN    --  Where to put result


WAKEDICT/MAKEDICT

------  SORT STEMLIST IF NOT PROVIDED

       SORTER
         STEMLIST.GEN    --  Input
           1    18   U
           20   24   P
           1    18   A
           1    56   C
         STEMLIST.GEN    --  Output

MAKESTEM

MAKEEWDS

------  SORT EWDSLIST

       SORTER
         EWDSLIST.GEN
           1   24   A         --  Main word
           1   24   C         --  Main word for CAPS
          51    6   R         --  Part of Speech
          72    5   N    D    --  RANK
          58    1   D         --  FREQ
         EWSDLIST.GEN        --  Output

MAKEEFIL