Analysis Tool For Literary Texts

The first problem I wanted to solve was to write a short program that would allow me to perform basic textual analysis of any work of literature.

I wanted to be able to study the richness of different authors’ language by looking at how they used neologisms (their own made up words), pseudo-archaisms, invented their own contractions for authentic speech, or used hyphenated compound words, etc. I also wanted to be able to list all the characters and place names (proper nouns) mentioned in a text.

The Plan

The goal was not to create a sterile tool for dissecting literary works by converting them into meaningless statistics, but rather to provide a simple means of comparing the lexicons used in different authors’ works. It would also permit a would-be author to analyse his or her own novel for overused adjectives or adverbs, or for inconsistent spelling of contractions or compound words, without having to become an expert in using regular expressions. The program would need access to a dictionary, but would also need to give users the option of using their own.

I decided to test it on the following texts: an English translation of Homer’s Odyssey, Mark Twain’s Huckleberry Finn, Charles Dickens’ Great Expectations and James Joyce’s Ulysses. The plan was to give the program a good workout on pseudo-archaic English, authors’ neologisms, conversational contractions and proper nouns. Put simply, if my code could handle these texts, it could handle anything.

Results

Here’s an abridged version of the output my program produced when it was used on Homer’s Odyssey:

  1. The following textual analysis was obtained using the command line:
  2. AnalyseThis_v6.py -i "The_Odyssey.txt" -s "In a Council" -e "empires of
  3. the Past" -R 1000 -cadfgyHPt
  4.  
  5.  
  6.  
  7. ANALYSIS of The_Odyssey.txt
  8.  
  9. In the following analysis of the above text:
  10.  
  11. - Proper nouns have been listed separately and not counted in the word
  12. frequency analysis or checked against the dictionary. (-P)
  13.  
  14. - A list of possible linguistic archaisms the author/translator may have
  15. used has been listed, but also counted in the word frequency analysis. (-a)
  16.  
  17. - All contractions used by the author/translator have been listed, but
  18. also counted in the word frequency analysis. (-c)
  19.  
  20. - All hyphenated compound words the author/translator has used have been
  21. listed but then split into their component parts for the word frequency
  22. analysis. (-H)
  23.  
  24. - Most adverbs the author/translator has used (those ending in 'ly') have
  25. been listed, as well as being counted in the word frequency analysis. (-y)
  26.  
  27. - All gerunds and present tense continuous '-ing' words used by the
  28. author/translator have been listed, and counted in the word frequency
  29. analysis. (-g)
  30.  
  31. - All upper- and lower-case Roman numerals up to 1000 have been ignored
  32. as words. (-R)
  33.  
  34. - A spell check will be performed on all the words used (except proper
  35. nouns and compound words if listed separately) against the dictionary of
  36. your choice.
  37.  
  38. The word frequency analysis will take the form of an alphabetical list of
  39. the words used in the text, against the usage frequency of each. (-f)
  40.  
  41.  
  42. *****************************************************************
  43.  
  44. The following 523 (probable) proper nouns were used: Abbott Acastus
  45. Achaean Achaeans Achaia Acheron Achilles Acroneus Adraste Aea Aeacus
  46. Aeaean Aeetes Aegae Aegisthus Aegyptus Aeolian ... Zacynthus Zethus Zeus
  47.  
  48. The following 147 (possible) linguistic archaisms were used: abidest
  49. abideth adviseth askest attendeth avowest badest bearest bestoweth
  50. biddest biddeth bidst bindeth bringest bringeth ... wieldest winneth wooest
  51.  
  52. Only one contraction was used: nurtur'd
  53.  
  54. The following 289 compound words were used: a-washing apple-trees
  55. assembly-place axe-handle back-bent back-flowing barley-flour ...
  56. women-servants wood-clad wood-nymphs woven-work yard-arm
  57.  
  58. The following 175 (mostly) adverbs were used: assembly assuredly
  59. barely beautifully belly blindly boldly briefly briskly busily
  60. carefully ceaselessly ... wisely wistfully woefully wondrously
  61.  
  62. The following 482 (probable) present tense continuous verbs and gerunds
  63. were used: abiding accepting accomplishing according accosting
  64. achieving aiming anointing appearing asking assailing avoiding awaiting
  65. ... yearling yearning yelping yielding yoking
  66.  
  67. The following 205 words were not found in a dictionary of 274,512 entries:
  68. abidest abideth admonitions adventured adviseth askest attendeth avowest
  69. ... waxeth wayed wearieth wentest wetteth wieldest winneth wooes wooest
  70. younglings
  71.  
  72.  
  73. Word Frequency Analysis
  74.  
  75. Below is a sorted list of the author/translator's lexicon, with word
  76. frequencies, of their work The_Odyssey.txt:
  77. a(1914) abase(1) abashed(2) abetted(1) abhor(1) abide(51)
  78. abides(6) abidest(1) abideth(1) abiding(8) able(6) aboard(9)
  79. abode(30) abolished(1) abolishes(1) about(175) ... your(80)
  80. yours(3) yourselves(6) youth(13) youths(21) zenith(2)
  81.  
  82. In this text of 130,713 words, the author/translator of The_Odyssey.txt
  83. used a total vocabulary of 5,789 words.
  84.  
  85. This textual analysis took 8.95 seconds.

 

As you can see, the tool shows how the translator has used archaisms  to pass on the gravitas of classical speech, and has felt compelled to invent many compound words to translate the richness and density of the original Ancient Greek. He has also avoided contractions in speech, probably to make it sound more formal.

The analysis of Huckleberry Finn shows a very different use of language:

The following textual analysis was obtained using the command line:
AnalyseThis.py -i "Huckleberry Finn.txt" -s "David Widger" -e "End of the Project" -R 1000 -cadfgyHPt



ANALYSIS of Huckleberry Finn.txt

In the following analysis of the above text:
 
- Proper nouns have been listed separately and not counted in the word frequency analysis or checked against the dictionary. (-P)
 
- A list of possible linguistic archaisms the author/translator may have used has been listed, but also counted in the word frequency analysis. (-a)
 
- All contractions used by the author/translator have been listed, but also counted in the word frequency analysis. (-c)
 
- All hyphenated compound words the author/translator has used have been listed but then split into their component parts for the word frequency analysis. (-H)
 
- Most adverbs the author/translator has used (those ending in 'ly') have been listed, as well as being counted in the word frequency analysis. (-y)
 
- All gerunds and present tense continuous '-ing' words used by the author/translator have been listed, and counted in the word frequency analysis. (-g)
 
- All upper- and lower-case Roman numerals up to 1000 have been ignored as words. (-R)
 
- A spell check will be performed on all the words used (except proper nouns and compound words if listed separately) against the dictionary of your choice.
 
The word frequency analysis will take the form of an alphabetical list of the words used in the text, against the usage frequency of each. (-f)

 *****************************************************************

The following 350 (probable) proper nouns were used:    A-a-men Ab Abner Abram Acts Adam Adolphus African Ah Amen America Amighty Andy Angelina Ann Antonette Apthorps Arab … Whipple Whitechapel Wilks Wilkses Will William Williams Winn Wunst Yit

The following 20 (possible) linguistic archaisms were used:    awfulest beatenest carelessest cert'nly cussedest dadblamedest glidingest horriblest ignorantest lonesomest naturedest pisonest pitifulest powerfullest schooliest sejest thrillingest treacherousest troublesomest upest

The following 175 contractions were used:    I'd I'll I'm I'se I'spec I've ab'litionist ain't amaz'n and'll b'fo b'lieve b'long b'longs better'n bo'd'n borry'd c'lumbus ca'm cain't can't cert'nly cle'r … who'd whyd'nt won't wouldn't writ'n y'r yes'm yo'sef yo'self you'd you'll you're you've your'n

The following 594 compound words were used:    a-a-men a-barking a-beaming a-bear a-begging a-bilin a-biling a-blazing a-blowing a-blubbering a-booming a-bothering a-bragging a-branching a-brewing a-buyin … witch-things wood-flat wood-pile wood-rank wood-saw wood-yard wood-yards wool-gethering wore-out world-renowned yaller-boys yaller-fever yaller-jackets yard-stick yonder-way

The following 69 (mostly) adverbs were used:    actuly badly barely belly brotherly bully cert'nly certainly chilly chimbly contumely crawly curly deadly deeply desperately devoutly directly drawly … naturally nearly painstakingly partly perfectly piccadilly prob'bly properly really reely rightly ripply shackly sholy sickly silly simply skasely specially trembly truly unlikely wrongfully yearly

The following 518 (probable) present tense continuous verbs and gerunds were used:    abusing according acknowledging acting addling aggravating agoing aiming amazing amusing answering …  whirling whispering whistling whittling whooing whooping willing winding winking wiping wishing wondering working wrapping wrinkling writing yawning yelling

The following 431 words were not found in a dictionary of 274,512 entries:   acrost actuly agwyne alassin allycumpain aluz alwuz amost ancesters anywhers aroun arter asho astonishin awfulest awluz ...wishin woodboats workin worl wouldn woundin wunst wusshup wuth wuz yaller yellocute yellocution yisterday yistiddy yit yuther


Word Frequency Analysis
 
Below is a sorted list of the author/translator's lexicon, with word frequencies, of their work Huckleberry Finn.txt:
 I'd(98)  I'll(120)  I'm(80)  I'se(1)  I'spec(1)  I've(49)  a(3198)  ab'litionist(1)  able(2)  aboard(21)  abolitionist(1)  about(421)  above(19)  abreast(5)  abroad(1)  absent(2)  abuse(1)  abused(1)  … you're(47)  you've(29)  young(57)  younger(3)  your(200)  your'n(1)  yourn(4)  yours(4)  yourself(20)  yourselves(3)  yow(4)  yuther(7)  


In this text of 107,495 words, the author/translator of Huckleberry Finn.txt used a total vocabulary of 5,839 words.

 
This textual analysis took 5.67 seconds.

 

On the one hand, the tool shows how Twain used contractions in speech and compound words to convey the various dialects of the region.  On the other, the program’s attempt to list archaisms simply managed to reveal some of the many superlatives (e.g. ignorantest, schooliest) used by Huckleberry Finn as he told his story. The many words not found in the dictionary also go to show Twain’s extensive use of mis-spelled words to reinforce Huck Finn’s lack of formal schooling.

The analysis of Great Expectations showed that Charles Dickens also liked to use contractions to illustrate contemporary authentic speech patterns. And at over 700 compound words, he also loved to use a hyphen whenever he could: today, the majority of his compound words are now either generally accepted as single words (gravedigger, maidservant) or as two words (kitchen table, paper bags).  Thus does language change.

The following textual analysis was obtained using the command line:
AnalyseThis.py -i "Great Expectations.txt" -R 1000 -cadfgyPHt



ANALYSIS of Great Expectations.txt

In the following analysis of the above text:
 
- Proper nouns have been listed separately and not counted in the word frequency analysis or checked against the dictionary. (-P)
 
- A list of possible linguistic archaisms the author/translator may have used has been listed, but also counted in the word frequency analysis. (-a)
 
- All contractions used by the author/translator have been listed, but also counted in the word frequency analysis. (-c)
 
- All hyphenated compound words the author/translator has used have been listed but then split into their component parts for the word frequency analysis. (-H)
 
- Most adverbs the author/translator has used (those ending in 'ly') have been listed, as well as being counted in the word frequency analysis. (-y)
 
- All gerunds and present tense continuous '-ing' words used by the author/translator have been listed, and counted in the word frequency analysis. (-g)
 
- All upper- and lower-case Roman numerals up to 1000 have been ignored as words. (-R)
 
- A spell check will be performed on all the words used (except proper nouns and compound words if listed separately) against the dictionary of your choice.
 
The word frequency analysis will take the form of an alphabetical list of the words used in the text, against the usage frequency of each. (-f)

 *****************************************************************

The following 318 (probable) proper nouns were used:    Abel Aberdeen Abraham Administering African Ah Alexander Alick Amelia Amen America Anne Antony Antwerp April Arabian Argus Arter Arthur As Athens August Australia Ay Ayther Barnard Barnwell Bartholomew Bee Belinda …  Tooby Trabb Tuesday Vauxhall Waldengarver Walworth Waterloo Wednesday Wemmick Wemmicks Wery Westminster Whimple Whitefriars Will William Windsor Wolf Wopsle Wotever Xn Yarmouth Yorkshire

The following 7 (possible) linguistic archaisms were used:    continueth eyeth fleeth giveth knoweth oncommonest placidest

The following 83 contractions were used:    I'd I'll I'm I've a'most ain't an't calc'lated can't chrisen'd couldn't cuthen'th d'ye didn't doesn't don't draw'd for'ard good'un grow'd ha'porth hadn't han't hasn't haven't he'd he'll isn't it'll know'd look'ee ma'am mas'r mayn't mightn't mustn't needn't o'clock … white-perspiration whooping-cough wicket-gate wicket-keeping wide-awake wild-flowers winding-sheets window-glass window-seat wine-coopering witch-like working-clothes working-day working-days working-dress writing-table young-looking

The following 706 compound words were used:    a-bed a-blazing a-eating a-fine-figure a-going addle-headed after-time ah-h ah-h-h alder-trees all-powerful already-mentioned arm-chair arm-chairs as-is as-ton-ishing back-falls back-room back-water back-yard ballast-lighters bank-note bank-notes ... white-perspiration whooping-cough wicket-gate wicket-keeping wide-awake wild-flowers winding-sheets window-glass window-seat wine-coopering witch-like working-clothes working-day working-days working-dress writing-table young-looking

The following 594 (mostly) adverbs were used:    absolutely absurdly abundantly accidentally accordingly accurately actively actually adequately advisedly affectionately agreeably alarmingly alternately amiably amply angrily anxiously apologetically apparently apply archly artfully ... vigorously violently virtuously vitally vivaciously vividly warily warmly watchfully weakly weekly wholesomely wholly wildly wilfully willingly womanly wonderfully worldly yearly

The following 1,087 (probable) present tense continuous verbs and gerunds were used:    abhorring accepting accompanying according accounting accumulating accusing aching acknowledging addressing adjoining admiring admitting adoring advancing affecting aggravating agonizing …  withdrawing withering wondering working worshipping wounding wrenching wrestling wriggling writing wrongdoing xcepting yelping yielding

The following 245 words were not found in a dictionary of 274,512 entries:   acquirements administered adwise afeerd aggravated alonger annum anwil anythink apartments apostrophizing …  whisperers wholesomer wicious wigor willage winegar wineglasses wiolent wisit wisiting wisitors wisits wittles wotever xcepting


Word Frequency Analysis
 
Below is a sorted list of the author/translator's lexicon, with word frequencies, of their work Great Expectations.txt:
 I'd(26)  I'll(85)  I'm(39)  I've(31)  a(4107)  a'most(5)  aback(1)  abandoned(2)  abased(1)  abashed(1)  abbey(1)  abear(2)  aberration(1)  abet(1)  abeyance(1)  abhorrence(4)  abhorrent(2)  …  younger(9)  your(397)  yourn(6)  yours(14)  yourself(51)  yourselves(2)  youth(9)  youthful(2)  youthfulness(1)  zeal(2)  zealous(2)  zest(1)  zip(1)  


In this text of 177,416 words, the author/translator of Great Expectations.txt used a total vocabulary of 10,676 words.

 
This textual analysis took 9.09 seconds.

 

The most remarkable use of language out of the four texts was in James Joyce’s Ulysses. I’ve included a synopsis of the analysis here:

The following textual analysis was obtained using the command line:
AnalyseThis.py -i "Ulysses.txt" -R 1000 -cadfgyPH



ANALYSIS of Ulysses.txt

In the following analysis of the above text:
 
- Proper nouns have been listed separately and not counted in the word frequency analysis or checked against the dictionary. (-P)
 
- A list of possible linguistic archaisms the author/translator may have used has been listed, but also counted in the word frequency analysis. (-a)
 
- All contractions used by the author/translator have been listed, but also counted in the word frequency analysis. (-c)
 
- All hyphenated compound words the author/translator has used have been listed but then split into their component parts for the word frequency analysis. (-H)
 
- Most adverbs the author/translator has used (those ending in 'ly') have been listed, as well as being counted in the word frequency analysis. (-y)
 
- All gerunds and present tense continuous '-ing' words used by the author/translator have been listed, and counted in the word frequency analysis. (-g)
 
- All upper- and lower-case Roman numerals up to 1000 have been ignored as words. (-R)
 
- A spell check will be performed on all the words used (except proper nouns and compound words if listed separately) against the dictionary of your choice.
 
The word frequency analysis will take the form of an alphabetical list of the words used in the text, against the usage frequency of each. (-f)

 *****************************************************************

The following 3,857 (probable) proper nouns were used:    Aaron Abba Abe Abeakuta Abeakutic Abraham Abram Abramovitz Abrines Abu Abulafia Accep Acclimatised Achates Achilles Acky … Yessex Yogibogeybox Yooka Yorkshire Youkstetter Yous Yrfmstbyes Yulelog Yum Yummyyum Yumyum Yvonne Zarathustra Zaretsky Zermatt Zion Zoe Zouave Zrads Zulu Zulus Zut

The following 45 (possible) linguistic archaisms were used:    allwisest beginneth beholdeth bestest biddeth bloometh broughtedst buckteeth c'est cert dismisseth dureth est farst foldeth followeth gert ... sitteth stealeth takest thoughtest wastest weepest wirst

The following 113 contractions were used:    I'd I'll I'm I've all'erta amn't anch'io aren't bennett'll bless'd bloom's boylan's c'est c'tait can't comb'd couldn't d'aisance d'arcy d'espagne d'hte d'oeil … they're thou'll tutt'amor virag's wasn't we'd we'll we're we've weren't wha'll what'll who'd who'll won't wouldn't y's you'd you'll you're you've

The following 101 compound words were used:    as-is austrian-hungarian austro-hungarian author-journalist believe-on-me bird-in-the-hand borris-in-ossory brother-in-law brothers-in-law brothers-in-love … tweedy-flower twenty-eight twenty-fifth two-in-the-bush vi-cocoa vice-chancellor voulez-vous will-o'-the-wisps zoe-fanny

The following 933 (mostly) adverbs were used:    abjectly abnormally abruptly absently absolutely abstractedly accidentally accommodatingly accordingly accurately actively actually admirably admiringly … winsomely wirily wisely wishly wobbly womanly wonderfully wondrously woolly worldly wrinkly wrongfully yearly yellowly youngly

The following 1,957 (probable) present tense continuous verbs and gerunds were used:    abandoning abetting ableeding abounding absorbing accepting accompanying accomplishing according accosting … worshipping worsting wotting wrastling wrenching wrestling wriggling wrinkling writhing writing wrongdoing yachting yapping yawning yearning yelling yelping yielding youngling zigzagging

The following 4,197 words were not found in a dictionary of 274,512 entries:   ableeding ablossom abovementioned abstrusiosities ac accidens acclimatised accompanable accretions acoming acracking actione acuminated addleaddle adiaphane adiutorium administered ador adrinking adversion afasting affly … wrongways wrynecked wus yachtingcap yadgana yankee yellowjohns yellowkitefaced yellowslobbered yewfronds yilo yoghin youre yous yrs yu yum yumyum yung ywimpled zivio zmellz zoe zrads zurich


Word Frequency Analysis
 
Below is a sorted list of the author/translator's lexicon, with word frequencies, of their work Ulysses.txt:
 I'd(19)  I'll(116)  I'm(152)  I've(12)  a(6581)  aback(1)  abaft(1)  abandon(1)  abandoned(7)  abandoning(1)  abandonment(1)  abasement(2)  abatement(1)  abattoir(1)  abbas(2)  abbess(1)  abbey(13)  … zingari(1)  zip(1)  zivio(1)  zmellz(1)  zodiac(2)  zodiacal(2)  zoe(2)  zones(1)  zoo(2)  zoological(1)  zouave(1)  zrads(3)  zurich(1)  


In this text of 248,037 words, the author/translator of Ulysses.txt used a total vocabulary of 25,761 words.

This textual analysis took 13.31 seconds.

 

He excels in three remarkable areas. First, at 3,857 proper nouns, he has more than ten times the number of people and place names than Dickens mentions in Great Expectations (318). Second, at nearly 4,200 words not found in the dictionary, he has invented almost ten times the number of words created by Mark Twain in the voice of Huckleberry Finn (431). Third, the size of his lexicon is immense. At over 25,700 words, it simply dwarfs those of The Odyssey (5,800), Great Expectations (10,700) and Huckleberry Finn (5,800). It is as if the English language is simply not big enough for James Joyce.

The Python Code

Here’s the code that was used to generate these textual analyses:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""\nPerforms a textual analysis of any given text, giving a summary of
the frequencies of words used, grouping the words by frequency (the
default) or listing them alphabetically with their individual frequencies (-f).
Calculates the size of the vocabulary used.
 
Additionally, the following optional analyses may be performed: a spellcheck
against a default (-d) or user-specified dictionary (-D), plus lists of any or
all of the following gramatical features used by the author/translator:
archaisms, contractions, gerunds, compound words, adverbs and proper nouns.
 
If called from the command line, the output is printed as an unformatted
continuous string to the std output, with paragraph breaks but no line- or
page-breaks. If redirected, it will therefore self-format to your own page
size. If called as a module by another program, it will return the output to
the calling program as a continuous string.
 
Regarding the dictionary, the code requires a dictionary called
EnglishDictionary.txt in the same folder as this source code, unless you
specify an alternative path to your own named dictionary using the -D flag.
 
Regarding file paths in the command line: Relative paths such as
'../../The_Odyssey.txt' notation are understood. If you're using Windows, make
sure you use the Unix forward slash '/', not the windows backslash '\' in your
dictionary paths.
 
 
Usage: from the command line type:
    AnalyseThis_v4.py -i <input_file> [-s <start_string>] [-e <end_string>]
    [-D <your_dict_filepath>] [-R <N>] [-acdfghyHPt] [> output_file]
 
    -i rel_path/input_file
        Mandatory flag: instructs the program to analyse the text in the file
        with name input_file. This can be filename, or a complete or relative
        path name.
 
Optional flags (in order of typical usage):
    -s "start trigger text"
        Don't start analysing the text in the input_file until after you've
        found the start trigger string in the input text file. Defaults to the
        start of the input file.
 
    -e "end trigger text"
        Stop analysing when you have found the end trigger text in the input
        file. This string is not analysed. Defaults to the EOF of the input
        file.
 
    -d
        Check all words against the program's built-in dictionary, listing
        those not in the dictionary. Default is not to spellcheck.
 
    -D rel_path/dictionary
        Use the user-specified dictionary in rel_path/dictionary to check the
        validity of all words found. Default is not to check any words.
        Any word not in the dictionary is listed, in addition to its listing
        in the frequency analysis. Overrides '-d'.
 
    -P
        Omit proper nouns from the word frequency analysis and dictionary
        check and summarise them in a separate list.
 
    -H
        List separately all hyphenated compound words used. Leaves them out of
        the word frequency analysis and dictionary check (if selected) by
        splitting them into their component words.
 
    -a
        list all simple archaisms uses. Uses a regex to search for words
        not found in the dictionary ending in: "eth", "dst", "est", "urst"
        and "ert".
 
    -c
        List all contractions used. Seaches the list words found for any
        containing an apostrophe. Ignores singular possessives, which are
        removed them from the word frequency analysis.
 
    -g
        List all gerunds and present tense continuous words ending in '-ing',
        removing them from the word frequency analysis.
 
    -y
        List all adverbs ending in '-ly'.
        Note 1: does not find all adverbs. Misses those adverbs that do not
        end in '-ly', e.g. fast, well.
        Note 2: Also counts words that end in '-ly' that are not always adverbs,
        e.g. early, fly, really, kingly.
        Note 3: Still counts the words in the main list in their own right.
        Reason: working out the base words is not always obvious. Many -ly
        adverbs are not formed simply by adding '-ly', e.g. laughably, daily,
        wearily.
 
    -R <N>
        Ignore Roman numerals up to maximum N., e.g. i, ii/II, iii/III, iv/IV.
        Note 1: with capital Romans, only those from II and up are ignored.
        If this flag is not set, they will be counted as words. Even if set,
        all instances of the Roman number I will be counted as occurrences of
        the first person singular 'I'.
 
    -f
        Print the analysis section as an alphabetically sorted list with the
        associated word frequencies, eg: apple(4) as(345) ... zebra(1).
        Replaces the normal list grouped by frequency.
 
    -t
        Time the running of the program, time stamping the output file/stream
        with how long the analysis took.
 
    -h
        Help: prints this long version of the usage.
 
 
Example command line, using the Gutenberg edition of the 1879
Butcher and Lang translation of Homer's Odyssey:
 
    AnalyseThis.py -i "The Odyssey.txt" -s "investigation" -e "Homer, thy song"
    -R 1000 -cadfgyHP > OdysseyAnalysis.txt
 
author: matta_idlecoder at protonmail.com
"""
import timeit
import string
import re
import os
import sys
import getopt
 
result_string = ''
 
 
def output(*args):
    """Assembles output result as a returned continuous string, or prints it
 
    Action depends on how the module has been called. By assembling the result
    as an output string, it allows the result to be returned to the calling
    program, if that's how the module is called. Allows a calling GUI wrapper
    to display it however it wants to. If it has been invoked from the
    commmand line, it returns the result to the std output, separated by
    spaces, rather than the default '\n' between items. Works for single items,
    or any iterable object.
    """
    global result_string
    if __name__ == '__main__':
        for count, thing in enumerate(args):
            print(thing, end=' ')
    else:  # code has been imported by another program. Augment the return string:
        for count, thing in enumerate(args):
            result_string += thing
            result_string += ' '
    return
 
 
def print_list_items(input_list):
    """ Prints out any list, separating the items by spaces
 
    Output is one long text string onto multiple lines, to allow the output
    to be formatted to the page size of the display.
    """
    for i in range(len(input_list)):
        output(input_list[i])
    return
 
 
def to_romancaps(x):
    """ Returns the uppercase Roman numerals equivalent of an Arabic number.
 
    modified from http://rosettacode.org/wiki/Roman_numerals/Encode#Python
    """
    ret = []
    anums = [1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1]
    rnumscaps = "M CM D CD C XC L XL X IX V IV I".split()
    for a,r in zip(anums, rnumscaps):
        n,x = divmod(x,a)
        ret.append(r*n)
    return ''.join(ret)
 
 
def to_romanlower(x):
    """ Returns the lowercase Roman numerals equivalent of an Arabic number.
 
    modified from http://rosettacode.org/wiki/Roman_numerals/Encode#Python
    """
    ret = []
    anums = [1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1]
    rnumslower = "m cm d cd c xc l xl x ix v iv i".split()
    for a,r in zip(anums, rnumslower):
        n,x = divmod(x,a)
        ret.append(r*n)
    return ''.join(ret)
 
 
def GetRomanNumerals(UptoMaxNumeral=100, lowercase=True,
                     uppercase=True):
    """ Returns a list of Roman numerals, in both upper or lower case, or both.
 
    modified from http://rosettacode.org/wiki/Roman_numerals/Encode#Python
    """
    roman_num_caps, roman_num_lower = [], []
    romannums = []
 
    for val in range(1,UptoMaxNumeral+1):
        if lowercase:
            roman_num_lower.append(to_romanlower(val))
        if uppercase:
            roman_num_caps.append(to_romancaps(val))
 
    romannums = roman_num_caps + roman_num_lower
    return romannums
 
 
def tidy_ends(letter_list):
    """strips non-alpha ends from a character list, leaving letters at each end
    """
    while letter_list[0] not in string.ascii_letters:
        del letter_list[0]
        if len(letter_list) == 0:
            return ''
    while letter_list[-1] not in string.ascii_letters:
        del letter_list[-1]
        if len(letter_list) == 0:
            return ''
    return letter_list
 
 
def clean_up(word):
    """ Cleans a word of any punctuation & numbers, returning it as lower case
 
    Algorithm is to remove surrounding special chars, then trailing <'s>
    then everything else that isn't an alpha or an apostrophe.
    """
    clean_text_str = ''
    new_text = []
 
    if not word:
        return(word)
 
    # word is a string, which is immutable in Python, we need a list:
    char_list = list(word)
    CharsToKeep = list(string.ascii_letters) + ["'"] + ['-']
 
    """ A more comprehensive list is obviously possible. Net has been omitted
    as it is also a valid word:  """
    if re.match(r'\b(http|www|com|org|edu|gov)\b', word, flags=0):
        return ''
 
    char_list = tidy_ends(char_list)
 
    if char_list[-2:] == ["'", "s"]:
        # remove obvious possessives and contractions that use <'s>
        char_list = char_list[:-2]
 
    # if we got this far, whatever it it, it must have a letter on each end:
    for char in char_list:
        if char in CharsToKeep:
            # get rid of everything except letters and apostrophes:
            new_text.append(char)
 
    if len(new_text) == 0:
        # Should never be true, but just in case:
        return ''
    else:
        # convert mutable list of chars back to immutable string:
        clean_text_str = ''.join(new_text)
 
    return clean_text_str
 
 
def CreateDictionarySet(DictionaryPath):
    """ Fetches a dictionary file, returning a dictionary word set.
    """
    dictionary_wordset, KnownProperNouns = set(), set()
 
    dict_file = open(DictionaryPath)
    for line in dict_file:
        word = line.strip()            # assumes one word per line
        dictionary_wordset.add(word)
    dict_file.close()
 
    """ NOTE: This is necessary for the simple reason that you can't modify an
    iterable while indexing it:
    """
    wordset_copy = dictionary_wordset.copy()
 
    """This bit cleans up the dictionary_set by removing single letters other
    than 'a' and 'I'. It then creates a second set of KnownProperNouns after
    removing them from the dictionary:
    """
    for word in wordset_copy:
        if len(word) == 1 and word != 'a' and word != 'I':
             dictionary_wordset.discard(word)
        if len(word) > 1 and word.istitle():
            dictionary_wordset.discard(word)
            KnownProperNouns.add(word)
 
    return (dictionary_wordset, KnownProperNouns)  # 2 sets with no intersections
 
 
def create_word_hist_for(text_file_path, dict_word_set, proper_nouns_in_dict,
                         start_trigger='', stop_trigger='',
                         split_n_list_compounds=True,
                         listing_proper_nouns=True,
                         listing_contractions=True,
                         removing_romans=True,  roman_maximux=1000):
    """ Creates a word frequency histogram from the input file
 
    Returns a dictionary with unique words as keys and their frequencies as
    found in the text as the value for that key. Removes all unwanted
    characters from the words. Only scans between the lines containing
    start_trigger and stop_trigger. If stop_trigger="" it scans to the EOF.
    """
    current_line_words = []
    word_freq_hist, in_text = dict(), False
    RomanNumList = []
    compounds_found, ContractionsUsed, proper_nouns_found = set(), set(), set()
 
    # checks for the most likely punctuation marks ',' and '.' first:
    punc_marks_to_split_with = list(string.punctuation[11:] +
                                    string.punctuation[:11])
    punc_marks_to_split_with.remove("'") # leave in apostrophes for contractions
    punc_marks_to_split_with.remove("-") # leave in hyphens for compounds
 
    with open(text_file_path, newline='') as File:
 
        if removing_romans:
            RomanNumList = GetRomanNumerals(roman_maximux)
 
        # start the analysis from beginning, without looking for a start string:`
        if not start_trigger:
            in_text = True
 
        for line in File:
            # this bit is skipped if the start_string is a null string:
            if not in_text:
                # YET - can only be here if start_trigger is set to something:
                if start_trigger not in line:
                    # then you still haven't found what you're looking for, Bono:
                    continue   # go to the next line of the text
                elif start_trigger in line:
                    # then you've found the first line of the body text:
                    in_text = True
                    # go to the next line of the text. Don't process this one:
                    continue
 
            if stop_trigger != "" and stop_trigger in line:
                # breaks out the current for-loop.
                # Stops reading the lines from the file:
                break
 
            elif not stop_trigger or stop_trigger not in line:  # Analyse the line
                #  Assumes the space char is the delimiter:
                current_line_words = line.split()
 
                for char_cluster in current_line_words:
                    multiple_words = []
                    char_cluster = ''.join(tidy_ends(list(char_cluster)))
 
                    if len(char_cluster) == 0:
                        continue
 
                    elif char_cluster.isalpha():
                        multiple_words.append(char_cluster)  # a list of one word
 
                    else:
                        if '--' in char_cluster:
                            # often found in  19th century English texts.
                            multiple_words = char_cluster.split('--')
 
                        elif '—' in char_cluster:
                            # a longer hyphen some printers use
                            # Not listed in string.punctuation for some reason:
                            multiple_words = char_cluster.split('—')
 
                        elif '-' in char_cluster:
                            char_cluster = char_cluster.lower()
                            char_cluster = clean_up(char_cluster)
                            if split_n_list_compounds:
                                compounds_found.add(char_cluster)
                                multiple_words = char_cluster.split('-')
                            else:
                                multiple_words.append(char_cluster) # a list of one word
 
                        elif "'" in char_cluster:
                            multiple_words.append(char_cluster) # a list of one word
 
                        else:
                            # algorithm is simply to split on the first punctuation
                            # mark found and delete the rest in clean_up():
                            for punc_mark in punc_marks_to_split_with:
                                if punc_mark in char_cluster:
                                    multiple_words = char_cluster.split(punc_mark)
                                    break
 
 
                    for word in multiple_words:
                        #  treat case as a list of words, even if it's just one,
                        # or one hyphenated word:
                        word = clean_up(word)
                        if len(word) == 0:
                            continue
 
                        """
                        This section decides whether to allow the word to
                        be added to the word frequency histogram. Words are
                        rejected with 'continue' based on simple criteria
                        of what makes a word. Word types the user has listed
                        separately, such as gerunds, are still counted in
                        the word frequency histogram:
                        """
                        if (len(word) == 1) and (word != 'I') and (word.lower() != 'a'):
                            """
                            Eliminates single letters other than <I>, <A>
                            and <a>, such as letters used in bullet points,
                            etc. This will also remove the Roman numerals
                            v/V and x/X:
                            """
                            continue   # don't count it in word_freq_hist below
 
    # Find all contractions:
                        if (word[:2] == "O'") or (word[:2] == "M'")  :
                            # you've found O'Neil, O'CONNOR, O'Brien, etc:
                            if word.isupper():
                                # takes care of O'CONNOR:
                                word = word[:3] + word[3:].lower()
                            proper_nouns_found.add(word)
                            if listing_proper_nouns:
                                continue # don't count it in word_freq_hist below
 
                        elif "'" in word and word[0] != 'I' :
                            # finds Didn't, DON'T, o'clock, Wouldn't, wouldn't:
                            # but not I've, I'd, I'm, I'll:
                            word = word.lower()
                            if listing_contractions:
                                # add it to the contraction count, but keep it in the analysis:
                                ContractionsUsed.add(word)
                                # continue
 
                        elif len(word) > 1 and word[0].lower() =='i' and word[1] == "'":
                            # You've found I've, I'd, I'm, I'll, even if the first
                            # letter is found in lowercase:
                            word = word.capitalize()
                            if listing_contractions:
                                # add it to the contraction count, but keep it in the analysis:
                                ContractionsUsed.add(word)
                                # continue
 
                        elif listing_contractions and "'" in word:
                            # cleanup() means it must be inside the word:
                            word = word.lower()
                            # add it to the contraction count, but keep it in the analysis:
                            if listing_contractions:
                                # add it to the contraction count, but keep it in the analysis:
                                ContractionsUsed.add(word)
                                # continue
 
     # Find all proper nouns:
                        elif removing_romans and (word.lower() in RomanNumList) and (word != 'I'):
                            # Need to check this before we do proper nouns:
                            continue # don't count it in word_freq_hist below
 
                        elif word in proper_nouns_in_dict: # among those found in the dictionary
                            #  picks up dictionary entries like <Grant> & <Abba> :
                            proper_nouns_found.add(word)
                            if listing_proper_nouns:
                                continue # don't count it in word_freq_hist below
 
                        elif word.isupper():
                            if word.lower() in dict_word_set:
                                word = word.lower()
                            else:
                                word = word.title()
                                # Must assume word is a proper noun
                                proper_nouns_found.add(word)
                                if listing_proper_nouns:
                                    continue # don't count it in word_freq_hist below
 
                        elif word.istitle() :
                            if word.lower() in dict_word_set:
                                word = word.lower()
                            else:  # Catches capitalized words unknown in lower case
                                # don't ignore <I>, even if you're listing_proper_nouns:
                                if listing_proper_nouns and len(word) > 1:
                                    proper_nouns_found.add(word)
                                    continue # don't count it in word_freq_hist below
 
                        elif word[0].isupper() and not word.istitle():
                            # Catches camelcase Scots surnames such as MacDonald, McNeil:
                            proper_nouns_found.add(word)
                            if listing_proper_nouns:
                                continue # don't count it in word_freq_hist below
 
                        # count all the words not filtered out:
                        word_freq_hist[word] = word_freq_hist.get(word, 0) + 1
 
    not_in_dict, compounds_found = CreateNotInList(word_freq_hist,
                    dict_word_set, split_n_list_compounds, compounds_found)
    File.close()
 
    return (word_freq_hist, not_in_dict, proper_nouns_found, compounds_found,
            ContractionsUsed)
 
 
def reverse_dict(input_dict):
    """ Takes an input dictionary and lists how many values had each key.
 
    """
    output_dict = dict()
    for key in input_dict:
        value = input_dict[key]
        if value not in output_dict:
            # output_dict[value] = key  # This should work, but it doesn't
            output_dict[value] = [key]
        else:
            output_dict[value].append(key)
    return output_dict
 
 
def PrintInitialSummary(text_name, SplitNListingCompounds,
                        ignore_n_list_propers, IgnoreNListContractions,
                        listing_archaisms, listing_adverbs, listing_gerunds,
                        CheckingSpelling, removing_Romans, RomanMaximus,
                        SortByAlpha):
    """Prints the summary of the analysis, listing all the options chosen.
 
    """
    output('\n\n\nANALYSIS of {}\n\nIn the following analysis of the above text:\n'.
           format(text_name))
 
    if ignore_n_list_propers:
        ProperNounInfo = "\n- Proper nouns have been listed separately and not "
        ProperNounInfo += "counted in the word frequency analysis or checked "
        ProperNounInfo += "against the dictionary. (-P)\n"
        output(ProperNounInfo)
    else:
        output("\n- Proper nouns have been counted in the word frequency analysis.\n")
 
    if listing_archaisms:
        ArchaismInfo = "\n- A list of possible linguistic archaisms the "
        ArchaismInfo += "author/translator may have used has been listed, but "
        ArchaismInfo += "also counted in the word frequency analysis. (-a)\n"
        output(ArchaismInfo)
 
    if IgnoreNListContractions:
        IgnoringContInfo = "\n- All contractions used by the author/translator "
        IgnoringContInfo += "have been listed, but also counted in the word "
        IgnoringContInfo += "frequency analysis. (-c)\n"
        output(IgnoringContInfo)
    else:
        IgnoringContInfo = "\n- All contractions used by the author/translator"
        IgnoringContInfo += " have been counted as words in the word frequency"
        IgnoringContInfo += " analysis.\n"
        output()
 
    if SplitNListingCompounds:
        CompoundInfo = "\n- All hyphenated compound words the author/translator"
        CompoundInfo += " has used have been listed but then split into their "
        CompoundInfo += "component parts for the word frequency analysis. (-H)\n"
        output(CompoundInfo)
    else:
        CompoundInfo = "\n- All hyphenated compound words the author/translator"
        CompoundInfo += " has used have been left unbroken and included in the "
        CompoundInfo += "word frequency analysis.\n"
        output(CompoundInfo)
 
    if listing_adverbs:
        AdverbInfo = "\n- Most adverbs the author/translator has used (those "
        AdverbInfo += "ending in 'ly') have been listed, as well as being "
        AdverbInfo += "counted in the word frequency analysis. (-y)\n"
        output(AdverbInfo)
 
    if listing_gerunds:
        GerundInfo = "\n- All gerunds and present tense continuous '-ing' "
        GerundInfo += "words used by the author/translator have been listed, "
        GerundInfo += "and also counted in the word frequency analysis. (-g)\n"
        output(GerundInfo)
 
    if removing_Romans:
        RomanInfo = "\n- All upper- and lower-case Roman numerals up to "
        RomanInfo += "{} have been ignored as words. (-R)\n".format(RomanMaximus)
        output(RomanInfo)
    else:
        output("\n- All upper- and lower-case Roman numerals have been counted as words.\n")
 
    if CheckingSpelling:
        CheckSpellInfo = "\n- A spell check will be performed on all the words"
        CheckSpellInfo += " used (except proper nouns and compound words if "
        CheckSpellInfo += "listed separately) against the dictionary of your "
        CheckSpellInfo += "choice.\n"
        output(CheckSpellInfo)
 
    if SortByAlpha:
        SortInfo = "\nThe word frequency analysis will take the form of an "
        SortInfo += "alphabetical list of the words used in the text, against "
        SortInfo += "the usage frequency of each. (-f)\n\n"
        output(SortInfo)
    else:
        SortInfo = "\nThe word frequency analysis will take the form of a list"
        SortInfo += " of the words used sorted by frequency, starting with the"
        SortInfo += " least frequent words.\n\n"
        output(SortInfo)
 
    output(65*'*')
    return
 
 
def PrintResultsFor(ThisList, SubsetName, WithChars=['']):
    """ Returns any elements of ThisList containing a regex listed in WithChars
 
    If the kwarg WithChars[] is not reassigned during the function call,
    the Regex defaults to '', which will always give a positive match in
    re.search(Regex, this_word), and print all of ThisList[]
    """
    ListSubset = set()
 
    for Regex in WithChars:
        for this_word in ThisList:
            if re.search(Regex, this_word):  # always gives a match for  Regex = ''
                ListSubset.add(this_word)
 
    if len(ListSubset) == 0:
        output('\n\nNo {}s were used.'.format(SubsetName))
    elif len(ListSubset) == 1:
        output('\n\nOnly one {} was used:   '.format(SubsetName))
        print_list_items(sorted(list(ListSubset)))
    else:
        output('\n\nThe following {:,} {}s were used:   '.
               format(len(ListSubset), SubsetName))
        print_list_items(sorted(list(ListSubset)))
    return
 
 
def PrintResultLists(text_name, vocab_list, not_in_dict, dict_wordset,
                     proper_nouns_found, compounds_used, contractions_found,
                     listing_proper_nouns=False, listing_archaisms=False,
                     ignore_n_list_contractions=False,
                     listing_and_splitting_compounds=False,
                     listing_adverbs=False, listing_gerunds=False,
                     checking_spelling=False, remove_romans=True,
                     greatest_roman=100, sorting_by_alpha=True):
    """Prints out the various lists of results, based on user options
    """
    PrintInitialSummary(text_name, listing_and_splitting_compounds,
                     listing_proper_nouns, ignore_n_list_contractions,
                     listing_archaisms, listing_adverbs, listing_gerunds,
                     checking_spelling,remove_romans, greatest_roman,
                     sorting_by_alpha)
 
    if listing_proper_nouns:
        PrintResultsFor(list(proper_nouns_found), '(probable) proper noun')
 
    if listing_archaisms:
        """ Checketh for early 17th century Bible English that English
        translators loveth to pepper their works with and giveth it that
        Olde Feele:
        """
        ArchaismList = [r"eth\b", r"dst\b", r"rst\b", r"ert\b", r"est\b"]
        PrintResultsFor(set(not_in_dict), '(probable) linguistic archaism',
                        WithChars=ArchaismList)
 
    if ignore_n_list_contractions:
        PrintResultsFor(contractions_found, 'contraction') # print the whole list
 
    if listing_and_splitting_compounds:
        PrintResultsFor(compounds_used, 'compound word') # print the whole list
 
    if listing_adverbs:
        # RE counts words ending in -ly that may also be contractions and compounds:
        PrintResultsFor(vocab_list, '(mostly) adverb', WithChars=[r"[a-z'-]{3,}ly\b"])
 
    if listing_gerunds:
        PrintResultsFor(vocab_list, '(probable) present tense continuous verbs and gerund',
                        WithChars=[r'[a-z]{3,}ings?\b'])
 
    if checking_spelling:
        if ignore_n_list_contractions:
            not_in_dict -= contractions_found   # set subtraction
        output('\n\nThe following {:,} words were not found in a dictionary of {:,} entries:  '.
               format(len(not_in_dict), len(dict_wordset)))
        print_list_items(sorted(not_in_dict))
    return
 
 
def CreateNotInList(word_histogram, DictWordSet, SplitAndListCompounds, CompoundsFound):
    """ Sort the words not found in the dictionary
    """
    NotInDict = set()
    histo_copy = word_histogram.copy()
    for word, freq in histo_copy.items():
        if word not in DictWordSet:
            # create list of words not in dictionary:
            if SplitAndListCompounds and '-' in word:
               # one that got away earlier if '--' or '—' was in the line
               CompoundsFound.add(word)
               del word_histogram[word]  # remove compound word from word hist
            else:
                NotInDict.add(word)
    return NotInDict, CompoundsFound
 
 
def PrintWordHistogram(TextName, WordHistogram, AlphaSort):
    """ Print the output of the analysis
 
    """
    RevWordHistogram, FreqList = dict(), []
    output('\n\n\nWord Frequency Analysis\n')
 
    if AlphaSort:  # condensed frequency histogram output
        output("\nBelow is a sorted list of the author/translator's lexicon, with word frequencies, of their work {}:\n".
               format(TextName))
        sorted_hist = sorted(WordHistogram)
        for word in sorted_hist:
            output('{0}({1}) '.format(word, WordHistogram[word]))
 
    else:  # long verbose output, sorting words into frequencies:
        RevWordHistogram = reverse_dict(WordHistogram)  # frequencies are now the keys
        FreqList = list(RevWordHistogram) # a list of the keys (freqs), discarding the values (words)
        FreqList.sort(reverse=True)  # a list of the freqs, in ascending order.
 
        output('\nBelow is a word frequency analysis of {}. '.format(TextName))
        output('Starting from the least common words in the text:')
 
        for i in range(len(FreqList)-1, -1, -1):  # count backwards from the last item:
            if FreqList[i] == 1:  # words that appear only once
                output('\n\nThe following {:,} words appear only once: '.format(
                len(RevWordHistogram[FreqList[i]])))
 
            elif FreqList[i] == 2:  # words that appear only twice:
                output('\n\nThe following {:,} words appear only twice:   '.format(
                len(RevWordHistogram[FreqList[i]])))
 
            elif len(RevWordHistogram[FreqList[i]]) == 1:
                # common words that don't share their count with other words:
                output('\n\nThis word appears a total of {:,} times:   '.format(
                FreqList[i]))
 
            else:   #  multiple words sharing their count with other words
                output('\n\nThe following {:,} words each appear {:,} times:  '.format(
                len(RevWordHistogram[FreqList[i]]), FreqList[i]))
 
            print_list_items(sorted(RevWordHistogram[FreqList[i]]))
 
    TotalWordsUsed = sum(WordHistogram.values())
    vocab_size = len(WordHistogram)
    output('\n\n\nIn this text of {:,} words, the author/translator of {} used a total vocabulary of {:,} words.\n\n'.
           format(TotalWordsUsed, TextName, vocab_size))
    return
 
 
def analyse_this(TextFilePath, DictPath, SortByAlpha=True,
                 StartTrigger="", StopTrigger="", CheckingSpelling=True,
                 SplitAndListCompounds=True, ListingArchaisms=True,
                 IgnoreNListProperNouns=True, ListingContractions=True,
                 ListingAdverbs=True, ListingGerunds=True,
                 RemovingRomans=True, RomanMaximus=1000):
    """ Perform a textual analysis, as defined by the user.
    """
    word_histogram, VocabList = dict(), []
    DictWordSet, ProperNounsInDict, ProperNounsFound = set(), set(), set()
    CompoundsFound, ContractionsFound, NotInDict = set(), set(), set()
 
    """Need to reset this explicitly as a workaround for data persistence
    between calls to this function from an external module, such as a GUI
    front end:
    """
    global result_string
    result_string = ''
 
    """ These return sets, not dicts. These 2 sets have no intersections.
    Would benefit from pickling for data persistence, instead of recreating it
    every time: """
    DictWordSet, ProperNounsInDict = CreateDictionarySet(DictPath)
 
 
    (word_histogram, NotInDict, ProperNounsFound, CompoundsFound, ContractionsFound) = \
    create_word_hist_for(TextFilePath, DictWordSet, ProperNounsInDict,
                    start_trigger=StartTrigger, stop_trigger=StopTrigger,
                    split_n_list_compounds=SplitAndListCompounds,
                    listing_proper_nouns=IgnoreNListProperNouns,
                    listing_contractions=ListingContractions,
                    removing_romans=RemovingRomans,
                    roman_maximux=RomanMaximus)
 
 
    VocabList = list(word_histogram.keys())
    text_path, TextName = os.path.split(TextFilePath)
 
    PrintResultLists(TextName, VocabList, NotInDict, DictWordSet,
                     ProperNounsFound, CompoundsFound, ContractionsFound,
                     listing_proper_nouns=IgnoreNListProperNouns,
                     listing_archaisms=ListingArchaisms,
                     ignore_n_list_contractions=ListingContractions,
                     listing_and_splitting_compounds=SplitAndListCompounds,
                     listing_adverbs=ListingAdverbs,
                     listing_gerunds=ListingGerunds,
                     checking_spelling=CheckingSpelling,
                     remove_romans=RemovingRomans,
                     greatest_roman=RomanMaximus,
                     sorting_by_alpha=SortByAlpha)
 
    PrintWordHistogram(TextName, word_histogram, SortByAlpha)
 
    if __name__ == "__main__":
        # Output already printed:
        return
    else:
        # code was imported by another program:
        return result_string
 
 
def usage(progname):
    """Prints a short command line guide.
    """
    output("""\nUsage: from the command line type:
    {} -i <input_file> [-s <start_string>] [-e <end_string>] [-D <your_dict_filepath>]
    [-R <N>] [-cadfghyHPt] [> output_file]
 
    -i rel_path/input_file
        Mandatory flag
 
Optional flags (in order of typical usage):
    -s "start trigger text"
 
    -e "end trigger text"
 
    -d  Check all words against the program's built-in dictionary, listing
        those not in the dictionary. Default is not to spellcheck.
 
    -D rel_path/Dictionary
        Use a different dictionary. Overrides '-d'.
 
    -P  summarise proper nouns in a separate list from the analysis.
 
    -H  List separately all hyphenated compound words used.
 
    -a  list archaisms uses.
 
    -c  List all contractions used.
 
    -g  List all gerunds and present tense continuous words used.
 
    -y  List all adverbs ending in '-ly'.
 
    -R <N>
        Ignore Roman numerals up to maximum N
 
    -f  Print alphabetically sorted word list with frequencies.
 
    -t  Time the running of the program, time stamping the output file/stream
        with how long the analysis took.
 
    -h  Help: prints a long version of this guide.\n\n""".format(progname))
 
    return
 
 
def main(argv):
    """Interprets the command line, turning flags into variables
    """
    # The flags you want to pass to your program:
    timing, spellcheck = False, False
    ignore_n_list_propers, split_and_list_hyphenates = False, False
    list_contractions, list_archaisms = False, False
    listing_adverbs, listing_gerunds = False, False
    alpha_sort_output = False
    remove_romans, roman_maximus = False, 0
    input_file_path = ''
    analysis_start_trig = analysis_stop_trig = ""
    dictionary_path = './EnglishDictionary.txt'
 
    # Nice trick to get the name of this program:
    path, ProgName = os.path.split(argv[0])
 
    try:
        """ This is where you tell Python about all the valid flags for your
        program. If the flag isn't here, your code should handle it as a
        command line error. A colon after the letter tells it to look for an
        associated value with the flag, e.g.
            -L French
        If there's no associated value, again, your main() code is where you
        handle it. Letters not followed by colons do not need values, and can
        either be listed individually, as in
 
            Analyse this -a -f -g -H
 
        or together, as in
 
            Analyse this -afgH
 
        and getopt() will work it out. It can also handle --linux flags, but
        these have not been implemented here:
        """
        opts, args = getopt.getopt(argv[1:], "i:s:e:D:R:acdfghyHPt")
 
    except getopt.GetoptError as err:
        print('\n', str(err))
        usage(ProgName)
        sys.exit(2)
 
    for opt, arg in opts:
        if opt == '-i':
            input_file_path = arg
        elif opt == '-s':
            analysis_start_trig = arg
        elif opt == '-e':
            analysis_stop_trig = arg
        elif opt == '-f':
            alpha_sort_output = True
        elif opt == '-D':  # use the user's own named dictionary file
            spellcheck = True
            dictionary_path = str(arg)
        elif opt == '-a':
            list_archaisms = True
        elif opt == '-c':
            list_contractions = True
        elif opt == '-d':  # use the program's internal dictionary
            spellcheck = True
        elif opt == '-g':
            listing_gerunds = True
        elif opt == '-h':
            print('\n\n' + ProgName, '\n', __doc__)
            sys.exit(2)
        elif opt == '-H':
            split_and_list_hyphenates = True
        elif opt == '-P':
            ignore_n_list_propers = True
        elif opt == '-y':
            listing_adverbs = True
        elif opt == '-R':
            remove_romans = True  # what did they ever do for us?
            roman_maximus = int(arg)
        elif opt == '-t':
            timing = True
            start = timeit.default_timer()
        else:
            assert False, "unhandled option"
 
    if not input_file_path:
        output("\nCommand line error: requires '-i' flag with the input file name & path.\n")
        usage(ProgName)
        sys.exit(2)
 
    input_files = [input_file_path, dictionary_path]
    for file in input_files:
        try:  # check the user's input files are there before proceeding:
            f = open(file, 'r')
        except IOError:
            print ('\nFilename error: no such file: {}\n'.format(file))
            sys.exit(2)  # (Unix convention: 0=no problem, 1=error, 2=cmdline)
        else:
            f.close()
 
    analyse_this(input_file_path, dictionary_path, SortByAlpha=alpha_sort_output,
                 StartTrigger=analysis_start_trig,
                 StopTrigger=analysis_stop_trig, CheckingSpelling=spellcheck,
                 SplitAndListCompounds=split_and_list_hyphenates,
                 ListingArchaisms=list_archaisms,
                 IgnoreNListProperNouns=ignore_n_list_propers,
                 ListingContractions=list_contractions,
                 ListingAdverbs=listing_adverbs,
                 ListingGerunds=listing_gerunds,
                 RemovingRomans=remove_romans, RomanMaximus=roman_maximus)
    if timing:
        stop = timeit.default_timer()
        output('\nThis textual analysis took {} seconds.\n'.format(round(stop - start, 2)))
    return
 
 
def print_cmd_line(args):
    """recreates the command line and prints it out
 
    This recreates the command line as a reference header in the output file,
    to allow the user to reproduce his/her results at a later date. Note that
    since this is called from within __main__, it will not be called if this
    program has been imported as a module from another program, such as a GUI
    front end, which is OK since avoiding a command line is one of the main
    reasons for using a GUI.
    """
    arg_list = []
    path, progname = os.path.split(args[0])
    arg_list += [progname]
    for item in args[1:]:
        if '-' not in item:     # item is therefore either a string or a positive number
            try:
                int(item)   # eliminates integer flags (can easily be changed to floats)
            except ValueError:
                item = '"' + item + '"' # put quotes around anything that isn't a flag or a number
        arg_list.append(item)
    command_line = ' '.join(arg_list) # recreates the command line
    print ('\nThe following results were obtained using the command line:\n' + command_line)
    return
 
 
if __name__ == "__main__":
    print_cmd_line(sys.argv)
    main(sys.argv)

 

The program will run on Windows, Linux or Mac. But before you can run it on on a text like Homer’s Odyssey, you’ll first have to create a new folder in your file system with the following contents:

AnalyseThis.py
The_Odyssey.txt
EnglishDictionary.txt

As you can see, you’re going to need to get hold of a dictionary file, with one word per line, and to rename it as EnglishDictionary.txt. Aim for one with around 200,000 words, with plurals, gerunds, superlatives, etc, and with US and UK spelling conventions. Many Unix systems have a reasonably good dictionary in the folder /usr/share/dict/words.  (1)

You’re also going to have to get hold of some texts to analyse.  Make sure you read the terms of use for anything you download, and be careful about downloading books still in copyright. Translations of the classics and the Bible are ideal for comparing different translators’ use of pseudo-archaisms. Just make sure that whatever you want to analyse, you save it to a *.txt format.

To run the code, you’re going to need Python 3 (come on, make the switch!).  Assuming you have it, start up a terminal window and move to the new folder you’ve just created – the one containing the above 3 files. Then, grant yourself permission to execute AnalyseThis.py by typing:

$ chmod +x AnalyseThis.py

To get instructions on how to use the program’s command line, either read the text at the start of AnalyseThis.py or, depending on how your system is set up, type some variation of the following command to get the program to print the help instructions to your screen:

$ AnalyseThis.py -h
$ ./AnalyseThis.py -h
$ python AnalyseThis.py -h
$ python ./AnalyseThis.py -h
$ python3 AnalyseThis.py -h
$ python3 ./AnalyseThis.py -h

Remember which one works – you’ll need the same format when you run the full command line.

Once you’ve read the instructions, you should be in a position to understand and use the following sample command line, which analyses The Odyssey and saves the output to the file OdysseyAnalysis.txt in the same folder:

$ AnalyseThis.py -i The_Odyssey.txt -cadfgyHPt > OdysseyAnalysis.txt

Remember to put double quotes around any filenames or strings that contain spaces.

If you want to use a dictionary file with a different name, just use the -D flag and replace EnglishDictionary.txt with the name of your own dictionary filename.txt.

As an exercise, it would be fairly easy to modify the code to analyse other languages that use variations of the Latin alphabet, provided you can get hold of a dictionary. For example, in French the ‘-ing’ ending is ‘-ant’, the equivalent of the adverbial ‘-ly’ ending is ‘-ment’, and there will be a rich collection of contractions to find without any code changes.

If you do want to run the code on a non-English text, note that only the -c, -H, -P and -R flags will work fine without any modification to the code. The other flags are written specifically with English in mind and will probably draw a blank, depending on the language you are analysing.

Finally, feel free to use it, or to modify it any way you like, or to suggest any improvements you think might be worth making.

Happy analysing!

 

(1) Avoid dictionaries full of junk such as acronyms, passwords and foreign words, such as the 450,000 word dictionary available at Stack Overflow. And don’t worry if your dictionary is full of proper nouns – the code will strip them out. If you want to combine multiple dictionaries at runtime, read the instructions in the function CreateDictionarySet().

3 Replies to “Analysis Tool For Literary Texts”

Leave a Reply

Your email address will not be published. Required fields are marked *