0% found this document useful (0 votes)
13 views

Lesson Plans LING360

Uploaded by

Felipe Lima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lesson Plans LING360

Uploaded by

Felipe Lima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Lesson plans

LING 360: Programming for Text Processing and Analysis


Earl Kjar Brown, PhD
Professor of Linguistics
Brigham Young University

Intro: Syllabus and such


Objective:
● Students will have a good idea of whether the class meets their needs.

1. Syllabus
2. Look at a few Assignments together, if not all of them (quickly).
3. Why Python?
4. Look at PyCharm together.
1. source file vs. console
2. REPL (read-eval-print loop): View > Tool Windows > Python console
3. preferences: (Windows) File > Settings
1. … Appearance and Behavior
2. … Editor > General
5. Example script: looking for “handsome” (as a regex) in Pride and Prejudice (that I download
from the Gutenberg Project), just to show them a simple example of what a few lines of code
can do to answer a question in linguistics: Was "handsome" used for both men and women
during Jane Austen's time period (or in her dialect, or idiolect)?
import re
with open('[pathway here]', encoding = 'utf8') as infile:
for line in infile:
if re.search(r'\bhandsome', line, re.I):
print(line, '\n')
6. Challenge Level versus Skill Level
1. https://alifeofproductivity.com/how-to-experience-flow-magical-chart/
2. https://uxdesign.cc/getting-into-the-flow-what-does-that-even-mean-58b16642ef1d

Lesson 1.1: Types


Objective:
● Students will understand Python's basic (or primitive) type system and be capable of
determining the type of a given value (with the type() function).

1. Think Python, chapter 1: The Way of the Program


a. Problem-solving is a central skill to successfully creating a program; analogy of toolbox and
palette of raw materials (lumber, nails, etc.) and its similarity to a programming language
(the toolbox) a directory full of text files (the raw materials).
b. Values and types: int, float, str, bool
1. Practice:
1. use the type() function to find out the type of the following sets of values:
1. 4, -3, 6
2. 4.5, -3.4, 6.27
3. 'hello', 'world', 'hello world', 'BYU is where I
study.'
4. True, False (mind the capitalization or the lack thereof with each letter)

Lesson 1.2: Script mode vs. REPL mode


Objective:
● Students will understand the difference between script mode and REPL (read-eval-print loop,
interactive) mode.

1. Script mode vs. REPL (read-eval-print loop, in the console)


a. Practice:
i. Open up a Python console (within PyCharm): View > Tool Windows > Python
Console
ii. Write a few simple arithmetic statements and press Enter (after each one), for
example: 2 + 3, 76 – 20, 4 * 7, 56 / 8
b. Open a new Python file: (within PyCharm) In left-hand menu, right-click on the
project name (e.g., "sandbox" if your project name is "sandbox") > New >
Python file
i. write the same arithmetic statements from above, save the file, and run it (Run >
Run… > [select file name and press Enter])
ii. Put the arithmetic statements from above in the print() function and rerun
your script (a.k.a., source file)
c. In a console window:
i. run: 'hello' + 'world'
ii. run: 'hello' * 3
d. In either a console or a script file:
i. run: print('hello' + 4)
ii. Read the last line of the error message that the interpreter throws/prints
iii. What kind of error occurred?
iv. Look up the str() function to see how to convert an int, float, or bool to
str

Lesson 1.3: Variables


Objective:
● Students will be capable of assigning values to variables.

1. Think Python, chapter 2: Variables, Expressions and Statements


a. Variables:
i. Like good ol' fashioned algebra: 2x + y = 13
ii. Formative quiz: Identify the well-formed variable names and discuss why the
ill-formed variable names are ill-formed.
1. first = 'Earl'
2. 5kids = 'awesome'
3. yourProfession = 'professor'
4. your_name = 'Earl Brown'
5. ca$h = 1000000
6. class = 'text processing and analysis'
7. houses2 = 'expensive'
8. x = 2
9. and = "What else can I tell you?"
10. current_sentence = "How may I help you?"
b. Order of operations
i. Like in math, parentheses have the highest precedence, then multiplication and
division, then addition and subtraction:
1. What does 2 * 3 - 1 return? Why?
2. How about 2 * (3 - 1)?
ii. Practice:
1. What would each of the following statements return? Try them to verify
your guesses.
a. print('hello' + 'world' * 3)
b. print(('hello' + 'world') * 3)
c. Practice:
i. Create a program (in a script) that asks the user for his/her name with a prompt
(look up the input() function), saves his/her name to a variable, and prints
back to the user the message: "Well, hello there [his/her
name]!", replacing "[his/her name]" with his/her actual name.
1. Bonus: If your computer has Python 3.6 or higher (check in PyCharm:
File > Settings > Project > Project interpreter), modify your program to
use string interpolation. Check out this question on Stack Overflow:
https://stackoverflow.com/questions/4450592/is-there-a-python-
equivalent-to-rubys-string-interpolation
ii. Find a (short) madlib online (possibly from madlibs.com itself) and create a
program that asks the user for the nouns, adjectives, etc., that are needed to
complete the story, and then print out the story once the user has supplied all the
required fields. Change computers with a neighbor and run his/her madlib.
d. Comments in scripts
i. Make the logic of computer code more easily understood by humans, whether
other people who read your code, or your later-self when you look at code you
previously wrote.
ii. Either on the previous line or at the end of the line, after code
iii. docstring for functions: three single or double quotes to begin and end a block of
comments
iv. PyCharm has a keyboard shortcut (Ctrl + /), as does Spyder (Ctrl + 1).
v. Practice:
1. Modify the program you previously wrote by putting in comments that
describe what each line of your script does, and then rerun the program in
order to verify that the interpreter didn't do anything with the comments.

Lesson 2.1: Errors


Objective:
● Students will identify and distinguish three types of errors in code.
1. Three types of errors can occur when writing computer code, as presented in Think Python,
chapter 2, section 2.8:
a. Syntax error
b. Run-time error (e.g. trying to string concatenate a str with an int or float without
first converting the int or float to a str; trying to divide an int or float by
zero)
c. Semantic error
2. Formative quiz: Identify the type of error in the following situations, code, or error messages.
a. The programmer wanted the first three letters of each word, but the program gave
him/her the first four letters.
b. print('hello' + 4) > TypeError: must be str, not int
c. # prints last letter of word
# print("hello"[-1])
d. The programmer tries to take the log of zero.
e. print("hello world"
f. print("What's your name: ") asks the user for his/her name
g. 7 / 0 > ZeroDivisionError

Lesson 2.2: if...else


Objective:
● Students will write conditional execution code using logical operators (and, or, not) and
conditional statements (if...else, if...elif...else).

1. Think Python, Chapter 5: Conditionals and Recursion


a. Boolean expressions
i. Practice:
1. Use boolean operators (==, !=, >, <, >=, <=) with various sets
of number to understand how they work with values of type int and
float, for example
a. 5 > 8 # False
b. 5 != 8 # True
c. 5.7 <= 8 # True
d. 7 >= 7 # True
e. 9 == 9.02 # False
2. Use the same boolean operators with various letters and words to
understand how they work with values of type str, for example:
a. 'a' < 'b'
b. 'f' > 'y'
c. 'apple' <= 'banana'
d. 'orange' == 'naranja'
e. 'lanky' != 'tall'
f. 'antelope' > 'alpaca'
b. Logical operators: and, or, not
i. Pretty darn similar to their meaning in English. They're used in conditional
execution (see next item!).
c. Conditional execution and alternative execution, also called "conditional statements,"
"control flow," or simply if...else statements
i. See examples in textbook, section 5.4.
1. Note: Mind the indentation and colon (in Python)!
ii. Practice:
1. Create a program that asks the user for a number between 1 and 10 and
prints back to the user one of two messages stating whether the number
is: (a) less than five, or (b) greater than or equal to five.
a. Hint: You will have to use the int() function to change the
input from data type str to int.
2. Modify the previous program so that the program prints back to the user
one of three messages: (a) less than, (b) equal to, or (c) greater than five.
3. Modify the previous program so that the message printed back to the user
indicates whether the number is: (a) between one and three, (b) between
four and six, or (c) between seven and ten.
a. Hint: You will likely need to use and in your conditional
statements.
4. Create a program that asks the user for his/her two favorite fruits and
prints a message stating whether the first fruit given by the user comes
before or after the second one in a dictionary (in alphabetical order).
d. Chained conditionals: if...elif...else
i. See examples in section 5.6.
ii. Practice:
1. Modify the previous program so that the program asks the user for only
one fruit and the message printed back to the user indicates if the fruit
given by the user falls between "guava" and "passion fruit" in the
dictionary.
a. Hint: You might use the logical operator and and several elif
statements.
b. Bonus: Modify your program so that if the user gives "guava" or
"passion fruit," the message states this fact.

Lesson 2.3: String slicing


Objective:
● Students will be capable of slicing strings by indexing in order to extract only a certain amount
of a string.

1. Think Python, Chapter 8: Strings


a. Strings are sequences of characters; see Figure 8-1 in Downey, Chapter 8, Section 8.4
"String slices".
b. Strings can be sliced from the left side with a positive index, or the right side with a
negative index.
i. 'orange'[2] returns 'a' (Mind the zero-based indexing of Python!)
ii. 'orange'[-1] returns 'e'
iii. 'orange'[-2:] returns 'ge'
c. There are three parameters: 'some_string'[ begin : end : step_size ]
i. 'orange'[2:] returns 'ange'
ii. 'orange'[:2] returns 'or'
iii. 'banana'[::2] returns 'bnn'
iv. 'orange'[:-2] returns 'oran'
v. 'university'[::2] returns 'uiest'
vi. 'apple'[::-1] returns 'elppa' (Shorthand for reversing a string)
d. Practice:
i. Write a program that asks the user for a word of at least six letters in length, then
prints back to the user on one line the first three letters, and then on a second line
prints from the fourth letter to the end of the word.
ii. Modify the program to print out every other letter.
iii. Modify the program to print out the word spelled backward.
e. Strings have useful methods which are accessed with dot notation:
i. 'apple'.upper() returns 'APPLE'
ii. 'apple'.title() returns 'Apple'
iii. 'México'.lower() returns 'méxico'
iv. 'banana'.islower() returns True
v. 'Banana'.islower() returns False
vi. 'Banana'.isupper() returns False
vii. 'BANANA'.isupper() returns True
viii. 'México'.find('x') returns 2
ix. 'banana'.replace('a', 'x') returns 'bxnxnx'
x. ' orange '.strip() (flanking spaces, tabs, newlines) returns
'orange'
xi. 'banana'.count('a') returns 3
f. Practice:
i. Write a program that asks the user for a word and prints back whether the word
is a palindrome (that is, spelled the same way forward and backward, for
example, "noon").
1. Hint: You'll need to force case to uppercase or lowercase to deal with
capitalization. For the Python interpreter, "Hannah" is not a palindrome,
but "HANNAH" and "hannah" are.
ii. Bonus:
1. Write a Pig Latin translator. If/When you get stuck, see an example here.

Lesson 3.1: Regexes


Objective:
● Students will be capable of using regular expressions (regexes) to match sequences of
characters within text.

1. Regular expressions (or "regexes")


a. A domain-specific language (or mini language or tiny language) to concisely match
patterns of characters within strings.
b. See "Handout_regexes.pdf" in the CMS and/or the "Regular expression cheatsheet" at
pythex.org.
c. See this tutorial.
d. The most useful regex functions (in my opinion):
i. re.search() returns a Boolean value indicating whether a match was found
as well as a match object with various methods
ii. re.findall() returns a list with matches
iii. re.finditer() returns an iterator that must be traversed in order to pull
info with methods
iv. re.sub() returns a string after making substitutions
e. The most useful flag to regex functions:
i. re.IGNORECASE, which has an abbreviation of re.I
f. Practice:
i. Write import re at the top of your script and use the re.findall()
function to write a regex that matches the following, and prints the matches to
the user (you). Use pythex.org if you'd like to test regexes (but make sure to run
each regex in Python):
1. a word with "ed" at the end
2. a word with "anti" at the beginning
3. both the American and British spellings of "labor" or "labour"
4. 'Jack', 'Mack', 'Pack'
5. both the American and British spellings of "center/centre"
6. an eight-letter word with 'j' as the third letter and 't' as the sixth letter
(e.g., majestic)
7. words other than "best" that end in "est"
8. a more concise regex than "(make|makes|made)"
9. the same letter twice, next to each other, in a word
10. two words next to each other with the same final letter
a. Hint: Use parentheses to capture and then "\1" to match the
previously caught match.
b. Bonus: Two words next to each other with the same final two
consonants (suggesting a possible rhyme).
11. modify the previous regex to not consume the second word (so that it's
available for another search)
a. Hint: Use positive lookahead, that is, "(?=...)"
12. The underlined portion of "This is a first example sentence."
a. Hint: Use greedy matching.
13. The underlined portions of "This is a first example sentence."
a. Hint: Use lazy/non-greedy matching.
14. Words that begin with "t" regardless of case, that is, "T" and "t", in the
following sentence: "The deal is that I make dinner every night, but
Juanito made it last night."
a. Hint1: Look up the "re.IGNORECASE" parameter of the
re.findall() function.
b. Hint2: "re.IGNORECASE" has an abbreviation of "re.I".
15. Write a program that asks the user for a couple sentences about
him/herself, and you write a regex to match the word "I".
g. Formative quiz: Identify which word or words, if any, are matched by the following
regexes, from the following sentences: "Linguistics is an exciting and thriving field,
especially computational linguistics, which broadly speaking, deals with using
computers to analyze language. Some of the best paying jobs for linguists are in this
field."
i. r"\w+s\b"
ii. r"\w+[dt]\b"
iii. r"\w*[iu][aei]\w*"
iv. r"\w+(\w)\W+\w+\1\b"
v. r"\b[^aeiou\W][aeiou]\w+"
vi. r"\b[aeiou][^aeiou]\w+"

Lesson 3.2: Measuring accuracy


Objective:
● Students will be capable of measuring the accuracy of regex searches.

1. Accuracy
a. Precision
b. Recall
2. Look at "precision_recall.pptx" in the CMS
3. Review "Accuracy.pptx"
4. Practice:
a. Calculate the precision and recall of the first regex in the formative quiz in the previous
lesson for finding word-final /s/ that represents the third-person singular "s" of present
tense verbs in English in the short text given in that quiz.
b. Calculate the precision and recall of that same regex for finding word-final /s/ that
represents a plural morpheme.

Lesson 3.3: Loops


Objective:
● Students will be capable of using the for loop and the while loop to traverse (or iterate) over
strings and collections, like lists.

1. loops
a. while loop (in Think Python, Chapter 7: Iteration)
i. A block of code is executed repeatedly as long as a conditional statement
evaluates to True at the beginning of each iteration.
1. Note: Be careful not to create an infinite loop!
ii. Practice:
1. Using a while loop, write a program that asks the user for a number
between 1 and 10 and prints out from 1 to that number, each on a new
line.
2. Using a while loop, write a program that asks the user to write a
sentence and then asks the user to give a single letter (with a different call
to input()), and prints out the sentence, one letter on a line, until the
letter given by the user is encountered, at which point the loop is exited
and no more of the sentence is printed. If the letter given by the user
doesn't occur in the sentence, the whole sentence should print out.
a. Hint: Look up the break keyword.
b. for loop (in ch. 8 of textbook)
i. A block of code is executed for each element of a string or collection (such as a
list).
c. Examples: print every letter in a string; see Downey, Chapter 8, Section 8.3 "Traversal
with a for loop"
d. Practice:
i. Write a program that asks the user for a word with at least eight letters, and then
prints on a new line each letter in the word.
ii. Modify the previous program so that only consonants are printed.
1. Hint: Look up the in operator and the continue keyword.
iii. Modify the previous program so that both consonants and vowels are printed, but
with vowels in uppercase.
1. Hint: Don't forget about the string method .upper()
iv. Modify the previous program so that instead of printing on new lines, each letter
is printed next to each other (as words are normally printed, lIkE thIs).
1. Hint: Look up the end parameter of the print() function.
1. Formative quiz about indentation in Python: Determine how many times "¡Hola Mundo!" will
print to the screen in each of the following blocks of code, and discuss why with a neighbor!
# Block 1
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")
# Block 2
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")
# Block 3
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")

Lesson 3.4: Lists


Objective:
● Students will be capable of using the list data structure, a collection, to work with many
values.

2. Think Python, chapter 10: Lists


a. A collection, which can have values of several different types:
i. [1, 2, 3]
ii. ['wolf', 'fox', 'coyote']
iii. or even [1, 'wolf', 8.5, 'Hey there!']
iv. or even ['spam', 2.1, 5, [10, 'five']] (a nested list, that is, a
list inside another list)
b. Useful list methods:
i. list_name.append('a') adds 'a' to end of list_name
1. Hint: list_name += ['a'] is a shorthand
ii. list_name.extend(another_list) combines two lists
iii. list_name.sort() modifies in place!
iv. sorted(list_name) creates a copy!
c. Useful string methods that create lists:
i. 'How are you today?'.split() returns ['How', 'are',
'you', 'today?']
ii. Split with regex: re.split(r"[aeiou]", "How are you today?")
returns ['H', 'w ', 'r', ' y', '', ' t', 'd', 'y?']
d. Useful string method that takes a list as input:
i. ' '.join(['How', 'are', 'you', 'today?']) returns 'How
are you today?' mind the space between single quotes!
e. Practice:
i. Write a program that asks the user for a sentence or two about him/herself, and
simply prints back to the user each word on a new line.
ii. Modify the previous program so that the words printed back to the user are in
alphabetical order.
1. Hint: You'll have to force case (either to lowercase or to uppercase), as,
for example, "H" and "h" are considered to be different letters by the
Python interpreter.
iii. Modify the original program so that every other word is uppercase.
1. Hint1: You'll need to create a counter to keep track of the number of
iterations.
2. Hint2: You'll need to use the modulus operator to determine whether the
current iteration is an odd-numbered one or an even-numbered one, for
example, counter % 2 == 0.
iv. Modify the original program so that the vowels are replaced by dashes.
1. Bonus: Solve this problem in both of the following two ways: (a) with a
nested for loop to traverse over the letters of each word, and (b) with
the re.sub function from the re module.

Lesson 4.1: functions


Objective:
● Students will define their own functions, often referred to as "user-defined" functions.

1. Think Python, chapter 3: Functions


a. Hint: Mind the colon and indentation!
b. Example function 1:
def uppercase_vowels(input_string):
"""Takes as input a string and returns as output a
string with the vowels converted to uppercase."""
output_string = ""
for character in input_string:
if character in 'aeiou':
output_string += character.upper()
#output_string = output_string +
character.upper()
else:
output_string += character
return output_string

○ Example function 2:
def dash_vowels(x):
"""Takes as input a string and returns as output a
string with vowels replaced with dashes."""
output = ""
for i in x:
if i in 'aeiou':
output += '-'
else:
output += i
return output

Practice:
1. Define a function that takes as input two strings and returns as output a Boolean
value indicating whether the two strings are anagrams of each other (that is, the
letters of one word can be reordered to spell the other word, like with "secure"
and "rescue").
a. Hint1: sorted(list_name)or list_name.sort() will be your
friend here.
b. Hint2: You should not ask the user for two strings with input(),
rather you should include two arguments in your function, one for each
word.
c. Hint3: Be sure to account for (possible differences in) capitalization.
d. Bonus: Modify your function so that it throws an error message if the
user doesn't supply two strings as arguments to the function call and/or if
the user doesn't supply values of type str.
2. Possible other functions:
a. Whether the letters of a word are in alphabetical order.
b. Whether the vowels of a word are in alphabetical order.
c. Whether a word has all five (orthographic) vowels, like "sequoia".
d. Whether a shorter word appears in a longer word, in the same order, like
"out" in "counter".

Lesson 4.2: Filtering lists


Objectives:
● Students will be capable of filtering lists with a conditional statement in a for loop.
● Students will be capable of filtering lists with a conditional statement in a list
comprehension.
1. Filtering lists in a for loop
a. Logic:
i. create an empty list, often referred to as an "accumulator" or "collector"
ii. traverse (or loop over) a list that already has elements in it
iii. in each iteration, determine if the current element evaluates to True in a
conditional statement
iv. if True, append the element to the collector list; if False, move on to the next
iteration
b. Practice:
i. Using the technique of filtering lists within a for loop, define a function that
takes as input a string and returns as output a list of repeated words, if any, in the
input string. If there are no repeated words, the function should return the special
value None. Next, write a program that uses the function and that asks the user
for a couple of sentences about him/herself and prints any repeated words.
ii. Modify the previous program to print back to the user words that begin with the
syllable structure CV, that is, a consonant followed by vowel.
iii. Modify the previous program to print words, if any, that begin with the negative
prefix "in/im/i" as in "inadequate/impossible/irreal/illegal".
2. Filtering with list comprehensions (in ch. 19, section 19.2 of Think Python)
a. A concise way of filtering collections or sequences, such as lists, if the filtering is
simple, that is, can be written in one line.
b. See examples in Section 19.2 of textbook.
i. capitalize all
ii. only uppercase
c. Practice:
i. Write a list comprehension that creates a list with the first character of the words
in the following list: numbers = ['one', 'two', 'three',
'four', 'five']
ii. Using the same numbers list from the previous exercise, write a list
comprehension that creates a list with only words that have an "e".
iii. Rewrite the function above that filters words with a CV syllable at the beginning,
using a list comprehension (instead of a for loop).

Lesson 4.3: Processing sentences


Objective:
● Students will be capable of dividing paragraphs into sentences, in order to work with individual
sentences.

1. Processing sentences
a. Logic:
i. Split up a paragraph at punctuation that indicates the end of a sentence.
1. With a regular expression with re.split()
2. Or with the NLTK sentence tokenizer function. First, you need to
download all the resources used in the NLTK book, using some code
here.
ii. Work with each sentence in a loop, collecting desired information during each
iteration.
b. Practice:
i. Define a function that takes as input a string with a paragraph and returns as
output a float with the average number of words per sentence in the paragraph
given.

Lesson 5.1: File I/O (input/output)


Objective:
● Students will be capable of navigating to a directory and reading, writing, and appending to
text files.

1. Think Python, chapter 14: Files


a. Logic:
i. Open a connection to the file with the open() function, specifying whether you
want to "r"ead, "w"rite, "r+"ead and write, or "a"ppend to the file. The default
is "r".
ii. Use the with keyword before the call to open() and the as keyword to assign
the file connection to a variable; see example on SO here.
1. Note: mind the Python 3 way of printing, that is, print(line):
iii. Do something (that is, read, write, or append) to the file with the "file object"
(also called "file handle"), that you previously saved to a variable.
iv. The file connection will automatically close once the indentation is lined up with
the line with the with keyword.
b. File management operations with os module
i. import os # import the os module before using one of its functions
ii. os.getcwd() # lists the current working directory
iii. os.chdir() # changes current working directory
iv. os.listdir() # list out contents of the current working directory
c. Several file object methods (and ways) of reading in a file:
i. .read() # returns the whole file as a string (even if that file is a whole book!)
ii. .readline() # reads one line at a time; best used in a for loop
iii. loop over the file object in a for loop; this is similar to the .readline()
method in a for loop
iv. .readlines() # (mind the plural!) returns a list with each line as a
separate element
d. Practice:
i. Download text file from: http://greenteapress.com/thinkpython2/code/words.txt.
Write a program that navigates (using the os module) to the directory with the
words.txt file, reads the file, and prints only the words with 20 or more letters.
ii. Write a function called has_no_e() that takes as input a string and returns as
output a Boolean value indicating whether the string has the letter "e" in it.
Modify your previous program to print only words that are 15 or more letters
and don't have the letter "e" in them.
iii. Modify the previous program so that instead of simply printing to the screen, the
words are written to a file named long_wds_without_e.txt, in the same
directory with words.txt.
iv. Download the jokes.txt file from the CMS. Write a program that appends to the
file a joke of your own (or one found on the internet, if you must).

Lesson 5.2: Handling exceptions with try...except


Objective:
● Students will be capable of catching exceptions (errors) in their scripts.

2. Think Python, chapter 14, section 14.5: Catching exceptions (errors)


a. Allows the programmer to try to execute a line or block of code and deal with problems,
if they happen.
b. See example from Section 14.5 of textbook.
c. Conceptually similar to if...else statements in that if the conditional statement
evaluates to True, the else block of code isn't executed. If all goes well with the try
block of code, the except block isn't executed.
d. Practice:
i. Write a program that asks the user for a number between 100 and 1000 and
prints out that number divided by -5, and then by -4, and then by -3, until you
reach 5. Use a try...except block so that when the number given by the
user is divided by 0 an error message is thrown, but the script continues.
1. Hint: The range() function is handy here.

Lesson 5.3: Normalizing counts


Objective:
● Students will be capable of producing normalized counts of specific linguistic features in text
files.

1. Normalized counts
a. Let's suppose that:
i. Corpus A has 10,000 words and Corpus B has 5,000 words;
ii. "handsome" occurs 30 times in Corpus A, but only 25 times in Corpus B;
iii. Question: In which corpus is "handsome" more frequent?
iv. According to raw counts, "handsome" is more frequent in Corpus A, because 30
is larger than 25.
v. However, according to normalized counts (per 1,000 words), "handsome" is
more frequent in Corpus B, because it occurs 5 times per 1,000 words in Corpus
B (25 / 5,000 * 1,000 = 5), but only 3 times per 1,000 words in Corpus A (30 /
10,000 * 1,000 = 3).
vi. Conclusion: According to raw counts, "handsome" is more frequent in Corpus A,
but according to normalized counts (per thousand words) "handsome" is more
frequent in Corpus B.
vii. Take-home message: In order to compare counts of words or linguistic features
between different corpora or texts, you first have to normalize counts.
b. Logic:
i. Navigate to the directory with the files and get their names.
ii. Create two counter variables, one for the total number of words and one for the
number of matches of the desired linguistic feature.
iii. Loop over the files.
iv. Within each iteration, split the current file into words.
v. Loop over the words in the current file.
vi. Within each iteration, increment the total words counter. Then, check if the
current word matches the regex, and if so, increment the linguistic feature
counter.
vii. After the loop has finished, divide the linguistic feature counter by the total
words counter to create a ratio, and then multiply that ratio by the base of the
normalized count (for example, by 1,000 if the base is per 1,000 words).
viii. Print out the results to the screen or write them to an output file.

1. Practice:
a. From the Gutenberg Project, download several .txt files of novels from an author
of your choice. Write a program to get a count normalized to 1,000 for a regex
pattern of your choice in one of the files. Write the output to a .csv file,
specifying that the columns be separated by a tab (\t) and that the rows be
separated by a newline break (\n). Import the output .csv file into a spreadsheet
(Excel, google spreadsheet, etc.) of your choice.
b. Modify the program to get normalized counts for each of the novels you
downloaded. Modify the code that produces the output .csv file so it is easy to
compare the normalized counts across the novels.
c. Download several .txt files from another author. Modify your program to get
normalized counts for each novel. Again, modify the code that produces the
output .csv file to make it easy to compare the normalized counts across the
novels.
d. Modify your program to get normalized counts for each author (two normalized
counts, one per author). Again, modify the code that produces the output .csv file
to make it easy to compare the normalized counts across the two authors.

Lesson 6.1: Non-UTF-8 encoding


Objective:
● Students will be capable of reading in files with character encodings other than UTF-8.

1. File encoding standards


a. ASCII (American Standard Code for Information Interchange) dates from the 1960s;
English-centric
i. https://en.wikipedia.org/wiki/ASCII
b. Unicode, dates from the late 1980s; intended to be an "international/multilingual text
character encoding system"; the "name 'Unicode' is intended to suggest a unique,
unified, universal encoding."
i. https://en.wikipedia.org/wiki/Unicode
ii. This video is informative, especially the first 9 minutes:
https://youtu.be/sgHbC6udIqc
c. Several main character encodings:
i. "UTF-8" (also written as "utf-8" or "utf8"); the most popular today (since 2008)
ii. "UTF-16" ("utf16")
iii. "Windows-1252" (also "ISO-8859-1" and "Latin-1" or "latin1")
iv. Many, many more: https://docs.python.org/3/library/codecs.html#standard-
encodings
d. The chardet module has the detect() function that makes a guess about the
character encoding of a file. It takes as input bytes, so you have to call open(...,
mode="rb"), to read in the file (with the .read() method of the file object) and
then pass the read data to detect(). Note: The module also comes with a command-
line tool chardetect pathway/to/filename that can be used in a Terminal
window (PyCharm > View > Tool Windows > Terminal) to (try to) figure out the
encoding of a file.
i. Sample code in Python script:
import chardet, os
os.chdir("/Users/ekb5/Documents/LING_360/
sample_texts/")
with open("Russian.txt", mode="rb") as infile:
whole_file_as_bytes = infile.read()
guessed_encoding =
chardet.detect(whole_file_as_bytes)
print(guessed_encoding)
e. Practice:
i. Download the zipped file sample_texts.zip from the CMS. It contains 7 .txt
files with text in different languages, with different character encodings.
ii. If your computer doesn't have the chardet module installed, you can install it
with pip install or within PyCharm, or you can use pythonanywhere.com.
(If you use pythonanywhere.com, you'll need to upload the files and refer to
pathways relative to your home directory in pythonanywhere, for example,
/home/ekbrown77/Arabic.txt.)
iii. Using the detect()function with the chardet module, take the encoding
that the tool returns and try to open the file with the open() function,
specifying an encoding argument.
iv. If the file still doesn't open, or if the characters don't look like they came in
correctly, try a different encoding listed for the language on this doc page:
https://docs.python.org/3/library/codecs.html#standard-encodings

Lesson 6.2: Word documents


Objective:
● Students will be capable of extracting text from Microsoft Word documents.

1. Word documents (with extension .docx)


a. Logic:
i. download python-docx (in PyCharm's Preferences window, or pip
install)
ii. import docx
iii. Create object with the function Document()
iv. Loop over the .paragraphs attribute of the object (mind the lack of
parentheses!)
v. Extract the text of the current paragraph with the .text attribute (again, no
parentheses)
vi. Note: Legacy Word documents with the extension of .doc need to be dealt with
in a different way. Check out this Stack OverFlow thread:
https://stackoverflow.com/questions/36001482/read-doc-file-with-python
b. Practice:
i. Download and unzip the zipped file Salinas.zip in the CMS.
ii. Write a program to print to screen the first file, named Salinas_01_ekb.docx.
iii. Modify the program to print to screen all 11 files.
iv. Modify the program to not print out lines that begin with "#" (the header and
time stamps) and lines that begin with "/" (the questions and comments of the
interviewer).
1. Hint: The caret symbol ^ in regexes identifies the beginning of a string.
v. Modify the program to create a concordance of words that end in "s" or "z" in all
files. Write out to a .csv file the following info, each in a new column (separated
by tabs): filename, paragraph number, preceding ten words, match, following ten
words. When imported into a spreadsheet, it should look something like this:

Lesson 6.3: Indexed PDF files


Objective:
● Students will be capable of extracting text from indexed PDF files.

1. Not all PDF files are created equal:


a. Indexed -> The text is in the metadata of the file and can be extracted.
b. Unindexed -> Just a picture; no text available.
i. Note: In this situation, you would have to run Optical Character Recognition
(OCR), which is not the objective of this lesson. Google Tesseract OCR and the
Python wrapper pytesseract are a good option.
2. Logic with indexed PDF files:
a. Download Xpdf command-line tools (*not XpdfReader), which has the very useful and
easy-to-use pdftotext command-line tool, and unzip the zipped file.
b. Control the pdftotext command-line tool from within a Python script with the
run() function within the subprocess module to produce a TXT file from each
PDF file.
i. Pass in the syntax of a command-line (aka. shell) call into the
subprocess.run() function.
c. After creating TXT files from all the PDF files, loop over the TXT files and perform the
desired analysis.
3. Practice:
a. Download the zipped file Jane_Austen.zip from the CMS and unzip it.
b. Write a program to create a TXT file named Emma.txt from Emma.pdf.
i. Hint 1: See the documentation for pdftotext here.
ii. Hint 2: You need to import subprocess before calling
subprocess.run()
c. Either in the same script or in a new script, read in the newly created Emma.txt file and
print it to the console.
d. Modify the previous program and do some post-processing as you read in Emma.txt so
that the Project Gutenberg header and footer are not printed.
i. Hint: Headers usually end with something like "Start of this Project Gutenberg"
and footers usually begin with something "End of the Project Gutenberg".
ii. BTW: For a single PDF file, you could simply determine on which pages the
novel starts and ends, and have pdftotext read in only those pages by setting
the -f (first page) and -l (last page) flags (see documentation here).
e. Modify the previous program to create a new TXT file named Emma_novel.txt that
only has the novel without the Project Gutenberg header and footer.
f. Modify the previous programs to create corresponding novel-only TXT files for all
PDF files in the directory.

Lesson 7.1: dictionaries


Objective:
● Students will create Python dictionaries with key-value pairs.

1. Think Python, chapter 11: dictionaries


a. Made up of key-value pairs, called items.
b. Analogy from coursera Python data structures course of a purse (dict) with belongings
(value) with a sticky note (key) on them.
i. Watch an informative video about Python dictionaries from Coursera here
(you don't have to join the course to watch the video).
c. Can be created with curly braces or with the function dict()
i. dict_name = {}
ii. dict_name = dict()
d. Several useful dictionary methods:
i. .keys() returns an iterator with the keys of each item
ii. .values() same, but with the values
iii. .items() returns a dict_items object, which is like a list, with 2-
element tuples (see next section)
iv. Add items to an existing dictionary with: dict_name['key'] =
'value', where dict_name is the variable you assigned your dictionary
to, and key and value have your actual data.
1. See thread on SO here.
e. Practice:
i. Create an empty dictionary and then add words as keys and definitions of
those words as values.
ii. Iterate over the dictionary and print to screen the keys. Iterate again, this time
printing the values. Now finally, print the key, a colon, and then the value.

Lesson 7.2: Tuples


Objective:
● Students will be capable of converting dictionaries to lists with tuples.

1. Think Python, chapter 12: tuples


a. Like lists, but are immutable, that is, they cannot be changed after they are created.
b. By convention and, sometimes, by necessity, they are written with parentheses:
i. t = ('a', 'b', 'c', 'd', 'e')
c. dictionaries can be converted into dict_items objects (much like lists) with 2-
element tuples with the .items() dictionary method.
d. Practice:
i. Use the dictionary method .items()to create a dict_items object of 2-
element tuples from your dictionary from above. Iterate over the
dict_items and write to a .csv file each word-definition pair, with the words
in one column and their definitions in the next column.

Lesson 7.3: Lambda functions


Objective:
● Students will be capable of sorting lists with tuples by using lambda functions.

1. Sorting dictionaries and lists with lambda functions


a. The function sorted() and the list method list_name.sort() have an argument
key that specifies a function to be applied to each element while sorting. The default is
the first item/element/character.
b. Lambda function are "anonymous" functions because you don't have to give them a
name and because they are usually immediately thrown away after use. They are often
used with sorting.
c. So, while you could supply a function created with the def keyword, lambda functions
are usually used as the function supplied to the key argument of sorted().
d. In order to sort a dictionary, you must first create a list with 2-element tuples from
the dictionary.
e. Example:
# creates dictionary
fam_dict = {'Lydia': 15, 'Earl': 45, 'Eve': 11, 'Kristi':
45, 'Esther': 16, 'Hannah': 12, 'Seth': 19}

# returns dict_items object (like a list) with tuples


fam_list = fam_dict.items()
# uses a lambda function to sort by the second element of
each tuple (mind the zero-basedness of Python!)
print(sorted(fam_list, key=lambda x:x[1]))

The following two definitions create the same function:

# normal way of defining a function


def second_element(x):
return x[1]

# the lambda way (which is weird to name it)


second_element = lambda x:x[1]

Practice:
1. Create a dictionary with the ages of your family. Convert it to a list with
tuples with the list() function, and print to screen.
2. Sort the list by names and print to screen. Sort again, this time by age, and
print to screen.
3. Sort the list by age in reverse order, and write to a .csv file.
a. Hint: Look up the reverse argument of the function sorted().

Lesson 7.4: Word frequency dictionaries


Objective:
● Students will be capable of creating word frequency lists using dictionaries.

1. Six different (but similar under the hood) ways to create word frequency lists:
a. if...else
i. See example in Section 11.2 of textbook.
b. try...except
i. See example on SO here.
c. dictionary method .get()
i. See example on SO here.
d. defaultdict() function in module collections
i. See example in Section 19.7 of textbook and on SO here.
e. Counter() function in module collections
i. See example in Section 19.6 of textbook and on SO here.
f. FreqDist() function and .most_common() method from nltk library
i. See example in Section 3.1 of NLTK book.
2. Formative (review) quiz:
a. In a previous lesson (Lesson 5.1 File I/0) we learned how to read text saved in a file into
a Python program. In an effort to review that material, choose the best way to
accomplish the desired outcome in each scenario. You can assume that a file connection
object is held by the variable infile. While there are more than two ways to read in
text from a file (review Lesson 5.1), only two options are given here because they are
reasonable ways to accomplish the tasks given below.
i. Options:
1. infile.read()
2. for line in infile:
ii. Tasks:
1. Read in each line (i.e., hard return / newline break) in the file, one at a
time.
2. Read in all text of the file as a single string.
3. You want to get frequencies of words in a file (and the file is small
enough to fit in memory).
4. You have a massive file (say, 50 GB) that won't fit in the working
memory of your computer, and you want to get frequencies of words.
3. Practice:
a. Write a few sentences about your favorite restaurant and/or food. Be sure to repeat at
least a few words, in order to double check that your word frequency list works
correctly.
b. Using the if...else technique, create a word frequency list of the words in your few
sentences and print the dictionary to the screen.
i. Hint: Be sure to deal with capitalization by converting all words to either
uppercase or lowercase.
c. Modify your program to use try...except, then the dictionary method .get(),
then the defaultdict() function in the module collections, and then the
Counter() function in the module collections, and finally, the FreqDist()
function from the nltk library.
d. Modify your program to sort the dictionary (after converting it to a list) in reverse
(descending) order based on the frequencies (values).
i. Read docs on sorting lists at python.org.
e. Modify your program to create a word frequency list of a .txt file of your choice
(possibly from http://www.gutenberg.org/).
f. Modify your program to create a word frequency list of several .txt files.
g. Modify your program to write the word frequency list to a .csv file, again in descending
order based on the frequencies.
i. Bonus: Modify your program to first sort in descending order by frequency, and
then, secondly, in ascending order by word, so that words with the same
frequency are arranged in alphabetical order. The first comment on the answer of
this Stack Overflow thread will be useful (but you'll need to modify the code a
bit in order to get both the key and the value, not just the key):
https://stackoverflow.com/questions/9919342/sorting-a-dictionary-by-value-
then-key
h. Modify your program to exclude the words in the following list: stop_list =
['the', 'a', 'an', 'of', 'from', 'with', 'to', 'and']

Lesson 8.1: Basic descriptive statistics


Objective:
● Students will be capable of producing basic descriptive statistics with the statistics
module.

1. statistics module
a. Functions to measure central tendency:
i. mean()
ii. median()
iii. mode()
b. Functions to measure dispersion or spread (around the central tendency)
i. stdev() # standard deviation
ii. variance() # this is the square of the standard deviation
iii. range
2. Practice:
a. Download (or reuse) a .txt file of a novel of your choice from the Gutenberg Project.
b. Calculate the measures of central tendency and the measures of dispersion for sentence
length.
i. Hint1: It might be easiest to read in the whole file as a string with the .read()
method of the file handle, and then re.split() the string into sentences on
punctuation that indicates the end of sentences.
ii. Hint2: Another way to split up a string into sentences is with the
sent_tokenize() function in the nltk library.
c. Download (or reuse) a .txt file of a novel from a different author and calculate the same
measurements.
d. Compare the two authors' sentence lengths, looking at measures of central tendency, as
well as how varied their sentence lengths are, with measures of dispersion.

Lesson 8.2: Part-of-speech tagging with NLTK library


Objective:
● Students will tag words for part-of-speech in a text with the NLTK (Natural Language Toolkit)
library.

1. Part-of-speech tagging
a. Logic/Steps:
i. Download nltk library (in PyCharm: Preferences > Project: [project name] >
Project Interpreter > the "+" > search for "nltk" > install packages
ii. In a .py file, run the code given here and download "book" ("Everything used in
the NLTK Book")
iii. import nltk
iv. Use the function word_tokenize(str) to tokenize text, that is, put words
and punctuation in separate elements of a list.
v. Use the function pos_tag(list) to tag part-of-speech (POS) of the tokens
(the individual words and punctuation). The returned value is a list with 2-
element tuples, the first element of which is the token, and the second element is
the POS tag.
vi. In a REPL (not in your .py script), you can use the function
nltk.help.upenn_tagset() to see what the tags of the Penn Treebank
tagset mean, for example, that "NNS" is the code for a plural noun. Or, you can
find the tagset online with a simple internet search of keywords like "penn
treebank tagset".
vii. The function pos_tag() takes the argument tagset="universal",
which gives simplified tags.
viii. See examples in chapter 5 of the NLTK book.
1. Beware! Unfortunately, some example code in the NLTK book isn't
completely reproducible, that is, there are some errors.
b. Practice:
i. Write a sentence or two about the topic of your choice. Try to include a word or
two that has different parts of speech depending on the context, for example,
"The judge must record the new world record" (verb / noun) or "Yesterday I
talked with her, but I haven't talked with her yet today" (simple past / present
perfect).
ii. Tag the part-of-speech of the sentence(s) and look up the tags in the Penn
Treebank tagset. Determine how many of the words were tagged correctly.
iii. Find a paragraph or two online and copy and paste it/them as the input text in
your program. Rerun your program and skim the tags and look for any incorrect
tags.
iv. Modify your previous program to use the "universal" tagset. Next, calculate
how many of each tag the paragraph(s) has.
v. Modify your program to tag a text of your choice (perhaps a novel from
Gutenberg, or a speech from a political leader). Find all comparative adjectives
in the document and (if there aren't too many) report the precision of the tagger
with comparative adjectives.

Lesson 8.3: Part-of-speech tagging with TreeTagger


Objective:
● Students will tag for part-of-speech with the third-party software TreeTagger, using a Python
wrapper.

1. TreeTagger part-of-speech tagger


a. An oldie but a goodie POS tagger that tags many different languages, freely available
here.
b. There are several Python wrappers, for example, the one linked from the TreeTagger
website called treetaggerwrapper. That wrapper has a limited number of languages it
supports here.
c. Logic/Example code:
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(
TAGDIR='[pathway/to/treetagger/dir]', TAGLANG='[two-
letter_language_code]')
tags = tagger.tag_text('[text_here]')
pretty_tags = treetaggerwrapper.make_tags(tags)
d. Practice:
i. Tag for part-of-speech 1 Nephi 1:1 in the Book of Mormon in several different
languages, from among the languages with parameter files on your hard drive.
1. Hint: In order to tag text in a language, you must have had the parameter
file in the TreeTagger directory on your computer when you installed the
software, that is, when you ran the bash command sh install-
tagger.sh from a Terminal window. If that wasn't the case, you can
download the parameter file for the desired language and rerun the
installer script, that is, the bash command given above.

Lesson 8.4: Part-of-speech tagging with Stanza


Objective:
● Students will tag words for part-of-speech with Stanza.

1. Stanza is a language analysis software library produced at Stanford University.


2. In addition to part-of-speech tagging, Stanza can return the lemma (see here) of a word (and
other information; see here).
3. Steps to part-of-speech tag words with Stanza:
a. Install Stanza (e.g., in Preferences windows in PyCharm or pip install stanza;
see here)
b. Download the models for languages (see here) you want to perform part-of-speech
tagging on (see code here).
c. Create a Pipeline object (see code here).
d. Use the Pipeline object to annotate text (see code here).
e. Pull out the information you want about the words, e.g., part-of-speech and lemma (see
code here).
i. Note: The gnarly double list comprehension in the sample code can be expanded
out to a more transparent double for loop.
4. Practice:
a. Download Stanza in PyCharm's Preferences window (or with pip install
stanza).
b. Download the language models for the languages that you'd like to perform part-of-
speech tagging on.
c. Write a toy language sample (i.e., a sentence or two) and print out to the console each
token and its corresponding part-of-speech (both upos and xpos tags, if available) and
the lemma.
d. Ramp it up to a file with lots of text (e.g., a novel).
e. Use the parts-of-speech to pull out a linguistic construction of your choice.
i. If you can't think of anything, try to pull out the way construction (e.g., she
worked her way through college, he cheated his way through high school).
f. Save the tokens of the linguistic construction out to a file on your hard drive.

Lesson 9.1: Processing tagged text


Objective:
● Students will be capable of extracting information from text with part-of-speech tags.

1. Logic:
a. The logic of the code will vary according to how the part-of-tags are organized.
b. In general, first you need to isolate tokens (words and punctuation) from their
accompanying part-of-speech tags.
c. Then, you search for specific words, likely with regular expressions, that have certain
POS tags, again, likely specified with regexes.
2. Practice:
a. Download the Mini-CORE_tagd.zip file from the CMS and unzip it on your hard drive.
b. Write a program to print out to the screen all the adjective + noun pairs.
i. Hint: It might be easiest to loop over the indexes of word/POS pairs within each
file, so that you can check if an adjective is followed by a noun, and if so, print
both words based on the indexes.
ii. Hint2: .readlines() is useful here.
c. Modify your program to calculate the frequency of the adjective-noun pairs and write
them to a .csv file, ordered in descending order by frequency.
d. Modify your program to write to a .csv file all the superlative adjective + plural non-
proper noun pairs.

Lesson 9.2: parsing XML data


Objective:
● Students will parse XML-formatted data with XPath syntax.

1. XML (Extensible Markup Language)


a. A markup language for formatting data to be read by computers (and humans, to some
extent).
b. Simple example here and slightly more complex one here, and one with attributes here.
2. XPath
a. XPath is a mini-language, along the lines of regular expressions, as XPath allows users
to extract certain data from an XML or HTML document with specialized syntax.
b. See tutorial here and cheatsheet here.
c. Basic usage:
i. '/breakfast_menu' selects "breakfast_menu" if it's the root element
ii. '/breakfast_menu/food' selects all "food" elements that are children of
the root element "breakfast_menu"
iii. '//food' selects all "food" elements anywhere in the document
iv. '//food/price' selects all "price" elements that are children of "food"
elements
v. '//food/price[@range]' selects all "price" elements that have a "range"
attribute
vi. '//food/price[@range="cheap"]' selects all "price" elements that
have a "range" attribute with a value of "cheap"
vii. '//food[contains(price, "7")]' selects all "food" elements with a
"price" child element whose text contains the character "7"
viii. '//food[starts-with(price, "$7")]/name' selects all the
"name" elements that are children of "food" elements that have a "price" child
element whose text starts with the two characters "$7"
ix. '//food[contains(name, "Waffles")]/name' selects all the
"name" elements whose text contain the characters "Waffles"
x. '//food/name[following-
sibling::price[@range="cheap"]]' selects all "name" elements that
have a following sibling of "price", which has an attribute of "range" with a
value of "cheap"
xi. '//food/name[following-sibling::price[@range="cheap"]
and following-sibling::description[@summary =
"delicious"]]' selects all "name" elements that have a following siblings
elements of "price" and "description", each with an attribute with a value
3. Logic/sample script:
from lxml import etree
tree = etree.parse("pathway_to_xml_file_here")
nodes = tree.xpath("xpath_here")
for i in nodes:
for j in i.itertext():
print(j)
4. Practice:
a. Download the sample .xml file from the British National Corpus entitled "AHS.xml",
available in the CMS.
b. Open the .xml in a web browser (so that it will pretty-print) and visually inspect the tags
that fall within the "s" tags. What do you think the "s", "w" and "c" tags stand for? What
do you think the "c5", "hw", and "pos" attributes of "w" tags stand for? Verify with the
BNC XML documentation here.
c. Write a program to print to screen the text within the sentences of the file.
d. Modify your script so that the words are printed horizontally to the right rather than
vertically.
i. Hint: Look at the documentation for the end= parameter of the print()
function here.
e. Modify your script to print to screen all the (unmarked) adjectives.
i. Hint: The CLAWS 5 tagset here will be helpful.
f. Modify your script to print to screen all the (unmarked) adjectives that are followed by a
common singular noun anywhere in the sentence.
g. Modify your script to print to screen all the (unmarked) adjectives that are followed by a
common singular noun that is the first word after the adjective.
h. Modify your script to print to screen any adjective followed by any noun that is the first
word after the adjective.
i. Modify your script to print to screen all the text in sentences with any adjective followed
by any noun that is the first word after the adjective.

Lesson 9.3: Parsing JSON data


Objective:
● Students will parse JSON-formatted data.

1. JSON
a. JSON (Javascript Object Notation) is a data format that is commonly used for sending
information from a web server to a client, that is, a person using a computer or a cell
phone to access a website or a computer program intended to retrieve lots of data.
b. A JSON object is very similar in format to a Python dictionary.
i. See example on wikipedia and here.
c. Read more about JSON here:
i. https://www.json.org/
ii. https://en.wikipedia.org/wiki/JSON
d. Logic to parse JSON:
i. import json
ii. Load .json file with .load() function or convert a JSON-formatted string to a
Python dictionary with .loads() function (mind the "s"!).
iii. Iterate over the Python dictionary (which may have a list in it), pulling out
values with their corresponding keys.
e. Practice:
i. Download the following file to your hard drive:
https://raw.githubusercontent.com/sitepoint-editors/json-examples/master/src/
db.json
ii. Write a program to print to screen the gender of the clients.
iii. Modify your program to create a frequency dictionary of those genders.
iv. Modify your program to calculate the mean age of the clients.
v. Modify your program to calculate the mean age by gender, that is, one mean age
for the women and another mean age for the men.
vi. Download the file tweets_queried_anonymous.json from the CMS. Print to
the console the text of each tweet.
1. Hint: The .load() function won't work with this file because the
tweets are each on their own line. Instead, loop over the file line by line
and use the .loads() function.

Lesson 9.4: Parsing CSV files


Objective:
● Students will read data stored in a CSV file into their Python script.

1. CSV: Comma-Separated Values files are tabular datasets (think spreadsheet tables), with
columns and rows.
2. The Python module pandas provides a tabular data structure called a "data frame".
3. Data can be read in from a CSV file or an Excel file.
4. Logic:
a. Importing the pandas module
b. Read in the data in the CSV (or Excel) file
c. Loop over the rows
d. Within the body of the for loop, access the columns you need
5. Demo:
a. Instructor shows how to read in the data from a CSV file and pull out text from a
specific column. The CSV file was created by the Whisper Automated Speech
Recognition (ASR) system on the audio of this interview here.

Lesson 10.1: Harvesting tweets from Twitter


Note: In February 2023, Twitter stopped offering free access to their API. I leave this lesson here for
reference purposes only.

Objective:
● Students will understand how to harvest tweets using Twitter's API.
1. Twitter has an API (application programming interface) to allow programmers to harvest (or
collect) tweets with computer programs.
2. Python has several modules that allow Python programmers to access Twitter's API; tweepy is a
good one.
3. Twitter's API returns JSON-encoded text.
4. The developer app has a bearer token that is needed to programmatically (that is, with Python)
access Twitter's API.
5. There are two ways to harvest tweets:
a. Capture tweets in real-time with a streaming API listener.
i. Listening for a keyword or hashtag (example here).
b. Retrieve tweets from (approximately) the last week, from the REST API.
i. Query strings here.
ii. Tweet attributes here.
1. Context annotations here.
6. Logic:
a. Apply for and receive a developer's account with Twitter here.
b. Create a developer's app at apps.twitter.com.
c. Two ways to get tweets:
i. Listener:
1. Using your app's bearer token, create a stream listener on the API.
ii. Query:
1. Using your app's bearer token, create a query string and request tweets
from Twitter's API.
d. Save tweets to a .json file (or .csv) on your hard drive.
e. In a different Python script (.py file), parse the .json file saved to your hard drive and
pull out the info you'd like from the tweets.
7. Demo/Practice:
a. Create a streaming listener on the API to listen for tweets with a hashtag of your choice
and save the tweets to a .json file on your hard drive (see example of saving .json files
here). You'll need to manually stop the script after a minute or two (or more or fewer,
depending on the popularity of the hashtag and the amount of data you want).
b. Query the API for recently sent tweets with a query string of your choice and save them
to a .json file on your hard drive.

Lesson 10.2: Harvesting comments from Youtube videos


Objective:
● Students will harvest comments from Youtube videos with Youtube's API

1. Steps:
a. Get an API key from Youtube at Google's developers console here. Start the free trial to
activate the key.
b. Download google-api-python-client Python module
i. PyCharm > Preferences window > Project: [project_name] > Python interpreter
> "+" > search for "google-api-python-client" > Install package
ii. pip install google-api-python-client
c. Request comments from videos with video ID (get video ID from URL)
i. 100 comments is the max without having to loop over additional pages. To get
more pages, see SO answer here.
d. Save JSON object to .json file on your hard drive
e. In a different Python script, parse through .json file and pull out comments.
2. Practice:
a. Using your API key, harvest 100 comments from a Youtube video of your choice and
save the JSON object to a .json file on your hard drive. If you don't know which video to
choose, why not use this one here from Dude Perfect.
b. In a different script, parse through the .json file and drill down into the nested
dictionaries to get the comments out.

Lesson 10.3: Harvesting posts from Reddit


Objective:
● Students will harvest posts from a subreddit of their choice.

1. Reddit is a social media platform that allows users to hide behind a cloak of anonymity and be
rude and crude as they discuss a myriad of topics in so called "subreddits".
2. For the language researcher who wants access to very unrehearsed and natural (written)
language, Reddit is a goldmine.
3. Steps:
a. Create user account, if you don't already have one.
b. Create an app here and write down/save the client id code (~20 alphanumeric character
code) and the client secret (~30 alphanumeric character code)
c. Download the PRAW Python module, either in:
i. PyCharm: Settings window > Python:ProjectName > Python Interpreter > + >
type "praw" > click "Install Package"
ii. pip install praw
d. Create a Python script following the instructions given in the documentation here.
4. Demo:
a. The instructor demonstrates how to scrape the posts from a subreddit of his choice.
5. Practice:
a. The students take the instructor's script (available in the CMS) and scrape posts from a
subreddit of their choice.

Lesson 10.4: Sentiment analysis


Objective:
● Students will analyze sentiment in social media text with the vaderSentiment sentiment
analyzer.

1. See examples here.


2. Logic after downloading vaderSentiment module:
from vaderSentiment.vaderSentiment import
SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
sent = analyzer.polarity_scores("I am happy!")
print(sent["compound"])
3. Practice:
a. Set up a sentiment analyzer and analyze some short sentences of your own.
b. Download the following files from the CMS (or use your own .json files with social
media messages):
i. tweets_happy.json
ii. youtube_comments_9X_nbT89X-c.json
c. First, parse the file with tweets and get the mean compound sentiment score.
i. Hint: There are 100 lines in the file, each with a JSON object that contains one
tweet, so it makes sense to read in one line at a time and convert it to a Python
dictionary with json.loads().
d. Second, parse the file with Youtube comments and get the mean compound sentiment
score.
i. Hint: The file has only one line with a big JSON object with all 100 comments
in it, so it makes sense to read in the whole file at once (with .read()) and
then convert it to a Python dictionary with json.loads().

Lesson 11.1: Webscraping


Objective:
● Students will webscrape the text and links out of webpages.

1. Logic:
a. Have Python (specifically the requests module here) act like a web browser and
retrieve (aka. scrape) the source code of a webpage.
i. A "User-Agent" of an HTTP request specifies information about the requesting
system (see here). Some webpages require a user-agent (e.g., here and here), so
we need to specify one in our request (see here).
b. Use the Python module jusText here to retrieve text and do something with it:
i. Print to screen;
ii. Write out to the harddrive in a .txt file;
iii. [Whatever else you might want to do with human-generated text].
c. Use the Python module bs4 (BeautifulSoup) here to retrieve links.
i. Probably have to clean up links:
1. Remove empty strings;
2. Remove duplicate links;
3. Remove same-document links (that begin with "#");
4. Create absolute links (i.e., links that start with "http") from relative links
(i.e., links that start with "/").
2. Practice:
a. Create a program that webscrapes a blog of your choice and print the text to the screen.
i. Hint1: Be sure to specify a common user-agent in your request; see SO thread
here.
ii. Hint2: The Python module jusText will be your friend here.
b. Modify the previous program to save the text to a .txt file on your hard drive.
i. Hint: File output is the name of the game here (review Lesson 5.1 "File I/O" if
needed).
c. Create a program that prints to screen the links on a webpage, that is, the URLs within
the href attribute of anchor tags <a>.
i. Hint: Choose your own adventure, whether with BeautifulSoup or with the
Python module lxml and xpath.
d. Modify the previous program to clean up the links by:
i. removing empty strings in the list of links, if any;
ii. removing duplicate links, if any;
1. The set data structure is the way to go here.
iii. removing same-document links that start with "#", if any;
iv. creating absolute links from relative links (i.e., links that start with "/") by
concatenating the domain with the relative path (e.g., https://www.r-
bloggers.com/2021/03/workshop-31-03-21-bring-a-shiny-app-to-production/
1. Hint: Look up urllib.parse().
e. Modify the previous program to randomly choose a few links from among the list of
clean links, then retrieve and print to screen the text on those few webpages.
i. Hint: The Python module random, and specifically the shuffle() function,
will be helpful here.
ii. Word of caution: random.shuffle() modifies in place (it does not create a
copy), so you should not save the list of links back to its same variable name.

Lesson 11.2: Automating web browsers


Objective:
● Students will automate the web browser Google Chrome with Selenium.

1. Selenium automates web browsers.


2. Python as a wrapper called (you guessed it!) selenium to control Selenium.
3. Automating a web browser, rather than webscraping the HTML code, is necessary in order to
run javascript functions and get data that is returned by those javascript functions.
4. Very basic video tutorial here.
5. Download ChromeDriver here.
a. Highly important note: Make sure the version of ChromeDriver that you download has
the same version number as your Google Chrome browser, otherwise you'll get an error
message like "This version of ChromeDriver only supports Chrome version NN Current
browser version is NN".
6. Implicit waits cause the driver to wait up to a certain number of seconds while trying to locate
an element; see here.
7. Elements on a webpage can be located in a number of ways: ID attribute, xpath, the text of a
link, part of the text of a link, the name of an HTML tag, class attribute name, and by CSS
selector; see here.
a. Note: Use the driver.find_element() and driver.find_elements()
function, not the driver.find_element_by_*() functions, etc.
8. You can write text in a search box with the .send_keys() method of an element (after
locating it), and you can click on a button or a link with the .click() method of an element.
9. Practice:
a. Write a Python script to open a Google Chrome window, navigate to the homepage of
Youtube, wait five seconds and close the window.
b. Modify your script to enter text in the search box at the top of the screen, click the
search button (with magnifying glass), wait five seconds, and then close the window.
i. Hint: You can see the HTML element of specific elements of a webpage by
right-clicking and selecting "Inspect" (or similar).
c. Modify your script to click on the first video on the page, wait five seconds, and then
close the window.
i. Hint: Inspect the title of the video, not the video thumbnail.
d. Modify your script to scroll down a lot, so that (more) comments are loaded.
e. Modify your script to scrape out the comments and print them to the Python console.

Lesson 12.1: Topic modeling with LDA


Objective:
● Students will extract topics from many texts with the gensim module.

1. Topic modeling is an unsupervised machine learning technique.


2. Latent Dirichlet allocation (LDA) is a popular algorithm for performing topic modeling.
3. "Unsupervised" means that no human input is needed for the algorithm to detect topics.
4. Good tutorial here and another one here, and a very detailed one here.
5. How do you find the optimal number of topics to specify?
a. Sophisticated way here and here.
b. Quick and dirty way: trial-and-error and visualizing results.
6. The pyLDAvis module creates an interactive visualization as an HTML page.
a. See the original paper here that gives details on this interactive visualization.
7. Demo:
a. The instructor leads the students through modifying the code given in the tutorials
linked to above to extract topics from a recent General Conference of the Church.

Lesson 12.2 Document similarity with doc2vec


Objective:
● Students perform document similarity analysis.

1. Word embeddings are a way to place words into a multi-dimensional space.


2. Similar words are close to each other in that space, while dissimilar words are far from each
other.
a. Instructor gives a demo with fasttext.
3. Documents can also be placed in a multi-dimensional space, and documents that are similar to
each other are close to each other, while dissimilar documents are far from each other.
4. Demo:
a. Instructor gives a demo of a Google Colab notebook tutorial of doc2vec here.
i. Note: At the bottom of the webpage here download the .ipynb file and upload
in the Google Colab environment of your personal Google account.

Lesson 12.3: Fuzzy string matching


Objective:
● Students will use fuzzy matching to determine the similarity of strings based on edit distance.
1. What question does string1 == string2 answer?
2. Fuzzy matching allows users to answer the question: "How similar are two strings?"
3. The Levenshtein Distance algorithm is probably the most popular edit distance algorithm.
4. The Python module called thefuzz implements the Levenshtein Distance algorithm and
provides seven useful functions.
a. Good blog post here.
i. Note: thefuzz was previously called fuzzywuzzy.
5. Fuzzy matching is available in other programming languages, for example, in R, Julia, Java,
and many others.
6. Activity:
a. Take a couple minutes to run the example code for each of the following functions and,
importantly, understand and explain to a neighbor what each function does:
ratio(), partial_ratio(), token_sort_ratio(),
token_set_ratio(), partial_token_sort_ratio().
b. Take a few minutes to understand what extract() and extractOne() do.
i. The default scorer is WRatio() here.
7. Formative quiz with a partner:
a. Discuss what the five functions would return for the following pairs of two strings, and
then check your guess by using the five functions:
i. string1 = "Linguistics is the study of language.",
string2 = "Linguistics is the study of language!"
ii. string1 = "Computers are awesome, like, sweet",
string2 = "Computers are awesome, like, sweet sweet"
iii. "a b c d e f g h i j k l m n o p q r s t u v w x y z",
"z y x w v u t s r q p o n m l k j i h g f e d c b a"
iv. "abcdefghijklmnopqrstuvwxyz",
"zyxwvutsrqponmlkjihgfedcba"
8. Brainstorm: What are possible applications of fuzzy string matching?
a. Spell-checkers, using a big list of words like the one here.
b. Joining two tables based on fuzzy matching of strings; see here.
c. FamilySearch hints of matches of the same person.
d. Search engine auto-correct search terms.
e. What else?

Lesson 13.1: Yelp reviews


Objective:
● Students will analyze Yelp reviews.

9. Yelp customer reviews


a. Yelp makes available (some) customer reviews in JSON format (and as a SQL database)
for data analysis challenges.
b. I've created a couple .json files:
i. "yelp_AZ_2021.json" has 27,334 reviews written in 2021 about businesses in
Arizona with "Restaurants" and/or "Food" in their "categories" description. The
file is 18 MB.
ii. "yelp_AZ_2018.json" has 119,376 reviews written in 2018 about all businesses
in Arizona. The file is 83 MB.
iii. Download one or both files from the CMS to your harddrive.
10. Logic:
a. import json
b. Open a connection to the .json file on your hard drive.
c. Loop over the connection with a for loop in order to read in each review, as each
review is on a separate line.
d. In each iteration of the for loop, convert the JSON-formatted string to a Python
dictionary with the json.loads() function.
e. Access the value associated with the key that you want to analyze, and add it to a
collector list or use it in some other way.
11. Practice:
a. Find the range of dates during which the reviews in the sample AZ 2021 dataset or the
AZ 2018 dataset were written.
b. Find the number of reviews written on Wednesdays, and the number of reviews written
on Saturdays.
i. Hint: You'll need to use the datetime module. See StackOverflow here and
here, and read the docs here.
c. Find the average number of stars for reviews written on Wednesdays and compare it to
the average number of stars for reviews written on Saturdays.
i. Bonus: find average of stars for all days

Lesson 13.2: Yelp reviews and linguistic features


Objective:
● Students will correlate the number of stars in Yelp reviews with linguistic features.

12. Logic:
a. Create a .csv file with as many rows as there are reviews in the .json file, and that has
several columns: business id, number of stars, number of occurrences of linguistic
feature(s).
b. Import the .csv file into Python (or into R, Julia, or Excel) and get the Pearson's r
correlation value between the column with the number of stars and the column with the
number of occurrences of the linguistic feature in question.
c. Visualize the same two columns with a scatterplot, with a regression line, and/or a
boxplot.
13. Practice:
a. Using the .json file with AZ 2021 or AZ 2018 Yelp reviews in the CMS, create a .csv
file with two columns: column A = number of stars in the current review, column B =
number of occurrences of the word "delicious" in the current review.
b. Use the .csv file created in the previous exercise to calculate the correlation between the
number of stars and the number of occurrences of "delicious". For this practice exercise,
you can assume a normal distribution of the data and therefore use the most common
correlation test: Pearson's r. See example of Python code here.
c. Correlate the number of stars in the reviews in the AZ 2018 or AZ 2021 Yelp dataset
with the number of times a vowel is repeated three or more times, for example, "this
restaurant is waaay better than that one" or "that business is soooooo overraaaated".
d. Correlate the number of stars in reviews with the number of times (double) quotation
marks are used.
e. Choose another linguistic feature and correlate it with the number of stars in the reviews
in the sample dataset. Be prepared to share with the class what you find.

Lesson 13.3: Macros in Microsoft Word


Objective:
● Students will run already-made macros for Microsoft Word.

1. Macros in Microsoft Word (and Excel and LibreOffice and Google Docs) are snippets of
computer code that perform specific tasks on a Word document (or other type of file).
2. Paul Beverley has written a veritable plethora of macros for Microsoft Word geared towards
editing tasks; see the complete list of macros here.
3. The main pre-editing macro is FRedit, which is a global find-and-replace macro (Find-and-
Replace edit).
a. Download the FRedit macro and follow the instructions (i.e., "1_instructions.docx" file)
to get it running.
b. Add some new find-and-replace instructions in the "4_Sample_List.docx" file and rerun
the macro.
4. ProperNounAlyse: Identifies proper nouns that might be misspelled.
5. NumberToText: Converts a numeral into the spell-out version of the number.
6. MatchDoubleQuotes: Identifies any paragraphs that contain an odd number of opening and
closing double quotation marks.
7. Practice:
a. Students get FRedit working on their computer.
b. Students get at least one other macro of your choice working on their computer.

You might also like