Lesson Plans LING360
Lesson Plans LING360
1. Syllabus
2. Look at a few Assignments together, if not all of them (quickly).
3. Why Python?
4. Look at PyCharm together.
1. source file vs. console
2. REPL (read-eval-print loop): View > Tool Windows > Python console
3. preferences: (Windows) File > Settings
1. … Appearance and Behavior
2. … Editor > General
5. Example script: looking for “handsome” (as a regex) in Pride and Prejudice (that I download
from the Gutenberg Project), just to show them a simple example of what a few lines of code
can do to answer a question in linguistics: Was "handsome" used for both men and women
during Jane Austen's time period (or in her dialect, or idiolect)?
import re
with open('[pathway here]', encoding = 'utf8') as infile:
for line in infile:
if re.search(r'\bhandsome', line, re.I):
print(line, '\n')
6. Challenge Level versus Skill Level
1. https://alifeofproductivity.com/how-to-experience-flow-magical-chart/
2. https://uxdesign.cc/getting-into-the-flow-what-does-that-even-mean-58b16642ef1d
1. Accuracy
a. Precision
b. Recall
2. Look at "precision_recall.pptx" in the CMS
3. Review "Accuracy.pptx"
4. Practice:
a. Calculate the precision and recall of the first regex in the formative quiz in the previous
lesson for finding word-final /s/ that represents the third-person singular "s" of present
tense verbs in English in the short text given in that quiz.
b. Calculate the precision and recall of that same regex for finding word-final /s/ that
represents a plural morpheme.
1. loops
a. while loop (in Think Python, Chapter 7: Iteration)
i. A block of code is executed repeatedly as long as a conditional statement
evaluates to True at the beginning of each iteration.
1. Note: Be careful not to create an infinite loop!
ii. Practice:
1. Using a while loop, write a program that asks the user for a number
between 1 and 10 and prints out from 1 to that number, each on a new
line.
2. Using a while loop, write a program that asks the user to write a
sentence and then asks the user to give a single letter (with a different call
to input()), and prints out the sentence, one letter on a line, until the
letter given by the user is encountered, at which point the loop is exited
and no more of the sentence is printed. If the letter given by the user
doesn't occur in the sentence, the whole sentence should print out.
a. Hint: Look up the break keyword.
b. for loop (in ch. 8 of textbook)
i. A block of code is executed for each element of a string or collection (such as a
list).
c. Examples: print every letter in a string; see Downey, Chapter 8, Section 8.3 "Traversal
with a for loop"
d. Practice:
i. Write a program that asks the user for a word with at least eight letters, and then
prints on a new line each letter in the word.
ii. Modify the previous program so that only consonants are printed.
1. Hint: Look up the in operator and the continue keyword.
iii. Modify the previous program so that both consonants and vowels are printed, but
with vowels in uppercase.
1. Hint: Don't forget about the string method .upper()
iv. Modify the previous program so that instead of printing on new lines, each letter
is printed next to each other (as words are normally printed, lIkE thIs).
1. Hint: Look up the end parameter of the print() function.
1. Formative quiz about indentation in Python: Determine how many times "¡Hola Mundo!" will
print to the screen in each of the following blocks of code, and discuss why with a neighbor!
# Block 1
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")
# Block 2
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")
# Block 3
for i in range(2):
pass
for j in range(2):
pass
print("¡Hola Mundo!")
○ Example function 2:
def dash_vowels(x):
"""Takes as input a string and returns as output a
string with vowels replaced with dashes."""
output = ""
for i in x:
if i in 'aeiou':
output += '-'
else:
output += i
return output
Practice:
1. Define a function that takes as input two strings and returns as output a Boolean
value indicating whether the two strings are anagrams of each other (that is, the
letters of one word can be reordered to spell the other word, like with "secure"
and "rescue").
a. Hint1: sorted(list_name)or list_name.sort() will be your
friend here.
b. Hint2: You should not ask the user for two strings with input(),
rather you should include two arguments in your function, one for each
word.
c. Hint3: Be sure to account for (possible differences in) capitalization.
d. Bonus: Modify your function so that it throws an error message if the
user doesn't supply two strings as arguments to the function call and/or if
the user doesn't supply values of type str.
2. Possible other functions:
a. Whether the letters of a word are in alphabetical order.
b. Whether the vowels of a word are in alphabetical order.
c. Whether a word has all five (orthographic) vowels, like "sequoia".
d. Whether a shorter word appears in a longer word, in the same order, like
"out" in "counter".
1. Processing sentences
a. Logic:
i. Split up a paragraph at punctuation that indicates the end of a sentence.
1. With a regular expression with re.split()
2. Or with the NLTK sentence tokenizer function. First, you need to
download all the resources used in the NLTK book, using some code
here.
ii. Work with each sentence in a loop, collecting desired information during each
iteration.
b. Practice:
i. Define a function that takes as input a string with a paragraph and returns as
output a float with the average number of words per sentence in the paragraph
given.
1. Normalized counts
a. Let's suppose that:
i. Corpus A has 10,000 words and Corpus B has 5,000 words;
ii. "handsome" occurs 30 times in Corpus A, but only 25 times in Corpus B;
iii. Question: In which corpus is "handsome" more frequent?
iv. According to raw counts, "handsome" is more frequent in Corpus A, because 30
is larger than 25.
v. However, according to normalized counts (per 1,000 words), "handsome" is
more frequent in Corpus B, because it occurs 5 times per 1,000 words in Corpus
B (25 / 5,000 * 1,000 = 5), but only 3 times per 1,000 words in Corpus A (30 /
10,000 * 1,000 = 3).
vi. Conclusion: According to raw counts, "handsome" is more frequent in Corpus A,
but according to normalized counts (per thousand words) "handsome" is more
frequent in Corpus B.
vii. Take-home message: In order to compare counts of words or linguistic features
between different corpora or texts, you first have to normalize counts.
b. Logic:
i. Navigate to the directory with the files and get their names.
ii. Create two counter variables, one for the total number of words and one for the
number of matches of the desired linguistic feature.
iii. Loop over the files.
iv. Within each iteration, split the current file into words.
v. Loop over the words in the current file.
vi. Within each iteration, increment the total words counter. Then, check if the
current word matches the regex, and if so, increment the linguistic feature
counter.
vii. After the loop has finished, divide the linguistic feature counter by the total
words counter to create a ratio, and then multiply that ratio by the base of the
normalized count (for example, by 1,000 if the base is per 1,000 words).
viii. Print out the results to the screen or write them to an output file.
1. Practice:
a. From the Gutenberg Project, download several .txt files of novels from an author
of your choice. Write a program to get a count normalized to 1,000 for a regex
pattern of your choice in one of the files. Write the output to a .csv file,
specifying that the columns be separated by a tab (\t) and that the rows be
separated by a newline break (\n). Import the output .csv file into a spreadsheet
(Excel, google spreadsheet, etc.) of your choice.
b. Modify the program to get normalized counts for each of the novels you
downloaded. Modify the code that produces the output .csv file so it is easy to
compare the normalized counts across the novels.
c. Download several .txt files from another author. Modify your program to get
normalized counts for each novel. Again, modify the code that produces the
output .csv file to make it easy to compare the normalized counts across the
novels.
d. Modify your program to get normalized counts for each author (two normalized
counts, one per author). Again, modify the code that produces the output .csv file
to make it easy to compare the normalized counts across the two authors.
Practice:
1. Create a dictionary with the ages of your family. Convert it to a list with
tuples with the list() function, and print to screen.
2. Sort the list by names and print to screen. Sort again, this time by age, and
print to screen.
3. Sort the list by age in reverse order, and write to a .csv file.
a. Hint: Look up the reverse argument of the function sorted().
1. Six different (but similar under the hood) ways to create word frequency lists:
a. if...else
i. See example in Section 11.2 of textbook.
b. try...except
i. See example on SO here.
c. dictionary method .get()
i. See example on SO here.
d. defaultdict() function in module collections
i. See example in Section 19.7 of textbook and on SO here.
e. Counter() function in module collections
i. See example in Section 19.6 of textbook and on SO here.
f. FreqDist() function and .most_common() method from nltk library
i. See example in Section 3.1 of NLTK book.
2. Formative (review) quiz:
a. In a previous lesson (Lesson 5.1 File I/0) we learned how to read text saved in a file into
a Python program. In an effort to review that material, choose the best way to
accomplish the desired outcome in each scenario. You can assume that a file connection
object is held by the variable infile. While there are more than two ways to read in
text from a file (review Lesson 5.1), only two options are given here because they are
reasonable ways to accomplish the tasks given below.
i. Options:
1. infile.read()
2. for line in infile:
ii. Tasks:
1. Read in each line (i.e., hard return / newline break) in the file, one at a
time.
2. Read in all text of the file as a single string.
3. You want to get frequencies of words in a file (and the file is small
enough to fit in memory).
4. You have a massive file (say, 50 GB) that won't fit in the working
memory of your computer, and you want to get frequencies of words.
3. Practice:
a. Write a few sentences about your favorite restaurant and/or food. Be sure to repeat at
least a few words, in order to double check that your word frequency list works
correctly.
b. Using the if...else technique, create a word frequency list of the words in your few
sentences and print the dictionary to the screen.
i. Hint: Be sure to deal with capitalization by converting all words to either
uppercase or lowercase.
c. Modify your program to use try...except, then the dictionary method .get(),
then the defaultdict() function in the module collections, and then the
Counter() function in the module collections, and finally, the FreqDist()
function from the nltk library.
d. Modify your program to sort the dictionary (after converting it to a list) in reverse
(descending) order based on the frequencies (values).
i. Read docs on sorting lists at python.org.
e. Modify your program to create a word frequency list of a .txt file of your choice
(possibly from http://www.gutenberg.org/).
f. Modify your program to create a word frequency list of several .txt files.
g. Modify your program to write the word frequency list to a .csv file, again in descending
order based on the frequencies.
i. Bonus: Modify your program to first sort in descending order by frequency, and
then, secondly, in ascending order by word, so that words with the same
frequency are arranged in alphabetical order. The first comment on the answer of
this Stack Overflow thread will be useful (but you'll need to modify the code a
bit in order to get both the key and the value, not just the key):
https://stackoverflow.com/questions/9919342/sorting-a-dictionary-by-value-
then-key
h. Modify your program to exclude the words in the following list: stop_list =
['the', 'a', 'an', 'of', 'from', 'with', 'to', 'and']
1. statistics module
a. Functions to measure central tendency:
i. mean()
ii. median()
iii. mode()
b. Functions to measure dispersion or spread (around the central tendency)
i. stdev() # standard deviation
ii. variance() # this is the square of the standard deviation
iii. range
2. Practice:
a. Download (or reuse) a .txt file of a novel of your choice from the Gutenberg Project.
b. Calculate the measures of central tendency and the measures of dispersion for sentence
length.
i. Hint1: It might be easiest to read in the whole file as a string with the .read()
method of the file handle, and then re.split() the string into sentences on
punctuation that indicates the end of sentences.
ii. Hint2: Another way to split up a string into sentences is with the
sent_tokenize() function in the nltk library.
c. Download (or reuse) a .txt file of a novel from a different author and calculate the same
measurements.
d. Compare the two authors' sentence lengths, looking at measures of central tendency, as
well as how varied their sentence lengths are, with measures of dispersion.
1. Part-of-speech tagging
a. Logic/Steps:
i. Download nltk library (in PyCharm: Preferences > Project: [project name] >
Project Interpreter > the "+" > search for "nltk" > install packages
ii. In a .py file, run the code given here and download "book" ("Everything used in
the NLTK Book")
iii. import nltk
iv. Use the function word_tokenize(str) to tokenize text, that is, put words
and punctuation in separate elements of a list.
v. Use the function pos_tag(list) to tag part-of-speech (POS) of the tokens
(the individual words and punctuation). The returned value is a list with 2-
element tuples, the first element of which is the token, and the second element is
the POS tag.
vi. In a REPL (not in your .py script), you can use the function
nltk.help.upenn_tagset() to see what the tags of the Penn Treebank
tagset mean, for example, that "NNS" is the code for a plural noun. Or, you can
find the tagset online with a simple internet search of keywords like "penn
treebank tagset".
vii. The function pos_tag() takes the argument tagset="universal",
which gives simplified tags.
viii. See examples in chapter 5 of the NLTK book.
1. Beware! Unfortunately, some example code in the NLTK book isn't
completely reproducible, that is, there are some errors.
b. Practice:
i. Write a sentence or two about the topic of your choice. Try to include a word or
two that has different parts of speech depending on the context, for example,
"The judge must record the new world record" (verb / noun) or "Yesterday I
talked with her, but I haven't talked with her yet today" (simple past / present
perfect).
ii. Tag the part-of-speech of the sentence(s) and look up the tags in the Penn
Treebank tagset. Determine how many of the words were tagged correctly.
iii. Find a paragraph or two online and copy and paste it/them as the input text in
your program. Rerun your program and skim the tags and look for any incorrect
tags.
iv. Modify your previous program to use the "universal" tagset. Next, calculate
how many of each tag the paragraph(s) has.
v. Modify your program to tag a text of your choice (perhaps a novel from
Gutenberg, or a speech from a political leader). Find all comparative adjectives
in the document and (if there aren't too many) report the precision of the tagger
with comparative adjectives.
1. Logic:
a. The logic of the code will vary according to how the part-of-tags are organized.
b. In general, first you need to isolate tokens (words and punctuation) from their
accompanying part-of-speech tags.
c. Then, you search for specific words, likely with regular expressions, that have certain
POS tags, again, likely specified with regexes.
2. Practice:
a. Download the Mini-CORE_tagd.zip file from the CMS and unzip it on your hard drive.
b. Write a program to print out to the screen all the adjective + noun pairs.
i. Hint: It might be easiest to loop over the indexes of word/POS pairs within each
file, so that you can check if an adjective is followed by a noun, and if so, print
both words based on the indexes.
ii. Hint2: .readlines() is useful here.
c. Modify your program to calculate the frequency of the adjective-noun pairs and write
them to a .csv file, ordered in descending order by frequency.
d. Modify your program to write to a .csv file all the superlative adjective + plural non-
proper noun pairs.
1. JSON
a. JSON (Javascript Object Notation) is a data format that is commonly used for sending
information from a web server to a client, that is, a person using a computer or a cell
phone to access a website or a computer program intended to retrieve lots of data.
b. A JSON object is very similar in format to a Python dictionary.
i. See example on wikipedia and here.
c. Read more about JSON here:
i. https://www.json.org/
ii. https://en.wikipedia.org/wiki/JSON
d. Logic to parse JSON:
i. import json
ii. Load .json file with .load() function or convert a JSON-formatted string to a
Python dictionary with .loads() function (mind the "s"!).
iii. Iterate over the Python dictionary (which may have a list in it), pulling out
values with their corresponding keys.
e. Practice:
i. Download the following file to your hard drive:
https://raw.githubusercontent.com/sitepoint-editors/json-examples/master/src/
db.json
ii. Write a program to print to screen the gender of the clients.
iii. Modify your program to create a frequency dictionary of those genders.
iv. Modify your program to calculate the mean age of the clients.
v. Modify your program to calculate the mean age by gender, that is, one mean age
for the women and another mean age for the men.
vi. Download the file tweets_queried_anonymous.json from the CMS. Print to
the console the text of each tweet.
1. Hint: The .load() function won't work with this file because the
tweets are each on their own line. Instead, loop over the file line by line
and use the .loads() function.
1. CSV: Comma-Separated Values files are tabular datasets (think spreadsheet tables), with
columns and rows.
2. The Python module pandas provides a tabular data structure called a "data frame".
3. Data can be read in from a CSV file or an Excel file.
4. Logic:
a. Importing the pandas module
b. Read in the data in the CSV (or Excel) file
c. Loop over the rows
d. Within the body of the for loop, access the columns you need
5. Demo:
a. Instructor shows how to read in the data from a CSV file and pull out text from a
specific column. The CSV file was created by the Whisper Automated Speech
Recognition (ASR) system on the audio of this interview here.
Objective:
● Students will understand how to harvest tweets using Twitter's API.
1. Twitter has an API (application programming interface) to allow programmers to harvest (or
collect) tweets with computer programs.
2. Python has several modules that allow Python programmers to access Twitter's API; tweepy is a
good one.
3. Twitter's API returns JSON-encoded text.
4. The developer app has a bearer token that is needed to programmatically (that is, with Python)
access Twitter's API.
5. There are two ways to harvest tweets:
a. Capture tweets in real-time with a streaming API listener.
i. Listening for a keyword or hashtag (example here).
b. Retrieve tweets from (approximately) the last week, from the REST API.
i. Query strings here.
ii. Tweet attributes here.
1. Context annotations here.
6. Logic:
a. Apply for and receive a developer's account with Twitter here.
b. Create a developer's app at apps.twitter.com.
c. Two ways to get tweets:
i. Listener:
1. Using your app's bearer token, create a stream listener on the API.
ii. Query:
1. Using your app's bearer token, create a query string and request tweets
from Twitter's API.
d. Save tweets to a .json file (or .csv) on your hard drive.
e. In a different Python script (.py file), parse the .json file saved to your hard drive and
pull out the info you'd like from the tweets.
7. Demo/Practice:
a. Create a streaming listener on the API to listen for tweets with a hashtag of your choice
and save the tweets to a .json file on your hard drive (see example of saving .json files
here). You'll need to manually stop the script after a minute or two (or more or fewer,
depending on the popularity of the hashtag and the amount of data you want).
b. Query the API for recently sent tweets with a query string of your choice and save them
to a .json file on your hard drive.
1. Steps:
a. Get an API key from Youtube at Google's developers console here. Start the free trial to
activate the key.
b. Download google-api-python-client Python module
i. PyCharm > Preferences window > Project: [project_name] > Python interpreter
> "+" > search for "google-api-python-client" > Install package
ii. pip install google-api-python-client
c. Request comments from videos with video ID (get video ID from URL)
i. 100 comments is the max without having to loop over additional pages. To get
more pages, see SO answer here.
d. Save JSON object to .json file on your hard drive
e. In a different Python script, parse through .json file and pull out comments.
2. Practice:
a. Using your API key, harvest 100 comments from a Youtube video of your choice and
save the JSON object to a .json file on your hard drive. If you don't know which video to
choose, why not use this one here from Dude Perfect.
b. In a different script, parse through the .json file and drill down into the nested
dictionaries to get the comments out.
1. Reddit is a social media platform that allows users to hide behind a cloak of anonymity and be
rude and crude as they discuss a myriad of topics in so called "subreddits".
2. For the language researcher who wants access to very unrehearsed and natural (written)
language, Reddit is a goldmine.
3. Steps:
a. Create user account, if you don't already have one.
b. Create an app here and write down/save the client id code (~20 alphanumeric character
code) and the client secret (~30 alphanumeric character code)
c. Download the PRAW Python module, either in:
i. PyCharm: Settings window > Python:ProjectName > Python Interpreter > + >
type "praw" > click "Install Package"
ii. pip install praw
d. Create a Python script following the instructions given in the documentation here.
4. Demo:
a. The instructor demonstrates how to scrape the posts from a subreddit of his choice.
5. Practice:
a. The students take the instructor's script (available in the CMS) and scrape posts from a
subreddit of their choice.
1. Logic:
a. Have Python (specifically the requests module here) act like a web browser and
retrieve (aka. scrape) the source code of a webpage.
i. A "User-Agent" of an HTTP request specifies information about the requesting
system (see here). Some webpages require a user-agent (e.g., here and here), so
we need to specify one in our request (see here).
b. Use the Python module jusText here to retrieve text and do something with it:
i. Print to screen;
ii. Write out to the harddrive in a .txt file;
iii. [Whatever else you might want to do with human-generated text].
c. Use the Python module bs4 (BeautifulSoup) here to retrieve links.
i. Probably have to clean up links:
1. Remove empty strings;
2. Remove duplicate links;
3. Remove same-document links (that begin with "#");
4. Create absolute links (i.e., links that start with "http") from relative links
(i.e., links that start with "/").
2. Practice:
a. Create a program that webscrapes a blog of your choice and print the text to the screen.
i. Hint1: Be sure to specify a common user-agent in your request; see SO thread
here.
ii. Hint2: The Python module jusText will be your friend here.
b. Modify the previous program to save the text to a .txt file on your hard drive.
i. Hint: File output is the name of the game here (review Lesson 5.1 "File I/O" if
needed).
c. Create a program that prints to screen the links on a webpage, that is, the URLs within
the href attribute of anchor tags <a>.
i. Hint: Choose your own adventure, whether with BeautifulSoup or with the
Python module lxml and xpath.
d. Modify the previous program to clean up the links by:
i. removing empty strings in the list of links, if any;
ii. removing duplicate links, if any;
1. The set data structure is the way to go here.
iii. removing same-document links that start with "#", if any;
iv. creating absolute links from relative links (i.e., links that start with "/") by
concatenating the domain with the relative path (e.g., https://www.r-
bloggers.com/2021/03/workshop-31-03-21-bring-a-shiny-app-to-production/
1. Hint: Look up urllib.parse().
e. Modify the previous program to randomly choose a few links from among the list of
clean links, then retrieve and print to screen the text on those few webpages.
i. Hint: The Python module random, and specifically the shuffle() function,
will be helpful here.
ii. Word of caution: random.shuffle() modifies in place (it does not create a
copy), so you should not save the list of links back to its same variable name.
12. Logic:
a. Create a .csv file with as many rows as there are reviews in the .json file, and that has
several columns: business id, number of stars, number of occurrences of linguistic
feature(s).
b. Import the .csv file into Python (or into R, Julia, or Excel) and get the Pearson's r
correlation value between the column with the number of stars and the column with the
number of occurrences of the linguistic feature in question.
c. Visualize the same two columns with a scatterplot, with a regression line, and/or a
boxplot.
13. Practice:
a. Using the .json file with AZ 2021 or AZ 2018 Yelp reviews in the CMS, create a .csv
file with two columns: column A = number of stars in the current review, column B =
number of occurrences of the word "delicious" in the current review.
b. Use the .csv file created in the previous exercise to calculate the correlation between the
number of stars and the number of occurrences of "delicious". For this practice exercise,
you can assume a normal distribution of the data and therefore use the most common
correlation test: Pearson's r. See example of Python code here.
c. Correlate the number of stars in the reviews in the AZ 2018 or AZ 2021 Yelp dataset
with the number of times a vowel is repeated three or more times, for example, "this
restaurant is waaay better than that one" or "that business is soooooo overraaaated".
d. Correlate the number of stars in reviews with the number of times (double) quotation
marks are used.
e. Choose another linguistic feature and correlate it with the number of stars in the reviews
in the sample dataset. Be prepared to share with the class what you find.
1. Macros in Microsoft Word (and Excel and LibreOffice and Google Docs) are snippets of
computer code that perform specific tasks on a Word document (or other type of file).
2. Paul Beverley has written a veritable plethora of macros for Microsoft Word geared towards
editing tasks; see the complete list of macros here.
3. The main pre-editing macro is FRedit, which is a global find-and-replace macro (Find-and-
Replace edit).
a. Download the FRedit macro and follow the instructions (i.e., "1_instructions.docx" file)
to get it running.
b. Add some new find-and-replace instructions in the "4_Sample_List.docx" file and rerun
the macro.
4. ProperNounAlyse: Identifies proper nouns that might be misspelled.
5. NumberToText: Converts a numeral into the spell-out version of the number.
6. MatchDoubleQuotes: Identifies any paragraphs that contain an odd number of opening and
closing double quotation marks.
7. Practice:
a. Students get FRedit working on their computer.
b. Students get at least one other macro of your choice working on their computer.