In this assignment you’ll implement a variant on the Map
interface using the
adapter pattern. You’ll use the implementation to build an n-gram viewer,
which shows data about common words and phrases in text over time.
You can get the starter code here:
https://www.dropbox.com/s/70bajyiayai6pn3/pa6-starter-master.zip?dl=0
DefaultMap
You’ll provide a fast implementation of an interface called DefaultMap
in
DefaultMapImpl.java
. You can implement it however you like, but we recommend
using one of Java’s built in Map
implementations (like HashMap or TreeMap)
with the adapter pattern.
The special feature of DefaultMap
is that it will be constructed with a
default value that it returns for keys that aren’t found. So, for example, a
default map constructed with a 0 default value will return 0 for all keys that
aren’t present:
DefaultMap<String, Integer> thisMap = new DefaultMapImpl<>(0);
assertEquals(0, (int)thisMap.get("C"));
assertEquals(0, (int)thisMap.get("A"));
thisTree.set("C", 3000);
assertEquals(3000, (int)thisMap.get("C"));
assertEquals(0, (int)thisMap.get("A"));
In addition, if null
is provided as the default value, any attempt to get a
key that isn’t set should result in a NoSuchElementException
:
DefaultMap<String, Integer> thisMap = new DefaultMapImpl<>(null);
thisMap.get("A"); // Throws NoSuchElementException
You will implement all the methods defined in the DefaultMap
interface. You
must ensure that each has the specified big-O bound in the average case, and
argue why based on the documentation of any libraries you use, or based on your
implementation. Note that these are big-O bounds, so doing better (like O(1)
where O(log(n)) is required) is acceptable. In each, n represents the number
of entries in the map.
set
, required O(log(n))get
, required O(log(n))containsKey
, required O(log(n))keys
, required O(n)values
, required O(n)size
, required O(1)defaultValue
, required O(1)DefaultMapImpl
should have a single constructor that takes one argument (the
default value).
Their specifications are defined in header comments in the DefaultMap.java
file, which you should follow. You may use any methods on in the Java
collections library to implement them (including helpers like
Collections.sort
). If you don’t know how to use a particular library method,
interface, or class in the standard Java utils, ask! This is an opportunity to
learn about the built-in libraries. You may find these useful:
The pages linked from those may also have useful information. Your
implementation of DefaultMap
will be graded automatically by tests that we
provide. We’ll provide a subset of the tests in the grader.
An n-gram is a sequence of N adjacent words in a document. The 1-grams (when N=1) in a document are just all the words within it. The 2-grams are all the sequences of two adjacent words, and so on. Here’s an example:
this sentence is an example sentence
"this"
, "sentence"
, "is"
, "an"
, "example"
, "sentence"
"this sentence"
, "sentence is"
, "is an"
, "an example"
, "example sentence"
"this sentence is"
, "sentence is an"
, "is an example"
, "an example sentence"
One way that text is analyzed is by breaking it into n-grams and counting the frequency at which various n-grams appear. Breaking this data into a time series is particularly interesting – how do particular terms change in frequency in documents over time?
Your task is to build the data structures needed to load a corpus of n-grams
and visualize them. You will build several methods in a class called Loader
to load a corpus of words, and then use a graphing library to plot them. You
will use the implementation of DefaultMap.java
to do so.
The data is a sample of English language from the years 1990 to 2012, containing a few million words in total. It is the COCA sample from this web site. It is laid out in plain text files within a single directory, and the filenames all have the format:
w_<type>_<year>.txt
Where <type>
says something about the source of data (interesting, but not
relevant for the assignment), and <year>
says what year the data is from.
For example w_mag_1991.txt
contains snippets of text from magazines in 1991,
and w_fic_2002.txt
contains snippets from fictional works in 2002.
The files themselves each contain several lines with text and other formatting characters on them, for example:
##4000054 Section : Features Research sheds new light on the origin of humanity 's most intimate quadruped ally <p> The poor dog , " wrote poet Lord Byron in a flight of emotion , " in life the firmest friend , The first to welcome , foremost to defend . " And certainly , few animal lovers would care to differ . The dog , after all , is commonly referred to as man 's best friend , and unquestionably serves a wide range of human purposes . Thanks to artificial selection , there are dogs that guard houses and dogs that herd livestock , dogs that locate game birds for shooting and dogs that retrieve game birds that have been shot , dogs that pull sleds and dogs that sit languidly in human laps . <p> Clearly , the relationship between dog and human runs deep in our culture and our psyches . No surprise , then , that the origin of the domestic dog has long been a matter for speculation and inquiry . But now , new techniques of molecular biology are allowing researchers to trace @ @ @ @ @ @ @ @ @ @ ways previously unavailable to traditional wildlife biologists , taxonomists , and archeologists . Investigators are making great strides in understanding the origin of the domestic dog , even though results are often subject to dispute and controversy , as might be expected of research on a creature that is genetically complex . <p> " No other species is so diverse , " says Robert Wayne , a University of California-Los Angeles evolutionary biologist who has just completed the largest study ever on dog genetics and evolution . " Dogs are a model for how rapid morphological change might take place in a natural population . " They also offer clues as to how genetic vigor can be maintained in domestic species . <p> One of the key questions of dog evolution focuses on the source : From what wild creature did the domestic dog arise ? Charles Darwin suggested that the close relationship between wolves , coyotes , and jackals-all of which can interbreed-so muddies questions of which species yielded the dog that " we shall probably never be able to ascertain the dog 's origins with certainty . " Austrian @ @ @ @ @ @ @ @ @ @
You’ll notice a few things about the format:
@
characters used to separate passagesThis format where be your source for retrieving the n-grams from the data
set, and these files are all in the directory data/
included with your
repository. Your program will consume this data and produce graphs like
this:
That is, a user can type in multiple n-grams separated by semicolons, and your program will produce a graph that shows the relationship.
You must implement the two following methods:
public static DefaultMap<Integer, DefaultMap<String, Integer>> generateDatabase(Path p)
public static Graph makeGraph(
DefaultMap<Integer, DefaultMap<String, Integer>> db,
String[] query)
The idea here is that generateDatabase
produces a useful structure that can
be efficiently queried later to produce a graph.
generateDatabase
should map from years to maps from n-grams to counts.
That is, the structure might represent something like:
1991 => { "is" => 30, "is a" => 10, "has a" => 20, ... }
1992 => { "is" => 45, "is a" => 12, "has a" => 25, ... }
1993 => { "is" => 60, "is a" => 13, "has a" => 28, ... }
So there will be one entry in the map for each year of data, and one entry in each year’s map for each n-gram that appeared at least once in that year’s data.
The Path
that generateDatabase
takes as an argument should be a path with a
directory structure like data
. The expectation is that the directory holds
files of the shape described above (w_<type>_<year>.txt
). The
generateDatabase
method should read all the files in this directory (see
“Reading Files” below), split them up on word boundaries, filter out non-word
strings (see “Filtering Strings” below), and build a map as described above.
You should add all 1- through 3-grams in the documents to the map (ignore grams
of 4 and above). Note that there are multiple files from each year, for example
both w_acad_1990.txt
and w_fic_1990.txt
represent text from the year 1990.
You should make sure to include n-grams from all the files corresponding to a
given year in the map for that year.
We will test your generateDatabase
with other directories than the given data
directory, and you may find it useful to do so as well. The test data we use
will have the expected layout above.
Once you’ve built the database, you can use it to ask interesting questions,
like how many times a several particular n-grams appeared in each year from
1990 to 2012 in the dataset. We’ve provided you with a class Graph
that uses
an open-source chart library to draw the graphical elements, but it needs to be
constructed with the right input data to work. You will write makeGraph
to
take a database of the shape constructed by generateDatabase
along with a
query as an array of n-grams, and produce the corresponding Graph
.
A query is an array of n-grams to plot, so, for example, to query for the n-grams “is a” and “has a”, the array argument would be:
new String[]{"is a", "has a"}
The individual n-grams themselves are represented simply as space-separated words.
The goal of makeGraph
is to construct a Graph
, whose constructor has the
following signature:
public Graph(String title, List<Integer> years, DefaultMap<String, List<Integer>> data)
title
parameter is simply the name that appears at the top.years
parameter is a List
containing all the years to be plottedThe data
parameter is a map from n-grams to lists of counts of the
n-grams, with one element in the list for each year in years
. For example,
if the years
were 1991, 1992, and 1993, the map might represent:
"is a" => [34, 67, 92]
"has a" => [44, 55, 33]
where there were 34
instances of "is a"
in 1991, 55
of "has a"
in
1992, and so on.
The Graph
constructor can then plot the information (see “Plotting” below).
Note that for the automatically graded part of the assignment, we will simply
check that you created the correct Graph
object.
You’re free to read in the files in any way you like. We found the following methods particularly useful:
FileSystems.getDefault().getPath()
for getting a Path
object from a string. Providing the path to the
directory containing the test data may be useful, for example.Files.newDirectoryStream()
,
which returns an iterator over the paths in a directoryFiles.readAllLines()
,
which gives back a List<String>
containing all the lines in a file represented by a given path.We didn’t need any more than these for the reference implementation. Here’s an example of a class that you could add to your implementation that shows an example of using these APIs. There’s a few things worth noting:
throws IOException
to main
here. This tells Java that if a file
or directory isn’t found, main
might just quit with an exceptionif
within the loop is just to demonstrate what the paths look like, and
to show what printing out/reading a single file looks like. There’s nothing
special about that individual file.toString
on a path to get a String
representation of it. This
can be useful for finding out which year a file refers to, since the year
will always appear just before the .txt
package cse12pa6student;
import java.io.IOException;
import java.nio.file.DirectoryStream;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Arrays;
public class ReadSomeFiles {
public static void main(String[] args) throws IOException {
Path path = FileSystems.getDefault().getPath("./data");
DirectoryStream<Path> d = Files.newDirectoryStream(path, "*.{txt}");
for(Path p: d) {
System.out.println("The path is: " + p);
if(p.toString().equals("./data/w_acad_1992.txt")) {
System.out.println(Files.readAllLines(p));
}
}
}
}
The data within each file has extraneous information in it. For example, there
are sequences of @
signs separating passages, punctuation like .
, and
paragraph markers like <p>
. In addition, some words are in uppercase and some
are in lowercase. You should filter out this extraneous information to get the
most useful sequence of n-grams you can.
This process not be graded automatically, but by reading your code and your argument for it. You should argue in your README about why your filtering rules were good (see the README section below). Note that this requires exercising your judgment, which we cannot do for you, so “Is this filtering rule OK?” kinds of questions need to come with an argument. We will autograde your submission on pre-filtered data that doesn’t have this extraneous information, so that your filtering rules won’t affect that part of your grade.
A reasonable approach to this is to break up each line on spaces to produce an
array or list (using String.split
), and then remove some of the
space-separated strings according to string matching rules you develop. Some
things to think about:
Note that when we test your code for correctness, we’ll focus on providing input that is in all lowercase with no punctuation. The string filtering you do will be graded manually and based on your arguments and judgment.
You will also implement a main
method that:
So, before the loop you should use generateDatabase
once to create the data
to query, then you can start the loop, and use makeGraph
to show the
responses.
Our loop looks something like:
database = generateDatabase(...);
Scanner in = new Scanner(System.in);
while(true) {
System.out.print("Enter query: ");
String query = in.nextLine();
... split query ...
Graph g = makeGraph(database, queryArray);
g.showChart();
}
This allows you to interactively explore relationships – this is a pretty useful data exploration tool! If you wanted to take this further, you could expand the set of data files you read in, you could build a visualizer for data from many different sources.
You will write a README to answer the following questions:
DefaultMapImpl
, argue why its performance meets the
specified required bound.generateDatabase
)? What is its asymptotic complexity in
terms of a tight big-O bound? In particular, how do you describe the input
size?This PA has the same style guidelines as PA5.
50 total points
DefaultMap
[automatically graded]generateDatabase
[automatically graded]makeGraph
[automatically graded]main
[manually graded]There is no bad implementation testing on this PA. However, we highly recommend
you use the practices you know from testing to thoroughly check that
DefaultMapImpl
and the n-gram helpers you wrote work as expected.