Description
Deep Learning for NLP
Coding Assignment 1
CS 544
Introduction
The goal of this coding assignment to get you familiar with TensorFlow and walk you through
some practical Deep Learning techniques. You will be starting with code that is similar
to one taught in the first NLP Deep Learning class. Recall, the code we taught in class
implemented a 3-layer neural network over the document vectors. The output layer classified
each document into (positive/negative) and (truthful/deceptive). You will utilize the dataset
from coding assignments 1 and 2. In this assignment, you will:
• Improve the Tokenization.
• Convert the first layer into an Embedding Layer, which makes the model somewhat
more interpretable. Many recent Machine Learning efforts strive to make models more
interpretable, but sometimes at the expense of prediction accuracy.
• Increase the generalization accuracy of the model by implementing input sparse dropout
– TensorFlow’s (dense) dropout layer does not work out-of-the-box, as explained later.
• Visualize the Learned Embeddings using t-SNE.
In order to start the assignment, please download the starter-code from:
• http://sami.haija.org/cs544/DL1/starter.py
You can run this code as:
python s t a r t e r . py path / to / coding1 /and/2/ data /
Note: This assignment will automatically be graded by a script, which verifies
the implementation of tasks one-by-one. It is important that you stick to these
guidelines: Only implement your code in places marked by ** TASK. Do not
change the signatures of the methods tagged ** TASK, or else the grading script
will fail in finding them and you will get a zero for the corresponding parts.
Otherwise, feel free to create as many helper functions as you wish!
Finally, you might find the first NLP Deep Learning lecture slides useful.
• This assignment is due Thursday April 4. We are working on Vocareum integration.
Nonetheless, you are advised to start early (before we finish the Vocareum Integration).
You can submit until April 7, but all submissions after April 4 will receive penalties.
1
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[10 points] Task 1: Improve Tokenization
The current tokenization:
# ** TASK 1.
d e f Tokenize ( comment ) :
“””Receives a string (comment) and returns array of tokens.”””
words = comment . s p l i t ( )
r e t u r n words
is crude. It splits on whitespaces only (spaces, tabs, new-lines). It leaves all other punctuations e.g. single- and double-quotes, exclamation marks, etc – there should be no reason to
have both terms “house” and “house?” in the vocabulary. While a perfect tokenization can
be quite involved, let us only slightly improve the existing one. Specifically, you should split
on any non-letter. You might find the python standard re package useful.
• Update code of Tokenize to work as described. Correct implementation should reduce
the number of tokens by about half.
2
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[20 + 6.5 points] Task 2: Convert the 1
st layer into an
embedding layer
Our goal here is to replace the first layer with something equivalent to tf.nn.embedding lookup,
followed by averaging, but without using the function tf.nn.embedding lookup as we aim
to understand the underlying mathematics behind embeddings and we do not (yet) want to
discuss variable-length representations in tensorflow1
.
The end-goal from this task is to make the output of this layer to represent every comment
(document) by the average word embedding appearing in the comment. For example, if we
represent the document by vector x ∈ R
|V |
, with |V | being the size of the vocabulary and
entry xi being the number of times word i appears in the document. Then, we would like
the output of the embedding layer for document x to be:
σ
x
>Y
||x||
(1)
where σ is element wise activation function. We wish to train the embedding matrix Y ∈
R
|V |×d which will embed each word in a d-dimensional space (each word embedding lives in
one row of the matrix Y ). The denominator ||x|| is to compute the average, which can be the
L1 or the L2 norm of the vector. In this exercise, use the L2 norm. The above should make
our model more interpretable. Note the following differences between the above embedding
layer and a traditional fully-connected (FC) layer, with transformation: σ
x
>W + b
.
1. FC layers have an additional bias-vector b. We do not want the bias vector. Its presence
makes the embeddings more tricky to be visualized or ported to other applications.
Here, W corresponds to the embedding dictionary Y .
2. As mentioned, the input vector b to Equation 1 should be normalized. if x is a matrix,
then normalization should be row-wise. (Hint: you can use tf.nn.l2 normalize).
3. Modern fully-connected layers are have σ = ReLu. Embeddings generally either have
(1) no activation or (2) a squashing activation (e.g. tanh, or L2-norm). We will opt to
use (2) specifically tanh activation, as option (1) might force us to choose an adaptive
learning-rate2
for the embedding layer.
4. The parameter W will be L2-regularized in standard FC i.e. by adding λ||W||2
2
to the
overall minimization objective function (where the scalar coefficient λ is generally set
to a small value such as 0.0001 or 0.00001). When training embeddings, we only want
to regularize the words that appear in the document rather than *all* embeddings at
every optimization update step. Specifically, we want to regularize by replacing the
standard L2 regularization with λ
x>Y
||x||
2
2
1Variable-length representations will likely be on next coding assignment
2Adaptive learning rates are incorporated in training algorithms such as AdaGrad and ADAM.
Reviews
There are no reviews yet.