Tutorial

This short tutorial will teach you how to use the gsitk library for Sentiment Analysis.

Concretely, it covers:

  • Managing datasets with the DatasetManager utility
  • Preprocessing textual data (although included datasets are already preprocessed)
  • Extract features with models that are implemented in gsitk
  • Persist to disk extracted features, and load them from disk
  • Prepare an evaluation using the Evaluation interface

This tutorial has been generated using a jupyter notebook that you may download and run locally, from here.

Requirements

For running this tutorial, you need to have gsitk installed. You can install using pip:

pip install gsitk

Also, you need to set a default data path. This is where gsitk will save all the datasets. You can do so by setting an environment variable. If you do not specify a $DATA_PATH, the default path is /data.

%env DATA_PATH=/tmp
env: DATA_PATH=/tmp

Finally, you need to download some NLTK resources:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('opinion_lexicon')
[nltk_data] Downloading package punkt to /home/oaraque/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/oaraque/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /home/oaraque/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!

True

Dataset management

gsitk largely simplifies dataset management, as it has several commonly used datasets for Sentiment Analysis. You can view the available datasets here. For this tutorial, we will download two datasets:

from gsitk.datasets.datasets import DatasetManager

dm = DatasetManager()
data = dm.prepare_datasets(['vader', 'imdb'])

prepare_datasets downloads the data form their original sources, preprocesses the text and labels, and saves all to disk (in $DATA_PATH). In this way, next time you call prepare_datasets, it will run quickly.

The data variable has now a python dict that contains two keys: vader and imdb.

print('type of data:', type(data))
print('datasets prepared',  data.keys())
type of data: <class 'dict'>
datasets prepared dict_keys(['vader', 'imdb'])

The datasets are saved in a pandas DataFrame:

data['vader'].head()
polarity text
0 1 [somehow, i, was, blessed, with, some, really,...
1 1 [yay, ., another, good, phone, interview, .]
2 1 [we, were, number, deep, last, night, amp, the...
3 1 [lmao, allcaps, ,, amazing, allcaps, !]
4 -1 [two, words, that, should, die, this, year, :,...
data['vader']['polarity'].value_counts()
 1    2901
-1    1299
Name: polarity, dtype: int64

All available datasets in gsitk can be seen here. This tool eases the replicability of sentiment analysis methods, offering a common ground for researchers to use.

Preprocessing

gsitk includes several functionalities for preprocessing text. Although all the included datasets are loaded through gsitk already processed, users may want to preprocess their own datasets. With the funcionalities presented below, they can do so.

gsitk includes three types of preprocessers:

  • Simple: the simple and more efficient pre-processor.
  • Pre-process Twitter: a processor indicated for parsing Twitter text.
  • Normalize: An all-purpose pre-processor.

For more information, please check the documentation.

Direct use

The most direct way to preprocess is as shown below:

from gsitk.preprocess import simple, pprocess_twitter, normalize

text = "The earth is not flat, but almost. Please, believe me!"
twitter_text = "@POTUS can I have a selfie? #thanks"

print('simple', simple.preprocess(text))
print('twitter', pprocess_twitter.preprocess(twitter_text))
print('normalize', normalize.preprocess(text))
simple ['the', 'earth', 'is', 'not', 'flat', ',', 'but', 'almost', '.', 'please', ',', 'believe', 'me', '!']
twitter <user> can i have a selfie? <hastag> thanks
normalize ['the', 'earth', 'is', 'not', 'flat', ',', 'but', 'almost', '.', 'please', ',', 'believe', 'me', '!']

Preprocessor interface

All preprocessing utilities implement the preprocess method, which can be useful for integrating these methods into your work pipeline. Nevertheless, gsitk offers the Preprocessor interface to facilitate the use of preprocessers into its philosophy; as well as to include preprocessing into scikit-learn Pipelines. A simple example is shown below:

from gsitk.preprocess import pprocess_twitter, Preprocessor

texts = [
    "@POTUS can I have a selfie? #thanks",
    "If only Bradley's arm was longer. Best photo ever. #oscars"
]
Preprocessor(pprocess_twitter).transform(texts)
array(['<user> can i have a selfie? <hastag> thanks',
       "if only bradley's arm was longer. best photo ever. <hastag> oscars"],
      dtype='<U66')

As mentioned, the Preprocessor utility has full compatibility with scikit-learn's Pipelines. For example:

from sklearn.pipeline import Pipeline
from gsitk.preprocess import normalize, Preprocessor, JoinTransformer

texts = [
    "This cat is crazy, he is not on the mat!",
    "Will no one rid me of this turbulent priest?"
]

preprocessing_pipe = Pipeline([
    ('twitter', Preprocessor(normalize)),
    ('join', JoinTransformer())
])

preprocessing_pipe.fit_transform(texts)
['this cat is crazy , he is not on the mat !',
 'will no one rid me of this turbulent priest ?']

Stop words removal

Removing stopwords is a common task in NLP. gsitk includes a functionality (StopWordsRemover) that performs this task, using NLTK's stopword lists. As before, StopWordsRemover is compatible with scikit-learn's Pipelines.

from gsitk.preprocess.stopwords import StopWordsRemover

texts = [
    "this cat is crazy , he is not on the mat !",
    "will no one rid me of this turbulent priest ?"
]

StopWordsRemover().fit_transform(texts)
['cat crazy , mat !', 'one rid turbulent priest ?']

As it uses the NLTK stop word collections, several languages can be parsed, as in this Spanish example.

from gsitk.preprocess.stopwords import StopWordsRemover

texts = [
    "entre el clavel blanco y la rosa roja , su majestad escoja",
    "con diez cañones por banda viento en popa a toda vela",
]

StopWordsRemover(language='spanish').fit_transform(texts)
['clavel blanco rosa roja , majestad escoja',
 'diez cañones banda viento popa toda vela']

Feature extraction

gsitk has several useful feature extractors. It includes the implementation of some models proposed in research works, which aids in replicability and comparison tasks. These techniques have been recently published in peer-reviewed publications, and are oriented to foster research. We show an example of the use an embedding model, extracting word2vec features (paper here) for Sentiment Analysis, and the SIMilarity-based sentiment projectiON (SIMON) model (paper here).

Word embedding model (Word2VecFeatures)

This model performs an aggregation of the individual word vectors, computing an unified representation that can be used directly by a classfical machine learning classifier. It uses a pre-trained word embedding model to extract a vector for each word, and then applies a pooling function to all words, obtaining document-level representation. By default, the pooling function is the average.

As a first step, import a word embedding model. Using gensim makes this step easier:

import gensim.downloader as api

embedding_model = api.load("glove-wiki-gigaword-50")
from gsitk.features.word2vec import Word2VecFeatures

w2v_transformer = Word2VecFeatures(model=embedding_model)
text = [
    ['my', 'dog', 'is', 'very', 'happy'],
    ['my', 'cat', 'is', 'instead', 'very', 'sad'],
]

w2v_features_test = w2v_transformer.fit_transform(text)
w2v_features_test
array([[ 0.2236732 ,  0.25583891, -0.487614  , -0.301236  ,  0.86721801,
         0.08720819, -0.48012703,  0.03669201,  0.099744  ,  0.0686424 ,
         0.05496354,  0.44522   , -0.11925981,  0.0202894 ,  0.58847399,
         0.292354  ,  0.25116399,  0.470316  , -0.15011139, -0.544134  ,
        -0.553544  ,  0.51258399,  0.36928921,  0.41198533,  0.81134399,
        -1.76617999, -0.923298  ,  0.54818199,  0.5961172 , -0.42874679,
         3.04592001,  0.29102398, -0.22677599,  0.087812  , -0.0293142 ,
        -0.072712  ,  0.1456914 ,  0.24413   ,  0.05948399, -0.600286  ,
        -0.1608764 ,  0.012316  , -0.39915799,  0.3701636 ,  0.3432422 ,
        -0.1038574 , -0.074668  , -0.441143  ,  0.2592026 ,  0.569796  ],
       [ 0.28759499,  0.22734876, -0.42304934, -0.26260117,  0.64443667,
         0.12306683, -0.13500085,  0.05503167, -0.14429333,  0.1364975 ,
         0.011965  ,  0.34951501, -0.09140483,  0.01189917,  0.49902666,
         0.3094525 ,  0.022184  ,  0.38554393, -0.05209548, -0.37077501,
        -0.54204166,  0.40015333,  0.29457433,  0.30822928,  0.70403166,
        -1.60760001, -0.93711665,  0.63221   ,  0.59509267, -0.38296766,
         2.95400006,  0.14456332, -0.09737834, -0.003224  , -0.06279167,
         0.050218  ,  0.14238633,  0.131965  ,  0.14213333, -0.493325  ,
        -0.159679  ,  0.1110655 , -0.28615333,  0.25916317,  0.1393735 ,
         0.07512983,  0.0305865 , -0.33857917,  0.189849  ,  0.43066502]])

SIMON model

The main idea of the SIMON method is that given a domain lexicon, the input text is measured against it, computing a vector that encodes the similarity between the input text and the lexicon. Such a vector encodes the similarity, as given by the word embedding model, of each of the words of the analyzed text to the lexicon words. For more information, please check the documentation section and the original publication.

For using SIMON, first, you need to use a word embedding model. The gensim library includes some downloadable models, that can be accessed as shown:

Also, the embedding model uses a lexicon of the domain to analyze; in this case, Sentiment Analysis. We can use the Bing Liu lexicon, accessible from NLTK:

from nltk.corpus import opinion_lexicon

lexicon = [list(opinion_lexicon.positive()), list(opinion_lexicon.negative())]

Finally, we need to configure the SIMON feature extractor. You can do this like this:

from gsitk.features import simon

simon_transformer = simon.Simon(lexicon=lexicon, n_lexicon_words=50, embedding=embedding_model)

Now, we can extract features using the simon model. The implementation has also support for scikit-learn's Pipelines. For example:

text = [
    ['my', 'dog', 'is', 'very', 'happy'],
    ['my', 'cat', 'is', 'instead', 'very', 'sad'],
]

simon_features_test = simon_transformer.fit_transform(text)
simon_features_test
array([[22.64666   ,  5.5705223 ,  2.2850964 , 10.810273  , 13.582376  ,
        15.125259  ,  8.218765  ,  9.00664   , -0.6937979 ,  3.9982183 ,
         6.4370413 ,  9.553367  , -0.5632133 , 13.165003  , 14.947266  ,
        12.059111  , 12.084166  , 16.94381   , 14.256762  ,  9.00737   ,
        13.533313  , 12.647817  ,  6.6219416 , 10.682739  ,  6.4929757 ,
        15.598402  ,  6.787737  ,  9.797682  ,  8.292459  , 12.26892   ,
        12.590726  ,  7.692454  ,  7.1737566 ,  2.7645953 , 10.165642  ,
         8.804068  ,  8.408757  ,  6.1955237 ,  2.0015996 ,  2.2309947 ,
         1.2787244 ,  4.7324104 , -3.376121  , 13.01586   , 15.482267  ,
        18.318289  , 11.544971  ,  5.715289  ,  1.1589067 ,  2.513846  ,
         2.0361245 ,  7.6383333 ,  3.632421  , -1.3370285 , -3.1686149 ,
        10.457821  ,  8.004176  ,  9.27165   ,  0.26848498, 13.7307205 ,
         0.66586924,  3.258816  , 13.448561  ,  6.2063417 ,  7.4360647 ,
        13.86675   , 11.382271  ,  8.852619  , 10.839699  ,  4.0306945 ,
         3.5801606 ,  5.652316  ,  7.7164745 ,  0.18364441,  0.35577607,
         8.836479  , 10.631449  ,  6.5578055 ,  4.9766135 ,  7.304039  ,
         1.4824042 ,  6.420419  , -0.29708278, 11.140987  , 11.104248  ,
        14.517608  , 14.589582  ,  2.9442854 ],
       [22.64666   ,  5.5705223 ,  5.0429745 , 10.810273  , 13.582375  ,
        15.125259  ,  8.218765  ,  9.006639  , -0.1227181 ,  3.9982183 ,
         6.4370413 ,  9.553365  , -0.43892786, 13.165003  , 14.947266  ,
        12.059112  , 12.084166  , 16.94381   , 14.256763  ,  9.00737   ,
        13.533313  , 12.647815  ,  6.6219416 , 10.682737  ,  6.4929757 ,
        15.598401  ,  6.8988657 ,  9.797681  ,  8.292459  , 12.26892   ,
        12.590726  ,  7.692454  ,  7.1737566 ,  5.890463  , 10.281912  ,
         7.938674  ,  8.35267   ,  5.9062653 ,  2.0015996 ,  2.2309947 ,
         1.2787243 ,  4.7324104 , -0.23842043, 13.015857  , 15.482265  ,
        25.534483  , 11.544971  ,  7.4026384 ,  4.7278094 ,  2.8091795 ,
         6.284559  ,  8.126308  ,  4.070155  ,  0.0385592 , -2.4845915 ,
        10.457822  ,  8.004177  ,  9.27165   , -0.4482143 , 13.730719  ,
         1.4019574 ,  5.206396  , 13.44856   ,  8.668956  ,  7.4360642 ,
        13.86675   , 11.382271  ,  8.85262   , 10.839697  ,  7.406282  ,
         3.5801613 ,  5.652316  ,  6.559163  , -0.39882052,  2.5723372 ,
         8.836479  , 10.631447  ,  7.5776615 ,  5.649705  ,  8.961614  ,
         2.4668307 ,  6.4204206 ,  1.1060963 , 11.383308  , 11.104248  ,
        14.517608  , 14.589582  ,  6.9521146 ]], dtype=float32)

Persist the features

gsitk allows to save features to disk, just using one line of code. This is useful if the feature extraction process takes too long, and it is not practical to repeat it. Thus, you can just save the features to disk, to use them later.

from gsitk.features import features

features.save_features(simon_features_test, 'simon_features_test') # you need to give the features an unique name

Now, the features are saved in disk, under the $DATA_PATH/features directory. In our example, in here:

!ls /tmp/features
simon_features_test.npy

To load them from disk, we use the same name as before:

my_feats = features.load_features('simon_features_test')
(my_feats == simon_features_test).all() # check if they are the same features
True

Classifiers

gsitk includes functionalities to predict sentiment directly from text. In this line of work, one of the most common approaches in Sentiment Analysis is to use a sentiment lexicon -that directly encodes subjective sentiment information- by matching the lexicon's word to those of the analyzed texts. gsitk implements this approach in the LexiconSum implementation.

from gsitk.classifiers import LexiconSum

# use the existing Bing Liu's lexicon, 
bingliu_pos = {word: 1 for word in opinion_lexicon.positive()}
bingliu_neg = {word: -1 for word in opinion_lexicon.negative()}
bingliu_pos.update(bingliu_neg)

ls = LexiconSum(bingliu_pos)
text = [
    ['my', 'dog', 'is', 'a', 'good', 'and', 'happy', 'pet'],
    ['my', 'cat', 'is', 'not', 'sad', 'just', 'mildly', 'bad'],
    ['today' , 'i', 'am', 'sad'],
]

ls.predict(text)
array([ 1., -1., -1.])

This implementation greatly eases the early stages of development, allowing users to quickly develop prototypes.

Evaluation

As mentioned, gsitk has useful utilities that allow you to easily configure a Sentiment Analysis evaluation. In this example, we show a demo on how to do that. For more information on evaluation using gsitk, please read the documentation.

As in all evaluation methodologies, we need some datasets from which we can evaluate our models. We have already loaded two datasets, so let's use one of them: the IMDB dataset.

data['imdb'].head()
id fold text polarity rating
0 4677 train [i, understand, this, film, to, be, a, debut, ... 1 9
1 7632 train [getting, to, work, on, this, film, when, it, ... 1 10
2 1181 train [rachel, griffiths, writes, and, directs, this... 1 9
3 5050 train [we, really, enjoyed, grey, owl, :, a, simple,... 1 7
4 832 train [interesting, how, much, more, realistic, bros... 1 8

Next, we need to declare some feature extraction methods, as we want to compare them. In this example, we compare SIMON to an straight-forward 1-gram method. We prepare the 1-gram method below. Also, we prepare a full pipeline, including the classifier:

from gsitk.preprocess import JoinTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

pipelineA = Pipeline([
    ('join', JoinTransformer()), # needed to use CountVectorizer
    ('vect', CountVectorizer(max_features=50)),
    ('scale', StandardScaler(with_mean=False)),
    ('clf', LogisticRegression()),
])
pipelineA.name = '1-gram'

simon_transformer = simon.Simon(lexicon=lexicon, n_lexicon_words=50, embedding=embedding_model)
pipelineB = Pipeline([
    ('simon', simon_transformer),
    ('scale', StandardScaler()),
    ('clf', LogisticRegression(solver='liblinear')),
])
pipelineB.name = 'simon'

w2v_transformer = Word2VecFeatures(model=embedding_model)
pipelineC = Pipeline([
    ('w2v', w2v_transformer),
    ('scale', StandardScaler()),
    ('clf', LogisticRegression(solver='liblinear')),
])
pipelineC.name = 'w2v'

Now, we train our two methods using the dataset. We select the train fold from the data:

pipelineA.fit(
    data['imdb'][data['imdb']['fold'] == 'train']['text'],
    data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)

pipelineB.fit(
    data['imdb'][data['imdb']['fold'] == 'train']['text'],
    data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)

pipelineC.fit(
    data['imdb'][data['imdb']['fold'] == 'train']['text'],
    data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)
print('Finished!')
Finished!

The evaluation is performed as follows:

from gsitk.evaluation.evaluation import Evaluation

# define datasets for evaluation: select test fold
datasets_evaluation = {'imdb': data['imdb'][data['imdb']['fold'] == 'test']}

# configure evaluation
ev = Evaluation(tuples=None,
                datasets=datasets_evaluation,
                pipelines=[pipelineA, pipelineB, pipelineC])

# perform evaluation, this can take a little long
ev.evaluate()

# results are stored in ev, and are in pandas DataFrame format
ev.results
Dataset Features Model CV accuracy precision_macro recall_macro f1_weighted f1_micro f1_macro Description
0 imdb None 1-gram__imdb False 0.64408 0.644085 0.64408 0.644077 0.64408 0.644077 join --> vect --> scale --> clf
1 imdb None simon__imdb False 0.74892 0.749377 0.74892 0.748805 0.74892 0.748805 simon --> scale --> clf
2 imdb None w2v__imdb False 0.75284 0.752973 0.75284 0.752807 0.75284 0.752807 w2v --> scale --> clf

In this way, we have performed a full evaluation on our data, comparing the two approaches. It can be seen the details of the results table and how the names of the methods are formed. Please note that the word embedding model and hyperparameters of the rest of methods are not normally used in a real evaluation, but are just set for the example. The same situation occurs for the metrics obtained.

Learn more

This tutorial has shown how to use the main functionalities of gsitk. For more information, please check the documentation.