Tutorial
This short tutorial will teach you how to use the gsitk library for Sentiment Analysis.
Concretely, it covers:
- Managing datasets with the
DatasetManager
utility - Preprocessing textual data (although included datasets are already preprocessed)
- Extract features with models that are implemented in gsitk
- Persist to disk extracted features, and load them from disk
- Prepare an evaluation using the
Evaluation
interface
This tutorial has been generated using a jupyter notebook that you may download and run locally, from here.
Requirements
For running this tutorial, you need to have gsitk installed. You can install using pip:
pip install gsitk
Also, you need to set a default data path. This is where gsitk will save all the datasets.
You can do so by setting an environment variable. If you do not specify a $DATA_PATH
, the default path is /data
.
%env DATA_PATH=/tmp
Finally, you need to download some NLTK resources:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('opinion_lexicon')
Dataset management
gsitk largely simplifies dataset management, as it has several commonly used datasets for Sentiment Analysis. You can view the available datasets here. For this tutorial, we will download two datasets:
from gsitk.datasets.datasets import DatasetManager
dm = DatasetManager()
data = dm.prepare_datasets(['vader', 'imdb'])
prepare_datasets
downloads the data form their original sources, preprocesses the text and labels, and saves all to disk (in $DATA_PATH
).
In this way, next time you call prepare_datasets
, it will run quickly.
The data
variable has now a python dict
that contains two keys: vader and imdb.
print('type of data:', type(data))
print('datasets prepared', data.keys())
The datasets are saved in a pandas DataFrame:
data['vader'].head()
data['vader']['polarity'].value_counts()
All available datasets in gsitk can be seen here. This tool eases the replicability of sentiment analysis methods, offering a common ground for researchers to use.
Preprocessing
gsitk includes several functionalities for preprocessing text. Although all the included datasets are loaded through gsitk already processed, users may want to preprocess their own datasets. With the funcionalities presented below, they can do so.
gsitk includes three types of preprocessers:
- Simple: the simple and more efficient pre-processor.
- Pre-process Twitter: a processor indicated for parsing Twitter text.
- Normalize: An all-purpose pre-processor.
For more information, please check the documentation.
Direct use
The most direct way to preprocess is as shown below:
from gsitk.preprocess import simple, pprocess_twitter, normalize
text = "The earth is not flat, but almost. Please, believe me!"
twitter_text = "@POTUS can I have a selfie? #thanks"
print('simple', simple.preprocess(text))
print('twitter', pprocess_twitter.preprocess(twitter_text))
print('normalize', normalize.preprocess(text))
Preprocessor interface
All preprocessing utilities implement the preprocess
method, which can be useful for integrating these methods into your work pipeline. Nevertheless, gsitk offers the Preprocessor
interface to facilitate the use of preprocessers into its philosophy; as well as to include preprocessing into scikit-learn Pipelines. A simple example is shown below:
from gsitk.preprocess import pprocess_twitter, Preprocessor
texts = [
"@POTUS can I have a selfie? #thanks",
"If only Bradley's arm was longer. Best photo ever. #oscars"
]
Preprocessor(pprocess_twitter).transform(texts)
As mentioned, the Preprocessor
utility has full compatibility with scikit-learn's Pipelines. For example:
from sklearn.pipeline import Pipeline
from gsitk.preprocess import normalize, Preprocessor, JoinTransformer
texts = [
"This cat is crazy, he is not on the mat!",
"Will no one rid me of this turbulent priest?"
]
preprocessing_pipe = Pipeline([
('twitter', Preprocessor(normalize)),
('join', JoinTransformer())
])
preprocessing_pipe.fit_transform(texts)
Stop words removal
Removing stopwords is a common task in NLP. gsitk includes a functionality (StopWordsRemover
) that performs this task, using NLTK's stopword lists.
As before, StopWordsRemover
is compatible with scikit-learn's Pipelines.
from gsitk.preprocess.stopwords import StopWordsRemover
texts = [
"this cat is crazy , he is not on the mat !",
"will no one rid me of this turbulent priest ?"
]
StopWordsRemover().fit_transform(texts)
As it uses the NLTK stop word collections, several languages can be parsed, as in this Spanish example.
from gsitk.preprocess.stopwords import StopWordsRemover
texts = [
"entre el clavel blanco y la rosa roja , su majestad escoja",
"con diez cañones por banda viento en popa a toda vela",
]
StopWordsRemover(language='spanish').fit_transform(texts)
Feature extraction
gsitk has several useful feature extractors. It includes the implementation of some models proposed in research works, which aids in replicability and comparison tasks. These techniques have been recently published in peer-reviewed publications, and are oriented to foster research. We show an example of the use an embedding model, extracting word2vec features (paper here) for Sentiment Analysis, and the SIMilarity-based sentiment projectiON (SIMON) model (paper here).
Word embedding model (Word2VecFeatures)
This model performs an aggregation of the individual word vectors, computing an unified representation that can be used directly by a classfical machine learning classifier. It uses a pre-trained word embedding model to extract a vector for each word, and then applies a pooling function to all words, obtaining document-level representation. By default, the pooling function is the average.
As a first step, import a word embedding model. Using gensim makes this step easier:
import gensim.downloader as api
embedding_model = api.load("glove-wiki-gigaword-50")
from gsitk.features.word2vec import Word2VecFeatures
w2v_transformer = Word2VecFeatures(model=embedding_model)
text = [
['my', 'dog', 'is', 'very', 'happy'],
['my', 'cat', 'is', 'instead', 'very', 'sad'],
]
w2v_features_test = w2v_transformer.fit_transform(text)
w2v_features_test
SIMON model
The main idea of the SIMON method is that given a domain lexicon, the input text is measured against it, computing a vector that encodes the similarity between the input text and the lexicon. Such a vector encodes the similarity, as given by the word embedding model, of each of the words of the analyzed text to the lexicon words. For more information, please check the documentation section and the original publication.
For using SIMON, first, you need to use a word embedding model. The gensim library includes some downloadable models, that can be accessed as shown:
Also, the embedding model uses a lexicon of the domain to analyze; in this case, Sentiment Analysis. We can use the Bing Liu lexicon, accessible from NLTK:
from nltk.corpus import opinion_lexicon
lexicon = [list(opinion_lexicon.positive()), list(opinion_lexicon.negative())]
Finally, we need to configure the SIMON feature extractor. You can do this like this:
from gsitk.features import simon
simon_transformer = simon.Simon(lexicon=lexicon, n_lexicon_words=50, embedding=embedding_model)
Now, we can extract features using the simon model. The implementation has also support for scikit-learn's Pipelines. For example:
text = [
['my', 'dog', 'is', 'very', 'happy'],
['my', 'cat', 'is', 'instead', 'very', 'sad'],
]
simon_features_test = simon_transformer.fit_transform(text)
simon_features_test
Persist the features
gsitk allows to save features to disk, just using one line of code. This is useful if the feature extraction process takes too long, and it is not practical to repeat it. Thus, you can just save the features to disk, to use them later.
from gsitk.features import features
features.save_features(simon_features_test, 'simon_features_test') # you need to give the features an unique name
Now, the features are saved in disk, under the $DATA_PATH/features
directory. In our example, in here:
!ls /tmp/features
To load them from disk, we use the same name as before:
my_feats = features.load_features('simon_features_test')
(my_feats == simon_features_test).all() # check if they are the same features
Classifiers
gsitk includes functionalities to predict sentiment directly from text.
In this line of work, one of the most common approaches in Sentiment Analysis is to use a sentiment lexicon -that directly encodes subjective sentiment information- by matching the lexicon's word to those of the analyzed texts.
gsitk implements this approach in the LexiconSum
implementation.
from gsitk.classifiers import LexiconSum
# use the existing Bing Liu's lexicon,
bingliu_pos = {word: 1 for word in opinion_lexicon.positive()}
bingliu_neg = {word: -1 for word in opinion_lexicon.negative()}
bingliu_pos.update(bingliu_neg)
ls = LexiconSum(bingliu_pos)
text = [
['my', 'dog', 'is', 'a', 'good', 'and', 'happy', 'pet'],
['my', 'cat', 'is', 'not', 'sad', 'just', 'mildly', 'bad'],
['today' , 'i', 'am', 'sad'],
]
ls.predict(text)
This implementation greatly eases the early stages of development, allowing users to quickly develop prototypes.
Evaluation
As mentioned, gsitk has useful utilities that allow you to easily configure a Sentiment Analysis evaluation. In this example, we show a demo on how to do that. For more information on evaluation using gsitk, please read the documentation.
As in all evaluation methodologies, we need some datasets from which we can evaluate our models. We have already loaded two datasets, so let's use one of them: the IMDB dataset.
data['imdb'].head()
Next, we need to declare some feature extraction methods, as we want to compare them. In this example, we compare SIMON to an straight-forward 1-gram method. We prepare the 1-gram method below. Also, we prepare a full pipeline, including the classifier:
from gsitk.preprocess import JoinTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
pipelineA = Pipeline([
('join', JoinTransformer()), # needed to use CountVectorizer
('vect', CountVectorizer(max_features=50)),
('scale', StandardScaler(with_mean=False)),
('clf', LogisticRegression()),
])
pipelineA.name = '1-gram'
simon_transformer = simon.Simon(lexicon=lexicon, n_lexicon_words=50, embedding=embedding_model)
pipelineB = Pipeline([
('simon', simon_transformer),
('scale', StandardScaler()),
('clf', LogisticRegression(solver='liblinear')),
])
pipelineB.name = 'simon'
w2v_transformer = Word2VecFeatures(model=embedding_model)
pipelineC = Pipeline([
('w2v', w2v_transformer),
('scale', StandardScaler()),
('clf', LogisticRegression(solver='liblinear')),
])
pipelineC.name = 'w2v'
Now, we train our two methods using the dataset. We select the train
fold from the data:
pipelineA.fit(
data['imdb'][data['imdb']['fold'] == 'train']['text'],
data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)
pipelineB.fit(
data['imdb'][data['imdb']['fold'] == 'train']['text'],
data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)
pipelineC.fit(
data['imdb'][data['imdb']['fold'] == 'train']['text'],
data['imdb'][data['imdb']['fold'] == 'train']['polarity'].values.astype(int),
)
print('Finished!')
The evaluation is performed as follows:
from gsitk.evaluation.evaluation import Evaluation
# define datasets for evaluation: select test fold
datasets_evaluation = {'imdb': data['imdb'][data['imdb']['fold'] == 'test']}
# configure evaluation
ev = Evaluation(tuples=None,
datasets=datasets_evaluation,
pipelines=[pipelineA, pipelineB, pipelineC])
# perform evaluation, this can take a little long
ev.evaluate()
# results are stored in ev, and are in pandas DataFrame format
ev.results
In this way, we have performed a full evaluation on our data, comparing the two approaches. It can be seen the details of the results table and how the names of the methods are formed. Please note that the word embedding model and hyperparameters of the rest of methods are not normally used in a real evaluation, but are just set for the example. The same situation occurs for the metrics obtained.
Learn more
This tutorial has shown how to use the main functionalities of gsitk. For more information, please check the documentation.