Sentiment Analysis

Databases

Sentiment databases
Movie reviews: IMDB reviews dataset on Kaggle
Sentiwordnet – mapping wordnet senses to a polarity model: SentiWordnet Site
Twitter airline sentiment on Kaggle
First GOP Debate Twitter Sentiment
Amazon fine foods reviews

Tools

** Many Sentiment tools,
NTLK sentiment analyzer
Vader (NTLK, standalone):
Text BLob:
Comparative opinion mining a review paper - has some info about unsupervised as well
Another reference list, has some unsupervised.
Sentiwordnet3.0 paper
presentation
Hebrew Psychological Lexicons by Natalie Shapira
This is the official code accompanying a paper on the Hebrew Psychological Lexicons was presented at CLPsych 2021.

Reference papers:

Twitter as a corpus for SA and opinion mining

Ground Truth

For sentiment In Vader -
1. “Screening for English language reading comprehension – each rater had to individually score an 80% or higher on a standardized college-level reading comprehension test.
2. Complete an online sentiment rating training and orientation session, and score 90% or higher for matching the known (prevalidated) mean sentiment rating of lexical items which included individual words, emoticons, acronyms, sentences, tweets, and text snippets (e.g., sentence segments, or phrases).
3. Every batch of 25 features contained five “golden items” with a known (pre-validated) sentiment rating distribution. If a worker was more than one standard deviation away from the mean of this known distribution on three or more of the five golden items, we discarded all 25 ratings in the batch from this worker.
4. Bonus to incentivize and reward the highest quality work. Asked workers to select the valence score that they thought “most other people” would choose for the given lexical feature (early/iterative pilot testing revealed that wording the instructions in this manner garnered a much tighter standard deviation without significantly affecting the mean sentiment rating, allowing us to achieve higher quality (generalized) results while being more economical).
5. Compensated AMT workers $0.25 for each batch of 25 items they rated, with an additional $0.25 incentive bonus for all workers who successfully matched the group mean (within 1.5 standard deviations) on at least 20 of 25 responses in each batch. Using these four quality control methods, we achieved remarkable value in the data obtained from our AMT workers – we paid incentive bonuses for high quality to at least 90% of raters for most batches.

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

1.6 million tweets labelled
13 languages
Evaluated 6 pretrained classification models
10 CFV
SVM / NB
Annotator agreements.
- about 15% were intentionally duplicated to be annotated twice,
- by the same annotator
- by two different annotators
Self-agreement from multiple annotations of the same annotator
Inter-agreement from multiple annotations by different annotators
The confidence intervals for the agreements are estimated by bootstrapping [12].
It turns out that the self-agreement is a good measure to identify low quality annotators,
the inter-annotator agreement provides a good estimate of the objective difficulty of the task, unless it is too low.

Alpha was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. Alpha is defined as follows:

Method cont here in a second paper

PreviousAnnotation & Disagreement NextQuestion Answering

Last updated 2 years ago

Was this helpful?