TF-IDF

TF-IDF - how important is a word to a document in a corpus

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

measures how important a term is

TF-IDF is TF*IDF

Data sets:

Sparse textual content

  1. mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)

def mean_weighted_embedding(model, words, idf=1.0):

if words:

return np.mean(idf * model[words], axis=0)a

else:

print('we have an empty list')

return []

idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])

logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]

  1. Enriching using lstm-next word (char or word-wise)

  2. Using external wiktionary/pedia data for certain words, phrases

  3. Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters

Last updated

Was this helpful?