NLP Feature Engineering

August 28, 2019

Hands on feature engineering

Susan Li from Kognitiv

Feature engineering for nlp TF-IDF

count the number token that contain a certain work then multiply that by (total docs/docs that contain the word)
you compute this for each token you are searching over
Allows you to use a feature vector as an input field Word2vec
not deep learning
CBOW predicts the current word given the neighbouring words
Skip-gram does the opposite FastText
can generate better embedding than word2vec
can construct a result even if word was not in training
gensim library Topic Modelling
this is an unsupervised learning algorithm
use intertopic distance map to see is your topics overlap
not all topics can be defined well
useful for data exploration
inconsistant data can fool the system, if two hotels have the same poor description of check-in time and whatnot then this technique would assume that they are similar.

The future might look like automatic feature engineering How does one explain end to end feature engineering and how do you debug it with it breaks? Good feature engineering are the backbone of machine learning Feature engineering for machine learning book O’Reilly

If you have several labels that are so similar that even a human would have trouble correctly labelling them it might be a good idea to consolide the labels :research: Stackoverflow tensorflow workshop examples problem is that all questions only have one tag in this dataset

fuzzywuzzy gensim

model explain ability pythom projects

Lime Shap

Understanding predictions: Machine learning interpretability

Pratap Rumamurthy from H20

ml pipeline: data integration & quality feature engineering model training machine learning interpretability

white box and black box models

why explain? multiplicity of good models fairness and social aspects trust of model producers and consumers security and hacking regulated/controlled environments

Surrogate Model

you can create a surrogate model that is explainable using your model to create a new dataset in order to have your surrogate to converge to your main model

if you need perfect interpretability to a complex model you can create a surrogate interpretable model around a specific point, it will be explainable around that point only.

Eg:if you have a model predicting something using profile information you can make everything static and just change age in order to understand what it is doing.

feature ranking based on permutation

take one feature, shuffle the values and predict again, the drop in accuracy of your original model will imply how important that feature is.