您的位置: 首页 > 文章 > Is That a Duplicate Quora Question?

Is That a Duplicate Quora Question?

分类: 文章 • 2024-09-02 17:17:04

https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur

TL;DR : I achieved near state-of-the-art accuracy by using a very deep neural net. The code is available here: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question

Quora released its first ever dataset publicly on 24th Jan, 2017. This dataset consists of question pairs which are either duplicate or not. Duplicate questions mean the same thing.

For example, the question pairs below are duplicates (from the Quora dataset)

How does Quora quickly mark questions as needing improvement?
Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds…

Why did Trump win the Presidency?
How did Donald Trump win the 2016 Presidential Election?

What practical applications might evolve from the discovery of the Higgs Boson?
What are some practical benefits of discovery of the Higgs Boson?

Some examples of non-duplicate questions are as follows:

Who should I address my cover letter to if I'm applying for a big company like Mozilla?
Which car is better from safety view?""swift or grand i10"".My first priority is safety?

Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic?
What mistakes are made when depicting hacking in ""Mr. Robot"" compared to real-life cybersecurity breaches or just a regular use of technologies?

How can I start an online shopping (e-commerce) website?
Which web technology is best suitable for building a big E-Commerce website?

In this article, we discuss methods which can be used to detect duplicate questions using Quora dataset. Of course, these methods can be used for other similar datasets.

Methods discussed in this article range from simple TF-IDF, Singular Value Decomposition, Fuzzy Features, Word2Vec features, GloVe features, LSTMs and 1D CNN. We provide a comparison of performance of these algorithms on the Quora dataset.

Let’s take a look at the data first.

Data

The data consisted of 404351 question pairs with 255045 negative samples (non-duplicates) and 149306 positive samples (duplicates). Approximately 40% positive samples.

First few rows of the data:

Is That a Duplicate Quora Question?

Label distribution:

Is That a Duplicate Quora Question?

Average number characters in question1: 59.57

Minimum number of characters in question1: 1

Maximum number of characters in question1: 623

Average number characters in question2: 60.14

Minimum number of characters in question2: 1

Maximum number of characters in question2: 1169

Since Quora Engineering chose accuracy to evaluate their models (https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) , I did the same.

Basic Feature Engineering

I started with some very basic features. These features included:

Length of question1
Length of question2
Difference in the two lengths
Character length of question1 without spaces
Character length of question2 without spaces
Number of words in question1
Number of words in question2
Number of common words in question1 and question2

These features can be created easily using pandas’ apply and lambda function.

Is That a Duplicate Quora Question?

Let’s call this basic set of features “fs-1”.

Next I created some fuzzy features using the fuzzywuzzy package (https://github.com/seatgeek/fuzzywuzzy). Fuzzywuzzy uses Levenshtein Distance to calculate differences between sequences.

The fuzzy features I used were:

QRatio
WRatio
Partial ratio
Partial token set ratio
Partial token sort ratio
Token set ratio
Token sort ratio

Is That a Duplicate Quora Question?

This set of features will be called “fs-2”.

TF-IDF and SVD Features

I calculated TF-IDF & SVD features in a few different ways:

TF-IDF is an acronym for Term Frequency - Inverse Document Frequency. Its one of the very basic methods people use in information retrieval. One can read more about TFIDF here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

SVD stands for Singular Value Decomposition (https://en.wikipedia.org/wiki/Singular_value_decomposition). I used a variation of SVD called Truncated SVD which is implemented in scikit-learn.

The following pipelines were implemented and evaluated:

Is That a Duplicate Quora Question?

I’ll denote these features as “fs3-1”, “fs3-2”, “fs3-3”, “fs3-4” and “fs3-5”. Pretty easy ha! ;)

Let’s move to some complicated features from here.

Word2Vec Features

Word2Vec creates a multi-dimensional vector for every word in the english vocabulary (or the corpus it has been trained on). Word2Vec embeddings are very popular in natural language processing and always provide us with great insights. Wikipedia provides a good explanation of what these embeddings are and how they are generated (https://en.wikipedia.org/wiki/Word2vec).

Word2Vec can be used to represent words and words which have similar meaning will be very close to each other in the word2vec space. An example has been shown in the following figure:

Is That a Duplicate Quora Question?

We can also represent sentences using word2vec.

For word2vec model, I used gensim (https://radimrehurek.com/gensim/) and pre-trained word2vec model trained on Google News corpus.

For sentences, I generated vectors using the following function:

Is That a Duplicate Quora Question?

To calculate similarity between the questions, another feature that I created was word mover’s distance. Word mover’s distance uses word2vec embeddings and works on a principle similar to that of earth mover’s distance to give a distance between two text documents. In simple words, word mover’s distance provides the minimum distance needed to “move” a word from one document to another document. (From word embeddings to document distances: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf).

Final word2vec features included:

Word mover distance
Normalized word mover distance
Cosine distance between vectors of question1 and question2
Manhattan distance between vectors of question1 and question2
Jaccard similarity between vectors of question1 and question2
Canberra distance between vectors of question1 and question2
Euclidean distance between vectors of question1 and question2
Minkowski distance between vectors of question1 and question2
Braycurtis distance between vectors of question1 and question2
Skew of vector for question1
Skew of vector for question2
Kurtosis of vector for question1
Kurtosis of vector for question2

All the Word2Vec features are denoted by fs4.

A separate set of w2v features consisted of vectors itself.

Word2vec vector for question1
Word2vec vector for question2

These will be represented by fs5.

A snapshot of data after all the features (except tf-idf and svd features):

Is That a Duplicate Quora Question?

Now, we have everything available and we can start creating machine learning models on top of these features.

Machine Learning Models

I evaluated two of my favorite models: logistic regression and xgboost. For logistic regression the data was first normalized using z-score scaling.

The following table gives the performance of logistic regression and xgboost on different sets of features that were created:

Is That a Duplicate Quora Question?

The xgboost on basic features, fuzzy features, w2v vectors and w2v features already beats a few deep learning techniques such as siamese network as discussed here: http://www.erogol.com/duplicate-question-detection-deep-learning/ .

To be honest, I didn't spend much time with tuning hyperparameters of these models. I believe the score can be improved further if we use some hyperparameter tuning techniques. I wanted to dive into deep neural networks as soon as possible and that's what I did next!!!

Deep Learning Models

I tried many different deep learning models, from simple network with dense layers only to LSTM, GRU and 1D CNN. These models gave an accuracy of around 0.80.

Finally, I was able to get an accuracy of 0.85 with a deep neural network which comprised of two translation layers, one for each question, initialized by GloVe embeddings, two LSTMs without GloVe embeddings and two 1D convolutional layers which were also initialized by GloVe embeddings. This was followed by a series of dense layers with dropout and batch normalization. The final network summary is provided below:

Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_7 (Embedding)          (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
timedistributed_3 (TimeDistribut (None, 40, 300)       90300                                        
____________________________________________________________________________________________________
lambda_3 (Lambda)                (None, 300)           0                                            
____________________________________________________________________________________________________
embedding_8 (Embedding)          (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
timedistributed_4 (TimeDistribut (None, 40, 300)       90300                                        
____________________________________________________________________________________________________
lambda_4 (Lambda)                (None, 300)           0                                            
____________________________________________________________________________________________________
embedding_9 (Embedding)          (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
convolution1d_3 (Convolution1D)  (None, 36, 64)        96064                                        
____________________________________________________________________________________________________
dropout_10 (Dropout)             (None, 36, 64)        0                                            
____________________________________________________________________________________________________
convolution1d_4 (Convolution1D)  (None, 32, 64)        20544                                        
____________________________________________________________________________________________________
globalmaxpooling1d_3 (GlobalMaxP (None, 64)            0                                            
____________________________________________________________________________________________________
dropout_11 (Dropout)             (None, 64)            0                                            
____________________________________________________________________________________________________
dense_13 (Dense)                 (None, 300)           19500                                        
____________________________________________________________________________________________________
dropout_12 (Dropout)             (None, 300)           0                                            
____________________________________________________________________________________________________
batchnormalization_9 (BatchNorma (None, 300)           1200                                         
____________________________________________________________________________________________________
embedding_10 (Embedding)         (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
convolution1d_5 (Convolution1D)  (None, 36, 64)        96064                                        
____________________________________________________________________________________________________
dropout_13 (Dropout)             (None, 36, 64)        0                                            
____________________________________________________________________________________________________
convolution1d_6 (Convolution1D)  (None, 32, 64)        20544                                        
____________________________________________________________________________________________________
globalmaxpooling1d_4 (GlobalMaxP (None, 64)            0                                            
____________________________________________________________________________________________________
dropout_14 (Dropout)             (None, 64)            0                                            
____________________________________________________________________________________________________
dense_14 (Dense)                 (None, 300)           19500                                        
____________________________________________________________________________________________________
dropout_15 (Dropout)             (None, 300)           0                                            
____________________________________________________________________________________________________
batchnormalization_10 (BatchNorm (None, 300)           1200                                         
____________________________________________________________________________________________________
embedding_11 (Embedding)         (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
lstm_3 (LSTM)                    (None, 300)           721200                                       
____________________________________________________________________________________________________
embedding_12 (Embedding)         (None, 40, 300)       28683300                                     
____________________________________________________________________________________________________
lstm_4 (LSTM)                    (None, 300)           721200                                       
____________________________________________________________________________________________________
batchnormalization_11 (BatchNorm (None, 1800)          7200        merge_2[0][0]                    
____________________________________________________________________________________________________
dense_15 (Dense)                 (None, 300)           540300      batchnormalization_11[0][0]      
____________________________________________________________________________________________________
prelu_6 (PReLU)                  (None, 300)           300         dense_15[0][0]                   
____________________________________________________________________________________________________
dropout_16 (Dropout)             (None, 300)           0           prelu_6[0][0]                    
____________________________________________________________________________________________________
batchnormalization_12 (BatchNorm (None, 300)           1200        dropout_16[0][0]                 
____________________________________________________________________________________________________
dense_16 (Dense)                 (None, 300)           90300       batchnormalization_12[0][0]      
____________________________________________________________________________________________________
prelu_7 (PReLU)                  (None, 300)           300         dense_16[0][0]                   
____________________________________________________________________________________________________
dropout_17 (Dropout)             (None, 300)           0           prelu_7[0][0]                    
____________________________________________________________________________________________________
batchnormalization_13 (BatchNorm (None, 300)           1200        dropout_17[0][0]                 
____________________________________________________________________________________________________
dense_17 (Dense)                 (None, 300)           90300       batchnormalization_13[0][0]      
____________________________________________________________________________________________________
prelu_8 (PReLU)                  (None, 300)           300         dense_17[0][0]                   
____________________________________________________________________________________________________
dropout_18 (Dropout)             (None, 300)           0           prelu_8[0][0]                    
____________________________________________________________________________________________________
batchnormalization_14 (BatchNorm (None, 300)           1200        dropout_18[0][0]                 
____________________________________________________________________________________________________
dense_18 (Dense)                 (None, 300)           90300       batchnormalization_14[0][0]      
____________________________________________________________________________________________________
prelu_9 (PReLU)                  (None, 300)           300         dense_18[0][0]                   
____________________________________________________________________________________________________
dropout_19 (Dropout)             (None, 300)           0           prelu_9[0][0]                    
____________________________________________________________________________________________________
batchnormalization_15 (BatchNorm (None, 300)           1200        dropout_19[0][0]                 
____________________________________________________________________________________________________
dense_19 (Dense)                 (None, 300)           90300       batchnormalization_15[0][0]      
____________________________________________________________________________________________________
prelu_10 (PReLU)                 (None, 300)           300         dense_19[0][0]                   
____________________________________________________________________________________________________
dropout_20 (Dropout)             (None, 300)           0           prelu_10[0][0]                   
____________________________________________________________________________________________________
batchnormalization_16 (BatchNorm (None, 300)           1200        dropout_20[0][0]                 
____________________________________________________________________________________________________
dense_20 (Dense)                 (None, 1)             301         batchnormalization_16[0][0]      
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 1)             0           dense_20[0][0]                   
====================================================================================================
Total params: 174,913,917
Trainable params: 60,172,917
Non-trainable params: 114,741,000
____________________________________________________________________________________________________

And the network architecture:

Is That a Duplicate Quora Question?

The network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and took 10-15 hours to train. This network achieved an accuracy of 0.848 (~0.85). I tried over 10 different architectures to come up with this one :)

I'm still training a few configurations and will update this article as soon as the results improve. Code is available on my git repo: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question

Major python libraries that I used:

scikit-learn
keras
tensorflow
pandas

I would like to thank Alexey Grigorev (https://github.com/alexeygrigorev/) for providing great pointers on word2vec features and Bradley Pallen for his similar work which is available on his github (https://github.com/bradleypallen/).

Detect duplicate questions in Quora #MachineLearning #DeepLearning