class: logo-slide --- class: title-slide ### Recurrent Neural Networks ### Applications of Data Science - Class 18 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2022-12-26 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # Time keeps moving on #### (Janis Joplin) --- ### What would Simone do next? <img src="images/simone_biles_landing.png" style="width: 60%" /> --- ### What would be the price of GOOG tomorrow? ```python df = pd.read_csv("../data/fortune500_100_days.csv") google = df[df['Fortune500'] == 'GOOG'].values[0][1:] plot_series(google[:-1], 100, google[-1], y_label='') plt.show() ``` <img src="images/Google-1.png" width="100%" /> --- ### What is he going to say next? <img src="images/bibi.png" style="width: 100%" /> --- class: section-slide # Simple RNN --- ### The Setting Suppose we have a univariate time series `\((x_{i1}, x_{i2}, ..., x_{iT})\)` of `\(T\)` time steps, where `\(i = 1, ..., N\)` (e.g. Fortune `\(N = 500\)` stock price, the last `\(T = 100\)` days), and we want to predict `\(y_i\)`, which could be: - The next day price (regression setting) - Positive or negative outcome (classification setting) - Simone Biles next move (???) <img src="images/lr_nn_morelayers.png" style="width: 30%" /> .insight[ 💡 What are the disadvantages of a regular network in this setting? ] --- class: section-slide # Detour: Time Series Analysis --- ### Don't invent the wheel! Time Series Analysis is a big deal in Statistics. .pull-left[ <img src = "images/tsa_hyndman.png" style="width: 80%"> ] .pull-right[ <img src = "images/tsa_cryer.png" style="width: 78%"> ] --- class: section-slide # End of Detour --- ### A Single Neuron RNN The most basic Single-Neuron RNN would: - take `\(x_t\)`, learn from it and output `\(\hat{y}_t\)` - then take both the next input `\(x_{t+1}\)` and previous output `\(\hat{y}_t\)`, learn from them - by learning we mean forward and backward propagating at each stage with some loss `\(L\)` (e.g. MSE) <img src = "images/single_neuron_RNN.png" style="width: 70%"> --- ### Unrolling a Neuron <img src = "images/unrolling_neuron.png" style="width: 70%"> .insight[ 💡 But how many parameters are we actually learning? What important NN principle is demonstrated here? ] --- ### If you build it, they will come. Remember we built a Logistic Regression NN? Guess what! At high level nothing changes! ```python def single_rnn(X, y, epochs, alpha): w = np.array([1, 1, 1]) ls = np.zeros(epochs) for i in range(epochs): l, w = optimize(X, y, alpha, w) ls[i] = l return ls, w def optimize(X, y, alpha, w): y_pred_arr, l = forward(X, y, w) grad = backward(X, y, y_pred_arr, w) w = gradient_descent(alpha, w, grad) return l, w def gradient_descent(alpha, w, grad): return w - alpha * grad ``` --- ### Forward Propagation `$$\hat{y}_1 = \tanh(w_0 + w_1 \cdot x_1 + w_2 \cdot y_{0}) \\ \vdots \\ \hat{y}_T = tanh(w_0 + w_1 \cdot x_T + w_2 \cdot \hat{y}_{T - 1}) \\ L = MSE = \frac{1}{N}\sum_{i = 1}^{N}(y_{i} - \hat{y}_{iT})^2$$` ```python def forward(X, y, w): N, T = X.shape y_pred_arr = np.zeros((N, T + 1)) y_pred = np.zeros(N) y_pred_arr[:, 0] = y_pred for t in range(T): y_pred = np.tanh(w[0] + X[:, t] * w[1] + y_pred * w[2]) y_pred_arr[:, t + 1] = y_pred l = np.mean((y - y_pred)**2) return y_pred_arr, l ``` --- ### Backward Propagation `$$\frac{\partial \hat{y}_{i1}}{\partial w_0} = \frac{\partial \tanh(o_{i1})}{\partial o_{i1}}\frac{\partial o_{i1}}{\partial w_0} = [1 - \tanh^2(o_{i1})] \cdot 1 = 1 - \hat{y}^2_{i1} \\ \frac{\partial \hat{y}_{i2}}{\partial w_0} = \frac{\partial \tanh(o_{i2})}{\partial o_{i2}}\frac{\partial o_{i2}}{\partial w_0} = (1 - \hat{y}^2_{i2})(1 + w_2\frac{\partial \hat{y}_{i1}}{\partial w_0})\\ \vdots \\ \frac{\partial L}{\partial w_0} = \sum_{i = 1}^N \frac{\partial L}{\partial \hat{y}_{iT}}\frac{\partial \hat{y}_{iT}}{\partial w_0} = \sum_{i = 1}^N -\frac{2}{N}(y_i - \hat{y}_{iT})\frac{\partial \tanh(o_{iT})}{\partial o_{iT}}\frac{\partial o_{iT}}{\partial w_0} = \\ = \sum_{i = 1}^N -\frac{2}{N}(y_i - \hat{y}_{iT})(1 - \hat{y}^2_{iT})[1 + w_2\frac{\partial \hat{y}_{iT-1}}{\partial w_0}]$$` And you are cordially invited to do the same for `\(\frac{\partial L}{\partial w_1}\)` and `\(\frac{\partial L}{\partial w_2}\)`. .font80percent[Hope you have as much fun as I did.] --- ```python def backward_t(X, y_pred_arr, w, grads, t, N): y_t = y_pred_arr[:, t] if t == 0: grads_w0 = np.ones((N, )) grads_w1 = X[:, t] grads_w2 = y_t else: dot_dyprev = w[2] dyprev_doprev = 1 - y_t ** 2 grads_w0 = np.ones(N) + dot_dyprev * dyprev_doprev * grads[:, 0] grads_w1 = X[:, t] + dot_dyprev * dyprev_doprev * grads[:, 1] grads_w2 = y_t + dot_dyprev * dyprev_doprev * grads[:, 2] return np.stack([grads_w0, grads_w1, grads_w2], axis=1) def backward(X, y, y_pred_arr, w): N, T = X.shape y_pred = y_pred_arr[:, -1] dl_dypred = -2 * (y - y_pred) / N dypred_doT = 1 - y_pred ** 2 grads = np.zeros((N, 3)) for t in range(T): grads = backward_t(X, y_pred_arr, w, grads, t, N) for j in range(3): grads[:, j] *= dl_dypred * dypred_doT final_grads = grads.sum(axis=0) return final_grads ``` --- ### The Fortune500 Stocks ```python df = pd.read_csv("../data/fortune500_100_days.csv") X = df.iloc[:, 1:].values y = X[:, -1] X = X[:, :-1] plt.clf() plot_series(X[1, :], 100, y[1], y_label="MMM") plt.show() ``` <img src="images/Fortune-Example-3.png" width="70%" /> --- ### What would be a good MSE? Predicting with the mean of `\(y\)` (the 101st day mean stock price) ```python np.mean((y - y.mean())**2) ``` ``` ## 0.3516210190443937 ``` Predicting with the last column of `\(X\)` (the 100th day stock price) ```python np.mean((y - X[:, -1])**2) ``` ``` ## 0.033289150367058463 ``` --- ### Training with our neuron ```python mses, w = single_rnn(X, y, epochs=1000, alpha=0.01) print(w) ``` ``` ## [-0.07161043 1.15476269 0.25800392] ``` ```python plt.plot(mses) plt.ylabel('MSE-Loss'); plt.xlabel('Epoch') plt.show() ``` <img src="images/Loss-History-5.png" width="60%" /> --- ```python y_pred_arr, mse = forward(X, y, w) y_pred = y_pred_arr[:, -1] print(mse) ``` ``` ## 0.03900926352398098 ``` ```python plt.scatter(y, y_pred) plt.ylabel('y_pred'); plt.xlabel('y_true') plt.show() ``` <img src="images/Scatter-Fit-7.png" width="50%" /> .insight[ 💡 Are you surprised? How could we easily improve? What would be a better approach for this simple dataset? ] --- ### Finally, Keras ```python from tensorflow.keras import Sequential from tensorflow.keras.layers import SimpleRNN from tensorflow.keras.optimizers import Adam model = Sequential([ SimpleRNN(1, input_shape=(None, 1)) ]) model.compile(optimizer=Adam(learning_rate=0.01), loss='mse') X = X[:, :, np.newaxis] y = y[:, np.newaxis] print(X.shape) ``` ``` ## (502, 100, 1) ``` ```python print(y.shape) ``` ``` ## (502, 1) ``` ```python history = model.fit(X, y, epochs=50, verbose=0) ``` --- ```python print(model.get_weights()[2], model.get_weights()[0], model.get_weights()[1]) ``` ``` ## [-0.05025314] [[0.89638436]] [[0.4493832]] ``` ```python y_pred = model.predict(X, verbose=0) print(np.mean((y - y_pred) **2)) ``` ``` ## 0.0437715170171821 ``` ```python plt.plot(history.history['loss']) plt.show() ``` <img src="images/SingleNeuron-Loss-9.png" width="50%" /> --- ```python plt.scatter(y, y_pred) plt.ylabel('y_pred'); plt.xlabel('y_true') plt.show() ``` <img src="images/SingleNeuron-Scatter-Fit-11.png" width="50%" /> --- ### Add inputs, Add neurons `\(\hat{Y}_t = \tanh(W_0 + X_t \cdot W_1 + \hat{Y}_{t-1} \cdot W_2)\)` - `\(X_t\)` is `\(n\)` batch size X `\(m\)` inputs - `\(W_1\)` is `\(m\)` inputs X `\(p\)` neurons - `\(\hat{Y}_t\)` is `\(n\)` batch size X `\(p\)` neurons - `\(W_2\)` is `\(p\)` neurons X `\(p\)` neurons - `\(W_0\)` is `\(p\)` X 1 bias vector <img src = "images/multiple_neurons_rnn.png" style="width: 70%"> --- ### Add layers ```python model = Sequential([ SimpleRNN(10, return_sequences=True, input_shape=[None, 1]), SimpleRNN(5, return_sequences=True), SimpleRNN(1) ]) model.compile(optimizer=Adam(learning_rate=0.01), loss='mse') ``` --- ```python model.summary() ``` ``` ## Model: "sequential_1" ## _________________________________________________________________ ## Layer (type) Output Shape Param # ## ================================================================= ## simple_rnn_1 (SimpleRNN) (None, None, 10) 120 ## ## simple_rnn_2 (SimpleRNN) (None, None, 5) 80 ## ## simple_rnn_3 (SimpleRNN) (None, 1) 7 ## ## ================================================================= ## Total params: 207 ## Trainable params: 207 ## Non-trainable params: 0 ## _________________________________________________________________ ``` .font80percent[ * 10 neurons each having: 1 inputs, 1 bias, all other 10 outputs in layer: 10 * (1 + 1 + 10) * 5 neurons each having: 10 inputs, 1 bias, all other 5 outputs in layer: 5 * (10 + 1 + 5) * 1 neurons each having: 5 inputs, 1 bias, all other 1 outputs in layer: 1 * (5 + 1 + 1) ] --- ```python history = model.fit(X, y, epochs=30, verbose=0) y_pred = model.predict(X, verbose=0) print(np.mean((y - y_pred) **2)) ``` ``` ## 0.027710174771783818 ``` ```python plt.scatter(y, y_pred) plt.ylabel('y_pred'); plt.xlabel('y_true') plt.show() ``` <img src="images/DeepRNN-Scatter-Fit-13.png" width="50%" /> .insight[ 💡 Though ~200 params for predicting 500 numbers... sounds a bit much. ] --- ### Possibilities are endless <img src = "images/possibilities.png" style="width: 70%"> .font80percent[Source: [CS230](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)] --- class: section-slide # Detour: Text is a Time Series! --- #### But we first have to tokenize it ```python from tensorflow.keras.preprocessing import text, sequence sentences = [ 'I love Stats so much', 'I love ML too', 'I love DS', 'I love NN' ] tokenizer = text.Tokenizer(num_words = 10) tokenizer.fit_on_texts(sentences) print(tokenizer.word_counts) ``` ``` ## OrderedDict([('i', 4), ('love', 4), ('stats', 1), ('so', 1), ('much', 1), ('ml', 1), ('too', 1), ('ds', 1), ('nn', 1)]) ``` ```python print(tokenizer.word_index) ``` ``` ## {'i': 1, 'love': 2, 'stats': 3, 'so': 4, 'much': 5, 'ml': 6, 'too': 7, 'ds': 8, 'nn': 9} ``` ```python print(tokenizer.index_word) ``` ``` ## {1: 'i', 2: 'love', 3: 'stats', 4: 'so', 5: 'much', 6: 'ml', 7: 'too', 8: 'ds', 9: 'nn'} ``` --- #### Then make it a sequence ```python text_sequences = tokenizer.texts_to_sequences(sentences) print(text_sequences) ``` ``` ## [[1, 2, 3, 4, 5], [1, 2, 6, 7], [1, 2, 8], [1, 2, 9]] ``` ```python X = sequence.pad_sequences(text_sequences, padding='post', truncating='post', maxlen=4) print(X) ``` ``` ## [[1 2 3 4] ## [1 2 6 7] ## [1 2 8 0] ## [1 2 9 0]] ``` --- #### Now we can embed it ```python from tensorflow.keras.layers import Embedding embed_layer = Embedding(input_dim= (10 + 1), output_dim=3) print(embed_layer(np.array([1]))) ``` ``` ## tf.Tensor([[ 0.00612465 0.00696361 -0.04569301]], shape=(1, 3), dtype=float32) ``` ```python X_embedded = embed_layer(X) print(X_embedded.shape) ``` ``` ## (4, 4, 3) ``` ```python print(X_embedded) ``` ``` ## tf.Tensor( ## [[[ 0.00612465 0.00696361 -0.04569301] ## [-0.00974723 -0.00407351 0.03557405] ## [-0.03233683 0.00490437 0.04315286] ## [-0.03733008 0.03002742 -0.02753231]] ## ## [[ 0.00612465 0.00696361 -0.04569301] ## [-0.00974723 -0.00407351 0.03557405] ## [-0.00920048 0.0153161 -0.03797644] ## [ 0.01899211 -0.04871576 -0.04502872]] ## ## [[ 0.00612465 0.00696361 -0.04569301] ## [-0.00974723 -0.00407351 0.03557405] ## [-0.00735121 0.04971125 -0.0313213 ] ## [ 0.0028616 -0.03679866 0.0218447 ]] ## ## [[ 0.00612465 0.00696361 -0.04569301] ## [-0.00974723 -0.00407351 0.03557405] ## [-0.01353997 -0.02620271 -0.04318953] ## [ 0.0028616 -0.03679866 0.0218447 ]]], shape=(4, 4, 3), dtype=float32) ``` --- class: section-slide # End of Detour --- ### Yelp! ~600K (!) [text reviews](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) of shops and restaurants, polarized to negative (1-2 stars) and positive (3-4 stars), 560K in training set. ```python (ds_train, ds_test), ds_info = tfds.load('yelp_polarity_reviews', split=['train', 'test'], with_info=True) df_train = tfds.as_dataframe(ds_train, ds_info) df_test = tfds.as_dataframe(ds_test, ds_info) df_test['text'] = df_test['text'].str.decode('utf-8') df_train['text'] = df_train['text'].str.decode('utf-8') print(df_train.shape) print(df_test.shape) print(df_test.head(3)) ``` ``` ## (560000, 2) ``` ``` ## (38000, 2) ``` ``` ## label text ## 0 0 Was not impressed, and will not return. ## 1 0 I went in to purchase overalls and was treated... ## 2 0 This place really is horrible... Every time I ... ``` --- #### Yelp! But with sequences ```python from sklearn.model_selection import train_test_split max_features = 10000 seq_len = 100 tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(df_train['text']) text_sequences = tokenizer.texts_to_sequences(df_train['text']) X = sequence.pad_sequences(text_sequences, padding='post', truncating='post', maxlen=seq_len) X_train, X_test, y_train, y_test = train_test_split(X, df_train['label'], test_size = 0.2) print(X_train.shape) print(X_test.shape) ``` ``` ## (448000, 100) ``` ``` ## (112000, 100) ``` In case you're wondering, yes, there are smarter text generators, but even this 0.5M rows `X` matrix is only ~220MB. --- #### Yelp with MLP Remember there's nothing preventing you from using a simple NN, for (almost) everything: ```python from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import EarlyStopping n_cells = 10 epochs = 100 batch_size = 30 words_embed_dim = 50 callbacks = EarlyStopping(monitor='val_loss', patience=5) mlp = Sequential([ Embedding(max_features + 1, words_embed_dim), Dense(n_cells, activation='relu'), Dense(1, activation='sigmoid') ]) mlp.compile(loss = 'binary_crossentropy', optimizer='adam', metrics='accuracy') ``` --- ```python mlp.summary() ``` ``` ## Model: "sequential_2" ## _________________________________________________________________ ## Layer (type) Output Shape Param # ## ================================================================= ## embedding_1 (Embedding) (None, None, 50) 500050 ## ## dense (Dense) (None, None, 10) 510 ## ## dense_1 (Dense) (None, None, 1) 11 ## ## ================================================================= ## Total params: 500,571 ## Trainable params: 500,571 ## Non-trainable params: 0 ## _________________________________________________________________ ``` ```python history = mlp.fit(X_train, y_train, validation_split=0.1, callbacks=callbacks, batch_size=batch_size, epochs=epochs) mlp.evaluate(X_test, y_test) ``` ``` ## [0.6664084792137146, 0.5871875286102295] ``` --- #### Yelp with RNN ```python rnn = Sequential([ Embedding(max_features + 1, words_embed_dim), SimpleRNN(n_cells, return_sequences=True), SimpleRNN(1, activation='sigmoid') ]) rnn.compile(loss = 'binary_crossentropy', optimizer='adam', metrics='accuracy') rnn.summary() ``` ``` ## Model: "sequential_3" ## _________________________________________________________________ ## Layer (type) Output Shape Param # ## ================================================================= ## embedding_2 (Embedding) (None, None, 50) 500050 ## ## simple_rnn_4 (SimpleRNN) (None, None, 10) 610 ## ## simple_rnn_5 (SimpleRNN) (None, 1) 12 ## ## ================================================================= ## Total params: 500,672 ## Trainable params: 500,672 ## Non-trainable params: 0 ## _________________________________________________________________ ``` --- ```python history = rnn.fit(X_train, y_train, validation_split=0.1, callbacks=callbacks, batch_size=batch_size, epochs=epochs) rnn.evaluate(X_test, y_test) ``` ``` ## [0.44967198371887207, 0.8164107203483582] ``` --- class: section-slide # 1D Convolution Layers --- ### If you got it in 2D... <img src="images/conv1d_excel.png" style="width: 60%" /> Or, in a formula: `\(Z_{i} = b + \sum_{v=0}^{f_w-1}X_{i - \frac{f_w - 1}{2} + v} \cdot W_{v}\)` Or, in Numpy: ```python np.convolve([0,1,2,1,0], [0.2,0.6,0.2][::-1], 'same') ``` ``` ## array([0.2, 1. , 1.6, 1. , 0.2]) ``` .insight[ 💡 Why would we want this? ] --- ```python from tensorflow.keras.layers import Conv1D rnn_conv1d = Sequential([ Embedding(max_features + 1, words_embed_dim), Conv1D(filters=5, kernel_size=2, strides=1), SimpleRNN(n_cells, return_sequences=True), SimpleRNN(1, activation='sigmoid') ]) rnn_conv1d.compile(loss = 'binary_crossentropy', optimizer='adam', metrics='accuracy') rnn_conv1d.summary() ``` ``` ## Model: "sequential_4" ## _________________________________________________________________ ## Layer (type) Output Shape Param # ## ================================================================= ## embedding_3 (Embedding) (None, None, 50) 500050 ## ## conv1d (Conv1D) (None, None, 5) 505 ## ## simple_rnn_6 (SimpleRNN) (None, None, 10) 160 ## ## simple_rnn_7 (SimpleRNN) (None, 1) 12 ## ## ================================================================= ## Total params: 500,727 ## Trainable params: 500,727 ## Non-trainable params: 0 ## _________________________________________________________________ ``` --- ```python history = rnn_conv1d.fit(X_train, y_train, validation_split=0.1, callbacks=callbacks, batch_size=batch_size, epochs=epochs) rnn_conv1d.evaluate(X_test, y_test) ``` ``` ## [0.4487762749195099, 0.825705349445343] ``` --- class: section-slide # LSTM --- ### RNN Cell has Short Memory > I hated this bar, though the bartender was so handsome and the drink he made me was absolutely delicious. The RNN sees `[I, hated, this, bar, ..., so, handsome, ... absolutely, delicious]`. What do you think it would predict? Enter Long Short-Term Memory cells (LSTM). LSTM keeps track of its memory, of its state `\(C_t\)`, by constantly updating how much it needs to: - forget from previous state: `\(f_t \cdot C_{t-1}\)` - remember from current "candidate" state: `\(i_t \cdot \tilde{C_t}\)` `\(C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t}\)` --- `\(C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t}\)` Where `\(i_t\)` and `\(f_t\)` are "gates" between 0 and 1. Finally, the state goes through `\(\tanh()\)` activation and another 0-1 gate `\(o_t\)`, and the output is: `\(\hat{Y}_t = o_t \cdot tanh(C_t)\)` So how do we learn the gates and get `\(\tilde{C_t}\)`? Don't panic: `\(f_t = \sigma(W_{0f} + X_t \cdot W_{1f} + \hat{Y}_{t-1} \cdot W_{2f})\)` `\(i_t = \sigma(W_{0i} + X_t \cdot W_{1i} + \hat{Y}_{t-1} \cdot W_{2i})\)` `\(o_t = \sigma(W_{0o} + X_t \cdot W_{1o} + \hat{Y}_{t-1} \cdot W_{2o})\)` `\(\tilde{C}_t = \tanh(W_{0c} + X_t \cdot W_{1c} + \hat{Y}_{t-1} \cdot W_{2c})\)` Where `\(\sigma\)` is the sigmoid function, shrinking any input to be between 0 and 1. --- #### Or if you prefer a diagram <img src="images/lstm_cell.png" style="width: 100%" /> --- #### Yelp with LSTM ```python from tensorflow.keras.layers import LSTM lstm = Sequential([ Embedding(max_features + 1, words_embed_dim), LSTM(n_cells, return_sequences=True), LSTM(1, activation='sigmoid') ]) ``` ``` ## WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU. ``` ```python lstm.compile(loss = 'binary_crossentropy', optimizer='adam', metrics='accuracy') ``` .insight[ 💡 So if RNN layer would have `\(l\)` parameters, LSTM would have...? ] --- ```python lstm.summary() ``` ``` ## Model: "sequential_5" ## _________________________________________________________________ ## Layer (type) Output Shape Param # ## ================================================================= ## embedding_4 (Embedding) (None, None, 50) 500050 ## ## lstm (LSTM) (None, None, 10) 2440 ## ## lstm_1 (LSTM) (None, 1) 48 ## ## ================================================================= ## Total params: 502,538 ## Trainable params: 502,538 ## Non-trainable params: 0 ## _________________________________________________________________ ``` ```python history = lstm.fit(X_train, y_train, validation_split=0.1, callbacks=callbacks, batch_size=batch_size, epochs=epochs) lstm.evaluate(X_test, y_test) ``` ``` ## [0.22240985929965973, 0.9227678775787354] ``` --- ### Beyond the black box Sending "probes" can help us get what the LSTM is doing. By standardizing an LSTM neuron output to 0-1 and translating this to a cold-hot palette as was done [here](https://towardsdatascience.com/visualising-lstm-activations-in-keras-b50206da96ff) we can see what turns it on/off. Here I checked the first neuron for a few test reviews (see full notebook in Colab): <img src="images/lstm_black_box.png" style="width: 100%" /> --- ### Play with it! For example, combining LSTM with the old MLP to predict women's shoes price: <img src="images/lstm_with_mlp.png" style="width: 50%" /> --- ### Quora! The [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs) dataset has over 400K pairs of questions labeled for whether they're duplicate or not: ```python quora_df = pd.read_csv('../data/quora_small.csv') quora_df[['question1', 'question2', 'is_duplicate']].head() ``` ``` ## question1 ... is_duplicate ## 0 What is the step by step guide to invest in sh... ... 0 ## 1 What is the story of Kohinoor (Koh-i-Noor) Dia... ... 0 ## 2 How can I increase the speed of my internet co... ... 0 ## 3 Why am I mentally very lonely? How can I solve... ... 0 ## 4 Which one dissolve in water quikly sugar, salt... ... 0 ## ## [5 rows x 3 columns] ``` ```python quora_df['is_duplicate'].mean() ``` ``` ## 0.38 ``` --- ### Siamese Recurrent Architecture Deciding whether two "things" are the same or not is a challenge to any ML system. NN are a great framework for this kind of task through a Siamese architecture, and LSTMs are great for text. We're following RStudio's [AI Blog](https://blogs.rstudio.com/ai/posts/2018-01-09-keras-duplicate-questions-quora/). <img src="images/siamese_lstm.png" style="width: 80%" /> --- ```python from tensorflow.keras import Model from tensorflow.keras.layers import Dot, Input from tensorflow.keras.regularizers import l2 max_features = 50000 tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(pd.concat([quora_df['question1'], quora_df['question2']]).unique()) question1 = tokenizer.texts_to_sequences(quora_df['question1']) question2 = tokenizer.texts_to_sequences(quora_df['question2']) seq_len = 20 question1_padded = sequence.pad_sequences(question1, maxlen=seq_len, value = max_features + 1) question2_padded = sequence.pad_sequences(question2, maxlen=seq_len, value = max_features + 1) input1 = Input(shape = (seq_len,), name = "input_question1") input2 = Input(shape = (seq_len, ), name = "input_question2") word_embedder = Embedding( input_dim = max_features + 2, output_dim = 128, input_length = seq_len, embeddings_regularizer = l2(0.0001) ) seq_embedder = LSTM(units = 128, kernel_regularizer = l2(0.0001) ) ``` --- ```python vector1 = seq_embedder(word_embedder(input1)) vector2 = seq_embedder(word_embedder(input2)) cosine_similarity = Dot(axes=1)([vector1, vector2]) output = Dense(1, activation = 'sigmoid')(cosine_similarity) model = Model(inputs = [input1, input2], outputs = output) model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy') ``` ```python # messy print, see notebook in Colab model.summary() ``` --- ```python train_sample, val_sample = train_test_split(quora_df.index, test_size=0.1) train_question1_padded = question1_padded[train_sample,] train_question2_padded = question2_padded[train_sample,] train_is_duplicate = quora_df['is_duplicate'][train_sample] val_question1_padded = question1_padded[val_sample,] val_question2_padded = question2_padded[val_sample,] val_is_duplicate = quora_df['is_duplicate'][val_sample] history = model.fit( [train_question1_padded, train_question2_padded], train_is_duplicate, batch_size = 64, epochs = 100, callbacks = [EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)], validation_data = ([val_question1_padded, val_question2_padded], val_is_duplicate) ) model.evaluate(([val_question1_padded, val_question2_padded], val_is_duplicate)) ``` ``` ## [0.43971124291419983, 0.8325954079627991] ``` --- Save the model, put this function in the most basic Shiny/Dash app, and... ```python def predict_question_pairs(model, tokenizer, q1, q2): q1 = tokenizer.texts_to_sequences([q1]) q2 = tokenizer.texts_to_sequences([q2]) q1 = sequence.pad_sequences(q1, maxlen=seq_len, value = max_features + 1) q2 = sequence.pad_sequences(q2, maxlen=seq_len, value = max_features + 1) return model.predict([q1, q2])[0][0] predict_question_pairs( model, tokenizer, "What does a LSTM do?", "How does a LSTM work?" ) ``` ``` ## 0.5638776 ```