Latest Posts from PySnacks

A Tutorial on using BERT for Text Classification w Fine Tuning

2020-01-23T12:06:42.753436+00:00

In this tutorial, we will learn how to use BERT for text classification. We will begin with a brief introduction of BERT, its architecture and fine-tuning mechanism. Then we will learn how to fine-tune BERT for text classification on following classification tasks:

Binary Text Classification: IMDB sentiment analysis with BERT [88% accuracy].
Multi-class Text Classification: 20-Newsgroup classification with BERT [90% accuracy].
Multi-label Text Classification: Toxic-comment classification with BERT [90% accuracy].

We will use BERT through the keras-bert Python library, and train and test our model on GPU’s provided by Google Colab with Tensorflow backend.

What is BERT ?

BERT stands for Bidirectional Encoder Representation of Transformers. It is a deep learning based unsupervised language representation model developed by researchers at Google AI Language. It is the first deeply-bidirectional unsupervised language model. The language models, until BERT, learnt from text sequences in either left-to-right or combined left-to-right and right-to-left contexts. Thus they were either not bidirectional or not bidirectional in all layers.The diagram below shows its bidirectional architecture as compared to other language models.

Deep-Bi-directionality in BERT Source

BERT incorporated deep bi-directionality in learning representations using a novel Masked Language Model(MLM) approach. This deep-bidirectional learning approach allows BERT to learn words with their context being both left and right words. Under the hood, BERT uses the popular Attention model for bidirectional training of transformers. With this approach BERT claims to have achieved the state-of-the-art results on a series of natural language processing and understanding tasks.

An overview of BERT Architecture

Before diving into using BERT for text classification, let us take a quick overview of BERT’s architecture. BERT is a multilayered bidirectional Transformer encoder. The diagram below shows a 12 layered BERT model(BERT-Base version). Note that each Transformer is based on the Attention Model.

There are multiple pre-trained model versions with varying numbers of encoder layers, attention heads and hidden size dimensions available. Below is a list of different model variants available.

H = The hidden size.

A = Number of self attention heads.

L = Number of Layers (Transformer Blocks)

The largest model available is BERT-Large which has 24 layers, 16 attention heads and 1024 dimensional output hidden vectors. For each model, there are also cased and uncased variants available. In this tutorial we will use BERT-Base which has 12 encoder layers with 12 attention heads and has 768 hidden sized representations.

Different Ways To Use BERT

BERT can be used for text classification in three ways.

Fine Tuning Approach: In the fine tuning approach, we add a dense layer on top of the last layer of the pretrained BERT model and then train the whole model with a task specific dataset.
Feature Based Approach: In this approach fixed features are extracted from the pretrained model.The activations from one or more layers are extracted without fine-tuning and these contextual embeddings are used as input to the downstream network for specific tasks. A few strategies for feature extraction discussed in the BERT paper are as follows:
1. Extracting Second-to-Last Hidden Layer
2. Extracting Last Hidden Layer
3. Concat Last Four Hidden
4. Weighted Sum All 12 Layers
As word-embedding: In this approach, the trained model is used to generate token embedding (vector representation of words) without any fine-tuning for an end-to-end NLP task. The vectors representations of tokens then can then be used for specific tasks like classification, topic modeling, summarisation etc. The following code demonstrates using BERT as word-embedding using the bert-embedding library.

Python

 1 2 3 4 5 6 7 8 9 10 11 12

#Source: https://pypi.org/project/bert-embedding/ pip install bert-embedding from bert_embedding import BertEmbedding text = "A tutorial on how to generate token embeddings using BERT" bert_embedding = BertEmbedding() result = bert_embedding(text.split('\n')) first_sentence = result[0] embedding = first_sentence[1] print (embedding) # array([ 0.4805648 , 0.18369392, -0.28554988, ..., -0.01961522, # 1.0207764 , -0.67167974], dtype=float32)

So which approach to choose for text classification with BERT? The answer depends on the performance requirements and the amount of effort we wish to put in, in terms of resources and time. Fine-tuning and feature-based extraction approaches require training, testing and validating on GPU or TPU and therefore are more time taking and resource intensive as compared to embedding-based approach. However, they are expected to yield better results as they benefit from the use of bidirectional contextual representation of whole sentences, tuned specifically for the task at hand.

The BERT paper recommends fine-tuning for better results. A few advantages of fine tuning BERT are as follows:

Better Results: Deeply-bidirectional learning enables it to achieve comparable or even better results than custom architecture tailored to one specific task.
Lesser data: BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words). The pre-trained model therefore has weights that allow us to fine tune for a specific dataset using much smaller datasets as compared to the case where the model needs to learn weights on a train from scratch.
Lesser resources: With advantage of being able to work with lesser training data, it cuts down the excessive compute and memory resources required to train the models from scratch.

Understanding Input to BERT

So, what is the input to BERT? Input to BERT is an embedding representation derived by summing token embedding, segmentation embedding and the position embedding of the text.

What are token embedding, segmentation embedding and the position embedding?

Token Embeddings: Token embeddings are the representations for the word-tokens of the text derived by tokenizing using WordPiece token vocabulary. For BERT-Base, the hidden size is 768, thus the token embedding created has a (SEQ_LEN X 768) size representation. The token embedding also includes [CLS] and [SEP] markers which denote the class(classification -category or label) and sentence separation respectively.
Position Embeddings: The position embedding is a representation for the position of each token in the sentence. For BERT-Base it is a 2D array of size (SEQ_LEN, 768), where each Nth row is a vector representation for the Nth position.
Segment Embeddings: The segment embedding identifies the different unique sentences in the text.

Note that each of the embeddings(token, position and segment), being summed to derive the input, has (SEQ_LEN x Hidden-Size) dimension. The SEQ_LEN value can be changed and is decided based on the length of the sentences in the downstream task dataset. The sentences which have length less than the sequence length need to be padded. The Hidden-Size (H) is decided by the choice of the BERT model(like BERT Tiny, Small, Base , Large etc.).

How to Fine Tune BERT for Text Classification ?

To Fine Tuning BERT for text classification, take a pre-trained BERT model, apply an additional fully-connected dense layer on top of its output layer and train the entire model with the task dataset. The diagram below shows how BERT is used for text-classification:

Note that only the final hidden state corresponding to the class token ([CLS]) is used as the aggregate sequence representation to feed into a fully connected dense layer for classification tasks. To understand it better, let us look at the last layers of BERT(BERT-Base, 12 Layers).

Bash

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Encoder-11-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-11-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 2362368 Encoder-11-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 0 Encoder-12-MultiHeadSelfAttention __________________________________________________________________________________________________ Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 0 Encoder-11-FeedForward-Norm[0][0] Encoder-12-MultiHeadSelfAttention __________________________________________________________________________________________________ Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 1536 Encoder-12-MultiHeadSelfAttention __________________________________________________________________________________________________ Encoder-12-FeedForward (FeedFor (None, 128, 768) 4722432 Encoder-12-MultiHeadSelfAttention __________________________________________________________________________________________________ Encoder-12-FeedForward-Dropout (None, 128, 768) 0 Encoder-12-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-12-FeedForward-Add (Add (None, 128, 768) 0 Encoder-12-MultiHeadSelfAttention Encoder-12-FeedForward-Dropout[0] __________________________________________________________________________________________________ Encoder-12-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-12-FeedForward-Add[0][0] __________________________________________________________________________________________________ Extract (Extract) (None, 768) 0 Encoder-12-FeedForward-Norm[0][0] __________________________________________________________________________________________________ NSP-Dense (Dense) (None, 768) 590592 Extract[0][0] __________________________________________________________________________________________________

For fine-tuning this model for classification tasks, we take the last layer NSP-Dense (Next Sentence Prediction-Dense) and tie its output to a new fully connected dense layer, as shown below.

Python

1 2 3 4 5

# Add dense layer for classification inputs = model.inputs[:2] dense = model.get_layer('NSP-Dense').output outputs = keras.layers.Dense(units=20, activation='softmax')(dense) model = keras.models.Model(inputs, outputs)

The updated model looks like this for binary text classification:

Bash

 1 2 3 4 5 6 7 8 9 10 11 12 13

Encoder-12-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-12-FeedForward-Add[0][0] __________________________________________________________________________________________________ Extract (Extract) (None, 768) 0 Encoder-12-FeedForward-Norm[0][0] __________________________________________________________________________________________________ NSP-Dense (Dense) (None, 768) 590592 Extract[0][0] __________________________________________________________________________________________________ dense (Dense) (None, 20) 15380 NSP-Dense[0][0] ================================================================================================== Total params: 109,202,708 Trainable params: 109,202,708 Non-trainable params: 0 __________________________________________________________________________________________________ None

The size of the last fully connected dense layer is equal to the number of classification classes or labels.

So, how do we choose activation and loss function for text classification? For Binary and Multiclass text classification we use the softmax activation function with sparse categorical cross entropy loss function while for multilabel text classification, sigmoid activation function with binary cross entropy loss function is more suitable.

Recommended Fine Tuning Hyper Parameters

According to the BERT paper, the following range of values are recommended:

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

Preparing Input datasets

Let us take a look at working examples of binary, multiclass and multilabel text classification by fine-tuning BERT. We will use Python based keras-bert library with Tensorflow backend and run our examples on Google Colab with GPU accelerators. Some of the code for these examples are taken from keras-bert documentation.

One method that is common across, all the tasks is the method that prepares the training, test and validation datasets. We need a method that generates these sets in the format BERT expects for text classification.

Understanding the input to keras-bert

For fine-tuning using keras-bert the following inputs are required:

Token Embedding: Each sentence in the dataset needs to be tokenized using WordPiece vocabulary, add [CLS] and [SEP] tokens, add padding.
Segment Mask Embedding: Generate segment embedding. (Array of zeros for single sentence representation.)
Target Labels

The positional embedding is derived internally and does not need to be passed explicitly.

To do the above three tasks we will use a method called load_data, the input to which would vary depending on the dataset format, however the processing logic and the output is the same across all. The output of load_data method is a tuple where the first item in a list of size two, the first item being text’s token embedding and the second item being texts segment embedding(array of zeros as we are classifying or labelling only one sentence at a time). The second item of the tuple is the target class, index wise-paired with the token and segment embedding.

Binary Text Classification Using BERT

To demonstrate using BERT with fine-tuning for binary text classification, we will use the Large Movie Review Dataset. This is a dataset for binary sentiment classification and contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

Let us begin with first downloading the dataset and preparing the training and test datasets.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

#!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip #!unzip -o uncased_L-12_H-768_A-12.zip dataset = tf.keras.utils.get_file( fname="aclImdb.tar.gz", origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", extract=True, ) token_dict = {} with codecs.open(vocab_path, 'r', 'utf8') as reader: for line in reader: token = line.strip() token_dict[token] = len(token_dict) tokenizer = Tokenizer(token_dict) def load_data(path, tagset): global tokenizer indices, sentiments = [], [] for folder, sentiment in tagset: folder = os.path.join(path, folder) for name in tqdm(os.listdir(folder)): with open(os.path.join(folder, name), 'r') as reader: text = reader.read() ids, segments = tokenizer.encode(text, max_len=SEQ_LEN) indices.append(ids) sentiments.append(sentiment) items = list(zip(indices, sentiments)) np.random.shuffle(items) indices, sentiments = zip(*items) indices = np.array(indices) mod = indices.shape[0] % BATCH_SIZE if mod > 0: indices, sentiments = indices[:-mod], sentiments[:-mod] return [indices, np.zeros_like(indices)], np.array(sentiments) train_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'train') test_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'test') tagset = [('neg', 0), ('pos', 1)] id_to_labels = {0: 'negative', 1: 'positive'} train_x, train_y = load_data(train_path, tagset) test_x, test_y = load_data(test_path, tagset)

Once we have our training data ready, let us define our model training hyper-parameters. We set the batch-size as 16 and learning-rate at 2e-5 as recommended by the BERT paper. It's important to not set a high value for learning rate, as it could cause the training to not converge or catastrophic forgetting.

Python

 1 2 3 4 5 6 7 8 9 10

# Bert Model Constants SEQ_LEN = 128 BATCH_SIZE = 16 EPOCHS = 3 LR = 2e-5 pretrained_path = 'uncased_L-12_H-768_A-12' config_path = os.path.join(pretrained_path, 'bert_config.json') checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt') vocab_path = os.path.join(pretrained_path, 'vocab.txt')

The next step is to build and train the model. We first load the pre-trained BERT-Base model. Then we take its last layer (NSP-Dense) and connect it to binary classification layer. The binary classification layer is essentially a fully-connected dense layer with size 2. Since it is a case of binary classification, we want the probabilities of the output nodes to sum upto 1, we use the softmax as the activation function.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

model = load_trained_model_from_checkpoint( config_path, checkpoint_path, training=True, trainable=True, seq_len=SEQ_LEN, ) inputs = model.inputs[:2] dense = model.get_layer('NSP-Dense').output outputs = keras.layers.Dense(units=2, activation='softmax')(dense) model = keras.models.Model(inputs, outputs) model.compile( RAdam(lr=LR), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'], ) history = model.fit( train_x, train_y, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.20, shuffle=True, )

Bash

1 2 3 4 5 6 7

Train on 19993 samples, validate on 4999 samples Epoch 1/3 19993/19993 [==============================] - 426s 21ms/sample - loss: 0.3789 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.3106 - val_sparse_categorical_accuracy: 0.8666 Epoch 2/3 19993/19993 [==============================] - 410s 20ms/sample - loss: 0.2370 - sparse_categorical_accuracy: 0.9029 - val_loss: 0.2764 - val_sparse_categorical_accuracy: 0.8852 Epoch 3/3 19993/19993 [==============================] - 408s 20ms/sample - loss: 0.1392 - sparse_categorical_accuracy: 0.9472 - val_loss: 0.3310 - val_sparse_categorical_accuracy: 0.8898

One the training is done, let us evaluate the model.

Python

1 2 3 4 5 6 7

from sklearn.metrics import accuracy_score, f1_score predicts = model.predict(test_x, verbose=True).argmax(axis=-1) accuracy = accuracy_score(test_y, predicts) macro_f1 = f1_score(test_y, predicts, average='macro') print ("Accuracy: %s" % accuracy) print ("macro_f1: %s" % macro_f1)

Bash

1 2	Accuracy: 0.8842429577464789 macro_f1: 0.8841799318689518

We could save the model with model.save(modelname.h5). The following code shows how to generate predictions.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14

texts = [ "It's a must watch", "Can't wait for it's next part!", 'It fell short of expectations.', 'Wish there was more to it!', 'Just wow!', 'Colossial waste of time', 'Save youself from this 90 mins trauma!' ] for text in texts: ids, segments = tokenizer.encode(text, max_len=SEQ_LEN) inpu = np.array(ids).reshape([1, SEQ_LEN]) predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0] print ("%s: %s"% (id_to_labels[predicted_id], text))

Bash

1 2 3 4 5 6 7

positive: It's a must watch positive: Can't wait for it's next part! negative: It fell short of expectations. positive: Wish there was more to it! positive: Just wow! negative: Colossial waste of time negative: Save youself from this 90 mins trauma!

Google Colab for IMDB sentiment analysis with BERT fine tuning.

Multi-class Text Classification Using BERT

To demonstrate multi-class text classification we will use the 20-Newsgroup dataset. It is a collection of about 20,000 newsgroup documents, spread evenly across 20 different newsgroups.

Let us first prepare the training and test datasets.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

dataset = tf.keras.utils.get_file( fname="20news-18828.tar.gz", origin="http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz", extract=True, ) tokenizer = Tokenizer(token_dict) def load_data(path, tagset): global tokenizer indices, labels = [], [] for folder, label in tagset: folder = os.path.join(path, folder) for name in tqdm(os.listdir(folder)): with open(os.path.join(folder, name), 'r', encoding="utf-8", errors='ignore') as reader: text = reader.read() ids, segments = tokenizer.encode(text, max_len=SEQ_LEN) indices.append(ids) labels.append(label) items = list(zip(indices, labels)) np.random.shuffle(items) indices, labels = zip(*items) indices = np.array(indices) mod = indices.shape[0] % BATCH_SIZE if mod > 0: indices, labels = indices[:-mod], labels[:-mod] return [indices, np.zeros_like(indices)], np.array(labels) path = os.path.join(os.path.dirname(dataset), '20news-18828') tagset = [(x, i) for i,x in enumerate(os.listdir(path))] id_to_labels = {id_: label for label, id_ in tagset} # Load data, split 80-20 for triaing/testing. all_x, all_y = load_data(path, tagset) train_perc = 0.8 total = len(all_y) n_train = int(train_perc * total) n_test = (total - n_train) test_x = [all_x[0][n_train:], all_x[1][n_train:]] train_x = [all_x[0][:n_train], all_x[1][:n_train]] train_y, test_y = all_y[:n_train], all_y[n_train:] print("# Total: %s, # Train: %s, # Test: %s" % (total, n_train, n_test))

Bash

1	# Total: 18816, # Train: 15052, # Test: 3764

Next, we build and train our model. We use the recommended BERT fine-tuning parameters and train our model for 4 epochs. The classification layer added on top of pre-trained BERT model is a fully-connected dense layer of size 20 (as 20 output classes) .

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

#pip install -q keras-bert keras-rectified-adam # Bert Model Constants SEQ_LEN = 128 BATCH_SIZE = 16 EPOCHS = 4 LR = 2e-5 pretrained_path = 'uncased_L-12_H-768_A-12' config_path = os.path.join(pretrained_path, 'bert_config.json') checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt') vocab_path = os.path.join(pretrained_path, 'vocab.txt') model = load_trained_model_from_checkpoint( config_path, checkpoint_path, training=True, trainable=True, seq_len=SEQ_LEN, ) # Add dense layer for classification inputs = model.inputs[:2] dense = model.get_layer('NSP-Dense').output outputs = keras.layers.Dense(units=20, activation='softmax')(dense) model = keras.models.Model(inputs, outputs) model.compile( RAdam(lr=LR), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'], ) history = model.fit( train_x, train_y, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.20, shuffle=True, )

Bash

1 2 3 4 5 6 7 8 9

Train on 12041 samples, validate on 3011 samples Epoch 1/4 12041/12041 [==============================] - 765s 64ms/sample - loss: 1.6826 - sparse_categorical_accuracy: 0.5052 - val_loss: 0.6773 - val_sparse_categorical_accuracy: 0.7948 Epoch 2/4 12041/12041 [==============================] - 749s 62ms/sample - loss: 0.4951 - sparse_categorical_accuracy: 0.8481 - val_loss: 0.4421 - val_sparse_categorical_accuracy: 0.8698 Epoch 3/4 12041/12041 [==============================] - 748s 62ms/sample - loss: 0.2534 - sparse_categorical_accuracy: 0.9239 - val_loss: 0.3752 - val_sparse_categorical_accuracy: 0.8947 Epoch 4/4 12041/12041 [==============================] - 746s 62ms/sample - loss: 0.1386 - sparse_categorical_accuracy: 0.9588 - val_loss: 0.3471 - val_sparse_categorical_accuracy: 0.9083

Once we have our model train, let us evaluate and use for muti-class labelling.

Python

1 2 3 4

from sklearn.metrics import accuracy_score, f1_score predicts = model.predict(test_x, verbose=True).argmax(axis=-1) accuracy = accuracy_score(test_y, predicts) macro_f1 = f1_score(test_y, predicts, average='macro')

Bash

1 2	Accuracy: 0.9024973432518597 macro_f1: 0.9001928370898599

Predict newsgroup labels with the trained model.

Python

 1 2 3 4 5 6 7 8 9 10 11 12

texts = [ 'Who scored the maximum goals?', 'Mars might have water and dragons!', 'CPU is over-clocked, causing it to heating too much!', 'I need to buy new prescriptions.', 'This is just government propaganda.' ] for text in texts: ids, segments = tokenizer.encode(text, max_len=SEQ_LEN) inpu = np.array(ids).reshape([1, SEQ_LEN]) predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0] print ("%s: %s"% (id_to_labels[predicted_id], text))

Bash

1 2 3 4 5 6

rec.sport.hockey: Who scored the maximum goals? sci.space: Mars might have water and dragons! comp.sys.ibm.pc.hardware: CPU is over-clocked, causing it to heating too much! sci.med: I need to buy new prescriptions. talk.politics.misc: This is just government propaganda. talk.politics.misc: This is just government propaganda.

Google Colab for 20 Newsgroup Multi-class Text Classification using BERT

Multilabel Text Classification Using BERT

To demonstrate multi-label text classification we will use Toxic Comment Classification dataset. It is a dataset on Kaggle, with Wikipedia comments which have been labeled by human raters for toxic behaviour. The different types o toxicity are: toxic, severe_toxic, obscene, threat, insult and identity_hate. Each comment can have either none or one or more type of toxicity. The dataset has over 100,000 labelled data, but for this tutorial we will use 25% of it to keep training memory and time requirements manageable.

Let us first build the training and test datasets.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

from google.colab import drive drive.mount('/content/gdrive') RESOUCE_DIR = "/content/gdrive/My\ Drive/resources" # Train/test Files datasets_dir = "%s/datasets/jigsaw-toxic-comment-classification-challenge" % (RESOUCE_DIR) test_datapath = "%s/test.csv" % (datasets_dir) test_labels = "%s/test_labels.csv" % (datasets_dir) train_datapath = "%s/train.csv" % (datasets_dir) tokenizer = Tokenizer(token_dict) def load_data(comments, comment_labels): global tokenizer indices, labels = [], [] for x in range(comments.shape[0]): ids, segments = tokenizer.encode(comments[x], max_len=SEQ_LEN) indices.append(ids) labels.append(comment_labels[x]) items = list(zip(indices, labels)) np.random.shuffle(items) indices, labels = zip(*items) indices = np.array(indices) mod = indices.shape[0] % BATCH_SIZE if mod > 0: indices, labels = indices[:-mod], labels[:-mod] return [indices, np.zeros_like(indices)], np.array(labels) train_df = pd.read_csv(train_datapath.replace('\\', '')) train_df = train_df.sample(frac=0.25,random_state = 42) train_lines = train_df['comment_text'].values labels_ordered = [ 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate' ] train_labels = train_df[labels_ordered].values train_x, train_y = load_data(train_lines, train_labels)

Next we build model and train it. The multi-label classification layer is a fully-connected dense layer of size 6 (6 possible labels), and we use sigmoid activation function to get independent probabilities of each class.

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

model = load_trained_model_from_checkpoint( config_path.replace('\\', ''), checkpoint_path.replace('\\', ''), training=True, trainable=True, seq_len=SEQ_LEN, ) # Add dense layer for classification inputs = model.inputs[:2] dense = model.get_layer('NSP-Dense').output outputs = keras.layers.Dense( units=len(labels_ordered), activation='sigmoid', name = 'Toxic-Categories-Dense' )(dense) model = keras.models.Model(inputs, outputs) model.compile( RAdam(lr=LR), loss='binary_crossentropy', metrics=['accuracy'], ) history = model.fit( train_x, train_y, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.33, shuffle=True, )

Bash

1 2 3 4 5

Train on 26724 samples, validate on 13164 samples Epoch 1/2 26724/26724 [==============================] - 1251s 47ms/sample - loss: 0.0858 - acc: 0.9660 - val_loss: 0.0450 - val_acc: 0.9822 Epoch 2/2 26724/26724 [==============================] - 1235s 46ms/sample - loss: 0.0404 - acc: 0.9845 - val_loss: 0.0431 - val_acc: 0.9827

We see that in just 2 epoch, our model achieved a 98% accuracy on the validation set. We can further save this model and use this model to generate labels as follows:

Python

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

texts = [ 'You are an idiot!', 'You are a drug addict!', 'I will kill you!', 'I want to goto London', ] for text in texts: ids, segments = tokenizer.encode(text, max_len=SEQ_LEN) inpu = np.array(ids).reshape([1, SEQ_LEN]) predicted = (model.predict([inpu,np.zeros_like(inpu)]) >= 0.5).astype(int) labels = [ label for i, label in enumerate(labels_ordered) if predicted[0][i] ] print ("%s: %s" % (text, labels))

Bash

1 2 3 4

You are an idiot!: ['toxic', 'obscene', 'insult'] You are a drug addict!: ['toxic'] I will kill you!: ['toxic', 'threat'] I want to goto London: []

Google Colab for Toxic Comment Classification with BERT fine tuning.

Conclusion

In this tutorial, we learnt how to use BERT with fine tuning for text classification. We saw that how using the pre-trained BERT model and just one additional classification layer, we can achieve high classification accuracy for different text classification tasks. BERT proves to be a very powerful language model and can be of immense value for text classification tasks.

Resources & References

How to Reverse Python Lists | In-place, slicing & reversed()

2020-05-18T17:55:01.148382+00:00

Python lists can be reversed using built-in methods reverse(), reversed() or by [::-1] list slicing technique. The reverse() built-in method reverses the list in place while the slicing technique creates a copy of the original list. The reversed() method simply returns a list iterator that returns elements in reverse order.

Below are the three built-in, common method used for reversing Python lists.

1. Reversing lists in-place using reverse()

Bash

1 2 3 4 5 6

>>> nums = [1,2,3,4,5,6,7,8] >>> type(nums.reverse()) <type 'NoneType'> >>> nums [8, 7, 6, 5, 4, 3, 2, 1] >>>

2. Reversing lists using slicing (creates a new copy)

Bash

 1 2 3 4 5 6 7 8 9 10 11

>>> nums = [1,2,3,4,5,6,7,8] >>> >>> nums_reversed = nums[::-1] >>> nums_reversed [8, 7, 6, 5, 4, 3, 2, 1] >>> type(nums_reversed) <type 'list'> >>> >>> nums [1, 2, 3, 4, 5, 6, 7, 8] >>>

3. Reversing lists using reversed

Bash

1 2 3 4 5 6 7

>>> nums = [1,2,3,4,5,6,7,8] >>> reversed(nums) <listreverseiterator object at 0x10fced990> >>> >>> [n for n in reversed(nums)] [8, 7, 6, 5, 4, 3, 2, 1] >>>

Let us look at each in detail to understand pros, cons and when to use a particular method.

Using reverse() for In-Place List Reversal

Bash

1 2 3 4 5 6

>>> nums = [1,2,3,4,5,6,7,8] >>> type(nums.reverse()) <type 'NoneType'> >>> nums [8, 7, 6, 5, 4, 3, 2, 1] >>>

Time and Space Complexity of Python List reverse()

The reverse() method works in O(n) time complexity and with O(1) space. Internally, when reverse() is called it operates by swapping i-th element with (n-i)th element. Therefore, the first element is replaced with the last element, the second element is replaced with the second last element and so on. Thus, a total of N/2 swap operations are required for list reversal. That makes the overall time complexity as O(N/2) which is same as O(N)

Pros of reverse methods:

In-Place
Intuitive and easy to understand, it upholds code readability.

Cons of reverse() method:

The order of elements in the original list is changed.

When to use reverse() methods ?

Scenarios where order of elements in the original list can be altered and keeping a low memory footprint is desired .

Using Slicing For Python List Reversal

Python lists can be reversed using the [::-1] slicing suffix. It creates and returns a new copy of the list without altering the actual list.

Bash

 1 2 3 4 5 6 7 8 9 10 11

>>> nums = [1,2,3,4,5,6,7,8] >>> >>> nums_reversed = nums[::-1] >>> nums_reversed [8, 7, 6, 5, 4, 3, 2, 1] >>> type(nums_reversed) <type 'list'> >>> >>> nums [1, 2, 3, 4, 5, 6, 7, 8] >>>

What does [::-1] notation mean? It means to select elements starting from the first element till the last element with a stride of negative one, i.e, in reverse order. The list slicing notation is [start:end:step], so here start=end=None means the defaults (0 and n-1) and step=-1 implies reverse order.

What are the pros, cons of using slicing for list reversal, and when should we prefer slicing over reverse() or reversed() ?

Pros of list slicing for list reversal:

The original list is not altered. The order of elements in the original arrays is maintained before and after the slicing operation.

Cons:

Takes extra space by creating a list of the same size.
While [::-1] notation is shorter, it is cryptic and requires more attention to understand as compared to english words syntax reverse() or reversed(). In short not the best for code readability.

When to use slicing or Python List Reversal:

If it is a requirement to preserve the order of elements in the original list.
It is fine to allocate extra memory for the copy of the list.

Using reversed() for Python list reversal

Python lists can also be reversed using the built-in reversed() method. The reversed() method neither reverses the list in-place nor it creates a copy of the full list. It instead returns a list iterator(listreverseiterator) that generates elements of the list in reverse order.

Bash

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

>>> nums = [1,2,3,4,5,6,7,8] >>> reversed(nums) <listreverseiterator object at 0x10fced990> >>> >>> [n for n in reversed(nums)] [8, 7, 6, 5, 4, 3, 2, 1] >>> >>> >>> def reverse_python_list(nums): ... for num in reversed(nums): ... yield num ... >>> >>> list(reverse_python_list([1,2,3,4,5,6,7,8])) [8, 7, 6, 5, 4, 3, 2, 1] >>>

Note that calling reversed(nums) simply returns an iterator object. We can see in the following example that the reverse_python_list method, which simply wraps the reversed() method, does not modify the original list or create a copy of the list.

Pros of reversed() for list reversal:

No extra space is required
The original list remains unchanged
The syntax aids to code readability

Cons:

None really. Just that extra caution needs to be exercised with iterators.The returned iterator can be used only once(it gets exhausted on looping over once). So, if it is required to access the reversed list multiple times, we need to create a copy of the list or call the reversed() function multiple times.

Common List Reversal Problems

Let us take a look at a few other common Python Lists reversal related problems.

How to reverse a list in python using for loop ?

To reverse a list of size n using for loop, iterate on the list from (n-1)th element to the first element and yield each element.

Bash

1 2 3 4 5 6 7 8 9

>>> def reverse_list_using_for(nums): ... # Traverse [n-1, -1) , in the opposite direction. ... for i in range(len(nums)-1, -1, -1): ... yield nums[i] ... >>> >>> print list(reverse_list_using_for([1,2,3,4,5,6,7])) [7, 6, 5, 4, 3, 2, 1] >>>

How to reverse python list using recursion ?

To reverse a list of using recursion, we define a method that returns sum of two lists, the first being the last element (selected by -1 index) and the second one being reverse of the entire list upto the last element (selected by :-1). The base condition is met when all the elements are exhausted and the array is empty, upon which we return an empty array. Below is the functional code.

Bash

1 2 3 4 5 6 7 8 9

>>> def reverse_list_using_recursion(nums): ... if not nums: ... return [] ... return [nums[-1]] + reverse_list_using_recursion(nums[:-1]) ... >>> >>> print reverse_list_using_recursion([1,2,3,4,5,6,7]) [7, 6, 5, 4, 3, 2, 1] >>>

How to reverse part(subset or slice) of Python list?

To reverse a part of a list, the built-in reverse, reversed or slicing methods can be used on the subset identified by slicing. The following code shows all the three approaches:

Bash

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

>>> nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> # Method: 1 Using Slicing >>> nums[3:8][::-1] [8, 7, 6, 5, 4] >>> >>> # Method: 2 Using reverse() >>> nums_subset = nums[3:8] >>> nums_subset.reverse() >>> nums_subet [8, 7, 6, 5, 4] >>> >>> # Method: 3 Using reversed() >>> list(reversed(nums[3:8])) [8, 7, 6, 5, 4] >>>

How to reverse Python Numpy Array?

The numpy arrays can be reversed using the slicing technique (using [::-1] slice descriptor) or by using numpy’s flipud method. The following code shows the usage of both:

Bash

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

>>> np_array = np.array([1, 2, 3, 4, 5, 6]) >>> # Method 1: Using slicing  >>> np_array[::-1] array([6, 5, 4, 3, 2, 1]) >>> type(np_array[::-1]) <type 'numpy.ndarray'> >>> >>> # Method 1: Using flipud >>> np.flipud(np_array) array([6, 5, 4, 3, 2, 1]) >>> type(np.flipud(np_array)) <type 'numpy.ndarray'> >>>

Summary

In this tutorial we learnt the three techniques for Python list reversal, viz reverse(), reversed() and slicing. We also looked at the pros and cons of each method.

So, which is the best way to reverse list in python?

The answer depends on the requirements. If the requirement is to maintain the order of original elements then reversed() or slicing technique should be used. If the requirement is to have minimal memory footprint, reverse() or reversed() are more suited. If it is required to have a minimal memory footprint along with maintaining order of elements in the original list, reversed() should be used. In general, if there is no such preference, reverse() or reversed() can be preferred over slicing technique as it aids to code readability.

Web Scraping at scale using Python Multithreading

2020-01-23T12:08:35.287220+00:00

Web Scraping at scale using Python Multithreading

About

2020-01-23T12:12:55.392826+00:00

Welcome to PySnacks!

PySnacks brings quality Python tutorials on Data Structures, Machine Learning, Web and Backend development.

Hi There! My name is Kundan Kumar and I am the founder, publisher and the gatekeeper of PySnacks. I believe learning should never stop. I created PySnacks to share what I learn, with a hope that it may help others with similar interest.

I am a software engineer. I started in the software industry in 2011, and have worked with Samsung R&D, Ittiam Systems and LeadSift.

In 2017, I moved to Canada to pursue Masters in Computer Science. I currently work at LeadSift where I work in the field of data-mining/machine-learning, data pipelines and web application backends.

It would be my pleasure to know my readers, and would love to add you to my LinkedIn network. You can also like and follow us on our social networks:

PySnacks Facebook Page
PySnacks Twitter

Me with my wife, Amalfi Coast, Italy, Feb-2019

Contact

2020-01-23T12:14:04.676620+00:00

Contact

Privacy Policy

2020-05-03T08:31:56.144946+00:00

Welcome to our Privacy Policy

Your privacy is important to us.

PySnacks is located at:

PySnacks, North End Halifax, B3K 5X5 - Nova Scotia , Canada

It is PySnacks's policy to respect your privacy regarding any information we may collect while operating our website. This Privacy Policy applies to https://www.pysnacks.com (hereinafter, "us", "we", or "https://www.pysnacks.com"). We respect your privacy and are committed to protecting personally identifiable information you may provide us through the Website. We have adopted this privacy policy ("Privacy Policy") to explain what information may be collected on our Website, how we use this information, and under what circumstances we may disclose the information to third parties. This Privacy Policy applies only to information we collect through the Website and does not apply to our collection of information from other sources.

This Privacy Policy, together with the Terms and conditions posted on our Website, set forth the general rules and policies governing your use of our Website. Depending on your activities when visiting our Website, you may be required to agree to additional terms and conditions.

Website Visitors

Like most website operators, PySnacks collects non-personally-identifying information of the sort that web browsers and servers typically make available, such as the browser type, language preference, referring site, and the date and time of each visitor request. PySnacks's purpose in collecting non-personally identifying information is to better understand how PySnacks's visitors use its website. From time to time, PySnacks may release non-personally-identifying information in the aggregate, e.g., by publishing a report on trends in the usage of its website.

PySnacks also collects potentially personally-identifying information like Internet Protocol (IP) addresses for logged in users and for users leaving comments on https://www.pysnacks.com blog posts. PySnacks only discloses logged in user and commenter IP addresses under the same circumstances that it uses and discloses personally-identifying information as described below.

Gathering of Personally-Identifying Information

Certain visitors to PySnacks's websites choose to interact with PySnacks in ways that require PySnacks to gather personally-identifying information. The amount and type of information that PySnacks gathers depends on the nature of the interaction. For example, we ask visitors who sign up for a blog at https://www.pysnacks.com to provide a username and email address.

Security

The security of your Personal Information is important to us, but remember that no method of transmission over the Internet, or method of electronic storage is 100% secure. While we strive to use commercially acceptable means to protect your Personal Information, we cannot guarantee its absolute security.

Ads appearing on our website may be delivered to users by advertising partners, who may set cookies. These cookies allow the ad server to recognize your computer each time they send you an online advertisement to compile information about you or others who use your computer. This information allows ad networks to, among other things, deliver targeted advertisements that they believe will be of most interest to you. This Privacy Policy covers the use of cookies by PySnacks and does not cover the use of cookies by any advertisers.

Links To External Sites

Our Service may contain links to external sites that are not operated by us. If you click on a third party link, you will be directed to that third party's site. We strongly advise you to review the Privacy Policy and terms and conditions of every site you visit.

We have no control over, and assume no responsibility for the content, privacy policies or practices of any third party sites, products or services.

Https://www.pysnacks.com uses Google AdWords for remarketing

Https://www.pysnacks.com uses the remarketing services to advertise on third party websites (including Google) to previous visitors to our site. It could mean that we advertise to previous visitors who haven't completed a task on our site, for example using the contact form to make an enquiry. This could be in the form of an advertisement on the Google search results page, or a site in the Google Display Network. Third-party vendors, including Google, use cookies to serve ads based on someone's past visits. Of course, any data collected will be used in accordance with our own privacy policy and Google's privacy policy.

You can set preferences for how Google advertises to you using the Google Ad Preferences page, and if you want to you can opt out of interest-based advertising entirely by cookie settings or permanently using a browser plugin.

Protection of Certain Personally-Identifying Information

PySnacks discloses potentially personally-identifying and personally-identifying information only to those of its employees, contractors and affiliated organizations that (i) need to know that information in order to process it on PySnacks's behalf or to provide services available at PySnacks's website, and (ii) that have agreed not to disclose it to others. Some of those employees, contractors and affiliated organizations may be located outside of your home country; by using PySnacks's website, you consent to the transfer of such information to them. PySnacks will not rent or sell potentially personally-identifying and personally-identifying information to anyone. Other than to its employees, contractors and affiliated organizations, as described above, PySnacks discloses potentially personally-identifying and personally-identifying information only in response to a subpoena, court order or other governmental request, or when PySnacks believes in good faith that disclosure is reasonably necessary to protect the property or rights of PySnacks, third parties or the public at large.

If you are a registered user of https://www.pysnacks.com and have supplied your email address, PySnacks may occasionally send you an email to tell you about new features, solicit your feedback, or just keep you up to date with what's going on with PySnacks and our products. We primarily use our blog to communicate this type of information, so we expect to keep this type of email to a minimum. If you send us a request (for example via a support email or via one of our feedback mechanisms), we reserve the right to publish it in order to help us clarify or respond to your request or to help us support other users. PySnacks takes all measures reasonably necessary to protect against the unauthorized access, use, alteration or destruction of potentially personally-identifying and personally-identifying information.

Aggregated Statistics

PySnacks may collect statistics about the behavior of visitors to its website. PySnacks may display this information publicly or provide it to others. However, PySnacks does not disclose your personally-identifying information.

Cookies

To enrich and perfect your online experience, PySnacks uses "Cookies", similar technologies and services provided by others to display personalized content, appropriate advertising and store your preferences on your computer.

A cookie is a string of information that a website stores on a visitor's computer, and that the visitor's browser provides to the website each time the visitor returns. PySnacks uses cookies to help PySnacks identify and track visitors, their usage of https://www.pysnacks.com, and their website access preferences. PySnacks visitors who do not wish to have cookies placed on their computers should set their browsers to refuse cookies before using PySnacks's websites, with the drawback that certain features of PySnacks's websites may not function properly without the aid of cookies.

By continuing to navigate our website without changing your cookie settings, you hereby acknowledge and agree to PySnacks's use of cookies.

Privacy Policy Changes

Although most changes are likely to be minor, PySnacks may change its Privacy Policy from time to time, and in PySnacks's sole discretion. PySnacks encourages visitors to frequently check this page for any changes to its Privacy Policy. Your continued use of this site after any change in this Privacy Policy will constitute your acceptance of such change.

Contact Information

If you have any questions about this Privacy Policy, please contact us via hello@pysnacks.com