Text Web Mining & Data Mining Notes and Records

Text Web Mining

Training Skills

How to use colab to run ML tasks

click “Runtime” -> “Change Runtime Type” -> select GPU or TPU
use code below to check GPU status
!nvidia-smi
build connection between google drive files and this notebook
from google.colab import drive
drive.mount(‘/content/drive/‘)
change dir
import os
os.chdir(“/content/drive/My Drive/Colab Notebooks/…/“)
Do not use same browser to run both Jupyter and Colab, might crash

Take Home Project 2 - sentiment classifier

preperation

good examples and guide: bentrevett / pytorch-sentiment-analysis / Sentiment Analysis with Pytorch — Part 3— CNN Model / xalanq
/ chinese-sentiment-classification / slaysd
/ pytorch-sentiment-analysis-classification
NN, RNN, BERT - given Jupyter files
modify CNN Jupyter files
1. change surname model to sentiment model - character level tokens
2. online reference: 3 kind of outcome sentiment classifer
3. use weka:
  weka to do deep learning / official guide of installing / how to quick install weka deep learning toolkit

Further work direction

coBerta kaggle example using coBerta
weka preprocessing + bert weka preprocessing
evaluation ambert evaluation / ambert discussion in zhihu

Group Project : Taylor Swift lyrics generator

method: RNN LSTM GPT2

Lab-no2 adjust hyperparameters to get higher accuracy and lower loss

Optimize target: Classifying Surnames with a Multilayer Perceptron

original code from the textbook website, https://github.com/joosthub/PyTorchNLPBook.

Build the best model (based on test loss and test accuracy) by exploring following options:

learning_rate
batch_size
dropout (use only if it helps)
batch norm (use only if it helps)
weight_decay (L2 regularization) (use only if it helps)
hidden_dim
Note that it is not necessary to adjust other parameter values even though you are allowed to do so.

Values of given parameters

Thanks for Professor Jin’s suggestion and instruction, I would choose to share some findings when using a slightly different input dataset later on, but not share my answer directly.

Best outcomes

Test loss: 1.61;
Test Accuracy: 57.789

Taking “drewer” as an example

Top 15 predictions:

drewer -> German (p=0.46)
drewer -> English (p=0.35)
drewer -> Dutch (p=0.06)
drewer -> Scottish (p=0.04)
drewer -> Czech (p=0.04)
drewer -> Polish (p=0.01)
drewer -> Spanish (p=0.01)
drewer -> French (p=0.01)
drewer -> Portuguese (p=0.01)
drewer -> Russian (p=0.00)
drewer -> Irish (p=0.00)
drewer -> Chinese (p=0.00)
drewer -> Japanese (p=0.00)
drewer -> Italian (p=0.00)
drewer -> Arabic (p=0.00)

Tips

Using VPN would block anaconda-navigator startup, because VPN would probably take up the specific localhost port.
homebrew to solve environment path problem. Install graphviz
1
2
brew install graphviz
pip install graphviz

Data Mining

Useful Info

Learn - University of Waikato. Course Link: Youtube / Youtube / Learn Course Link: Youtube
Dataset repository of UCI

Classification Problem

dataset
Final result online presentation (Jupyter)

Basic acknowledge and demo testing

Environment: anaconda-nevigator / Jupyter-pytorch / sklearn

How to implement decision tree classifier model

Build project

Environment: Docker / Jupyter-tensorflow / sklearn / original reference

preprocessing
train and test
gather data and outcomes
compare and analysis
draw ROC plot

Additional way: use Weka to generate result

convert .data file into .arff file
.data -> .csv (use “sublime text”) -> add label names in .csv (use “sublime text” but not “numbers”) -> use tools to change to .arff, remember to select first row contains labels.
use Weka to genarate
compare and analysis

Reference

Use conda environment when facing conda install inconsistency
Using label encoding function would lead to error below.
1
Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
can be fixed by remove the null values in dataset. Basic guide Application guide
Use MicroSoft and Weka to convert xls/csv to arff / CSV2ARFF website / ARFF2CSV
Label encoding & One-Hot encoding
Pandas missing value process methods

Association Problem

Weka

Preparation

Learn - University of Waikato. Course Link: Youtube / Youtube / Learn Course Link: Youtube
Dataset repository of UCI
Small example of running Apriori in Websit javaTpoint
Detailed ppt of Association Rule Mining from USTC
Complexity

Apriori

Algorithm steps
Step-1: Determine the support of itemsets in the transactional database, and select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.
Detailed analysis and C++ implementation

Apriori in WEKA

WEKA use data set in .arff format
tips: 1) use csv2arff or arff2csv
1. can also use arff functions to change xls and csv files directly into arff files
There are 12 parameters could be set before running Apriori reference
Apriori properties in WEKA.)

Original dataset test information

Time	labor	breast-cancer	wisconsin-breast-cancer	hypothyroid	mushroom	letter	adult
attribute number	17	10	10	30	23	16	15
instance number	57	286	699	3772	8124	20000	32561
setting 1	263	302	296	372	343	526	746
setting 2	289	337	314	587	723	2493	4633
brute estimate	16246	81513	199221	1075057	2315421	5700200	9280211

Setting 1: lowerBoundMinSupport = 0.5; classindex = -1; delta = 0.05; minConfidence = 0.9; minRules = 100
Setting 2: lowerBoundMinSupport = 0.1; classindex = -1; delta = 0.01; minConfidence = 0.95; minRules = 200

The reason for disturbance in the plot is that Apriori in WEKA starts with the upper bound support and incrementally decreases support. The algorithm stops when either the specified number of rules are generated, or the lower bound for min is reached. So the abnormal run time is due to the different stop criteria[1].

Additional dataset test information

In addition, I choose a set of datasets with similar structure, the experimental outcome fits the theoretical prediction very well.

Dataset	spectrum disorder screening - adolescent	spectrum disorder screening - children	spectrum disorder screening - adult	absent at work	SouthGermanCredit	mushroom
attribute	21	21	21	21	21	22->21
instance	104	292	704	740	1000	8124
time / setting 1	303	320	319	317	387	409
time / setting 2	307	346	372	472	953	976

Brute force estimation

The brute force step can be summed up as follows. Firstly, generate all association rules. Then, determine whether the item sets fit the criteria by checking through all the instances.
Set d as the number of attributes and N as the number of instances, then the number of all association rules is [3 ^ d - 2 ^ (d + 1) + 1]. The overall time complexity is [k * {3 ^ D-2 ^ (D + 1) + 1} * N]. In test experiment, it takes 1ms to estimate the brute force time of D = 4, n = 4, so K is 0.005. When d = 10, n = 57, the total time of “labor” dataset spent could be estimated as 16.246s, so as to other datasets.

Huang's Blog

Text Web Mining & Data Mining

Text Web Mining

Training Skills

How to use colab to run ML tasks

Take Home Project 2 - sentiment classifier

preperation

Further work direction

Group Project : Taylor Swift lyrics generator

Lab-no2 adjust hyperparameters to get higher accuracy and lower loss

Optimize target: Classifying Surnames with a Multilayer Perceptron

Values of given parameters

Best outcomes

Taking “drewer” as an example

Tips

Data Mining

Useful Info

Classification Problem

Basic acknowledge and demo testing

Build project

Additional way: use Weka to generate result

Reference

Association Problem

Weka

Preparation

Apriori

Apriori in WEKA

Original dataset test information

Additional dataset test information

Brute force estimation

Plot

Error analysis

Reference

Debug Reference

K-means visualization

Small interesting points