Breast Cancer CT Scan dataset with fastai
Basic vision model in the scope of Part 1 (2022) fastai Deep learning course
Most of my experience with ML and Deep learning has been with tabular data and NLP. Here is a basic vision model with the fastai library in the scope of Part 1 (2022) fastai Deep learning course.
I am using the Breast Cancer CT dataset from Kaggle to train a model https://www.kaggle.com/datasets/sabermalek/bcfpp The dataset has 3 categorical labels:
- 0 : Cancer
- 1 : Benign
- 2 : Normal
#
# import libraries
import joblib as jlb
from fastai.vision.all import *
#
# Read data
images, labels, masks = jlb.load('../input/bcfpp/BCFPP.jlb')
# labels contains 3 classes (0 for Cancer, 1 for Benign and 2 for normal)
images = np.uint8(images)
# lets see what we got
print(f"images: {images.shape}")
print(f"labels: {labels.shape}")
print(f"masks: {masks.shape}")
im = Image.fromarray(images[0])
im.to_thumb(384)
im = Image.fromarray(np.uint8(masks[0]))
im.to_thumb(384)
Not sure what to do with the mask data at this point so we will ignore it for now when training our model.
Lets look at the distribution of our data labels to see if we are dealing with an unbalanced dataset.
(pd.array(labels)
.value_counts()
.to_frame(name='labels')
.set_axis(['count'], axis=1)
.rename(index={0:'cancer',1:'benign',2:'normal'})
)
Perfectly balanced so no issue there.
DataBlock
We have not covered DataBlock in depth yet and all the examples involve reading image files from disk. The image data from the Breast CT scan dataset is in the form of a Numpy array. We can use the DataBlock by defining functions for get_items, get_x and get_y.
Not the most elegant solution but at this stage it will get us to a model quickly.
I tried using lambda functions but I got and error with learn.export when I did that.
#
# just return the item index for items
def get_items(i):
return i
# return image
def get_x(i):
return images[i]
# return mapped label
def get_y(i):
if labels[i] == 0: return "cancer"
if labels[i] == 1: return "benign"
if labels[i] == 2: return "normal"
# Build DataBlock and keep 20% of our data for validation set
dls = DataBlock(
blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_items=get_items,
get_x=get_x,
get_y=get_y,
item_tfms=Resize(384)
).dataloaders(list(range(images.shape[0])))
# look at a few images in our DataBlock
dls.show_batch(max_n=6)
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(6)
learn.export('model.pkl')
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(cmap='Purples')
In this classification we are looking to optimize cancer recall. Misclassifying cancer for benign or benign for cancer is not an issue as a biopsy would most likely be performed to see if cancer is actually present. So the bad outcome would be our model predicting a normal scan when it was actually cancer.
In our confusion matrix this was only the case for 2 scans !
print(f'percent of instances where model predicted normal when cancer was present: {round((2/len(dls.valid_ds))*100,2)}%')
The model missed a cancer diagnosis in only 0.28% of the CT scans! This is considering that the 68 times our model mislabelled cancer for benign it would have been caught with a biopsy.