Apollo Rock Thin Section Classifier
This document outlines the functionality of various approaches within the Thin Slice Classifier project.
Authors: Freja Thoresen, Aidan Cowley, Romeo Haak, Jonas Lewe, Clara Moriceau, Piotr Knapczyk, Victoria Engelschion
Data Sources
- NASA PDS database (https://pdsimage2.wr.usgs.gov/Missions/Apollo/Lunar_Sample_Photographs/)
- Lunar Institute Data (https://www.lpi.usra.edu/lunar/samples/atlas/thin_sections/)
- Virtual Microscope (http://www.virtualmicroscope.org/explore)
Models
Trained models can be downloaded from wandb.
https://wandb.ai/freja-thoresen/SimCLR https://wandb.ai/freja-thoresen/Geological%20Binary%20Classifier
Binary Classifier:
To execute the binary classifier, run the msm_statistics.py
script using the following command:
combined_data2x.msm <image_directory_name>
This command will generate a "datasets" folder containing two sub-folders: "grain" and "rock_type". For example, in the "rock_type" sub-folder, images will be classified into either a "breccia" or a "basalt" folder based on the generated classification dictionary. Additionally, sample IDs will be added as prefixes to the existing file names for easier management. If using the binary classifier, ensure to remove the "other" folder.
Machine Learning / AI Components
This section assumes you have completed the Quick Guide in the Set everything up
section.
stratified_group_kfold.py
This script creates data folds, ensuring that images from the same sample are grouped together in either the training or testing sets. It is utilized by preprocessing_helper.py
.
preprocessing_helper.py
This file is responsible for cleaning and organizing folders for training, testing, and validation. It ensures that images from the same sample are stored together.
networks.py
This is the core component of the binary classifier, which is responsible for training and fine-tuning the networks.
You can adjust the network type and parameters by the following:
network = InceptionResNet(training_directory, validation_directory, test_directory,
epochs=20, finetune_epochs=30, batch_size=32)
The available networks include VGG16, VGG19, and InceptionResNet. Additional functionalities include:
- -c
: Enables cross-validation training
- -f
: Enables fine-tuning after initial training
- -x
: Executes experiment 1 to check for repeated false positives
- -t
: Runs for only 2 epochs for testing purposes
- -g
: Draws precision-recall curves (not available for cross-validation)
- -T
: Evaluates model performance on the test set
To train the network for rock type prediction, use:
python networks.py -f -T ../datasets/rock_type
Preprocessing
While not necessary for running the classifier, understanding the following information can be helpful regarding saved and processed files.
full_database_tree
Contains links to all high-resolution JPEG and TIF images in the PDS database. This is used in the download_labels
function within sample_downloader.py
.
pds_data.msm
This file contains the moon sample metadata (msm) for the data from the PDS database. Essentially, this file contains information about the sample, specifying its superclass, subclass, sample ID, etc.
See https://pdsimage2.wr.usgs.gov/Missions/Apollo/Lunar_Sample_Photographs/A14VIS_0001/DATA/BASALT/FELDSPATHIC/14053/THIN_SECTIONS/S71-23315.LBL for an example
This file is a product of running lbl_parser.py. This file is used in processing_combined.py and statistic_combined.py if using the MsmStatisticsPdsimage class.
lunar_institute_data.msm
This file serves a similar purpose to pds_data.msm
but employs a slightly different data structure.
See https://www.lpi.usra.edu/lunar/samples/atlas/thin_section/?mission=Apollo%2011&sample=10058&source_id=JSC04230 for an example of the data saved.
This file is also utilized in processing_combined.py
.
combined_data2x.msm
IMPORTANT FILE. This file consolidates data from the lunar sample atlas and NASA's PDS. It also indicates grain size and rock type for the samples.
To create this file:
1. Combine lunar_institute.msm
and pds_data.msm
using combine_pds_and_lunar()
in processing_combined.py
.
2. Run the following lines of code with the newly created combined_data.msm
in processing_combined.py
to generate combined_data2x.msm
:
data_msm = load_file("combined_data.msm")
data_msm = {k: change_paths(v) for k, v in data_msm.items()}
write_to_file(data_msm, 'combined_data2x.msm')
This file is referenced by msm_statistics_combined.py
.
sample_downloader.py
This file is usually the first point of reference for acquiring data. It contains the ImageFinder
class responsible for compiling links from NASA's PDS image database. The full_database_tree
is also available on GitHub for immediate use.
Class: ImageFinder
You can use this class as follows:
image_finder = ImageFinder()
image_finder.director(<mode>)
The director method has three modes: 'combine', 'all', and 'missed only'. Depending on your needs, replace <mode>
with one of these valid options. Running in "all" mode assumes you don't have any data yet and will attempt to scrape all links from NASA's PDS database located at:
https://pdsimage2.wr.usgs.gov/Missions/Apollo/Lunar_Sample_Photographs/.
Please note that the site may occasionally close connections to prevent bot behavior. If this occurs while in "full" mode, any links that could not be accessed will be saved to a file named 'leftover_urls'.
If you then run the director in "missed only" mode, it will retry accessing the missed URLs and attempt to scrape those links and their subdirectories. After calling "full" and "missed only" once, you should have all the links you need. Finally, you can run the director in "combine" mode to merge the results from the "database_tree" file produced by the "full" mode and "database_tree_rest" from the "missed only" mode into a single file called "full_database_tree".
Once you've created the "full_database_tree" file, you can use it to download label files and images. It is recommended to download the label files first, as they are essential for preprocessing. To download the labels, use:
all_urls = load_file('full_database_tree')
download_labels(all_urls, os.path.join('Data', 'labels'))
Be aware that the same issue of the remote host closing the connection might occur here as well, so you may need to run this function multiple times. Don't worry; it will inform you when everything has succeeded.
The other functions in this module are mainly used by different files but serve to fetch or download the actual images of a sample. The most critical function among them is process_local_samples
.
lbl_parser.py
This file is responsible for parsing the label files from the PDS database. You can run the script using:
python lbl_parser.py <Directory where label files are stored> <Desired filename>
If everything is successful, you should receive an output file named pds_data.msm
.
processing_combined.py
This is an important file that assists in cleaning up and combining the two MSM files obtained from the PDS and LPI databases. It standardizes the information saved from both files into one. Call the necessary functions as needed.
msm_statistics_combined.py
This file is tasked with extracting specific statistics and information. Depending on which MSM file you are using, you'll need to select the appropriate class.
For example, if you're using combined_data2x.msm
, the relevant class is MSMCombinedStatistics
. In that case, you can initialize it like so on line 213:
statistic_type = MsmCombinedStatistics(data_msm)
Setup
Installation
- Run
make install
, which sets up a virtual environment and all Python dependencies therein. - Run
source .venv/bin/activate
to activate the virtual environment. - (Optional) Run
make install-pre-commit
, which installs pre-commit hooks for linting, formatting and type checking.
Adding and Removing Packages
To install new PyPI packages, run:
uv add <package-name>
To remove them again, run:
uv remove <package-name>
To show all installed packages, run:
uv pip list
All Built-in Commands
The project includes the following convenience commands:
make install
: Install the project and its dependencies in a virtual environment.make install-pre-commit
: Install pre-commit hooks for linting, formatting and type checking.make lint
: Lint the code usingruff
.make format
: Format the code usingruff
.make type-check
: Type check the code usingmypy
.make test
: Run tests usingpytest
and update the coverage badge in the readme.make docker
: Build a Docker image and run the Docker container.make docs
: View documentation locally in a browser.make publish-docs
: Publish documentation to GitHub Pages.make tree
: Show the project structure as a tree.
A Word on Modules and Scripts
In the src
directory there are two subdirectories, apollo_rock_thin_section_classifier
and scripts
. This is a brief explanation of the differences between the two.
Modules
All Python files in the apollo_rock_thin_section_classifier
directory are modules
internal to the project package. Examples here could be a general data loading script,
a definition of a model, or a training function. Think of modules as all the building
blocks of a project.
When a module is importing functions/classes from other modules we use the relative import notation - here's an example:
from .other_module import some_function
Scripts
Python files in the scripts
folder are scripts, which are short code snippets that
are external to the project package, and which is meant to actually run the code. As
such, only scripts will be called from the terminal. An analogy here is that the
internal numpy
code are all modules, but the Python code you write where you import
some numpy
functions and actually run them, that a script.
When importing module functions/classes when you're in a script, you do it like you would normally import from any other package:
from apollo_rock_thin_section_classifier import some_function
Note that this is also how we import functions/classes in tests, since each test Python file is also a Python script, rather than a module.
Features
Docker Setup
A Dockerfile is included in the new repositories, which by default runs
src/scripts/main.py
. You can build the Docker image and run the Docker container by
running make docker
.
Automatic Documentation
Run make docs
to create the documentation in the docs
folder, which is based on
your docstrings in your code. You can publish this documentation to Github Pages by
running make publish-docs
. To add more manual documentation pages, simply add more
Markdown files to the docs
directory; this will automatically be included in the
documentation.
Automatic Test Coverage Calculation
Run make test
to test your code, which also updates the "coverage badge" in the
README, showing you how much of your code base that is currently being tested.
Continuous Integration
Github CI pipelines are included in the repo, running all the tests in the tests
directory, as well as building online documentation, if Github Pages has been enabled
for the repository (can be enabled on Github in the repository settings).
Code Spaces
Code Spaces is a new feature on Github, that allows you to develop on a project
completely in the cloud, without having to do any local setup at all. This repo comes
included with a configuration file for running code spaces on Github. When hosted on
esa/apollo_rock_thin_section_classifier
then simply press the <> Code
button
and add a code space to get started, which will open a VSCode window directly in your
browser.