miprometheus.problems¶
Problem¶
-
class
miprometheus.problems.
Problem
(params_, name_='Problem')[source]¶ Class representing base class for all Problems.
Inherits from
torch.utils.data.Dataset
as all subclasses will represent a problem with an associated dataset, and the worker will usetorch.utils.data.DataLoader
to generate batches.Implements features & attributes used by all subclasses.
-
__init__
(params_, name_='Problem')[source]¶ Initializes problem object.
Parameters: - params (
miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from the configuration.yaml
file). - name (str) – Problem name (DEFAULT: ‘Problem’).
This constructor:
stores a pointer to
params
:>>> self.params = params_
sets a problem name:
>>> self.name = name_
sets a default loss function:
>>> self.loss_function = None
initializes the size of the dataset:
>>> self.length = None
initializes the logger.
>>> self.logger = logging.Logger(self.name)
initializes the data definitions: this is used for defining the
DataDict
keys.
Note
This dict contains information about the DataDict produced by the current problem class.
This object will be used during handshaking between the model and the problem class to ensure that the model can accept the batches produced by the problem.
This dict should at least contains the targets field:
>>> self.data_definitions = {'targets': {'size': [-1, 1], 'type': [torch.Tensor]}}
- initializes the default values: this is used to pass missing parameters values to the model.
Note
It is likely to encounter a case where the model needs a parameter value only known when the problem has been instantiated, like the size of a vocabulary set or the number of marker bits.
The user can fill in those values in this dict, which will be passed to the model in its __init__ . The model will then be able to fill it its missing parameters values, either from params or this dict.
>>> self.default_values = {}
sets the access to
AppState
: for dtype, visualization flag etc.>>> self.app_state = AppState()
- params (
-
create_data_dict
()[source]¶ Returns a
miprometheus.utils.DataDict
object with keys created on the problem data_definitions and empty values (None).Returns: new miprometheus.utils.DataDict
object.
-
set_loss_function
(loss_function)[source]¶ Sets loss function.
Parameters: loss_function – Loss function (e.g. torch.nn.CrossEntropyLoss
) that will be set as the optimization criterion.
-
collate_fn
(batch)[source]¶ Generates a batch of samples from a list of individuals samples retrieved by
__getitem__()
.The default collate_fn is
torch.utils.data.dataloader.default_collate()
.Note
This base
collate_fn()
method only calls the defaulttorch.utils.data.dataloader.default_collate()
, as it can handle several cases (mainly tensors, numbers, dicts and lists).If your dataset can yield variable-length samples within a batch, or generate batches on-the-fly , or possesses another non regular characteristic, it is most likely that you will need to override this default
collate_fn()
.Parameters: batch (list) – miprometheus.utils.DataDict
retrieved by__getitem__()
, each containing tensors, numbers, dicts or lists.Returns: DataDict containing the created batch.
-
__getitem__
(index)[source]¶ Getter that returns an individual sample from the problem’s associated dataset (that can be generated on-the-fly, or retrieved from disk. It can also possibly be composed of several files.).
Note
To be redefined in subclasses.
Note
The getter should return a DataDict: its keys should be defined by
self.data_definitions
keys.This ensures consistency of the content of the
miprometheus.utils.DataDict
when processing to the handshake between themiprometheus.problems.Problem
class and themiprometheus.models.Model
class. For more information, please seemiprometheus.models.Model.handshake_definitions()
.e.g.:
>>> data_dict = DataDict({key: None for key in self.data_definitions.keys()}) >>> # you can now access each value by its key and assign the corresponding object (e.g. `torch.tensor` etc) >>> ... >>> return data_dict
Warning
Mi-Prometheus supports multiprocessing for data loading (through the use of
torch.utils.data.DataLoader
).To construct a batch (say 64 samples), the indexes are distributed among several workers (say 4, so that each worker has 16 samples to retrieve). It is best that samples can be accessed individually in the dataset folder so that there is no mutual exclusion between the workers and the performance is not degraded.
If each sample is generated on-the-fly, this shouldn’t cause a problem. There may be an issue with randomness. Please refer to the official PyTorch documentation for this.
Parameters: index (int) – index of the sample to return. Returns: Empty DataDict
, having the same key asself.data_definitions
.
-
worker_init_fn
(worker_id)[source]¶ Function to be called by
torch.utils.data.DataLoader
on each worker subprocess, after seeding and before data loading. (default:None
).Note
Set the
NumPy
random seed of the worker equal to the previous NumPy seed + itsworker_id
to avoid having all workers returning the same random numbers.Parameters: worker_id (int) – the worker id (in [0, torch.utils.data.DataLoader
.num_workers - 1])Returns: None
by default
-
get_data_definitions
()[source]¶ Getter for the data_definitions dict so that it can be accessed by a
worker
to establish handshaking with themiprometheus.models.Model
class.Returns: self.data_definitions()
-
evaluate_loss
(data_dict, logits)[source]¶ Calculates loss between the predictions / logits and targets (from
data_dict
) using the selected loss function.Parameters: - data_dict (
miprometheus.utils.DataDict
) – DataDict containing (among others) inputs and targets. - logits – Predictions of the model.
Returns: Loss.
- data_dict (
-
add_statistics
(stat_col)[source]¶ Adds statistics to
miprometheus.utils.StatisticsCollector
.Note
Empty - To be redefined in inheriting classes.
Parameters: stat_col – miprometheus.utils.StatisticsCollector
.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Base statistics collection.
Note
Empty - To be redefined in inheriting classes. The user has to ensure that the corresponding entry in the
miprometheus.utils.StatisticsCollector
has been created withadd_statistics()
beforehand.Parameters: - stat_col –
miprometheus.utils.StatisticsCollector
. - data_dict (
miprometheus.utils.DataDict
) –DataDict
containing inputs and targets. - logits – Predictions being output of the model (
torch.Tensor
).
- stat_col –
-
add_aggregators
(stat_agg)[source]¶ Adds statistical aggregators to
miprometheus.utils.StatisticsAggregator
.Note
Empty - To be redefined in inheriting classes.
Parameters: stat_agg – miprometheus.utils.StatisticsAggregator
.
-
aggregate_statistics
(stat_col, stat_agg)[source]¶ Aggregates the statistics collected by
miprometheus.utils.StatisticsCollector
and adds the results tomiprometheus.utils.StatisticsAggregator
.Note
Empty - To be redefined in inheriting classes. The user can override this function in subclasses but should call
aggregate_statistics()
to collect basic statistical aggregators (if set).Parameters: - stat_col –
miprometheus.utils.StatisticsCollector
. - stat_agg –
miprometheus.utils.StatisticsAggregator
.
- stat_col –
-
initialize_epoch
(epoch)[source]¶ Function called to initialize a new epoch.
Note
Empty - To be redefined in inheriting classes.
Parameters: epoch (int) – current epoch index
-
finalize_epoch
(epoch)[source]¶ Function called at the end of an epoch to execute a few tasks.
Note
Empty - To be redefined in inheriting classes.
Parameters: epoch (int) – current epoch index
-
plot_preprocessing
(data_dict, logits)[source]¶ Allows for some data preprocessing before the model creates a plot for visualization during training or inference.
Note
Empty - To be redefined in inheriting classes.
Parameters: - data_dict (
miprometheus.utils.DataDict
) –DataDict
. - logits – Predictions of the model (
torch.Tensor
).
Returns: data_dict, logits after preprocessing.
- data_dict (
-
curriculum_learning_initialize
(curriculum_params)[source]¶ Initializes curriculum learning - simply saves the curriculum params.
Note
This method can be overwritten in the derived classes.
Parameters: curriculum_params – Interface to parameters accessing curriculum learning view of the registry tree.
-
curriculum_learning_update_params
(episode)[source]¶ Updates problem parameters according to curriculum learning.
Note
This method can be overwritten in the derived classes.
Parameters: episode (int) – Number of the current episode. Returns: True informing that Curriculum Learning wasn’t active at all (i.e. is finished).
-
check_and_download
(file_folder_to_check, url='none', download_name='~/data/downloaded')[source]¶ Checks whether a file or folder exists at given path (relative to storage folder), otherwise downloads files from the given URL.
Parameters: Returns: False if file was found, True if a download was necessary.
-
ProblemFactory¶
-
class
miprometheus.problems.
ProblemFactory
[source]¶ ProblemFactory: Class instantiating the specified problem class using the passed params.
-
static
build
(params)[source]¶ Static method returning a particular problem, depending on the name provided in the list of parameters.
Parameters: params ( miprometheus.utils.ParamInterface
) – Parameters used to instantiate the Problem class...note:
``params`` should contains the exact (case-sensitive) class name of the Problem to instantiate.
Returns: Instance of a given problem.
-
static
ImageTextToClass Problems¶
-
class
miprometheus.problems.
ImageTextToClassProblem
(params)[source]¶ Abstract base class for VQA (Visual Question Answering) problems.
Problem classes like CLEVR inherits from it.
Provides some basic features useful in all problems of such type.
-
__init__
(params)[source]¶ Initializes problem:
Calls
problems.problem.Problem
class constructor,Sets loss function to
CrossEntropy
,sets
self.data_definitions
to:>>> self.data_definitions = {'texts': {'size': [-1, -1], 'type': [torch.Tensor]}, >>> 'images': {'size': [-1, -1, -1, 3], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1, 1], 'type': [torch.Tensor]} >>> }
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
calculate_accuracy
(data_dict, logits)[source]¶ Calculates the accuracy as the mean number of correct answers in a given batch.
Parameters: - data_dict (DataDict) – DataDict containing the targets.
- logits – Predictions of the model.
Returns: Accuracy.
-
add_statistics
(stat_col)[source]¶ Add accuracy statistic to
StatisticsCollector
.Parameters: stat_col – StatisticsCollector
.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects accuracy.
Parameters: - stat_col –
StatisticsCollector
. - data_dict (DataDict) – DataDict containing the targets and the mask.
- logits – Predictions of the model.
- stat_col –
-
CLEVR¶
-
class
miprometheus.problems.
CLEVR
(params)[source]¶ CLEVR Dataset class: Represents the CLEVR dataset.
See reference here: https://cs.stanford.edu/people/jcjohns/clevr/
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from configuration.yaml
file).Given the relative complexity of this class,
params
should follow a specific template. Here are 2 examples:>>> params = {'settings': {'data_folder': '~/data/CLEVR_v1.0', >>> 'set': 'train', >>> 'dataset_variant': 'CLEVR'}, >>> 'images': {'raw_images': False, >>> 'feature_extractor': {'cnn_model': 'resnet101', >>> 'num_blocks': 4}}, >>> 'questions': {'embedding_type': 'random', 'embedding_dim': 300}})
>>> params = {'settings': {'data_folder': '~/data/CLEVR_v1.0', >>> 'set': 'train', >>> 'dataset_variant': 'CLEVR-Humans'}, >>> 'images': {'raw_images': True}, >>> 'questions': {'embedding_type': 'glove.6B.300d'}}
params
is separated in 3 sections:- settings: generic settings for the
CLEVR
class, - images: specific parameters for the images,
- questions: specific parameters for the questions.
Here is a breakdown of the available options:
settings:
data_folder
: Root folder of the dataset. Will also be used to store generated files/ (e.g. tokenization of the questions, features extracted from the images etc.)Warning
As of now, this class doesn’t handle downloading & decompressing the dataset if it is not present in the
data_folder
. Please make sure that the dataset is already present in thisdata_folder
.- For CLEVR-Humans, since only the questions change (and the images remains the same), please put the corresponding .json files in ~/CLEVR_v1.0/questions/.
- For CLEVR-CoGenT, this is a fairly separate dataset with different questions & images.
Indicate
data_folder
as the root to ~/CLEVR_CoGenT_v1.0/ in this case.
set
: either “train”, “val” in the case of “CLEVR” & “CLEVR-Humans”, and “valA”, “valB” or “trainA” in the case of CLEVR-CoGenT. “test” is not supported yet since ground truth answers are not distributed by the CLEVR authors.dataset_variant
: either “CLEVR”, “CLEVR-CoGenT” or “CLEVR-Humans”.
images:
raw_images
: whether or not to use to the original images as the visual source. IfFalse
, thenfeature_extractor
cannot be empty. The visual source will then be features extracted from the original images using a specified pretrained CNN.cnn_model
: In the case of features extracted from the original images, the specific CNN model to use. Must be part oftorchvision.models
.num_blocks
: In the case of features extracted from the original images, this represents the number of layers to use fromcnn_model
.Warning
This is not verified in any way by this class.
questions:
embedding_type
: string to indicate the pretrained embedding to use: either “random” to usenn.Embedding
or one of the following:- “charngram.100d”,
- “fasttext.en.300d”,
- “fasttext.simple.300d”,
- “glove.42B.300d”,
- “glove.840B.300d”,
- “glove.twitter.27B.25d”,
- “glove.twitter.27B.50d”,
- “glove.twitter.27B.100d”,
- “glove.twitter.27B.200d”,
- “glove.6B.50d”,
- “glove.6B.100d”,
- “glove.6B.200d”,
- “glove.6B.300d”
embedding_dim
: In the case of a randomembedding_type
, this is the embedding dimension to use.embedding_source
: In the case of a randomembedding_type
, this is the source of the embeddings to use.str
, equal to one of the dataset variant: “CLEVR”, “CLEVR-CoGenT” or “CLEVR-Humans”.Warning
If this
embedding_source
is different than the indicateddataset_variant
above:The class assumes that there is exist in
data_folder
/generated_files:- A file <embedding_source>_embedding_weights.pkl corresponding to the random embedding weights to use,
- A file <embedding_source>_dics.pkl corresponding to the dicts
{'words': index}
&{'answer': index}
.
The class will then override checking if the file containing the tokenized questions exist, and instead load the <embedding_source>_dics.pkl file, and use it to tokenize the questions.
Nonetheless, the tokenized questions and dicts will not be saved to file.
The class will also load the <embedding_source>_embedding_weights.pkl file and use it as the weights of the random embedding layer.
This is particularly useful to finetune or test a CLEVR-trained model on CoGenT-A or CoGenT-B.
Should work for both the training & validation samples although only has been tested on validation samples so far.
Note
The following is set by default:
>>> params = {'settings': {'data_folder': '~/data/CLEVR_v1.0', >>> 'set': 'train', >>> 'dataset_variant': 'CLEVR'}, >>> 'images': {'raw_images': True}, >>> 'questions': {'embedding_type': 'random', 'embedding_dim': 300, 'embedding_source': 'CLEVR'}})
-
__init__
(params)[source]¶ Instantiate the CLEVR class.
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from configuration.yaml
file).
-
parse_param_tree
(params)[source]¶ Parses the parameters tree passed as input to the constructor.
Due to the relative complexity inherent to the several variants of the CLEVR dataset (Humans, CoGenT) and the processing available to both the images (features extraction or not) and the questions (which type of embedding to use), this step is of relative importance.
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from configuration.yaml
file).
-
generate_questions_dics
(set, word_dic=None, answer_dic=None, save_to_file=True)[source]¶ Loads the questions from the .json file, tokenize them, creates vocab dics and save that to files.
Parameters: - set (str) – String to specify which dataset to use:
train
,val
(test
not handled yet.) - word_dic (dict) – dict
{'word': index}
to be used to tokenize the questions. Optional. If passed, it is used and unseen words are added. It not passed, an empty one is created. - answer_dic (dict) – dict
{'answer': index}
to be used to process the answers. Optional. If passed, it is used and unseen answers are added. It not passed, an empty one is created. - save_to_file (bool, default: True) – Whether to save to file the tokenized questions and the dicts.
Returns: A dict, containing for each question:
- The tokenized question,
- The answer,
- The original question string,
- The original path to the associated image
- The question type
The word_dic
The answer_dic
- set (str) – String to specify which dataset to use:
-
generate_feature_maps_file
()[source]¶ Uses
miprometheus.utils.GenerateFeatureMaps
to pass theCLEVR
images through a pretrained CNN model.
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({‘images’,’questions’, ‘questions_length’, ‘questions_string’, ‘questions_type’, ‘targets’, ‘targets_string’, ‘index’,’imgfiles’}), with: - images: extracted feature maps from the raw image
- questions: tensor of word indexes
- questions_length: len(question)
- questions_string: original question string
- questions_type: category of the question (query, count…)
- targets: index of the answer in the answers dictionary
- targets_string: None for now
- index: index of the sample
- imgfiles: image filename
-
collate_fn
(batch)[source]¶ Combines a list of DataDict (retrieved with
__getitem__()
) into a batch.Note
Because each tokenized question has a variable length, padding is necessary to create batches.
Hence, for a given batch, each question is padded to the length of the longest one.
This length changes between batches, but this shouldn’t be an issue.
Parameters: batch (list) – list of individual samples to combine Returns: DataDict({‘images’,’questions’, ‘questions_length’, ‘questions_string’, ‘questions_type’, ‘targets’, ‘targets_string’, ‘index’,’imgfiles’})
-
finalize_epoch
(epoch)[source]¶ Empty for now.
Will call
get_acc_per_family()
to get the accuracy per family once it has been refactored.Parameters: epoch (int) – current epoch index
-
initialize_epoch
(epoch)[source]¶ Resets the accuracy per category counters.
Parameters: epoch (int) – current epoch index
-
get_acc_per_family
(data_dict, logits)[source]¶ Compute the accuracy per family for the current batch. Also accumulates the number of correct predictions & questions per family in self.correct_pred_families (saved to file).
Note
To refactor.
Parameters: - data_dict (
miprometheus.utils.DataDict
) – DataDict({‘images’,’questions’, ‘questions_length’, ‘questions_string’, ‘questions_type’, ‘targets’, ‘targets_string’, ‘index’,’imgfiles’}) - logits (
torch.Tensor
) – network predictions.
- data_dict (
-
show_sample
(data_dict, sample=0)[source]¶ Show a sample of the current DataDict.
Parameters: - data_dict (
miprometheus.utils.DataDict
) – DataDict({‘images’,’questions’, ‘questions_length’, ‘questions_string’, ‘questions_type’, ‘targets’, ‘targets_string’, ‘index’,’imgfiles’}) - sample (int) – sample index to visualize.
- data_dict (
-
plot_preprocessing
(data_dict, logits)[source]¶ Recover the predicted answer (as a string) from the logits and adds it to the current DataDict. Will be used in
models.model.Model.plot()
.Parameters: - data_dict (
miprometheus.utils.DataDict
) – DataDict({‘images’,’questions’, ‘questions_length’, ‘questions_string’, ‘questions_type’, ‘targets’, ‘targets_string’, ‘index’,’imgfiles’}) - logits (
torch.Tensor
) – Predictions of the model.
Returns: - data_dict with one added predicted answer key,
- logits
- data_dict (
- settings: generic settings for the
Sort-Of-CLEVR¶
-
class
miprometheus.problems.
SortOfCLEVR
(params)[source]¶ Sort-of-CLEVR
is a simple VQA problem, where the goal is to answer the question regarding a given image. Implementation of the generation is inspired by: https://github.com/gitlimlab/Relation-Network-TensorflowImprovements:
- Generates scenes with dynamic varying number of objects (2-6)
- More types of intra- and inter-relational questions
- More natural interpretation of questions
Parameters: Note
When generating the dataset, this class:
First verifies if a file with a matching filename already exists in the
data_folder
. The filename follows the following template:>>> filename = '<split>_<size>_<img_size>.hy'
If such a file exists, it is loaded and used as the dataset. If not, it is created and then used.
If
regenerate
isTrue
, the file is recreated regardless if one with the matching filename already exists or not.
Note
The following is set by default:
>>> params = {'data_folder': '~/data/sort-of-clevr/', >>> 'split': 'train', >>> 'regenerate': False, >>> 'size': 10000, >>> 'img_size': 128}
-
__init__
(params)[source]¶ Initializes
Sort-of-CLEVR
problem, calls base classImageTextToClassProblem
initialization, sets properties using the provided parameters.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
load_dataset
(data_folder, data_filename)[source]¶ Loads the dataset from the HDF5-encoded file.
Note
This function will look first if a dataset with the same filename already exists or not in the specified
data_folder
(this filename contains the number of samples and image size of the samples). If no such file does not exist, it is generated and saved indata_folder
(with the specifieddata_filename
).
-
generate_h5py_dataset
(filename)[source]¶ Generates a whole new
Sort-of-CLEVR
dataset and saves it in the form of a HDF5 file.Parameters: filename (str) – name of the file containing the samples.
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Warning
HDF5 does not support multi threaded data access with num_workers > 1 on the data loading. A way around this is to move every call for opening the HDF5 file to this
__getitem__
method.See https://discuss.pytorch.org/t/hdf5-multi-threaded-alternative/6189/9 for more info.
Parameters: index – index of the sample to return. Returns: DataDict({‘images’,’questions’, ‘targets’, ‘targets_index’, ‘scenes_description’}), with: - images: images (
self.img_size
) - questions: encoded questions
- targets: one-hot encoded answers
- targets_index: index of the answers
- scenes_description: Scene description.
- images: images (
-
collate_fn
(batch)[source]¶ Combines a list of
DataDict
(retrieved with__getitem__
) into a batch.Note
This function wraps a call to
default_collate
and simply returns the batch as aDataDict
instead of a dict.Parameters: batch – list of individual DataDict
samples to combine.Returns: DataDict({'images','questions', 'targets', 'targets_index', 'scenes_description'})
containing the batch.
-
color2str
(color_index)[source]¶ Decodes the specified color index and returns it as a string.
Parameters: color_index (int) – Index of the color. Returns: color name as a string.
-
shape2str
(shape_index)[source]¶ Decodes the specified shape index and returns it as a string.
Parameters: shape_index (int) – Index of the color. Returns: shape name as a string.
-
question_type_template
(question_index)[source]¶ Decodes the specified question index and returns the corresponding string template.
Parameters: question_index (int) – Index of the color. Returns: corresponding string template.
-
question2str
(encoded_question)[source]¶ Decodes the encoded question, i.e. produces a human-understandable string.
Parameters: encoded_question (tensor) – Concatenation of two one-hot vectors:
- The first one denotes the object of interest (its color),
- The second one denotes the question type.
Returns: The question as a human-understandable string.
-
answer2str
(encoded_answer)[source]¶ Decodes the answer and returns the corresponding label.
Parameters: encoded_answer (np.array) – Answer index, encoded as a one-hot vector. Returns: answer label.
-
scene2str
(objects)[source]¶ Returns a string containing the shape, color and position of every object forming the scene.
Parameters: objects – List of objects - abstract scene representation. Returns: Str containing the scene description.
-
generate_scene_representation
()[source]¶ Generates the scene representation.
Returns: List of objects - abstract scene representation.
-
generate_image
(objects)[source]¶ Generates the image on the basis of a given scene representation.
Parameters: objects – List of objects - abstract scene representation. Returns: np.array
containing the generated image.
-
generate_question_matrix
(objects)[source]¶ Generates the questions matrix: [# of shape * # of Q, # of color + # of Q].
This matrix contains all possible questions for a given scene representation.
Parameters: objects – List of objects - abstract scene representation. :return the questions matrix (
np.array
)
-
generate_answer_matrix
(objects)[source]¶ Generates the answers matrix: [# of shape * # of Q, # of color + 4]
# of color + 4 = [color 1, color 2, … , circle, rectangle, yes, no]
Parameters: objects (list) – List of objects - abstract scene representation. Returns: the answer matrix ( np.array
)
-
plot_preprocessing
(data_dict, logits)[source]¶ Allows for some data preprocessing before the model creates a plot for visualization during training or inference. To be redefined in inheriting classes.
Parameters: - data_dict – DataDict({‘images’,’questions’, ‘targets’, ‘targets_index’, ‘scenes_description’})
- logits (Tensor) – Predictions of the model.
Returns: data_tuplem aux_tuple, logits after preprocessing.
ShapeColorQuery¶
-
class
miprometheus.problems.
ShapeColorQuery
(params)[source]¶ Shape-Color-Query is a variation of the
Sort-of-CLEVR
problem, where the question is a sequence composed of three items:- The first two are encoding the object, identified by its color & shape,
- The third is encoding the query.
Please see the
SortOfCLEVR
documentation for more information.-
__init__
(params)[source]¶ Initializes the
Shape-Color-Query
problem, calls base classSortOfCLEVR
initialization, sets properties using the provided parameters.Parameters: params (miprometheus.utils.ParamInterface) – Dictionary of parameters (read from configuration .yaml
file).Note
The following is set by default:
>>> params = {'data_folder': '~/data/shape-color-query/', >>> 'split': 'train', >>> 'regenerate': False, >>> 'size': 10000, >>> 'img_size': 128}
-
question2str
(encoded_question)[source]¶ Decodes the question, i.e. produces a human-understandable string.
Parameters: encoded_question – A 3D tensor, with 1 row and 3 columns:
- The first two are encoding the object, identified by its shape & color,
- The third is encoding the query.
Returns: Question in the form of a string.
-
generate_question_matrix
(objects)[source]¶ Generates the questions tensor: [# of objects * # of Q, 3, encoding], where the 2nd dimension (temporal) encodes consecutively: shape, color, query
Parameters: objects – List of objects - abstract scene representation. Returns: a 3D tensor [# of questions for the whole scene, 3, num_bits]
ImageToClass Problems¶
-
class
miprometheus.problems.
ImageToClassProblem
(params_, name_)[source]¶ Abstract base class for image classification problems.
Problem classes like MNIST & CIFAR10 inherits from it.
Provides some basic features useful in all problems of such type.
-
__init__
(params_, name_)[source]¶ Initializes problem:
Calls
problems.problem.Problem
class constructor,Sets loss function to
CrossEntropy
,sets
self.data_definitions
to:>>> self.data_definitions = {'images': {'size': [-1, 3, -1, -1], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1, 1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Parameters: - params – Dictionary of parameters (read from configuration
.yaml
file). - name – Name of the problem.
-
calculate_accuracy
(data_dict, logits)[source]¶ Calculates accuracy equal to mean number of correct classification in a given batch.
Parameters: - logits – Predictions of the model.
- data_dict (DataDict) – DataDict containing the targets.
Returns: Accuracy.
-
add_statistics
(stat_col)[source]¶ Add accuracy statistic to
StatisticsCollector
.Parameters: stat_col – StatisticsCollector
.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects accuracy.
Parameters: - stat_col –
StatisticsCollector
. - data_dict (DataDict) – DataDict containing the targets and the mask.
- logits – Predictions of the model.
- stat_col –
-
add_aggregators
(stat_agg)[source]¶ Adds problem-dependent statistical aggregators to
StatisticsAggregator
.Parameters: stat_agg – StatisticsAggregator
.
-
CIFAR-10¶
-
class
miprometheus.problems.
CIFAR10
(params)[source]¶ Image classification problem using the CIFAR-10 dataset.
Please see reference here: https://www.cs.toronto.edu/~kriz/cifar.html
Warning
The dataset is not originally split into a training set, validation set and test set; only training and test set. It is recommended to use a validation set.
torch.utils.data.SubsetRandomSampler
is recommended.-
__init__
(params)[source]¶ Initializes the CIFAR-10 problem:
Calls
problems.problem.ImageToClassProblem
class constructor,Sets following attributes using the provided
params
:self.data_folder
(string) : Root directory of dataset where the directorycifar-10-batches-py
will be saved,self.use_train_data
(bool, optional) : IfTrue
, creates dataset from training set, otherwise creates from test set,self.resize
: (optional) resize the images to [h, w] if set,self.defaut_values
:>>> self.default_values = {'num_classes': 10, >>> 'num_channels': 3, >>> 'width': self.width, # (DEFAULT: 32) >>> 'height': self.height} # DEFAULT: 32)
self.data_definitions
:>>> self.data_definitions = {'images': {'size': [-1, 3, self.height, self.width], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Warning
Resizing images might cause a significant slow down in batch generation.
Note
The following is set by default:
>>> params = {'data_folder': '~/data/cifar10', >>> 'use_train_data': True}
Parameters: params (miprometheus.utils.ParamInterface) – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({'images','targets', 'targets_label'})
, with:- images: Image, resized if indicated in
params
, - targets: Index of the target class
- targets_label: Label of the target class (cf
self.labels
)
- images: Image, resized if indicated in
-
collate_fn
(batch)[source]¶ Combines a list of
DataDict
(retrieved with__getitem__
) into a batch.Note
This function wraps a call to
default_collate
and simply returns the batch as aDataDict
instead of a dict.Multi-processing is supported as the data sources are small enough to be kept in memory (self.root-dir/cifar-10-batches/data_batch_i have a size of 31.0 MB).
Parameters: batch – list of individual DataDict
samples to combine.Returns: DataDict({'images','targets', 'targets_label'})
containing the batch.
-
MNIST¶
-
class
miprometheus.problems.
MNIST
(params_)[source]¶ Classic MNIST classification problem.
Please see reference here: http://yann.lecun.com/exdb/mnist/
Warning
The dataset is not originally split into a training set, validation set and test set; only training and test set. It is recommended to use a validation set.
torch.utils.data.SubsetRandomSampler
is recommended.-
__init__
(params_)[source]¶ Initializes MNIST problem:
Calls
problems.problem.ImageToClassProblem
class constructor,Sets following attributes using the provided
params
:self.data_folder
(string) : Root directory of dataset whereprocessed/training.pt
andprocessed/test.pt
will be saved,self.use_train_data
(bool, optional) : If True, creates dataset fromtraining.pt
, otherwise fromtest.pt
self.resize
: (optional) resize the images to [h, w] if set,self.defaut_values
:>>> self.default_values = {'num_classes': 10, >>> 'num_channels': 1, >>> 'width': self.width, # (DEFAULT: 28) >>> 'height': self.height} # (DEFAULT: 28)
self.data_definitions
:>>> self.data_definitions = {'images': {'size': [-1, 1, self.height, self.width], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Warning
Resizing images might cause a significant slow down in batch generation.
Note
The following is set by default:
>>> self.params.add_default_params({'data_folder': '~/data/mnist', >>> 'use_train_data': True})
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({'images','targets', 'targets_label'})
, with:- images: Image, resized if
self.resize
is set, - targets: Index of the target class
- targets_label: Label of the target class (cf
self.labels
)
- images: Image, resized if
-
collate_fn
(batch)[source]¶ Combines a list of
DataDict
(retrieved with__getitem__
) into a batch.Note
This function wraps a call to
default_collate
and simply returns the batch as aDataDict
instead of a dict. Multi-processing is supported as the data sources are small enough to be kept in memory (training.pt has a size of 47.5 MB).Parameters: batch – list of individual DataDict
samples to combine.Returns: DataDict({'images','targets', 'targets_label'})
containing the batch.
-
SequenceToSequence Problems¶
-
class
miprometheus.problems.
SeqToSeqProblem
(params)[source]¶ Class representing base class for all sequential problems.
-
__init__
(params)[source]¶ Initializes problem object. Calls base constructor.
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
evaluate_loss
(data_dict, logits)[source]¶ Calculates accuracy equal to mean number of correct predictions in a given batch. WARNING: Applies mask to both logits and targets!
Parameters: - data_dict – DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’}).
- logits – Predictions being output of the model.
-
VQA Problems¶
-
class
miprometheus.problems.
VQAProblem
(params)[source]¶ Abstract base class for sequential VQA problems.
COG inherits from it (for now).
Provides some basic features useful in all problems of such type.
-
__init__
(params)[source]¶ Initializes problem:
- Calls
miprometheus.problems.SeqToSeqProblem
class constructor, - Sets loss function to
torch.nn.CrossEntropyLoss
, - Sets
self.data_definitions
to:
>>> self.data_definitions = {'images': {'size': [-1, -1, 3, -1, -1], 'type': [torch.Tensor]}, >>> 'mask': {'size': [-1, -1, 1], 'type': [torch.Tensor]}, >>> 'questions' {'size': [-1, 1], 'type': [list, str]}, >>> 'targets': {'size': [-1, -1, 1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from configuration.yaml
file).- Calls
-
show_sample
(data_dict, sample_number=0, sequence_number=0)[source]¶ Shows a sample from the batch.
Parameters: - data_dict (
miprometheus.utils.DataDict
) –DataDict
containing inputs and targets. - sample_number (int) – Number of sample in batch (default: 0)
- sequence_number (int) – Which image in the sequence to display (default: 0)
- data_dict (
-
COG¶
-
class
miprometheus.problems.
COG
(params)[source]¶ The COG dataset is a sequential VQA dataset.
Inputs are a sequence of images of simple shapes and characters on a black background, and a question based on these objects that relies on memory which has to be answered at every step of the sequence.
See https://arxiv.org/abs/1803.06092 (A Dataset and Architecture for Visual Reasoning with a Working Memory) for the reference paper.
-
__init__
(params)[source]¶ Initializes the
COG
problem:Calls
miprometheus.problems.VQAProblem
class constructor,Sets the following attributes using the provided
params
:self.data_folder
(string) : Data directory where the dataset is stored.self.set
(string) : ‘val’, ‘test’, or ‘train’self.tasks
(string or list of string) : Which tasks to use. ‘class’, ‘reg’, ‘all’, or a list of tasks such as [‘AndCompareColor’, ‘AndCompareShape’]. Only the selected tasks will be used.self.dataset_type
(string) : Which dataset to use, ‘canonical’, ‘hard’, or ‘generated’. If ‘generated’, please specify ‘examples_per_task’, ‘sequence_length’, ‘memory_length’, and ‘max_distractors’ under ‘generation’. Can also specify ‘nr_processors’ for generation.
Adds the following as default params:
>>> {'data_folder': os.path.expanduser('~/data/cog'), >>> 'set': 'train', >>> 'tasks': 'class', >>> 'dataset_type': 'canonical', >>> 'initialization_only': False}
Sets:
>>> self.data_definitions = {'images': {'size': [-1, self.sequence_length, 3, self.img_size, self.img_size], 'type': [torch.Tensor]}, >>> 'tasks': {'size': [-1, 1], 'type': [list, str]}, >>> 'questions': {'size': [-1, 1], 'type': [list, str]}, >>> 'targets_reg': {'size': [-1, self.sequence_length, 2], 'type': [torch.Tensor]}, >>> 'targets_class': {'size': [-1, self.sequence_length, 1], 'type' : [list,str]} >>> }
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from configuration.yaml
file).
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({'images', 'questions', 'targets', 'targets_label'})
, with:images
: Sequence of images,tasks
: Which task family sample belongs to,questions
: Question on the sequence (this is constant per sequence for COG),targets_reg
: Sequence of targets as tuple of floats for pointing tasks,targets_class
: Sequence of word targets for classification tasks.
-
collate_fn
(batch)[source]¶ Combines a list of
miprometheus.utils.DataDict
(retrieved with__getitem__()
) into a batch.Parameters: batch (list) – individual miprometheus.utils.DataDict
samples to combine.Returns: DataDict({'images', 'tasks', 'questions', 'targets_reg', 'targets_class'})
containing the batch.
-
parse_tasks_and_dataset_type
(params)[source]¶ Parses the task list and dataset type. Then sets folder paths to appropriate values.
Parameters: params ( miprometheus.utils.ParamInterface
) – Dictionary of parameters (read from the configuration.yaml
file).
-
source_dataset
()[source]¶ Handles downloading and unzipping the canonical or hard version of the dataset.
-
add_statistics
(stat_col)[source]¶ Add
COG
-specific stats tomiprometheus.utils.StatisticsCollector
.Parameters: stat_col – miprometheus.utils.StatisticsCollector
.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects dataset details.
Parameters: - stat_col –
miprometheus.utils.StatisticsCollector
. - data_dict –
miprometheus.utils.DataDict
containing targets. - logits – Prediction of the model (
torch.Tensor
)
- stat_col –
-
Algorithmic SequenceToSequence Problems¶
-
class
miprometheus.problems.
AlgorithmicSeqToSeqProblem
(params)[source]¶ Base class for algorithmic sequential problems.
Provides some basic features useful in all problems of such nature.
..info:
- All derived classes will provide two operation modes:
- “optimized”: “__getitem__” in fact does nothing (returns index), whereas “collate_fn” generates the whole batch.
- “not_optimized”: “__getitem__” generates a single sample, while “collate_fn” collates them.
Advantage of the “not_optimized” mode is that a single batch will contain sequences of varying length. This mode is around 10 times slower though.
..warning:
In both cases the derived classes will work as true data generators, and not really care about the indices provided from the list. As a result, each epoch will contain newly generated, thus different samples (for the same indices)...warning:
“optimized” mode is not suited to be used with many dataloader workers, i.e. setting num_workers > 0 will in fact slow the whole generation (by 3-4 times!).-
__init__
(params)[source]¶ Initializes problem object. Calls base
SeqToSeqProblem
constructor.Sets
nn.BCEWithLogitsLoss()
as the default loss function.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
pad_collate_tensor_list
(tensor_list, max_seq_len=-1)[source]¶ Method collates list of 2D tensors with varying dimension 0 (“sequence length”). Pads 0 along that dimension.
Parameters: - tensor_list – list [BATCH_SIZE] of tensors [SEQ_LEN, DATA_SIZE] to be padded.
- max_seq_len – max sequence length (DEFAULT: -1 means that it will recalculate it on the fly)
Returns: 3D padded tensor [BATCH_SIZE, MAX_SEQ_LEN, DATA_SIZE]
-
generate_batch
(batch_size)[source]¶ Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
..note:
To be implemented in the derived algorithmic problem classes.Parameters: batch_size – Size of the batch to be returned. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
generate_sample_ignore_index
(index)[source]¶ Returns one individual sample generated on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
As the name of the method suggests, ‘’the index’’ will in fact be ignored during generation.
Parameters: index – index of the sample to returned (IGNORED). Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with: - sequences: [2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: [1] (random value between self.min_sequence_length and self.max_sequence_length)
- targets: [2*SEQ_LENGTH+2, DATA_BITS]
- masks: [2*SEQ_LENGTH+2]
- num_subsequences: [1]
-
collate_samples_from_batch
(batch_of_dicts)[source]¶ Generates a batch of samples on-the-fly
Parameters: batch_of_dicts – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*MAX_SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: [BATCH_SIZE, 1] (random values between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, 2*MAX_SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*MAX_SEQ_LENGTH+2]
- num_subsequences: [BATCH_SIZE, 1]
-
do_not_generate_sample
(index)[source]¶ Method used as __getitem__ in “optimized” mode. It simply returns back the received index. Whole generation is made in ‘’collate_fn’’ (i.e. collate_by_generation_batch’‘)
Warning
As the name of the method suggests, the method does not generate the sample.
Parameters: index – index of the sample to returned (IGNORED). Returns: index
-
collate_by_batch_generation
(batch)[source]¶ Generates a batch of samples on-the-fly.
Warning
The samples created by
__getitem__
are simply not used in this function. As``collate_fn`` generates on-the-fly a batch of samples relying on the underlying ‘’generate_batch’’ method, all having the same length (randomly selected thought).Parameters: batch – Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
set_max_length
(max_length)[source]¶ Sets maximum sequence lenth (property).
Parameters: max_length – Length to be saved as max.
-
curriculum_learning_initialize
(curriculum_params)[source]¶ Initializes curriculum learning - simply saves the curriculum params.
Note
This method can be overwritten in the derived classes.
Parameters: curriculum_params – Interface to parameters accessing curriculum learning view of the registry tree.
-
curriculum_learning_update_params
(episode)[source]¶ Updates problem parameters according to curriculum learning. In the case of algorithmic sequential problems, it updates the max sequence length, depending on configuration parameters.
Parameters: episode (int) – Number of the current episode. Returns: Boolean informing whether curriculum learning is finished (or wasn’t active at all).
-
calculate_accuracy
(data_dict, logits)[source]¶ Calculate accuracy equal to mean difference between outputs and targets.
Warning
Applies mask to both logits and targets.
Parameters: - data_dict – DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}).
- logits (tensor) – Predictions of the model.
Returns: Accuracy.
-
add_ctrl
(seq, ctrl, pos)[source]¶ Adds control channels to a sequence.
Parameters: - seq (array_like) – Sequence to which controls channel are added.
- ctrl (array_like) – Elements to add
Param: pos: Object that defines the index or indices before which ctrl is inserted.
Returns: updated sequence.
-
augment
(seq, markers, ctrl_start=None, add_marker_data=False, add_marker_dummy=True)[source]¶ Creates augmented sequence as well as end marker and a dummy sequence.
Parameters: Returns: [augmented_sequence, dummy]
-
add_statistics
(stat_col)[source]¶ Add accuracy, seq_length and max_seq_length statistics to a
StatisticsCollector
.Parameters: stat_col ( StatisticsCollector
) – Statistics collector.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects accuracy, seq_length and max_seq_length.
Parameters: - stat_col (
StatisticsCollector
) – Statistics collector. - data_dict (DataDict) – DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}).
- logits (tensor) – Predictions of the model.
- stat_col (
-
add_aggregators
(stat_agg)[source]¶ Adds problem-dependent statistical aggregators to
StatisticsAggregator
.Parameters: stat_agg – StatisticsAggregator
.
-
aggregate_statistics
(stat_col, stat_agg)[source]¶ Aggregates the statistics collected by
StatisticsCollector
and adds the results toStatisticsAggregator
.Parameters: - stat_col –
StatisticsCollector
. - stat_agg –
StatisticsAggregator
.
- stat_col –
Dual Comparison¶
-
class
miprometheus.problems.
SequenceComparisonCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn sequence comparison task. System needs to compare both subsequences elementwise and return sequence of 0s and 1s denoting whether items were equal, i.e. x1(0) != x2(0), x1(1) != x2(1), …, x1(n) != x2(n)
- ..note:
- Can also work in ‘’inequality’’ mode, i.e. return 1 when x1(n) != x2(n).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
class
miprometheus.problems.
SequenceEqualityCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn sequence symmetry task. Two sequences x1 and x2 are symmetric if x2 == x1.
- ..note:
- Can also work in ‘’inequality’’ mode, i.e. return 1 when x1 != x2.
-
class
miprometheus.problems.
SequenceSymmetryCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn sequence symmetry task. Two sequences x1 and x2 are symmetric if x2 == reversed(x1).
- ..note:
- Can also work in ‘’antisymmetry’’ mode, i.e. return 1 when x1 != reversed(x2).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
Dual Distraction¶
-
class
miprometheus.problems.
DistractionCarry
(params)[source]¶ - # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn recalling the last sub sequence of Y and all sub sequences of X.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 # x2 % y2 … # xn % yn & d $ d` pattern of target: dummies … … … … yn all(xi) mask: used to mask the data part of the target. xi, yi, and d(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED # TODO: This is commented for now to avoid the issue with add_ctrl and augment in AlgorithmicSeqToSeqProblem # TODO: NOT SURE THAT THIS FN IS WORKING WELL (WITHOUT THE PRESENCE OF THE BATCH DIMENSION)
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.param batch: Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here!
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
pattern of inputs: # x1 % y1 # x2 % y2 … # xn % yn & d $ d` pattern of target: dummies … … … … yn all(xi) mask: used to mask the data part of the target. xi, yi, and d(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED & IMPROVED
-
-
class
miprometheus.problems.
DistractionForget
(params)[source]¶ - # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn recalling all sub sequences of Y and X.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED # TODO: This is commented for now to avoid the issue with add_ctrl and augment in AlgorithmicSeqToSeqProblem # TODO: NOT SURE THAT THIS FN IS WORKING WELL (WITHOUT THE PRESENCE OF THE BATCH DIMENSION)
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED & IMPROVED
-
-
class
miprometheus.problems.
DistractionIgnore
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn recalling just sub sequences X and ignore Y.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 # x2 % y2 … # xn % yn & d pattern of target: dummies … … … … all(xi) mask: used to mask the data part of the target. xi, yi, and d: sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED # TODO: This is commented for now to avoid the issue with add_ctrl and augment in AlgorithmicSeqToSeqProblem # TODO: NOT SURE THAT THIS FN IS WORKING WELL (WITHOUT THE PRESENCE OF THE BATCH DIMENSION)
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, SEQ_LENGTH, DATA_BITS],
- mask: [BATCH_SIZE, SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 # x2 % y2 … # xn % yn & d pattern of target: dummies … … … … all(xi) mask: used to mask the data part of the target. xi, yi, and d: sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED & IMPROVED
-
Dual Ignore¶
-
class
miprometheus.problems.
InterruptionNot
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn swap all sub sequences of Y and recall all sub sequence X.
The swap is done in the following way: “bitshifted” the Y by num_items to right.
For example:
num_items = 2 -> seq_items >> 2 num_items = -1 -> seq_items << 1
Offers two modes of operation, depending on the value of num_items parameter:
- -1 < num_items < 1: relative mode, where num_items represents the % of length of the sequence by which it should be shifted
- otherwise: absolute number of items by which the sequence will be shifted.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED # TODO: This is commented for now to avoid the issue with add_ctrl and augment in AlgorithmicSeqToSeqProblem # TODO: NOT SURE THAT THIS FN IS WORKING WELL (WITHOUT THE PRESENCE OF THE BATCH DIMENSION)
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED & IMPROVED
-
class
miprometheus.problems.
InterruptionReverseRecall
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn reverse recalling all sub sequences of Y and recall all sub sequences X.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS]. SEQ_LENGTH depends on number of sub-sequences
- and its lengths.
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: nb_sub_seq_a + nb_sub_seq_b
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d F(y1) d d F(y2) … d d F(yn) all(xi) F: inversion function mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d F(y1) d d F(y2) … d d F(yn) all(xi) F: inversion function mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED & IMPROVED
-
-
class
miprometheus.problems.
InterruptionSwapRecall
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn swap all sub sequences of Y and recall all sub sequence X.
The swap is done in the following way: “bitshifted” the Y by num_items to right.
For example: num_items = 2 -> seq_items >> 2 num_items = -1 -> seq_items << 1
Offers two modes of operation, depending on the value of num_items parameter:
- -1 < num_items < 1: relative mode, where num_items represents the % of length of the sequence by which it should be shifted
- otherwise: absolute number of items by which the sequence will be shifted.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
rotate
(seq, rotation, seq_length)[source]¶ # Rotate sequence by shifting the items to right: seq >> num_items.
# i.e num_items = 2 -> seq_items >> 2 # and num_items = -1 -> seq_items << 1
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: num_seq_a + num_seq_b
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d F(y1) d d F(y2) … d d F(yn) all(xi) F: swap function mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, SEQ_LENGTH, DATA_BITS],
- mask: [BATCH_SIZE, SEQ_LENGTH]
- num_subsequences: num_seq_a + num_seq_b
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d F(y1) d d F(y2) … d d F(yn) all(xi) F: swap function mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED
Manipulation Spatial¶
-
class
miprometheus.problems.
ManipulationSpatialNot
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Class generating sequences of random bit-patterns with inverted targets, so the system is supposed to learn NOT logical operation.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [2*SEQ_LENGTH+2, DATA_BITS],
- mask: [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
-
class
miprometheus.problems.
ManipulationSpatialRotation
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Creates input being a sequence of bit pattern and target being the same sequence, but with data_bits “bitshifted” by num_bits to right.
Offers two modes of operation, depending on the value of num_bits parameter:
1. -1 < num_bits < 1: relative mode, where num_bits represents the % of data bits by which every should be shifted
2. otherwise: absolute number of bits by which the sequence will be shifted.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [2*SEQ_LENGTH+2, DATA_BITS],
- mask: [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
Manipulation Temporal¶
-
class
miprometheus.problems.
ManipulationTemporalSwap
(params)[source]¶ # TODO: THE DOCUMENTATION OF THIS FILE NEEDS TO BE UPDATED & IMPROVED
Creates input being a sequence of bit pattern and target being the same sequence “bitshifted” by num_items to right.
For example:
num_items = 2 -> seq_items >> 2 num_items = -1 -> seq_items << 1
Offers two modes of operation, depending on the value of num_items parameter:
- -1 < num_items < 1: relative mode, where num_items represents the % of length of the sequence by which it should be shifted
- Otherwise: absolute number of items by which the sequence will be shifted.
TODO: sequences of different lengths in batch (filling with zeros?)
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).self.name = ‘SerialRecall’
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [2*SEQ_LENGTH+2, DATA_BITS],
- mask: [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
-
class
miprometheus.problems.
SkipRecallCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn serial recall problem (a.k.a. copy task). The formulation follows the original copy task from NTM paper, where:
- There are two markers, indicating:
- beginning of storing/memorization and
- beginning of recalling from memory.
2. Additionally, there is a command line (3rd command bit) indicating whether given item is to be stored in memory (0) or recalled (1).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
Recall¶
-
class
miprometheus.problems.
OperationSpan
(params)[source]¶ Class generating successions of sub sequences X and Y of random bit- patterns, the target was designed to force the system to learn swap all sub sequences of Y and recall all sub sequence X.
The swap is done in the following way: “bitshifted” the Y by num_items to right.
For example:
num_items = 2 -> seq_items >> 2 num_items = -1 -> seq_items << 1
Offers two modes of operation, depending on the value of num_items parameter:
- -1 < num_items < 1: relative mode, where num_items represents the % of length of the sequence by which it should be shifted
- otherwise: absolute number of items by which the sequence will be shifted.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
rotate
(seq, rotation, length)[source]¶ Rotate a sequence by shifting the items to the right: seq >> num_items.
# i.e num_items = 2 -> seq_items >> 2 # and num_items = -1 -> seq_items << 1
Parameters: Returns: rotated sequence.
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS]. SEQ_LENGTH depends on number of sub-sequences
- and its lengths.
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: SEQ_LENGTH]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THIS DOCUMENTATION NEEDS TO BE UPDATED
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, 2*SEQ_LENGTH+2, DATA_BITS],
- mask: [BATCH_SIZE, [2*SEQ_LENGTH+2]
- num_subsequences: 1
pattern of inputs: # x1 % y1 & d1 # x2 % y2 & d2 … # xn % yn & dn $ d` pattern of target: d d y1 d d y2 … d d yn all(xi) mask: used to mask the data part of the target. xi, yi, and dn(d’): sub sequences x of random length, sub sequence y of random length and dummies.
# TODO: THIS DOCUMENTATION NEEDS TO BE UPDATED
-
class
miprometheus.problems.
ReadingSpan
(params)[source]¶ # TODO : Documentation will be added soon
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter that returns one individual sample generated on-the-fly
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Parameters: index – index of the sample to return. Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [SEQ_LENGTH, CONTROL_BITS+DATA_BITS]. SEQ_LENGTH depends on number of sub-sequences
- and its lengths.
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [SEQ_LENGTH, DATA_BITS],
- mask: [SEQ_LENGTH]
- num_subsequences: number of subsequences
-
collate_fn
(batch)[source]¶ Generates a batch of samples on-the-fly
Warning
Because of the fact that the sequence length is randomly drawn between
self.min_sequence_length
andself.max_sequence_length
and then fixed for one given batch (but varies between batches), we cannot follow the scheme merge together individuals samples that can be retrieved in parallel with several workers. Indeed, each sample could have a different sequence length, and merging them together would then not be possible (we cannot have variable-sequence-length samples within one batch without padding). Hence,collate_fn
generates on-the-fly a batch of samples, all having the same length (initially randomly selected). The samples created by__getitem__
are simply not used in this function.Parameters: batch – Should be a list of DataDict retrieved by __getitem__, each containing tensors, numbers, dicts or lists. –> Not Used Here! Returns: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘num_subsequences’}), with: - sequences: [BATCH_SIZE, SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: random value between self.min_sequence_length and self.max_sequence_length
- targets: [BATCH_SIZE, SEQ_LENGTH, DATA_BITS],
- mask: [BATCH_SIZE, SEQ_LENGTH]
- num_subsequences: number of subsequences
# TODO: THE DOCUMENTATION NEEDS TO BE UPDATED
-
-
class
miprometheus.problems.
RepeatReverseRecallCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn repeated reverse recall problem.
There are 2 markers, indicating:
- beginning of storing/memorization,
- beginning of forward recalling from memory.
Additionally, there is a command line (3rd command bit) indicating whether given item is to be stored in memory (0) or recalled (1).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
class
miprometheus.problems.
RepeatSerialRecallCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn repeated serial recall problem.
There are 2 markers, indicating:
- beginning of storing/memorization,
- beginning of forward recalling from memory,
Additionally, there is a command line (3rd command bit) indicating whether given item is to be stored in memory (0) or recalled (1).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
class
miprometheus.problems.
ReverseRecallCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn reverse recall problem.
- There are two markers, indicating:
- beginning of storing/memorization and
- beginning of recalling from memory.
- Additionally, there is a command line (3rd command bit) indicating whether given item is to be stored in memory (0) or recalled (1).
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
-
class
miprometheus.problems.
ScratchPadCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn the scratch pad problem (overwriting the memory).
Minor modification I: the target contains may contain random command lines.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, SEQ_LENGTH, CONTROL_BITS+DATA_BITS],
- sequences_length: [BATCH_SIZE] (random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, SEQ_LENGTH, DATA_BITS],
- masks: [BATCH_SIZE, SEQ_LENGTH, 1]
- num_subsequences: [BATCH_SIZE, 1] (number of subsequences)
-
-
class
miprometheus.problems.
SerialRecallCommandLines
(params)[source]¶ Class generating sequences of random bit-patterns and targets forcing the system to learn serial recall problem (a.k.a. copy task). The formulation follows the original copy task from NTM paper, where:
There are two markers, indicating:
- beginning of storing/memorization and
- beginning of recalling from memory.
For other elements of the sequence the command bits are set to zero
Minor modification I: the target contains may contain random command lines.
Minor modification II: generator returns a mask, which can be used for filtering important elements of the output.
-
__init__
(params)[source]¶ Constructor - stores parameters. Calls parent class
AlgorithmicSeqToSeqProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
generate_batch
(batch_size)[source]¶ - Generates a batch of samples of size ‘’batch_size’’ on-the-fly.
Note
The sequence length is drawn randomly between
self.min_sequence_length
andself.max_sequence_length
.Warning
All the samples within the batch will have the same sequence lengt.param batch_size: Size of the batch to be returned.
return: DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘masks’, ‘num_subsequences’}), with:
- sequences: [BATCH_SIZE, 2*SEQ_LENGTH+2, CONTROL_BITS+DATA_BITS]
- sequences_length: [BATCH_SIZE, 1] (the same random value between self.min_sequence_length and self.max_sequence_length)
- targets: [BATCH_SIZE, , 2*SEQ_LENGTH+2, DATA_BITS]
- masks: [BATCH_SIZE, 2*SEQ_LENGTH+2, 1]
- num_subsequences: [BATCH_SIZE, 1]
TextToText Problems¶
text_to_text_problem.py: abstract base class for text to text sequential problems, e.g. machine translation.
-
class
miprometheus.problems.seq_to_seq.text2text.text_to_text_problem.
TextToTextProblem
(params)[source]¶ Base class for text to text sequential problems.
Provides some basic features useful in all problems of such type.
-
__init__
(params)[source]¶ Initializes problem object. Calls base
SeqToSeqProblem
constructor.Sets
nn.NLLLoss()
as default loss function.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
compute_BLEU_score
(data_dict, logits)[source]¶ Compute the BLEU score in order to evaluate the translation quality (equivalent of accuracy).
Note
Reference paper: http://www.aclweb.org/anthology/P02-1040.pdf
Implementation inspired from https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
To handle all samples within a batch, we accumulate the individual BLEU score for each pair of sentences and average over the batch size.
Parameters: - data_dict – DataDict({‘inputs’, ‘inputs_length’, ‘inputs_text’, ‘targets’, ‘targets_length’, ‘outputs_text’}).
- logits – Predictions of the model.
Returns: Average BLEU Score for the batch ( 0 < BLEU < 1).
-
evaluate_loss
(data_dict, logits)[source]¶ Computes loss.
By default, the loss function is the Negative Log Likelihood function.
The input given through a forward call is expected to contain log-probabilities (LogSoftmax) of each class.
The input has to be a Tensor of size either (batch_size, C) or (batch_size, C, d1, d2,…,dK) with K ≥ 2 for the K-dimensional case.
The target that this loss expects is a class index (0 to C-1, where C = number of classes).
Parameters: - data_dict – DataDict({‘inputs’, ‘inputs_length’, ‘inputs_text’, ‘targets’, ‘targets_length’, ‘outputs_text’}).
- logits – Predictions of the model.
Returns: loss
-
add_statistics
(stat_col)[source]¶ Add BLEU score to a
StatisticsCollector
.Parameters: stat_col ( StatisticsCollector
) – Statistics collector.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects BLEU score.
Parameters: - stat_col –
StatisticsCollector
- data_dict – DataDict({‘inputs’, ‘inputs_length’, ‘inputs_text’, ‘targets’, ‘targets_length’, ‘outputs_text’}).
- logits – Predictions of the model.
- stat_col –
-
show_sample
(data_dict, sample=0)[source]¶ Shows the sample (both input and target sequences) using matplotlib. Elementary visualization.
Parameters: - data_dict – DataDict({‘inputs’, ‘inputs_length’, ‘inputs_text’, ‘targets’, ‘targets_length’, ‘outputs_text’}).
- sample – Number of sample in a batch (default: 0)
Note
TODO
-
unicode_to_ascii
(s)[source]¶ Turn a Unicode string to plain ASCII.
See: http://stackoverflow.com/a/518232/2809427.
Parameters: s – Unicode string. Returns: plain ASCII string.
-
normalize_string
(s)[source]¶ Lowercase, trim, and remove non-letter characters in string s.
Parameters: s – string. Returns: normalized string.
-
indexes_from_sentence
(lang, sentence)[source]¶ Construct a list of indexes using a ‘vocabulary index’ from a specified Lang class instance for the specified sentence (see
Lang
class below).Parameters: Returns: list of indexes.
-
tensor_from_sentence
(lang, sentence)[source]¶ Uses
indexes_from_sentence()
to create a tensor of indexes with the EOS token.Parameters: Returns: tensor of indexes, terminated by the EOS token.
-
tensors_from_pair
(pair, input_lang, output_lang)[source]¶ Creates a tuple of tensors of indexes from a pair of sentences.
Parameters: - pair (tuple) – input & output languages sentences
- input_lang – instance of the
Lang
class, having aword2index
dict, representing the input language. - output_lang – instance of the
Lang
class, having aword2index
dict, representing the output language.
Returns: tuple of tensors of indexes.
-
tensors_from_pairs
(pairs, input_lang, output_lang)[source]¶ Returns a list of tuples of tensors of indexes from a list of pairs of sentences. Uses
tensors_from_pair()
.Parameters: - pairs (list) – sentences pairs
- input_lang – instance of the class Lang, having a word2index dict, representing the input language.
- output_lang – instance of the class Lang, having a word2index dict, representing the output language.
Returns: list of tensors of indexes.
-
-
class
miprometheus.problems.seq_to_seq.text2text.text_to_text_problem.
Lang
(name)[source]¶ Simple helper class allowing to represent a language in a translation task. It will contain for instance a vocabulary index (
word2index
dict) & keep track of the number of words in the language.This class is useful as each word in a language will be represented as a one-hot vector: a giant vector of zeros except for a single one (at the index of the word). The dimension of this vector is potentially very high, hence it is generally useful to trim the data to only use a few thousand words per language.
The inputs and targets of the associated sequence to sequence networks will be sequences of indexes, each item representing a word. The attributes of this class (
word2index
,index2word
,word2count
) are useful to keep track of this.-
__init__
(name)[source]¶ Constructor.
Parameters: name – string to name the language (e.g. french, english)
-
TranslationAnki¶
-
class
miprometheus.problems.
TranslationAnki
(params)[source]¶ Class generating sequences of indexes as inputs & targets for a English <-> Other Language translation task.
Warning
The inspiration for this class being an existing PyTorch tutorial, this class is limited.
It currently only supports the files located at http://www.manythings.org/anki/
It currently only supports latin alphabet for now (because of string normalization) and does not include advanced features like beam search or pretrained embeddings.
Take this class as an example and not as a production-ready application.
-
__init__
(params)[source]¶ Initializes the problem: stores parameters. Calls parent class
TextToTextProblem
initialization.Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
prepare_data
()[source]¶ Prepare the data for generating batches.
Uses
filter_pairs()
to normalize, trim & filter input sentences pairs. Also fills inLang()
instances for the input & output languages.Returns: Lang()
object for input & output languages + filtered sentences pairs.
-
download
()[source]¶ Download the specified zip file from http://www.manythings.org/anki/. Notes: This website hosts data files for English -> other language translation: the main file is named after the other language.
Ex: for a English -> French translation, the main file is named ‘fra.txt’,
Ex: for a English -> German translation, the main file is named ‘deu.txt’ etc.
-
filter_pair
(p)[source]¶ Indicate whether a sentence pair is compliant with some filtering criteria, such as:
- The number of words (that includes ending punctuation) in the sentences,
- The start of the input language sentence.
Parameters: p (list) – pair of sentences Returns: True if the pair respects the filtering constraints else False.
-
filter_pairs
()[source]¶ Filter several pairs at once using filter_pair as a boolean mask.
Returns: list of filtered pairs.
-
__getitem__
(index)[source]¶ Retrieves a sample from
self.tensor_pairs
and get the associated strings fromself.pairs
.Parameters: index (int) – index of the sample to return. Returns: DataDict({‘inputs’, ‘inputs_length’, ‘inputs_text’ ‘targets’, ‘targets_length’, ‘targets_text’}).
-
collate_fn
(batch)[source]¶ Combines a list of DataDict (retrieved with
__getitem__
) into a batch.Note
Because each tokenized sentence has a variable length, padding is necessary to create batches.
Hence, for a given batch, each sentence is padded to the length of the longest one.
The batch is sorted decreasingly as a function of the input sentences length.
This length changes between batches, but this shouldn’t be an issue.
Parameters: batch (list) – Individual samples to combine Returns: DataDict({'inputs', 'inputs_length', 'inputs_text' 'targets', 'targets_length', 'targets_text'})
containing the batch.
-
plot_preprocessing
(data_dict, logits)[source]¶ Does some preprocessing to logits to then plot the attention weights for the AttnEncoderDecoder model.
Warning
This function hasn’t been reviewed yet
Parameters: - data_dict – DataDict({‘sequences’, ‘sequences_length’, ‘targets’, ‘mask’, ‘inputs_text’, ‘outputs_text’}).
- logits – prediction, shape [batch_size x max_seq_length x output_voc_size]
Returns: data_dict, + logits as dict {‘inputs_text’, ‘logits_text’}
-
VideoToClass Problems¶
-
class
miprometheus.problems.
VideoToClassProblem
(params)[source]¶ Abstract base class for sequential vision problems.
Problem classes like Sequential MNIST inherits from it.
Provides some basic features useful in all problems of such type.
-
__init__
(params)[source]¶ Initializes problem:
Calls
problems.problem.Problem
class constructor,Sets loss function to
CrossEntropy
,sets
self.data_definitions
to:>>> self.data_definitions = {'images': {'size': [-1, -1, 3, -1, -1], 'type': [torch.Tensor]}, >>> 'mask': {'size': [-1, -1, -1], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1, -1, 1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
calculate_accuracy
(data_dict, logits)[source]¶ Calculates accuracy equal to mean number of correct classification in a given batch.
Warning
Applies a mask to the logits.
Parameters: - logits – Predictions of the model.
- data_dict (DataDict) – DataDict containing the targets and the mask.
Returns: Accuracy.
-
evaluate_loss
(data_dict, logits)[source]¶ Computes loss.
Warning
Applies a mask to the logits.
Parameters: - logits – Predictions of the model.
- data_dict (DataDict) – DataDict containing the targets and the mask.
Returns: Loss.
-
add_statistics
(stat_col)[source]¶ Add accuracy statistic to
StatisticsCollector
.Parameters: stat_col – StatisticsCollector
.
-
collect_statistics
(stat_col, data_dict, logits)[source]¶ Collects accuracy.
Parameters: - stat_col –
StatisticsCollector
. - data_dict (DataDict) – DataDict containing the targets and the mask.
- logits – Predictions of the model.
- stat_col –
-
Sequential MNIST¶
-
class
miprometheus.problems.
PermutedSequentialRowMnist
(params)[source]¶ The Permuted MNIST is a sequence of classification tasks in which the rows of the input images are swapped with a random permutation.
Warning
The dataset is not originally split into a training set, validation set and test set; only training and test set. It is recommended to use a validation set.
torch.utils.data.SubsetRandomSampler
is recommended.-
__init__
(params)[source]¶ Initializes PermutedSequentialRowMnist problem:
Calls
problems.problem.VideoToClassProblem
class constructor,Sets following attributes using the provided
params
:self.root_dir
(string) : Root directory of dataset whereprocessed/training.pt
andprocessed/test.pt
will be saved,self.use_train_data
(bool, optional) : If True, creates dataset fromtraining.pt
, otherwise fromtest.pt
self.defaut_values
:>>> self.default_values = {'nb_classes': 10, >>> 'num_channels': 1, >>> 'width': 28, >>> 'height': 28}
self.data_definitions
:>>> self.data_definitions = {'images': {'size': [-1, 28, 1, 1, 28], 'type': [torch.Tensor]}, >>> 'mask': {'size': [-1, 28, 1], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1, 28, 1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({'images','targets', 'targets_label'})
, with:- images: Image,
- mask,
- targets: Index of the target class
- targets_label: Label of the target class (cf
self.labels
)
-
collate_fn
(batch)[source]¶ Combines a list of
DataDict
(retrieved with__getitem__
) into a batch.Note
This function wraps a call to
default_collate
and simply returns the batch as aDataDict
instead of a dict. Multi-processing is supported as the data sources are small enough to be kept in memory (training.pt has a size of 47.5 MB).Parameters: batch – list of individual DataDict
samples to combine.Returns: DataDict({'images','targets', 'targets_label'})
containing the batch.
-
-
class
miprometheus.problems.
SequentialPixelMNIST
(params)[source]¶ The Sequential MNIST implies that the model does not get to see/generate the whole image at once (like for example a normal 2d-ConvNet would), but only one pixel at a time sequentially.
Warning
The dataset is not originally split into a training set, validation set and test set; only training and test set. It is recommended to use a validation set.
torch.utils.data.SubsetRandomSampler
is recommended.-
__init__
(params)[source]¶ Initializes SequentialPixelMNIST problem:
Calls
problems.problem.VideoToClassProblem
class constructor,Sets following attributes using the provided
params
:self.root_dir
(string) : Root directory of dataset whereprocessed/training.pt
andprocessed/test.pt
will be saved,self.use_train_data
(bool, optional) : If True, creates dataset fromtraining.pt
, otherwise fromtest.pt
self.defaut_values
:>>> self.default_values = {'nb_classes': 10, >>> 'length': 28*28}
self.data_definitions
:>>> self.data_definitions = {'images': {'size': [-1, 28*28, 1, 1, 1], 'type': [torch.Tensor]}, >>> 'mask': {'size': [-1, 28*28, 1], 'type': [torch.Tensor]}, >>> 'targets': {'size': [-1, 28*28, 1], 'type': [torch.Tensor]}, >>> 'targets_label': {'size': [-1, 1], 'type': [list, str]} >>> }
Parameters: params – Dictionary of parameters (read from configuration .yaml
file).
-
__getitem__
(index)[source]¶ Getter method to access the dataset and return a sample.
Parameters: index (int) – index of the sample to return. Returns: DataDict({'images', 'mask', 'targets', 'targets_label'})
, with:- images: sequence of ‘images’ in [batch size, sequence length, channels, x, y] format. Single pixels, so x == y == 1
- mask
- targets: Index of the target class
-
collate_fn
(batch)[source]¶ Combines a list of
DataDict
(retrieved with__getitem__
) into a batch.Note
This function wraps a call to
default_collate
and simply returns the batch as aDataDict
instead of a dict. Multi-processing is supported as the data sources are small enough to be kept in memory (training.pt has a size of 47.5 MB).Parameters: batch – list of individual DataDict
samples to combine.Returns: DataDict({'sequences','targets', 'targets_label'})
containing the batch.
-