How to organize and reference your data

We use a simple file structure, and a JSON file to reference and organize our data as a minimal database. This also allows us to declare some useful biological, experimental or processing information related to the object under study.

Note

We chose the JSON format as we believe it is a fairly easy to use and human readable format.

Using this may seems at first as a big constraint and maybe too much work, but this will later:

  • be filled automatically by scripts and tools available in TimageTK, allowing for a higher level of automation (that mean less work for you, yes!).

  • become easier as you use it and, who knows, maybe you will start to ~~like it~~ fully use it by adding more and more information like “observation metadata” or some “conclusions” and enrich even more the dataset.

  • save you some time and headaches about what you did and how when writing that paper or preparing that presentation… information matters!

Example

We use a simple and classic Experiment/Dataset/Data hierarchy backed by a JSON file to save biological & processing metadata. A folder tree view of the “database” my_local_db would look like this:

my_local_db/
├── experiment_001/
│   ├── raw/
│   │   ├── Data_001.ext
│   │   ├── ...
│   ├── filtered_dataset/
│   │   ├── FData_001.ext
│   │   ├── ...
│   ├── ...
│   └── experiment_001.json
├── experiment_002/
│   ├── raw/
│   │   ├── Data_001.ext
│   │   ├── ...
│   ├── filtered_dataset/
│   │   ├── FData_001.ext
│   │   ├── ...
│   ├── ...
│   └── experiment_002.json
...

Note

The term “experiment” here refers to a single observation unit, e.g. a single intensity image or a temporal sequence. The datasets will be the successive transformation steps leading to the desired analysis, see the next section for details and examples.

Dataset names

The following list of names is used and recommended, except for raw there are all optional and depends on the type of data you have and the analysis you want to perform:

  • raw: contains the original data from microscopy;

  • cell_intensity_image: the intensity images with a membrane or wall targeted marker, could also be a reference to a channel of the raw data (if a multichannel format) in the related “experiment JSON file”;

  • nuclei_intensity_image: the intensity images with a nuclei targeted marker, could also be a reference to a channel of the raw data (if a multichannel format) in the related “experiment JSON file”;

  • multiangle_landmarks: the manually defined landmarks used to perform multi-angle image fusion;

  • multiangle_fusion: the images resulting of a multi-angle image fusion;

  • watershed_segmentation: the watershed segmentation of a cellular tissue (from cell_intensity_image or multiangle_fusion);

  • nuclei_detection: the nuclei segmented image (from cell_intensity_image); # TODO, see with G.C.

  • spatial_properties: the cell-based geometrical properties (as CSV files) like volume or contact areas; # TODO

  • signal_quantification: the cell-based or nuclei-based CSV files with relative or absolute expression signal of targeted genes with a given marker (fluorescent, …); # TODO

  • temporal_landmarks: the manually defined landmarks used to perform temporal registration; # TODO

  • lineage: the cell lineage files;

  • temporal_properties: the cell-based temporal properties (as CSV files) like volumetric growth rates or growth tensors; # TODO

  • temporal_clustering: yes we can! # TODO

Note

Other sub-cellular targets name can be defined for signal quantification! Talk to your local developer about it!

RAW data

Important

It is crucial that we ALWAYS preserve the RAW data!

Warning

Use a properly setup (image) database (with backups, …) like OMERO or ALWAYS have at least two copies of your raw data (with one in a secure & unsued drive) or in a cloud storage service!