FAIRLinked.RDFTableConversion.MDS_DF

Submodules

FAIRLinked.RDFTableConversion.MDS_DF.main module

class FAIRLinked.RDFTableConversion.MDS_DF.main.MatDatSciDf(df: DataFrame, metadata_template: dict | None = None, matched_log: list | None = None, unmatched_log: list | None = None, data_relations_dict: dict | None = None, orcid: str = '0000-0000-0000-0000', df_name: str | None = None, metadata_rows: bool | None = False, ontology_graph: Graph | None = None, base_uri='https://cwrusdle.bitbucket.io/mds/', local_unit_file: bool | None = True)[source]

Bases: object

A semantic wrapper for Pandas DataFrames in the Materials Data Science domain.

This class serves as a “Semantic Firewall” for experimental materials data. It bridges tabular data and Linked Data by maintaining synchronized internal objects for measurement data, semantic headers, metadata templates, and column-to-column relationships. It enforces FAIR principles by validating researcher identifiers (ORCID) and ensuring ontological consistency before serialization.

df

The cleaned measurement data, stripped of metadata headers.

Type:: pd.DataFrame

header_df

A 3-row buffer (Type, Unit, Study Stage) used for mapping or pre-allocating metadata for the dataset.

Type:: pd.DataFrame

metadata_obj

The internal manager handling the RDFLib Graph and JSON-LD template synchronization.

Type:: MatDatSciDf.Metadata

data_relations

The internal manager for defining semantic links (Object/Datatype properties) between columns.

Type:: MatDatSciDf.DataRelationsDict

orcid

Validated ORCID iD of the data curator.

Type:: str

orcid_verified

Boolean status of curator identity verification.

Type:: bool

df_name

Descriptive name for the dataset used in file exports.

Type:: str

ontology

The reference ontology graph used for fuzzy matching and property resolution.

Type:: rdflib.Graph

base_uri

The namespace prefix used for generating semantic subjects.

Type:: str

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'Definition not available', study_stage: str = 'UNK')[source]

Registers and appends metadata for a specific data column to both the temporary JSON-LD graph and the internal RDFLib Graph.

This method prevents duplicate entries by checking the existing JSON-LD @graph for the column name. If the column does not exist, it constructs a clean Python dictionary representing the JSON-LD entity, appends it to the temporary graph structure, and synchronizes it by parsing it into the internal template_graph.

Parameters:

col_name (str) – The exact name of the data column (e.g., ‘patient_age’). Used as the skos:altLabel identifier to prevent duplicate entries.
rdf_type (str) – The RDF semantic type or class for the column. If a namespace prefix (like ‘mds:’) is omitted, the ‘mds:’ prefix will be automatically prepended.
unit (str, optional) – The measurement unit of the column data, mapped to a QUDT ontology identifier. Defaults to “UNITLESS”.
definition (str, optional) – A human-readable textual description of what the column represents. Defaults to “Definition not available”.
study_stage (str, optional) – The phase or stage of the study lifecycle this data belongs to (e.g., ‘COLLECTION’, ‘ANALYSIS’). Defaults to “UNKNOWN”.

Return type:

None

Raises:

ValueError – If required parameters are malformed (handled by downstream JSON/RDF parsers).

add_relations(data_relations: dict)[source]

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

delete_relation(prop_key: str, pair: tuple | None = None)[source]

Top-level API to remove semantic links between columns.

Parameters:

prop_key (str) – The property identifier (e.g., ‘mds:measuredBy’).
pair (tuple, optional) – Specific (subj, obj) columns to un-link. If None, removes all links for that property.

df_name = 'Unnamed_Dataframe'

classmethod from_rdf_dir(input_dir: str, orcid: str, metadata_template: dict | None = None, data_relations_dict: dict | None = None, df_name: str = 'Imported_RDF_Data', ontology_graph: Graph | None = None, base_uri: str = 'https://cwrusdle.bitbucket.io/mds/')[source]

Factory method to reconstruct a MatDatSciDf instance and validate semantic integrity from a directory of RDF files.

This method crawls a directory for supported RDF formats, parses the triples, and reconstructs the tabular data (DataFrame) and metadata (JSON-LD Template). It serves as a data audit pipeline by cross-referencing file-level triples against a master template for unit consistency and a user-provided schema for structural integrity.

Parameters:

input_dir (str) – Path to the directory containing RDF files (JSON-LD, Turtle, etc.).
orcid (str) – The ORCID identifier of the user performing the reconstruction.
data_relations_dict (dict, optional) – The expected Subject-Predicate-Object schema to validate against each file. If provided, mismatches are logged.
df_name (str, optional) – Descriptive name for the resulting DataFrame and validation report. Defaults to “Imported_RDF_Data”.
ontology_graph (rdflib.Graph, optional) – A reference ontology used to resolve labels and CURIEs during validation.
base_uri (str, optional) – The base URI used for semantic subject identification. Defaults to “https://cwrusdle.bitbucket.io/mds/”.

Returns:

A fully initialized and validated instance containing the: reconstructed dataset and associated semantic logs.

Return type:

MatDatSciDf

Reports & Logs:

Generates ‘{df_name}_import_validation.txt’ in the input directory.
Logs Unit Conflicts: Flagged if a column unit differs from the first encountered definition.
Logs Schema Mismatches: Flagged if expected semantic links are missing within individual RDF graphs.

Note

Supported extensions: .jsonld, .ttl, .nt, .rdf, .xml.
Missing data columns in specific files are filled with ‘pd.NA’ to maintain tabular integrity.

get_relation_pairs_onto()[source]

Analyzes the ontology and metadata template to discover relationships between columns.

Returns:: { URI: [(subj_col, obj_col), …] }
Return type:: dict

get_relations()[source]

Extracts all Object and Datatype properties from the associated ontology.

This method scans the ontology graph for OWL ObjectProperties and DatatypeProperties, mapping their human-readable rdfs:labels to their full URIs and property types.

Returns:

A dictionary (prop_metadata_dict) where:

Key: Property label (str)
Value: Tuple of (Property URI, Property Type)

Return type:

dict

mds_graph = <Graph identifier=Nfe1a650e28be408b9b318849071c2952 (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

save_mds_df(output_dir: str, metadata_in_output_df: bool = False, formats: list = ['csv', 'parquet', 'arrow'])[source]

Saves the internal DataFrame and associated metadata to the local file system.

This method supports multi-format export (CSV, Parquet, Arrow). It can also generate a ‘semantic’ version of the CSV where the first three rows of the file contain the RDF Type, QUDT Unit, and Study Stage for each column, facilitating human readability and FAIR data principles.

Parameters:

output_dir (str) – The directory path where files will be stored.
metadata_obj (Metadata, optional) – The Metadata management object. If provided, it will also trigger the saving of the JSON-LD template and match logs.
metadata_in_output_df (bool, optional) – If True, prepends three header rows (Type, Units, Study Stage) to the CSV output. Defaults to False.
formats (list, optional) – A list of strings specifying output formats. Supported: ‘csv’, ‘parquet’, ‘arrow’, ‘feather’. Defaults to [“csv”, “parquet”, “arrow”].

Note

When ‘metadata_in_output_df’ is True, only the CSV format will contain
the multi-row headers. Parquet and Arrow formats are saved using a ‘clean’ version (data only) to preserve strict schema typing.
For Parquet and Arrow exports, all columns are cast to strings to
ensure compatibility with mixed-type metadata fields.
The method automatically standardizes column order alphabetically.

Returns:: None

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

static search_license(query: str)[source]

Searches the SPDX license database for a matching ID or name.

This is a utility method: it can be called as MatDatSciDf.search_license(“MIT”) without initializing the class (i.e., no DataFrame or ORCID required).

Parameters:: query (str) – The search term (e.g., ‘Creative Commons’, ‘GPL’, ‘MIT’).

semantic_remapping(data_graph: Graph)[source]: Validates types against the reference ontology and remaps unrecognized types to the base BFO Entity class.

serialize_bulk(output_path: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, label_pairs: list[tuple[str, str]] | None = None, license: str | None = None, write_files: bool | None = True) → Graph[source]

Aggregates all row-level RDF graphs into a single master file while preserving the original context.

This method performs a “Bulk Serialization” by first generating RDF subgraphs for every row in the DataFrame and then merging them into a singular master Graph object. Unlike ‘serialize_row’, which creates multiple files, this method outputs one unified dataset file, ensuring that the JSON-LD ‘@context’ is applied globally to maintain consistent prefixing (e.g., ‘mds:’, ‘qudt:’) across all entries.

Parameters:

output_path (str) – The full destination path, including the filename and extension, where the aggregated graph will be saved.
format (str, optional) – The RDF serialization format (e.g., ‘json-ld’, ‘turtle’, ‘xml’). Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate unique row identifiers.
id_cols (list[str], optional) – Column names to be used as entity identifiers (@id) instead of row keys.
label_pairs (list[tuple[str, str]], optional) – A list of 2-tuples (X, Y) where column X represents an entity in the dataframe, and column Y contains the literal text string that should be assigned as its ‘rdfs:label’. If a cell in column Y is missing or empty, the label triple for that row is omitted.
license (str, optional) – SPDX license ID or URI to be applied to the triples.
write_files (bool, optional) – Whether to write serialized data to disk. Defaults to True.

Returns:

A single aggregated RDFLib Graph object containing the triples: for every row in the dataset.

Return type:

Graph

Note

This method is highly recommended for creating FAIR-compliant datasets destined for Triple Stores or Graph Databases.
It maintains the exact same URI structure and namespace bindings as individual row serializations to ensure interoperability.
The output directory is automatically created if it does not exist.

serialize_row(output_folder: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, label_pairs: list[tuple[str, str]] | None = None, license: str | None = None, write_files: bool | None = True) → list[Graph][source]

Serializes each row of the DataFrame into individual RDF files using the active semantic metadata template.

This method transforms tabular experimental data into Linked Data. It iterates through the DataFrame, generating a unique row identifier (Subject URI) for each entry based on either specified ‘id_cols’ or a hash of the study-stage metadata. It maps cell values to ‘qudt:value’ triples, applies dynamic ‘rdfs:label’ tags, and establishes inter-column relationships defined in the internal ‘data_relations’ manager.

Parameters:

output_folder (str) – Directory where individual RDF files will be saved.
format (str, optional) – The RDF serialization format. Supported: ‘json-ld’, ‘turtle’, ‘xml’, ‘nt’. Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate the unique row string used for file naming and internal row indexing.
id_cols (list[str], optional) – Column names whose values should be normalized and used as the primary Subject URI identifier (@id). If None, Subject URIs are generated from the unique row key.
label_pairs (list[tuple[str, str]], optional) – A list of 2-tuples (X, Y) where column X represents an entity in the dataframe, and column Y contains the literal text string that should be assigned as its ‘rdfs:label’. If a cell in column Y is missing or empty, the label triple for that row is omitted.
license (str, optional) – An SPDX license identifier (e.g., ‘MIT’) or a full URI. Defaults to ‘CC0-1.0’.
write_files (bool, optional) – If True, writes each row to a file on disk. If False, only returns the list of RDF Graphs. Defaults to True.

Raises:

ValueError – If the provided license is invalid or if the metadata template is missing required ‘skos:altLabel’ definitions.

Returns:

A list of RDFLib Graph objects, each representing: one row of experimental data and its associated semantic context.

Return type:

List[rdflib.Graph]

Note

Parent directories for output_folder are created automatically.
Files are named using the pattern: ‘{random_suffix}-{row_key}.{ext}’.
Triples for ‘pd.NA’ or empty string values are omitted to maintain graph sparsity and data integrity.

template_generator(skip_prompts: bool = False)[source]

Generates a semantic metadata template by mapping DataFrame columns to ontology terms.

This method performs a fuzzy match between column headers and the loaded ontology. It attempts to automatically resolve the RDF type (@type), study stage, and units. If a direct match is not found, or if ‘skip_prompts’ is False, it can interactively prompt the user to provide missing metadata fields.

The resulting template follows the JSON-LD structure, integrating namespaces such as QUDT, SKOS, PROV, and MDS.

Parameters:

skip_prompts (bool, optional) – If True, suppresses interactive user input for missing units or definitions, instead using ‘UNITLESS’ or placeholders. Defaults to False.

Returns:

A tuple containing:

metadata_template (dict): The complete JSON-LD dictionary with ‘@context’ and ‘@graph’ entries for each column.
matched_log (list): A list of strings documenting successful fuzzy-match associations (Column => IRI).
unmatched_log (list): A list of column names that could not be found in the provided ontology.

Return type:

tuple

Note

The method prioritizes metadata explicitly included in the first three rows of the CSV (type, unit, study stage).
Unit extraction handles both raw strings (e.g., ‘unit:KiloGM’) and string-encoded dictionaries (e.g., “{‘@id’: ‘unit:M’}”).
Time-stamping via ‘prov:generatedAtTime’ is applied to each entry for provenance tracking.

update_metadata(col_name: str, field: str, value: str)[source]

Updates a specific property of a column metadata entry in both the JSON-LD template and the internal RDFLib Graph in a synchronized, lock-step transaction.

This method maps a user-friendly shorthand token (passed via field) to its corresponding JSON-LD schema key and formal RDF ontology predicate. It safely modifies the temporary JSON source dictionary and updates the corresponding triple statement within the template_graph.

Parameters:

col_name (str) – The exact string name of the target data column (e.g., ‘systolic_bp’). Matches against the existing ‘skos:altLabel’ identifier.
field ({'definition', 'unit', 'type', 'stage', 'note'}) –
The shorthand token representing the metadata property to modify:
- ’definition’Maps to skos:definition (SKOS.definition). Updates the text-based
  human description. Expects a plain string.
- ’unit’Maps to qudt:hasUnit (QUDT.hasUnit). Updates the measurement unit.
  Accepts a raw value (e.g., ‘KG’) or a prefixed URI (e.g., ‘unit:KG’). Will be transformed into a dictionary block in JSON-LD and a URIRef in RDF.
- ’type’Maps to @type / rdf:type (RDF.type). Updates the semantic class or
  concept type of the column. Autocompletes to the ‘mds:’ namespace if a prefix is missing.
- ’stage’Maps to mds:hasStudyStage (MDS.hasStudyStage). Updates the phase of the
  study lifecycle. Expects a string (e.g., ‘COLLECTION’).
- ’note’Maps to skos:note (SKOS.note). Appends an administrative or usage
  note to the concept. Expects a string value.
value (str) – The new data value to assign to the specified field.

Return type:

None

Raises:

Prints a warning message if the field is unrecognized, or if the col_name was successfully –
updated in the JSON template but could not be found as a subject node inside the RDF Graph. –

update_metadata_bulk(metadata_template: dict)[source]: Wrapper to update metadata template in bulk for multiple columns

validate_data_relations()[source]: Wrapper to validate relations using the instance’s own data and ontology.

validate_metadata() → bool[source]

Performs a two-way integrity check between the DataFrame and the Metadata Template.

Category 1 (Undefined Data Columns):: Columns in the DataFrame that are NOT defined in the Metadata. -> Result: These will be skipped during serialization.
Category 2 (Empty Metadata Entries):: Definitions in the Metadata that have no matching column in the DataFrame. -> Result: These will create ‘empty’ RDF nodes with no measurement values.

Returns:: True if data and metadata are perfectly aligned, False otherwise.
Return type:: bool

view_data_relations()[source]: Displays a visual validation report for the provided DataRelationsDict.

view_metadata(format: str = 'table')[source]

Prints the current metadata template to the standard output.

Depending on the chosen format, this method will either output a pretty-printed JSON-LD structure representing the underlying knowledge graph or a tabular summary compiled into a pandas DataFrame.

Parameters:

format ({'table', 'json'}, default 'table') –

The output format for displaying the metadata template.

’table’: Flattens the nested JSON-LD ‘@graph’ arrays (including handling
complex structures like ‘qudt:hasUnit’ sub-dictionaries) and extracts key attributes (Label, Type, Unit, Definition, Study Stage) into a summarized, human-readable table. If executed in a Jupyter Notebook, it renders as an HTML table; in a terminal, it outputs as plain text.
’json’: Outputs the raw, un-flattened JSON-LD template structure with
proper indentation for deep debugging.

Returns:

None
Outputs
——-
Prints the formatted metadata summary or raw JSON directly to stdout. If an
unsupported format string is provided, prints an error message.

view_relations()[source]

Prints a formatted list of all semantic relations available in the ontology.

This is a helper method for users to discover which properties can be used in a DataRelationsDict to link columns together.

FAIRLinked.RDFTableConversion.MDS_DF.analysis_tracker module

class FAIRLinked.RDFTableConversion.MDS_DF.analysis_tracker.AnalysisGroup(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]

Bases: object

Manages a collection of related AnalysisTracker instances, facilitating group-level reporting and master graph generation.

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]: Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.

create_MatDatSciDf()[source]

Converts the group data into a MatDatSciDf object, integrating ontology-mapped metadata.

Returns:: The semantic-aware DataFrame object.
Return type:: MatDatSciDf

create_group_arg_df() → DataFrame[source]

Aggregates all individual analysis DataFrames into a single master DataFrame.

Returns:: Concatenated data from all tracked analyses.
Return type:: pd.DataFrame

create_group_report()[source]

Consolidates individual analysis reports into one master Markdown document.

Returns:: A full Markdown report for the entire group.
Return type:: str

create_metadata_template()[source]

Automatically generates a metadata template by matching group data columns against the loaded ontology.

Returns:: (metadata_template, matched_log, unmatched_log)
Return type:: tuple

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

get_context() → dict[source]

Defines the JSON-LD context for the group metadata.

Returns:: Prefix to namespace URI mappings.
Return type:: dict

mds_graph = <Graph identifier=N227283a0d969435a9d94531bcc72a4b0 (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

run_and_track(func, *args, tracker: AnalysisTracker | None = None, **kwargs)[source]: Executes a function and stores metadata. Can use an existing tracker to group multiple functions under one ID, or create a new one.

save_jsonld()[source]: Serializes all individual analysis JSON-LDs and creates a master graph file that links all components to the group activity.

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

save_report()[source]: Saves the consolidated group report to a dedicated group directory.

track(func)[source]

A decorator to automatically wrap a function with provenance tracking.

Parameters:: func – The function to be decorated.
Returns:: The wrapped function that executes via run_and_track.
Return type:: function

update_metadata(col_name: str, field: str, value: str)[source]: Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.

view_metadata(format: str = 'table')[source]: Wrapper to print the current metadata template as a formatted table or raw JSON-LD. Change to format = ‘json-ld’ to view metadata template in JSON-LD format.

class FAIRLinked.RDFTableConversion.MDS_DF.analysis_tracker.AnalysisTracker(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]

Bases: object

A system for auditing scientific analysis, capturing data provenance, and generating semantic JSON-LD metadata.

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]: Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.

create_analysis_jsonld(license: str | None = None)[source]

Assembles all tracked data and file events into a valid JSON-LD string.

Returns:: A formatted JSON-LD string containing the analysis graph.
Return type:: str

create_arg_df()[source]

Flattens the tracked variables into a single-row Pandas DataFrame for tabular comparison across different runs.

Returns:: A DataFrame row containing run metadata and values.
Return type:: pd.DataFrame

create_metadata_template()[source]

Automatically generates a metadata template by matching group data columns against the loaded ontology.

Returns:: (metadata_template, matched_log, unmatched_log)
Return type:: tuple

create_report() → str[source]

Generates a human-readable Markdown summary of the analysis variables and file system activities.

Returns:: A Markdown formatted report.
Return type:: str

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

detect_all_imports()[source]: Unified Environment Scanner for Jupyter and standard scripts. Identifies top-level software dependencies currently available in the session.

get_context() → dict[source]

Defines the JSON-LD context mapping prefixes to namespace URIs.

Returns:: A dictionary of semantic prefix mappings (e.g., prov, mds, qudt).
Return type:: dict

mds_graph = <Graph identifier=N828a5c56b6f944228eb50f41dfeb686d (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

run_and_track(func, *args, **kwargs)[source]

Executes a function while auditing arguments, results, and environment.

This method acts as a high-level provenance wrapper. It captures the “top-most” (direct) input IRIs from the function signature and the direct output IRIs from the return value. While all internal data structures (like nested dictionary keys) are routed and saved to the global metadata log, only the direct IRIs are linked to the Activity node via CCO and PROV-O properties.

The method performs the following audit steps:

Generates a unique 15-digit numeric activity ID.
Binds and routes direct function arguments to capture input IRIs.
Triggers a live environment scan (imports/sys.modules).
Executes the function while monitoring OS-level file handles.
Routes and captures return value IRIs.
Finalizes a Linked Data Activity node with prov:used and prov:generated.

Parameters:

func (callable) – The scientific function or method to be executed.
*args – Positional arguments to be passed to the target function.
**kwargs – Keyword arguments to be passed to the target function.

Returns:

The original return value of the wrapped function. If an: exception occurs, it returns None after logging the error as a provenance event.

Return type:

Any

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

save_report()[source]: Saves the human-readable Markdown report to the reports directory.

semantic_remapping(unmatched_log)[source]: Refines simple Python types by matching them against the current metadata template’s semantic types.

serialize_analysis_jsonld(license: str | None = None)[source]: Writes the JSON-LD metadata to a physical file within the analysis directory.

track(func)[source]

A decorator to automatically wrap a function with provenance tracking.

Parameters:: func – The function to be decorated.
Returns:: The wrapped function that executes via run_and_track.
Return type:: function

track_dataframe(name, df, parent_id=None)[source]

Logs structural metadata of a Pandas DataFrame, including column names and row counts.

Parameters:

name – DataFrame name.
df – The pandas DataFrame object.
parent_id – ID of the containing process or object.

track_dict(name, val, parent_id=None)[source]

Logs a dictionary’s keys and recursively tracks its nested values.

Parameters:

name – Dictionary name.
val – The dictionary object.
parent_id – ID of the containing process or object.

track_list_array(name, data, parent_id=None)[source]

Tracks the dimensions and size of lists and NumPy arrays.

Parameters:

name – Array or list name.
data – The sequence or array-like object.
parent_id – ID of the containing process or object.

track_other(name, obj, parent_id=None)[source]

Falls back to inspecting custom objects by logging their public attributes as nested data.

Parameters:

name – Object name.
obj – The Python object to inspect.
parent_id – ID of the containing process or object.

track_simple_datatype(name, val, parent_id=None)[source]

Tracks primitive types (str, int, float, bool) and attempts to map them to ontology terms using fuzzy matching.

Parameters:

name – Variable name.
val – The primitive value.
parent_id – ID of the containing process or object.

update_metadata(col_name: str, field: str, value: str)[source]: Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.

update_metadata_bulk(metadata_template: dict)[source]: Wrapper to update metadata template in bulk for multiple columns

view_metadata(format: str = 'table')[source]: Wrapper to print the current metadata template as a formatted table or raw JSON-LD. Change to format = ‘json-ld’ to view metadata template in JSON-LD format.

FAIRLinked.RDFTableConversion.MDS_DF.data_relations_manager module

class FAIRLinked.RDFTableConversion.MDS_DF.data_relations_manager.DataRelationsDict(prop_col_pair_dict: dict)[source]

Bases: object

Manages semantic relationships between DataFrame columns for RDF serialization.

This class stores and organizes mappings that define how columns relate to one another using RDF Object or Datatype properties. These relations are later used by the serializer to generate triples that connect different entities within the same row.

prop_pair_dict

A dictionary where keys are property names (URIs or CURIEs) and values are lists of tuples, each containing a (subject_column, object_column) pair.

Type:: dict

add_relations(data_relations: dict, ontology_graph: Graph, onto_props: dict)[source]: Merges new column relationships into the dictionary with ontology validation. Normalizes keys to full URIs and prevents duplicate column pairs.

delete_relation(prop_key: str, pair: tuple | None = None)[source]

Removes semantic relationships from the dictionary.

Parameters:

prop_key (str) – The property label, CURIE, or URI identifying the group.
pair (tuple, optional) – A specific (subject_column, object_column) tuple to remove. If None, the entire property group is deleted.

print_data_relations(df: DataFrame | None = None, df_name: str | None = 'DataFrame', ontology_graph: Graph | None = None, onto_props: dict | None = None)[source]

Displays a human-readable summary of column relationships with integrated validation status.

This method serves two purposes: 1. Simple Visualization: If called without arguments, it prints a clean map of the defined Subject-Predicate-Object relationships. 2. Active Validation: If a DataFrame and Ontology components are provided, it performs a “pre-flight check” to verify that every property exists in the ontology and every column exists in the data.

The output uses status symbols: - ✅ : The property or column is valid or validation was skipped. - ❌ [Property Unknown] : The property key could not be resolved as a Label, URI, or CURIE. - ❌ [Col ‘Name’ missing] : The specified column was not found in the DataFrame headers.

Parameters:

df (pd.DataFrame, optional) – The DataFrame to validate column names against. Defaults to None.
df_name (str, Optional) – Name of the DataFrame.
ontology_graph (rdflib.Graph, optional) – The RDF graph used to expand and verify CURIEs (e.g., ‘mds:term’). Required if ‘onto_props’ is provided. Defaults to None.
onto_props (dict, optional) – A dictionary of valid ontology properties (Labels mapped to URIs). Defaults to None.

Note

Validation is case-sensitive for both properties and column names.
If ‘onto_props’ is provided but ‘ontology_graph’ is None, CURIE resolution

will be skipped, which may result in false-negative errors for prefixed properties.

save_relations(output_path: str)[source]

Exports the semantic mapping to both JSON (machine-readable) and TXT (human-readable) formats.

Parameters:: output_path (str) – The destination file path (extension is ignored).

validate_data_relations(df: DataFrame, ontology_graph: Graph, onto_props: dict, df_name: str | None = 'DataFrame') → bool[source]

Validates the DataRelationsDict against the DataFrame and the Ontology.

This method ensures that: 1. Every property key used can be resolved (via rdfs:label, CURIE, or full URI). 2. Every column name paired with a property actually exists in the DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame containing the experimental data.
df_name (str, Optional) – Name of the DataFrame.
ontology_graph (Graph) – The RDFLib Graph object for the ontology (used for CURIE expansion).
onto_props (dict) – The dictionary from MatDatSciDf.get_relations() mapping labels to (URI, Type).

Returns:

True if all relations and columns are valid, False otherwise.

Return type:

bool

FAIRLinked.RDFTableConversion.MDS_DF.metadata_manager module

class FAIRLinked.RDFTableConversion.MDS_DF.metadata_manager.Metadata(metadata_template: dict, matched_log: list | None = None, unmatched_log: list | None = None)[source]

Bases: object

Manages semantic metadata and synchronization between JSON-LD templates and RDF graphs.

This class acts as a specialized container for experimental metadata. It maintains a ‘source of truth’ using an RDFLib Graph to ensure semantic consistency, while providing a standard dictionary interface for JSON-LD serialization. It also tracks the success of metadata mapping through matched and unmatched logs.

metadata_temp

The JSON-LD representation of the metadata template, including @context and @graph.

Type:: dict

matched_log

A historical record of columns successfully mapped to ontology terms during the initialization process.

Type:: list

unmatched_log

A record of columns that failed to find an automated match in the reference ontology.

Type:: list

template_graph

The internal RDFLib Graph used for complex updates, validation, and semantic querying.

Type:: rdflib.Graph

MDS

Namespace for Materials Data Science ontology terms.

Type:: rdflib.Namespace

QUDT

Namespace for Quantities, Units, Dimensions, and Types.

Type:: rdflib.Namespace

UNIT

Namespace for QUDT unit individuals.

Type:: rdflib.Namespace

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'Definition not available', study_stage: str = 'UNKNOWN')[source]

Registers and appends metadata for a specific data column to both the temporary JSON-LD graph and the internal RDFLib Graph.

This method prevents duplicate entries by checking the existing JSON-LD @graph for the column name. If the column does not exist, it constructs a clean Python dictionary representing the JSON-LD entity, appends it to the temporary graph structure, and synchronizes it by parsing it into the internal template_graph.

Parameters:

col_name (str) – The exact name of the data column (e.g., ‘patient_age’). Used as the skos:altLabel identifier to prevent duplicate entries.
rdf_type (str) – The RDF semantic type or class for the column. If a namespace prefix (like ‘mds:’) is omitted, the ‘mds:’ prefix will be automatically prepended.
unit (str, optional) – The measurement unit of the column data, mapped to a QUDT ontology identifier. Defaults to “UNITLESS”.
definition (str, optional) – A human-readable textual description of what the column represents. Defaults to “Definition not available”.
study_stage (str, optional) – The phase or stage of the study lifecycle this data belongs to (e.g., ‘COLLECTION’, ‘ANALYSIS’). Defaults to “UNKNOWN”.

Return type:

None

Raises:

ValueError – If required parameters are malformed (handled by downstream JSON/RDF parsers).

delete_column_metadata(col_name: str)[source]: Removes all metadata associated with a specific column from both the internal JSON template and the RDF graph.

print_template(format: str = 'table')[source]

Prints the current metadata template to the standard output.

Depending on the chosen format, this method will either output a pretty-printed JSON-LD structure representing the underlying knowledge graph or a tabular summary compiled into a pandas DataFrame.

Parameters:

format ({'table', 'json'}, default 'table') –

The output format for displaying the metadata template.

’table’: Flattens the nested JSON-LD ‘@graph’ arrays (including handling complex structures like ‘qudt:hasUnit’ sub-dictionaries) and extracts key attributes (Label, Type, Unit, Definition, Study Stage) into a summarized, human-readable table. If executed in a Jupyter Notebook, it renders as an HTML table; in a terminal, it outputs as plain text.
’json’: Outputs the raw, un-flattened JSON-LD template structure with proper indentation for deep debugging.

Returns:

None
Outputs
——-
Prints the formatted metadata summary or raw JSON directly to stdout. If an
unsupported format string is provided, prints an error message.

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]

Exports the synchronized metadata template and import logs to the file system.

This method performs three primary tasks: 1. Serializes the current JSON-LD metadata template (the source of truth) to a file. 2. Optionally exports a log of all columns successfully matched during initialization. 3. Optionally exports a deduplicated log of columns that were not found in the RDF source.

Parameters:

output_path (str) – File path where the JSON-LD metadata template will be saved.
matched_log_path (str, optional) – File path to save the list of successfully matched columns. If None, no log is created.
unmatched_log_path (str, optional) – File path to save the unique list of columns missing RDF metadata. If None, no log is created.

Note

This method automatically creates any missing parent directories for the provided file paths to prevent ‘FileNotFoundError’.

Returns:: None

update_bulk(metadata_template: dict)[source]: Merges an external metadata template into the current instance. Iterates through the ‘@graph’ and decides whether to update existing columns or add new ones.

update_template(col_name: str, field: str, value: str)[source]

Updates a specific property of a column metadata entry in both the JSON-LD template and the internal RDFLib Graph in a synchronized, lock-step transaction.

This method maps a user-friendly shorthand token (passed via field) to its corresponding JSON-LD schema key and formal RDF ontology predicate. It safely modifies the temporary JSON source dictionary and updates the corresponding triple statement within the template_graph.

Parameters:

col_name (str) – The exact string name of the target data column (e.g., ‘systolic_bp’). Matches against the existing ‘skos:altLabel’ identifier.
field ({'definition', 'unit', 'type', 'stage', 'note'}) –
The shorthand token representing the metadata property to modify:
- ’definition’ : Maps to skos:definition (SKOS.definition). Updates the text-based human description. Expects a plain string.
- ’unit’ : Maps to qudt:hasUnit (QUDT.hasUnit). Updates the measurement unit. Accepts a raw value (e.g., ‘KG’) or a prefixed URI (e.g., ‘unit:KG’). Will be transformed into a dictionary block in JSON-LD and a URIRef in RDF.
- ’type’ : Maps to @type / rdf:type (RDF.type). Updates the semantic class or concept type of the column. Autocompletes to the ‘mds:’ namespace if a prefix is missing.
- ’stage’ : Maps to mds:hasStudyStage (MDS.hasStudyStage). Updates the phase of the study lifecycle. Expects a string (e.g., ‘COLLECTION’).
- ’note’ : Maps to skos:note (SKOS.note). Appends an administrative or usage note to the concept. Expects a string value.
value (str) – The new data value to assign to the specified field.

Return type:

None

Raises:

Prints a warning message if the field is unrecognized, or if the col_name was successfully –
updated in the JSON template but could not be found as a subject node inside the RDF Graph. –

Module contents

FAIRLinked.RDFTableConversion.MDS_DF module

FAIRLinked.MDS_DF Module

This module provides tools for tracking scientific analysis, managing metadata using RDF ontologies, and generating JSON-LD provenance graphs to ensure research data is Findable, Accessible, Interoperable, and Reusable (FAIR).

Main Components:

MatDatSciDf: The core semantic-aware DataFrame that integrates tabular data with ontology-based metadata.
AnalysisTracker: A context manager and decorator system for capturing function execution provenance and file system events.
AnalysisGroup: A management class for aggregating multiple analysis runs into a consolidated master graph.
Metadata: Handles the lifecycle of metadata templates and ontology mappings.
DataRelationsDict: Manages semantic relationships and links between disparate data entities.

Example

>>> from fairlinked import AnalysisTracker
>>> tracker = AnalysisTracker(proj_name="MyExperiment", home_path="./results")
>>> @tracker.track
>>> def my_analysis_step(data):
>>>     return data * 2

class FAIRLinked.RDFTableConversion.MDS_DF.AnalysisGroup(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]

Bases: object

Manages a collection of related AnalysisTracker instances, facilitating group-level reporting and master graph generation.

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]: Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.

create_MatDatSciDf()[source]

Converts the group data into a MatDatSciDf object, integrating ontology-mapped metadata.

Returns:: The semantic-aware DataFrame object.
Return type:: MatDatSciDf

create_group_arg_df() → DataFrame[source]

Aggregates all individual analysis DataFrames into a single master DataFrame.

Returns:: Concatenated data from all tracked analyses.
Return type:: pd.DataFrame

create_group_report()[source]

Consolidates individual analysis reports into one master Markdown document.

Returns:: A full Markdown report for the entire group.
Return type:: str

create_metadata_template()[source]

Automatically generates a metadata template by matching group data columns against the loaded ontology.

Returns:: (metadata_template, matched_log, unmatched_log)
Return type:: tuple

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

get_context() → dict[source]

Defines the JSON-LD context for the group metadata.

Returns:: Prefix to namespace URI mappings.
Return type:: dict

mds_graph = <Graph identifier=N227283a0d969435a9d94531bcc72a4b0 (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

run_and_track(func, *args, tracker: AnalysisTracker | None = None, **kwargs)[source]: Executes a function and stores metadata. Can use an existing tracker to group multiple functions under one ID, or create a new one.

save_jsonld()[source]: Serializes all individual analysis JSON-LDs and creates a master graph file that links all components to the group activity.

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

save_report()[source]: Saves the consolidated group report to a dedicated group directory.

track(func)[source]

A decorator to automatically wrap a function with provenance tracking.

Parameters:: func – The function to be decorated.
Returns:: The wrapped function that executes via run_and_track.
Return type:: function

update_metadata(col_name: str, field: str, value: str)[source]: Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.

view_metadata(format: str = 'table')[source]: Wrapper to print the current metadata template as a formatted table or raw JSON-LD. Change to format = ‘json-ld’ to view metadata template in JSON-LD format.

class FAIRLinked.RDFTableConversion.MDS_DF.AnalysisTracker(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]

Bases: object

A system for auditing scientific analysis, capturing data provenance, and generating semantic JSON-LD metadata.

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]: Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.

create_analysis_jsonld(license: str | None = None)[source]

Assembles all tracked data and file events into a valid JSON-LD string.

Returns:: A formatted JSON-LD string containing the analysis graph.
Return type:: str

create_arg_df()[source]

Flattens the tracked variables into a single-row Pandas DataFrame for tabular comparison across different runs.

Returns:: A DataFrame row containing run metadata and values.
Return type:: pd.DataFrame

create_metadata_template()[source]

Automatically generates a metadata template by matching group data columns against the loaded ontology.

Returns:: (metadata_template, matched_log, unmatched_log)
Return type:: tuple

create_report() → str[source]

Generates a human-readable Markdown summary of the analysis variables and file system activities.

Returns:: A Markdown formatted report.
Return type:: str

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

detect_all_imports()[source]: Unified Environment Scanner for Jupyter and standard scripts. Identifies top-level software dependencies currently available in the session.

get_context() → dict[source]

Defines the JSON-LD context mapping prefixes to namespace URIs.

Returns:: A dictionary of semantic prefix mappings (e.g., prov, mds, qudt).
Return type:: dict

mds_graph = <Graph identifier=N828a5c56b6f944228eb50f41dfeb686d (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

run_and_track(func, *args, **kwargs)[source]

Executes a function while auditing arguments, results, and environment.

This method acts as a high-level provenance wrapper. It captures the “top-most” (direct) input IRIs from the function signature and the direct output IRIs from the return value. While all internal data structures (like nested dictionary keys) are routed and saved to the global metadata log, only the direct IRIs are linked to the Activity node via CCO and PROV-O properties.

The method performs the following audit steps:

Generates a unique 15-digit numeric activity ID.
Binds and routes direct function arguments to capture input IRIs.
Triggers a live environment scan (imports/sys.modules).
Executes the function while monitoring OS-level file handles.
Routes and captures return value IRIs.
Finalizes a Linked Data Activity node with prov:used and prov:generated.

Parameters:

func (callable) – The scientific function or method to be executed.
*args – Positional arguments to be passed to the target function.
**kwargs – Keyword arguments to be passed to the target function.

Returns:

The original return value of the wrapped function. If an: exception occurs, it returns None after logging the error as a provenance event.

Return type:

Any

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

save_report()[source]: Saves the human-readable Markdown report to the reports directory.

semantic_remapping(unmatched_log)[source]: Refines simple Python types by matching them against the current metadata template’s semantic types.

serialize_analysis_jsonld(license: str | None = None)[source]: Writes the JSON-LD metadata to a physical file within the analysis directory.

track(func)[source]

A decorator to automatically wrap a function with provenance tracking.

Parameters:: func – The function to be decorated.
Returns:: The wrapped function that executes via run_and_track.
Return type:: function

track_dataframe(name, df, parent_id=None)[source]

Logs structural metadata of a Pandas DataFrame, including column names and row counts.

Parameters:

name – DataFrame name.
df – The pandas DataFrame object.
parent_id – ID of the containing process or object.

track_dict(name, val, parent_id=None)[source]

Logs a dictionary’s keys and recursively tracks its nested values.

Parameters:

name – Dictionary name.
val – The dictionary object.
parent_id – ID of the containing process or object.

track_list_array(name, data, parent_id=None)[source]

Tracks the dimensions and size of lists and NumPy arrays.

Parameters:

name – Array or list name.
data – The sequence or array-like object.
parent_id – ID of the containing process or object.

track_other(name, obj, parent_id=None)[source]

Falls back to inspecting custom objects by logging their public attributes as nested data.

Parameters:

name – Object name.
obj – The Python object to inspect.
parent_id – ID of the containing process or object.

track_simple_datatype(name, val, parent_id=None)[source]

Tracks primitive types (str, int, float, bool) and attempts to map them to ontology terms using fuzzy matching.

Parameters:

name – Variable name.
val – The primitive value.
parent_id – ID of the containing process or object.

update_metadata(col_name: str, field: str, value: str)[source]: Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.

update_metadata_bulk(metadata_template: dict)[source]: Wrapper to update metadata template in bulk for multiple columns

view_metadata(format: str = 'table')[source]: Wrapper to print the current metadata template as a formatted table or raw JSON-LD. Change to format = ‘json-ld’ to view metadata template in JSON-LD format.

class FAIRLinked.RDFTableConversion.MDS_DF.DataRelationsDict(prop_col_pair_dict: dict)[source]

Bases: object

Manages semantic relationships between DataFrame columns for RDF serialization.

This class stores and organizes mappings that define how columns relate to one another using RDF Object or Datatype properties. These relations are later used by the serializer to generate triples that connect different entities within the same row.

prop_pair_dict

A dictionary where keys are property names (URIs or CURIEs) and values are lists of tuples, each containing a (subject_column, object_column) pair.

Type:: dict

add_relations(data_relations: dict, ontology_graph: Graph, onto_props: dict)[source]: Merges new column relationships into the dictionary with ontology validation. Normalizes keys to full URIs and prevents duplicate column pairs.

delete_relation(prop_key: str, pair: tuple | None = None)[source]

Removes semantic relationships from the dictionary.

Parameters:

prop_key (str) – The property label, CURIE, or URI identifying the group.
pair (tuple, optional) – A specific (subject_column, object_column) tuple to remove. If None, the entire property group is deleted.

print_data_relations(df: DataFrame | None = None, df_name: str | None = 'DataFrame', ontology_graph: Graph | None = None, onto_props: dict | None = None)[source]

Displays a human-readable summary of column relationships with integrated validation status.

This method serves two purposes: 1. Simple Visualization: If called without arguments, it prints a clean map of the defined Subject-Predicate-Object relationships. 2. Active Validation: If a DataFrame and Ontology components are provided, it performs a “pre-flight check” to verify that every property exists in the ontology and every column exists in the data.

The output uses status symbols: - ✅ : The property or column is valid or validation was skipped. - ❌ [Property Unknown] : The property key could not be resolved as a Label, URI, or CURIE. - ❌ [Col ‘Name’ missing] : The specified column was not found in the DataFrame headers.

Parameters:

df (pd.DataFrame, optional) – The DataFrame to validate column names against. Defaults to None.
df_name (str, Optional) – Name of the DataFrame.
ontology_graph (rdflib.Graph, optional) – The RDF graph used to expand and verify CURIEs (e.g., ‘mds:term’). Required if ‘onto_props’ is provided. Defaults to None.
onto_props (dict, optional) – A dictionary of valid ontology properties (Labels mapped to URIs). Defaults to None.

Note

Validation is case-sensitive for both properties and column names.
If ‘onto_props’ is provided but ‘ontology_graph’ is None, CURIE resolution

will be skipped, which may result in false-negative errors for prefixed properties.

save_relations(output_path: str)[source]

Exports the semantic mapping to both JSON (machine-readable) and TXT (human-readable) formats.

Parameters:: output_path (str) – The destination file path (extension is ignored).

validate_data_relations(df: DataFrame, ontology_graph: Graph, onto_props: dict, df_name: str | None = 'DataFrame') → bool[source]

Validates the DataRelationsDict against the DataFrame and the Ontology.

This method ensures that: 1. Every property key used can be resolved (via rdfs:label, CURIE, or full URI). 2. Every column name paired with a property actually exists in the DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame containing the experimental data.
df_name (str, Optional) – Name of the DataFrame.
ontology_graph (Graph) – The RDFLib Graph object for the ontology (used for CURIE expansion).
onto_props (dict) – The dictionary from MatDatSciDf.get_relations() mapping labels to (URI, Type).

Returns:

True if all relations and columns are valid, False otherwise.

Return type:

bool

class FAIRLinked.RDFTableConversion.MDS_DF.MatDatSciDf(df: DataFrame, metadata_template: dict | None = None, matched_log: list | None = None, unmatched_log: list | None = None, data_relations_dict: dict | None = None, orcid: str = '0000-0000-0000-0000', df_name: str | None = None, metadata_rows: bool | None = False, ontology_graph: Graph | None = None, base_uri='https://cwrusdle.bitbucket.io/mds/', local_unit_file: bool | None = True)[source]

Bases: object

A semantic wrapper for Pandas DataFrames in the Materials Data Science domain.

This class serves as a “Semantic Firewall” for experimental materials data. It bridges tabular data and Linked Data by maintaining synchronized internal objects for measurement data, semantic headers, metadata templates, and column-to-column relationships. It enforces FAIR principles by validating researcher identifiers (ORCID) and ensuring ontological consistency before serialization.

df

The cleaned measurement data, stripped of metadata headers.

Type:: pd.DataFrame

header_df

A 3-row buffer (Type, Unit, Study Stage) used for mapping or pre-allocating metadata for the dataset.

Type:: pd.DataFrame

metadata_obj

The internal manager handling the RDFLib Graph and JSON-LD template synchronization.

Type:: MatDatSciDf.Metadata

data_relations

The internal manager for defining semantic links (Object/Datatype properties) between columns.

Type:: MatDatSciDf.DataRelationsDict

orcid

Validated ORCID iD of the data curator.

Type:: str

orcid_verified

Boolean status of curator identity verification.

Type:: bool

df_name

Descriptive name for the dataset used in file exports.

Type:: str

ontology

The reference ontology graph used for fuzzy matching and property resolution.

Type:: rdflib.Graph

base_uri

The namespace prefix used for generating semantic subjects.

Type:: str

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'Definition not available', study_stage: str = 'UNK')[source]

Registers and appends metadata for a specific data column to both the temporary JSON-LD graph and the internal RDFLib Graph.

This method prevents duplicate entries by checking the existing JSON-LD @graph for the column name. If the column does not exist, it constructs a clean Python dictionary representing the JSON-LD entity, appends it to the temporary graph structure, and synchronizes it by parsing it into the internal template_graph.

Parameters:

col_name (str) – The exact name of the data column (e.g., ‘patient_age’). Used as the skos:altLabel identifier to prevent duplicate entries.
rdf_type (str) – The RDF semantic type or class for the column. If a namespace prefix (like ‘mds:’) is omitted, the ‘mds:’ prefix will be automatically prepended.
unit (str, optional) – The measurement unit of the column data, mapped to a QUDT ontology identifier. Defaults to “UNITLESS”.
definition (str, optional) – A human-readable textual description of what the column represents. Defaults to “Definition not available”.
study_stage (str, optional) – The phase or stage of the study lifecycle this data belongs to (e.g., ‘COLLECTION’, ‘ANALYSIS’). Defaults to “UNKNOWN”.

Return type:

None

Raises:

ValueError – If required parameters are malformed (handled by downstream JSON/RDF parsers).

add_relations(data_relations: dict)[source]

delete_column_metadata(col_name: str)[source]

Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.

Parameters:: col_name (str) – The column label to remove from the metadata template.

delete_relation(prop_key: str, pair: tuple | None = None)[source]

Top-level API to remove semantic links between columns.

Parameters:

prop_key (str) – The property identifier (e.g., ‘mds:measuredBy’).
pair (tuple, optional) – Specific (subj, obj) columns to un-link. If None, removes all links for that property.

df_name = 'Unnamed_Dataframe'

classmethod from_rdf_dir(input_dir: str, orcid: str, metadata_template: dict | None = None, data_relations_dict: dict | None = None, df_name: str = 'Imported_RDF_Data', ontology_graph: Graph | None = None, base_uri: str = 'https://cwrusdle.bitbucket.io/mds/')[source]

Factory method to reconstruct a MatDatSciDf instance and validate semantic integrity from a directory of RDF files.

This method crawls a directory for supported RDF formats, parses the triples, and reconstructs the tabular data (DataFrame) and metadata (JSON-LD Template). It serves as a data audit pipeline by cross-referencing file-level triples against a master template for unit consistency and a user-provided schema for structural integrity.

Parameters:

input_dir (str) – Path to the directory containing RDF files (JSON-LD, Turtle, etc.).
orcid (str) – The ORCID identifier of the user performing the reconstruction.
data_relations_dict (dict, optional) – The expected Subject-Predicate-Object schema to validate against each file. If provided, mismatches are logged.
df_name (str, optional) – Descriptive name for the resulting DataFrame and validation report. Defaults to “Imported_RDF_Data”.
ontology_graph (rdflib.Graph, optional) – A reference ontology used to resolve labels and CURIEs during validation.
base_uri (str, optional) – The base URI used for semantic subject identification. Defaults to “https://cwrusdle.bitbucket.io/mds/”.

Returns:

A fully initialized and validated instance containing the: reconstructed dataset and associated semantic logs.

Return type:

MatDatSciDf

Reports & Logs:

Generates ‘{df_name}_import_validation.txt’ in the input directory.
Logs Unit Conflicts: Flagged if a column unit differs from the first encountered definition.
Logs Schema Mismatches: Flagged if expected semantic links are missing within individual RDF graphs.

Note

Supported extensions: .jsonld, .ttl, .nt, .rdf, .xml.
Missing data columns in specific files are filled with ‘pd.NA’ to maintain tabular integrity.

get_relation_pairs_onto()[source]

Analyzes the ontology and metadata template to discover relationships between columns.

Returns:: { URI: [(subj_col, obj_col), …] }
Return type:: dict

get_relations()[source]

Extracts all Object and Datatype properties from the associated ontology.

This method scans the ontology graph for OWL ObjectProperties and DatatypeProperties, mapping their human-readable rdfs:labels to their full URIs and property types.

Returns:

A dictionary (prop_metadata_dict) where:

Key: Property label (str)
Value: Tuple of (Property URI, Property Type)

Return type:

dict

mds_graph = <Graph identifier=Nfe1a650e28be408b9b318849071c2952 (<class 'rdflib.graph.Graph'>)>

overwrite_metadata(metadata_template: dict)[source]: Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA

save_mds_df(output_dir: str, metadata_in_output_df: bool = False, formats: list = ['csv', 'parquet', 'arrow'])[source]

Saves the internal DataFrame and associated metadata to the local file system.

This method supports multi-format export (CSV, Parquet, Arrow). It can also generate a ‘semantic’ version of the CSV where the first three rows of the file contain the RDF Type, QUDT Unit, and Study Stage for each column, facilitating human readability and FAIR data principles.

Parameters:

output_dir (str) – The directory path where files will be stored.
metadata_obj (Metadata, optional) – The Metadata management object. If provided, it will also trigger the saving of the JSON-LD template and match logs.
metadata_in_output_df (bool, optional) – If True, prepends three header rows (Type, Units, Study Stage) to the CSV output. Defaults to False.
formats (list, optional) – A list of strings specifying output formats. Supported: ‘csv’, ‘parquet’, ‘arrow’, ‘feather’. Defaults to [“csv”, “parquet”, “arrow”].

Note

When ‘metadata_in_output_df’ is True, only the CSV format will contain
the multi-row headers. Parquet and Arrow formats are saved using a ‘clean’ version (data only) to preserve strict schema typing.
For Parquet and Arrow exports, all columns are cast to strings to
ensure compatibility with mixed-type metadata fields.
The method automatically standardizes column order alphabetically.

Returns:: None

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]: Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.

static search_license(query: str)[source]

Searches the SPDX license database for a matching ID or name.

This is a utility method: it can be called as MatDatSciDf.search_license(“MIT”) without initializing the class (i.e., no DataFrame or ORCID required).

Parameters:: query (str) – The search term (e.g., ‘Creative Commons’, ‘GPL’, ‘MIT’).

semantic_remapping(data_graph: Graph)[source]: Validates types against the reference ontology and remaps unrecognized types to the base BFO Entity class.

serialize_bulk(output_path: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, label_pairs: list[tuple[str, str]] | None = None, license: str | None = None, write_files: bool | None = True) → Graph[source]

Aggregates all row-level RDF graphs into a single master file while preserving the original context.

This method performs a “Bulk Serialization” by first generating RDF subgraphs for every row in the DataFrame and then merging them into a singular master Graph object. Unlike ‘serialize_row’, which creates multiple files, this method outputs one unified dataset file, ensuring that the JSON-LD ‘@context’ is applied globally to maintain consistent prefixing (e.g., ‘mds:’, ‘qudt:’) across all entries.

Parameters:

output_path (str) – The full destination path, including the filename and extension, where the aggregated graph will be saved.
format (str, optional) – The RDF serialization format (e.g., ‘json-ld’, ‘turtle’, ‘xml’). Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate unique row identifiers.
id_cols (list[str], optional) – Column names to be used as entity identifiers (@id) instead of row keys.
label_pairs (list[tuple[str, str]], optional) – A list of 2-tuples (X, Y) where column X represents an entity in the dataframe, and column Y contains the literal text string that should be assigned as its ‘rdfs:label’. If a cell in column Y is missing or empty, the label triple for that row is omitted.
license (str, optional) – SPDX license ID or URI to be applied to the triples.
write_files (bool, optional) – Whether to write serialized data to disk. Defaults to True.

Returns:

A single aggregated RDFLib Graph object containing the triples: for every row in the dataset.

Return type:

Graph

Note

This method is highly recommended for creating FAIR-compliant datasets destined for Triple Stores or Graph Databases.
It maintains the exact same URI structure and namespace bindings as individual row serializations to ensure interoperability.
The output directory is automatically created if it does not exist.

serialize_row(output_folder: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, label_pairs: list[tuple[str, str]] | None = None, license: str | None = None, write_files: bool | None = True) → list[Graph][source]

Serializes each row of the DataFrame into individual RDF files using the active semantic metadata template.

This method transforms tabular experimental data into Linked Data. It iterates through the DataFrame, generating a unique row identifier (Subject URI) for each entry based on either specified ‘id_cols’ or a hash of the study-stage metadata. It maps cell values to ‘qudt:value’ triples, applies dynamic ‘rdfs:label’ tags, and establishes inter-column relationships defined in the internal ‘data_relations’ manager.

Parameters:

output_folder (str) – Directory where individual RDF files will be saved.
format (str, optional) – The RDF serialization format. Supported: ‘json-ld’, ‘turtle’, ‘xml’, ‘nt’. Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate the unique row string used for file naming and internal row indexing.
id_cols (list[str], optional) – Column names whose values should be normalized and used as the primary Subject URI identifier (@id). If None, Subject URIs are generated from the unique row key.
label_pairs (list[tuple[str, str]], optional) – A list of 2-tuples (X, Y) where column X represents an entity in the dataframe, and column Y contains the literal text string that should be assigned as its ‘rdfs:label’. If a cell in column Y is missing or empty, the label triple for that row is omitted.
license (str, optional) – An SPDX license identifier (e.g., ‘MIT’) or a full URI. Defaults to ‘CC0-1.0’.
write_files (bool, optional) – If True, writes each row to a file on disk. If False, only returns the list of RDF Graphs. Defaults to True.

Raises:

ValueError – If the provided license is invalid or if the metadata template is missing required ‘skos:altLabel’ definitions.

Returns:

A list of RDFLib Graph objects, each representing: one row of experimental data and its associated semantic context.

Return type:

List[rdflib.Graph]

Note

Parent directories for output_folder are created automatically.
Files are named using the pattern: ‘{random_suffix}-{row_key}.{ext}’.
Triples for ‘pd.NA’ or empty string values are omitted to maintain graph sparsity and data integrity.

template_generator(skip_prompts: bool = False)[source]

Generates a semantic metadata template by mapping DataFrame columns to ontology terms.

This method performs a fuzzy match between column headers and the loaded ontology. It attempts to automatically resolve the RDF type (@type), study stage, and units. If a direct match is not found, or if ‘skip_prompts’ is False, it can interactively prompt the user to provide missing metadata fields.

The resulting template follows the JSON-LD structure, integrating namespaces such as QUDT, SKOS, PROV, and MDS.

Parameters:

skip_prompts (bool, optional) – If True, suppresses interactive user input for missing units or definitions, instead using ‘UNITLESS’ or placeholders. Defaults to False.

Returns:

A tuple containing:

metadata_template (dict): The complete JSON-LD dictionary with ‘@context’ and ‘@graph’ entries for each column.
matched_log (list): A list of strings documenting successful fuzzy-match associations (Column => IRI).
unmatched_log (list): A list of column names that could not be found in the provided ontology.

Return type:

tuple

Note

The method prioritizes metadata explicitly included in the first three rows of the CSV (type, unit, study stage).
Unit extraction handles both raw strings (e.g., ‘unit:KiloGM’) and string-encoded dictionaries (e.g., “{‘@id’: ‘unit:M’}”).
Time-stamping via ‘prov:generatedAtTime’ is applied to each entry for provenance tracking.

update_metadata(col_name: str, field: str, value: str)[source]

Updates a specific property of a column metadata entry in both the JSON-LD template and the internal RDFLib Graph in a synchronized, lock-step transaction.

This method maps a user-friendly shorthand token (passed via field) to its corresponding JSON-LD schema key and formal RDF ontology predicate. It safely modifies the temporary JSON source dictionary and updates the corresponding triple statement within the template_graph.

Parameters:

col_name (str) – The exact string name of the target data column (e.g., ‘systolic_bp’). Matches against the existing ‘skos:altLabel’ identifier.
field ({'definition', 'unit', 'type', 'stage', 'note'}) –
The shorthand token representing the metadata property to modify:
- ’definition’Maps to skos:definition (SKOS.definition). Updates the text-based
  human description. Expects a plain string.
- ’unit’Maps to qudt:hasUnit (QUDT.hasUnit). Updates the measurement unit.
  Accepts a raw value (e.g., ‘KG’) or a prefixed URI (e.g., ‘unit:KG’). Will be transformed into a dictionary block in JSON-LD and a URIRef in RDF.
- ’type’Maps to @type / rdf:type (RDF.type). Updates the semantic class or
  concept type of the column. Autocompletes to the ‘mds:’ namespace if a prefix is missing.
- ’stage’Maps to mds:hasStudyStage (MDS.hasStudyStage). Updates the phase of the
  study lifecycle. Expects a string (e.g., ‘COLLECTION’).
- ’note’Maps to skos:note (SKOS.note). Appends an administrative or usage
  note to the concept. Expects a string value.
value (str) – The new data value to assign to the specified field.

Return type:

None

Raises:

Prints a warning message if the field is unrecognized, or if the col_name was successfully –
updated in the JSON template but could not be found as a subject node inside the RDF Graph. –

update_metadata_bulk(metadata_template: dict)[source]: Wrapper to update metadata template in bulk for multiple columns

validate_data_relations()[source]: Wrapper to validate relations using the instance’s own data and ontology.

validate_metadata() → bool[source]

Performs a two-way integrity check between the DataFrame and the Metadata Template.

Category 1 (Undefined Data Columns):: Columns in the DataFrame that are NOT defined in the Metadata. -> Result: These will be skipped during serialization.
Category 2 (Empty Metadata Entries):: Definitions in the Metadata that have no matching column in the DataFrame. -> Result: These will create ‘empty’ RDF nodes with no measurement values.

Returns:: True if data and metadata are perfectly aligned, False otherwise.
Return type:: bool

view_data_relations()[source]: Displays a visual validation report for the provided DataRelationsDict.

view_metadata(format: str = 'table')[source]

Prints the current metadata template to the standard output.

Depending on the chosen format, this method will either output a pretty-printed JSON-LD structure representing the underlying knowledge graph or a tabular summary compiled into a pandas DataFrame.

Parameters:

format ({'table', 'json'}, default 'table') –

The output format for displaying the metadata template.

’table’: Flattens the nested JSON-LD ‘@graph’ arrays (including handling
complex structures like ‘qudt:hasUnit’ sub-dictionaries) and extracts key attributes (Label, Type, Unit, Definition, Study Stage) into a summarized, human-readable table. If executed in a Jupyter Notebook, it renders as an HTML table; in a terminal, it outputs as plain text.
’json’: Outputs the raw, un-flattened JSON-LD template structure with
proper indentation for deep debugging.

Returns:

None
Outputs
——-
Prints the formatted metadata summary or raw JSON directly to stdout. If an
unsupported format string is provided, prints an error message.

view_relations()[source]

Prints a formatted list of all semantic relations available in the ontology.

This is a helper method for users to discover which properties can be used in a DataRelationsDict to link columns together.

class FAIRLinked.RDFTableConversion.MDS_DF.Metadata(metadata_template: dict, matched_log: list | None = None, unmatched_log: list | None = None)[source]

Bases: object

Manages semantic metadata and synchronization between JSON-LD templates and RDF graphs.

This class acts as a specialized container for experimental metadata. It maintains a ‘source of truth’ using an RDFLib Graph to ensure semantic consistency, while providing a standard dictionary interface for JSON-LD serialization. It also tracks the success of metadata mapping through matched and unmatched logs.

metadata_temp

The JSON-LD representation of the metadata template, including @context and @graph.

Type:: dict

matched_log

A historical record of columns successfully mapped to ontology terms during the initialization process.

Type:: list

unmatched_log

A record of columns that failed to find an automated match in the reference ontology.

Type:: list

template_graph

The internal RDFLib Graph used for complex updates, validation, and semantic querying.

Type:: rdflib.Graph

MDS

Namespace for Materials Data Science ontology terms.

Type:: rdflib.Namespace

QUDT

Namespace for Quantities, Units, Dimensions, and Types.

Type:: rdflib.Namespace

UNIT

Namespace for QUDT unit individuals.

Type:: rdflib.Namespace

add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'Definition not available', study_stage: str = 'UNKNOWN')[source]

Registers and appends metadata for a specific data column to both the temporary JSON-LD graph and the internal RDFLib Graph.

This method prevents duplicate entries by checking the existing JSON-LD @graph for the column name. If the column does not exist, it constructs a clean Python dictionary representing the JSON-LD entity, appends it to the temporary graph structure, and synchronizes it by parsing it into the internal template_graph.

Parameters:

col_name (str) – The exact name of the data column (e.g., ‘patient_age’). Used as the skos:altLabel identifier to prevent duplicate entries.
rdf_type (str) – The RDF semantic type or class for the column. If a namespace prefix (like ‘mds:’) is omitted, the ‘mds:’ prefix will be automatically prepended.
unit (str, optional) – The measurement unit of the column data, mapped to a QUDT ontology identifier. Defaults to “UNITLESS”.
definition (str, optional) – A human-readable textual description of what the column represents. Defaults to “Definition not available”.
study_stage (str, optional) – The phase or stage of the study lifecycle this data belongs to (e.g., ‘COLLECTION’, ‘ANALYSIS’). Defaults to “UNKNOWN”.

Return type:

None

Raises:

ValueError – If required parameters are malformed (handled by downstream JSON/RDF parsers).

delete_column_metadata(col_name: str)[source]: Removes all metadata associated with a specific column from both the internal JSON template and the RDF graph.

print_template(format: str = 'table')[source]

Prints the current metadata template to the standard output.

Depending on the chosen format, this method will either output a pretty-printed JSON-LD structure representing the underlying knowledge graph or a tabular summary compiled into a pandas DataFrame.

Parameters:

format ({'table', 'json'}, default 'table') –

The output format for displaying the metadata template.

’table’: Flattens the nested JSON-LD ‘@graph’ arrays (including handling complex structures like ‘qudt:hasUnit’ sub-dictionaries) and extracts key attributes (Label, Type, Unit, Definition, Study Stage) into a summarized, human-readable table. If executed in a Jupyter Notebook, it renders as an HTML table; in a terminal, it outputs as plain text.
’json’: Outputs the raw, un-flattened JSON-LD template structure with proper indentation for deep debugging.

Returns:

None
Outputs
——-
Prints the formatted metadata summary or raw JSON directly to stdout. If an
unsupported format string is provided, prints an error message.

save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]

Exports the synchronized metadata template and import logs to the file system.

This method performs three primary tasks: 1. Serializes the current JSON-LD metadata template (the source of truth) to a file. 2. Optionally exports a log of all columns successfully matched during initialization. 3. Optionally exports a deduplicated log of columns that were not found in the RDF source.

Parameters:

output_path (str) – File path where the JSON-LD metadata template will be saved.
matched_log_path (str, optional) – File path to save the list of successfully matched columns. If None, no log is created.
unmatched_log_path (str, optional) – File path to save the unique list of columns missing RDF metadata. If None, no log is created.

Note

This method automatically creates any missing parent directories for the provided file paths to prevent ‘FileNotFoundError’.

Returns:: None

update_bulk(metadata_template: dict)[source]: Merges an external metadata template into the current instance. Iterates through the ‘@graph’ and decides whether to update existing columns or add new ones.

update_template(col_name: str, field: str, value: str)[source]

Updates a specific property of a column metadata entry in both the JSON-LD template and the internal RDFLib Graph in a synchronized, lock-step transaction.

This method maps a user-friendly shorthand token (passed via field) to its corresponding JSON-LD schema key and formal RDF ontology predicate. It safely modifies the temporary JSON source dictionary and updates the corresponding triple statement within the template_graph.

Parameters:

col_name (str) – The exact string name of the target data column (e.g., ‘systolic_bp’). Matches against the existing ‘skos:altLabel’ identifier.
field ({'definition', 'unit', 'type', 'stage', 'note'}) –
The shorthand token representing the metadata property to modify:
- ’definition’ : Maps to skos:definition (SKOS.definition). Updates the text-based human description. Expects a plain string.
- ’unit’ : Maps to qudt:hasUnit (QUDT.hasUnit). Updates the measurement unit. Accepts a raw value (e.g., ‘KG’) or a prefixed URI (e.g., ‘unit:KG’). Will be transformed into a dictionary block in JSON-LD and a URIRef in RDF.
- ’type’ : Maps to @type / rdf:type (RDF.type). Updates the semantic class or concept type of the column. Autocompletes to the ‘mds:’ namespace if a prefix is missing.
- ’stage’ : Maps to mds:hasStudyStage (MDS.hasStudyStage). Updates the phase of the study lifecycle. Expects a string (e.g., ‘COLLECTION’).
- ’note’ : Maps to skos:note (SKOS.note). Appends an administrative or usage note to the concept. Expects a string value.
value (str) – The new data value to assign to the specified field.

Return type:

None

Raises:

Prints a warning message if the field is unrecognized, or if the col_name was successfully –
updated in the JSON template but could not be found as a subject node inside the RDF Graph. –