FAIRLinked.RDFTableConversion package
Submodules
FAIRLinked.RDFTableConversion.csv_to_jsonld_mapper module
- FAIRLinked.RDFTableConversion.csv_to_jsonld_mapper.extract_qudt_units(url='https://qudt.org/vocab/unit/')[source]
Extract all units from the QUDT ontology programmatically.
- Parameters:
url – The URL of the QUDT unit vocabulary
- Returns:
Dictionary containing unit information
- FAIRLinked.RDFTableConversion.csv_to_jsonld_mapper.extract_terms_from_ontology(ontology_graph)[source]
Extract terms from an RDF graph representing an OWL ontology.
- Parameters:
ontology_graph (rdflib.Graph) – The ontology RDF graph.
- Returns:
A list of dictionaries containing term IRIs, original labels, and normalized labels.
- Return type:
list[dict]
- FAIRLinked.RDFTableConversion.csv_to_jsonld_mapper.find_best_match(column, ontology_terms)[source]
Find the best matching ontology term for a given column name.
- Parameters:
column (str) – The name of the column from the CSV file.
ontology_terms (list[dict]) – List of extracted ontology terms.
- Returns:
The best-matching ontology term, or None if no good match is found.
- Return type:
dict or None
- FAIRLinked.RDFTableConversion.csv_to_jsonld_mapper.jsonld_template_generator(csv_path, ontology_graph, output_path, matched_log_path, unmatched_log_path, skip_prompts=False)[source]
Use a CSV file into a JSON-LD template that user can fill out column metadata.
- Parameters:
csv_path (str) – Path to the CSV file to generate JSON-LD template.
ontology_graph (rdflib.Graph) – The ontology RDF graph for matching terms.
output_path (str) – Path to write the resulting JSON-LD file.
matched_log_path (str) – Path to write the log of columns that matched the ontology.
unmatched_log_path (str) – Path to write the log of columns that can’t be found in the ontology.
skip_prompts (bool) – Allow users to skip metadata prompts
FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler module
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.extract_data_from_csv(metadata_template, csv_file, orcid, output_folder, row_key_cols=None, id_cols=None, prop_column_pair_dict=None, ontology_graph=None, base_uri='https://cwrusdle.bitbucket.io/mds/', license=None)[source]
Converts CSV rows into RDF graphs using a JSON-LD template and optional property mapping, writing JSON-LD files. This function assumes that the two rows below the header row contains the unit and the proper ontology name.
- Parameters:
metadata_template (dict) – JSON-LD template with “@context” and “@graph”.
csv_file (str) – Path to the input CSV.
row_key_cols (list[str]) – Columns to uniquely identify each row.
id_cols (list[str]) – Columns that contain unique entity identifier independent of row.
orcid (str) – ORCID identifier (dashes removed automatically).
output_folder (str) – Directory to save JSON-LD files.
prop_column_pair_dict (dict or None, optional) – Maps property keys to (subject_column, object_column) column pairs. If None or empty, no properties are added.
ontology_graph (RDFLib Graph object or None, optional) – Ontology for property type/URI resolution. Required if prop_column_pair_dict is provided.
base_uri (str, optional) – Base URI used to construct subject and object URIs.
license (str, optional) – License to be used for the dataset.
- Returns:
List of RDFLib Graphs, one per row.
- Return type:
List[rdflib.Graph]
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.extract_data_from_csv_interface(args)[source]
CLI wrapper for extract_data_from_csv. Loads JSON/CSV/ontology files and calls the core function.
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.extract_from_folder(csv_folder, metadata_template, orcid, row_key_cols, id_cols, output_base_folder, prop_column_pair_dict=None, ontology_graph=None, base_uri='https://cwrusdle.bitbucket.io/mds/', license=None)[source]
Processes all CSV files in a folder and converts each into RDF/JSON-LD files using a metadata template and optional object/datatype property mappings.
- Parameters:
csv_folder (str) – Path to the folder containing CSV files.
metadata_template (dict) – JSON-LD metadata template with “@context” and “@graph” describing the RDF structure.
row_key_cols (list[str]) – List of CSV column names used to construct a unique key for each row.
id_cols (list[str]) – Columns that contain unique entity identifier independent of row.
orcid (str) – ORCID iD of the user (dashes will be removed automatically).
output_base_folder (str) – Directory where output subfolders (one per CSV) will be created for JSON-LD files.
prop_column_pair_dict (dict or None, optional) – Mapping from property key (e.g., predicate label) to list of (subject_column, object_column) tuples. These define additional object or datatype properties to inject based on CSV columns. If None, no extra connections are added.
ontology_graph (str or None, optional) – RDFLib graph object of ontology from which property URIs and types are resolved. Required only if prop_column_pair_dict is given.
base_uri (str, optional) – Base URI used to construct RDF subject and object URIs. Defaults to the CWRU MDS base.
- Returns:
Writes JSON-LD files to disk. No return value.
- Return type:
None
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.generate_prop_metadata_dict(ontology_graph)[source]
Generates a dictionary where the keys are human-readable labels of object/datatype properties, and the values are 2-tuples that contain the URI of that property in the first entry and the type (object/datatype) in second entry.
- Parameters:
ontology_graph (RDFLib graph object of the ontology) – Path to the RDF/OWL ontology file.
- Returns:
Dictionary of the form: {
”has material”: (”http://example.org/ontology#hasMaterial”, “Object Property”), “has value”: (”http://example.org/ontology#hasValue”, “Datatype Property”), …
}
- Return type:
dict
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.hash6(s)[source]
Takes any string and returns a 6-digit number (100000-999999).
- Parameters:
s – Input string to hash
- Returns:
A 6-digit number between 100000 and 999999
- Return type:
int
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.resolve_predicate(key, ontology_graph)[source]
Resolves a given key into a full RDF predicate URI and determines its property type (object or datatype) within a provided ontology graph.
The function accepts either a full IRI (e.g.
http://example.org/ontology#hasMaterial) or a CURIE (e.g.ex:hasMaterial). It first checks whether the key is a valid absolute IRI. If not, it attempts to expand the key as a CURIE using the namespace manager attached to the supplied RDFLib ontology graph. If neither expansion succeeds, the function returns(None, None).Once a valid predicate URI is obtained, the function inspects the ontology graph to determine whether the predicate is an
owl:ObjectPropertyor anowl:DatatypeProperty. If neither type is declared in the ontology, the label type is returned asNone.- Parameters:
key (str) – Predicate identifier to resolve. Can be a full IRI (e.g.
http://...) or a CURIE (e.g.ex:hasMaterial).ontology_graph (rdflib.Graph) – RDFLib graph object representing the ontology within which the predicate should be resolved. The graph must have a properly configured
namespace_managerto expand CURIEs.
- Returns:
A 2-tuple of the form
(predicate_uri, label_type)where: -predicate_uriis anrdflib.term.URIRefrepresenting the resolved predicate IRI,or
Noneif the key could not be resolved.label_typeis a string describing the property type:"Object Property","Datatype Property", orNoneif no type match was found.
- Return type:
tuple
- FAIRLinked.RDFTableConversion.csv_to_jsonld_template_filler.write_license_triple(output_folder: str, base_uri: str, license_id: str)[source]
Creates a compact JSON-LD file defining a single RDF triple that links a dataset to its license.
- This function generates a minimal JSON-LD graph of the form:
mds:Dataset dcterms:license <SPDX_URI>
If a short SPDX identifier (e.g. “MIT”, “CC-BY-4.0”) is provided, the function verifies that the identifier exists in the official SPDX license list (licenses.json, bundled with the package) and converts it to its canonical SPDX URI (e.g. https://spdx.org/licenses/MIT.html). If a full URI beginning with “http” is supplied, the URI is used as-is.
The resulting triple is serialized to a compact JSON-LD file named
dataset_license.jsonldin the specified output folder. The JSON-LD document includes a top-level@contextcontaining compact namespace prefixes formdsanddcterms.- Parameters:
output_folder (str) – Path to the directory where the output JSON-LD file will be written. The directory is created if it does not exist.
base_uri (str) – Base namespace URI of the MDS ontology. The function appends a fragment (“#”) and uses
mds:Datasetas the subject IRI of the triple.license_id (str) – SPDX short identifier (e.g., “MIT”, “CC-BY-4.0”) OR full license URI. Short identifiers are validated against the official SPDX license list before being converted into full URIs.
Outputs
-------
dataset_license.jsonld (file) –
A JSON-LD file written to
output_folderwith the structure:- ”@context”: {
“mds”: “https://cwrusdle.bitbucket.io/mds/”, “dcterms”: “http://purl.org/dc/terms/”
}, “@id”: “mds:Dataset”, “dcterms:license”: {
”@id”: “https://spdx.org/licenses/MIT.html”
}
FAIRLinked.RDFTableConversion.jsonld_batch_converter module
Module contents
- class FAIRLinked.RDFTableConversion.AnalysisGroup(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]
Bases:
objectManages a collection of related AnalysisTracker instances, facilitating group-level reporting and master graph generation.
- add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]
Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.
- create_MatDatSciDf()[source]
Converts the group data into a MatDatSciDf object, integrating ontology-mapped metadata.
- Returns:
The semantic-aware DataFrame object.
- Return type:
- create_group_arg_df() DataFrame[source]
Aggregates all individual analysis DataFrames into a single master DataFrame.
- Returns:
Concatenated data from all tracked analyses.
- Return type:
pd.DataFrame
- create_group_report()[source]
Consolidates individual analysis reports into one master Markdown document.
- Returns:
A full Markdown report for the entire group.
- Return type:
str
- create_metadata_template()[source]
Automatically generates a metadata template by matching group data columns against the loaded ontology.
- Returns:
(metadata_template, matched_log, unmatched_log)
- Return type:
tuple
- delete_column_metadata(col_name: str)[source]
Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.
- Parameters:
col_name (str) – The column label to remove from the metadata template.
- get_context() dict[source]
Defines the JSON-LD context for the group metadata.
- Returns:
Prefix to namespace URI mappings.
- Return type:
dict
- mds_graph = <Graph identifier=N8130b61943c04bacb25010f2eae54102 (<class 'rdflib.graph.Graph'>)>
- overwrite_metadata(metadata_template: dict)[source]
Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA
- run_and_track(func, *args, tracker: AnalysisTracker | None = None, **kwargs)[source]
Executes a function and stores metadata. Can use an existing tracker to group multiple functions under one ID, or create a new one.
- save_jsonld()[source]
Serializes all individual analysis JSON-LDs and creates a master graph file that links all components to the group activity.
- save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]
Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.
- track(func)[source]
A decorator to automatically wrap a function with provenance tracking.
- Parameters:
func – The function to be decorated.
- Returns:
The wrapped function that executes via run_and_track.
- Return type:
function
- class FAIRLinked.RDFTableConversion.AnalysisTracker(proj_name: str, home_path: str, orcid: str | None = '0000-0000-0000-0000', metadata_template: dict | None = None, base_uri: str | None = 'https://cwrusdle.bitbucket.io/mds/', ontology_graph: Graph | None = None, prefix: str | None = 'mds', file_events: bool | None = False)[source]
Bases:
objectA system for auditing scientific analysis, capturing data provenance, and generating semantic JSON-LD metadata.
- add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]
Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.
- create_analysis_jsonld(license: str | None = None)[source]
Assembles all tracked data and file events into a valid JSON-LD string.
- Returns:
A formatted JSON-LD string containing the analysis graph.
- Return type:
str
- create_arg_df()[source]
Flattens the tracked variables into a single-row Pandas DataFrame for tabular comparison across different runs.
- Returns:
A DataFrame row containing run metadata and values.
- Return type:
pd.DataFrame
- create_metadata_template()[source]
Automatically generates a metadata template by matching group data columns against the loaded ontology.
- Returns:
(metadata_template, matched_log, unmatched_log)
- Return type:
tuple
- create_report() str[source]
Generates a human-readable Markdown summary of the analysis variables and file system activities.
- Returns:
A Markdown formatted report.
- Return type:
str
- delete_column_metadata(col_name: str)[source]
Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.
- Parameters:
col_name (str) – The column label to remove from the metadata template.
- detect_all_imports()[source]
Unified Environment Scanner for Jupyter and standard scripts. Identifies top-level software dependencies currently available in the session.
- get_context() dict[source]
Defines the JSON-LD context mapping prefixes to namespace URIs.
- Returns:
A dictionary of semantic prefix mappings (e.g., prov, mds, qudt).
- Return type:
dict
- mds_graph = <Graph identifier=Nfe0eac7054904ee0acf6c381abe14770 (<class 'rdflib.graph.Graph'>)>
- overwrite_metadata(metadata_template: dict)[source]
Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA
- run_and_track(func, *args, **kwargs)[source]
Executes a function while auditing arguments, results, and environment.
This method acts as a high-level provenance wrapper. It captures the “top-most” (direct) input IRIs from the function signature and the direct output IRIs from the return value. While all internal data structures (like nested dictionary keys) are routed and saved to the global metadata log, only the direct IRIs are linked to the Activity node via CCO and PROV-O properties.
- The method performs the following audit steps:
Generates a unique 15-digit numeric activity ID.
Binds and routes direct function arguments to capture input IRIs.
Triggers a live environment scan (imports/sys.modules).
Executes the function while monitoring OS-level file handles.
Routes and captures return value IRIs.
Finalizes a Linked Data Activity node with prov:used and prov:generated.
- Parameters:
func (callable) – The scientific function or method to be executed.
*args – Positional arguments to be passed to the target function.
**kwargs – Keyword arguments to be passed to the target function.
- Returns:
- The original return value of the wrapped function. If an
exception occurs, it returns None after logging the error as a provenance event.
- Return type:
Any
- save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]
Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.
- semantic_remapping(unmatched_log)[source]
Refines simple Python types by matching them against the current metadata template’s semantic types.
- serialize_analysis_jsonld(license: str | None = None)[source]
Writes the JSON-LD metadata to a physical file within the analysis directory.
- track(func)[source]
A decorator to automatically wrap a function with provenance tracking.
- Parameters:
func – The function to be decorated.
- Returns:
The wrapped function that executes via run_and_track.
- Return type:
function
- track_dataframe(name, df, parent_id=None)[source]
Logs structural metadata of a Pandas DataFrame, including column names and row counts.
- Parameters:
name – DataFrame name.
df – The pandas DataFrame object.
parent_id – ID of the containing process or object.
- track_dict(name, val, parent_id=None)[source]
Logs a dictionary’s keys and recursively tracks its nested values.
- Parameters:
name – Dictionary name.
val – The dictionary object.
parent_id – ID of the containing process or object.
- track_list_array(name, data, parent_id=None)[source]
Tracks the dimensions and size of lists and NumPy arrays.
- Parameters:
name – Array or list name.
data – The sequence or array-like object.
parent_id – ID of the containing process or object.
- track_other(name, obj, parent_id=None)[source]
Falls back to inspecting custom objects by logging their public attributes as nested data.
- Parameters:
name – Object name.
obj – The Python object to inspect.
parent_id – ID of the containing process or object.
- track_simple_datatype(name, val, parent_id=None)[source]
Tracks primitive types (str, int, float, bool) and attempts to map them to ontology terms using fuzzy matching.
- Parameters:
name – Variable name.
val – The primitive value.
parent_id – ID of the containing process or object.
- update_metadata(col_name: str, field: str, value: str)[source]
Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.
- class FAIRLinked.RDFTableConversion.MatDatSciDf(df: DataFrame, metadata_template: dict | None = None, matched_log: list | None = None, unmatched_log: list | None = None, data_relations_dict: dict | None = None, orcid: str = '0000-0000-0000-0000', df_name: str | None = None, metadata_rows: bool | None = False, ontology_graph: Graph | None = None, base_uri='https://cwrusdle.bitbucket.io/mds/', local_unit_file: bool | None = True)[source]
Bases:
objectA semantic wrapper for Pandas DataFrames in the Materials Data Science domain.
This class serves as a “Semantic Firewall” for experimental materials data. It bridges tabular data and Linked Data by maintaining synchronized internal objects for measurement data, semantic headers, metadata templates, and column-to-column relationships. It enforces FAIR principles by validating researcher identifiers (ORCID) and ensuring ontological consistency before serialization.
- df
The cleaned measurement data, stripped of metadata headers.
- Type:
pd.DataFrame
- header_df
A 3-row buffer (Type, Unit, Study Stage) used for mapping or pre-allocating metadata for the dataset.
- Type:
pd.DataFrame
- metadata_obj
The internal manager handling the RDFLib Graph and JSON-LD template synchronization.
- Type:
MatDatSciDf.Metadata
- data_relations
The internal manager for defining semantic links (Object/Datatype properties) between columns.
- Type:
MatDatSciDf.DataRelationsDict
- orcid
Validated ORCID iD of the data curator.
- Type:
str
- orcid_verified
Boolean status of curator identity verification.
- Type:
bool
- df_name
Descriptive name for the dataset used in file exports.
- Type:
str
- ontology
The reference ontology graph used for fuzzy matching and property resolution.
- Type:
rdflib.Graph
- base_uri
The namespace prefix used for generating semantic subjects.
- Type:
str
- add_column_metadata(col_name: str, rdf_type: str, unit: str = 'UNITLESS', definition: str = 'No definition provided', study_stage: str = 'UNK')[source]
Top-level API to manually define semantic metadata for a new column. Useful for defining columns found in ‘Discovery Warning’ reports.
- delete_column_metadata(col_name: str)[source]
Top-level API to remove a column’s semantic metadata definition. Useful for cleaning up incorrect mappings or unwanted discovery columns.
- Parameters:
col_name (str) – The column label to remove from the metadata template.
- delete_relation(prop_key: str, pair: tuple | None = None)[source]
Top-level API to remove semantic links between columns.
- Parameters:
prop_key (str) – The property identifier (e.g., ‘mds:measuredBy’).
pair (tuple, optional) – Specific (subj, obj) columns to un-link. If None, removes all links for that property.
- df_name = 'Unnamed_Dataframe'
- classmethod from_rdf_dir(input_dir: str, orcid: str, metadata_template: dict | None = None, data_relations_dict: dict | None = None, df_name: str = 'Imported_RDF_Data', ontology_graph: Graph | None = None, base_uri: str = 'https://cwrusdle.bitbucket.io/mds/')[source]
Factory method to reconstruct a MatDatSciDf instance and validate semantic integrity from a directory of RDF files.
This method crawls a directory for supported RDF formats, parses the triples, and reconstructs the tabular data (DataFrame) and metadata (JSON-LD Template). It serves as a data audit pipeline by cross-referencing file-level triples against a master template for unit consistency and a user-provided schema for structural integrity.
- Parameters:
input_dir (str) – Path to the directory containing RDF files (JSON-LD, Turtle, etc.).
orcid (str) – The ORCID identifier of the user performing the reconstruction.
data_relations_dict (dict, optional) – The expected Subject-Predicate-Object schema to validate against each file. If provided, mismatches are logged.
df_name (str, optional) – Descriptive name for the resulting DataFrame and validation report. Defaults to “Imported_RDF_Data”.
ontology_graph (rdflib.Graph, optional) – A reference ontology used to resolve labels and CURIEs during validation.
base_uri (str, optional) – The base URI used for semantic subject identification. Defaults to “https://cwrusdle.bitbucket.io/mds/”.
- Returns:
- A fully initialized and validated instance containing the
reconstructed dataset and associated semantic logs.
- Return type:
- Reports & Logs:
Generates ‘{df_name}_import_validation.txt’ in the input directory.
Logs Unit Conflicts: Flagged if a column unit differs from the first encountered definition.
Logs Schema Mismatches: Flagged if expected semantic links are missing within individual RDF graphs.
Note
Supported extensions: .jsonld, .ttl, .nt, .rdf, .xml.
Missing data columns in specific files are filled with ‘pd.NA’ to maintain tabular integrity.
- get_relation_pairs_onto()[source]
Analyzes the ontology and metadata template to discover relationships between columns.
- Returns:
{ URI: [(subj_col, obj_col), …] }
- Return type:
dict
- get_relations()[source]
Extracts all Object and Datatype properties from the associated ontology.
This method scans the ontology graph for OWL ObjectProperties and DatatypeProperties, mapping their human-readable rdfs:labels to their full URIs and property types.
- Returns:
- A dictionary (prop_metadata_dict) where:
Key: Property label (str)
Value: Tuple of (Property URI, Property Type)
- Return type:
dict
- mds_graph = <Graph identifier=Nc770c1b294194b809dfb70ba6ff6c57d (<class 'rdflib.graph.Graph'>)>
- overwrite_metadata(metadata_template: dict)[source]
Wrapper to delete and replace metadata information. WARNING: THIS WILL DELETE ALL CURRENT METADATA
- save_mds_df(output_dir: str, metadata_in_output_df: bool = False, formats: list = ['csv', 'parquet', 'arrow'])[source]
Saves the internal DataFrame and associated metadata to the local file system.
This method supports multi-format export (CSV, Parquet, Arrow). It can also generate a ‘semantic’ version of the CSV where the first three rows of the file contain the RDF Type, QUDT Unit, and Study Stage for each column, facilitating human readability and FAIR data principles.
- Parameters:
output_dir (str) – The directory path where files will be stored.
metadata_obj (Metadata, optional) – The Metadata management object. If provided, it will also trigger the saving of the JSON-LD template and match logs.
metadata_in_output_df (bool, optional) – If True, prepends three header rows (Type, Units, Study Stage) to the CSV output. Defaults to False.
formats (list, optional) – A list of strings specifying output formats. Supported: ‘csv’, ‘parquet’, ‘arrow’, ‘feather’. Defaults to [“csv”, “parquet”, “arrow”].
Note
- When ‘metadata_in_output_df’ is True, only the CSV format will contain
the multi-row headers. Parquet and Arrow formats are saved using a ‘clean’ version (data only) to preserve strict schema typing.
- For Parquet and Arrow exports, all columns are cast to strings to
ensure compatibility with mixed-type metadata fields.
The method automatically standardizes column order alphabetically.
- Returns:
None
- save_metadata(output_path: str, matched_log_path: str | None = None, unmatched_log_path: str | None = None)[source]
Wrapper to export the JSON-LD template and the status logs (matched/unmatched columns) to files.
- static search_license(query: str)[source]
Searches the SPDX license database for a matching ID or name.
This is a utility method: it can be called as MatDatSciDf.search_license(“MIT”) without initializing the class (i.e., no DataFrame or ORCID required).
- Parameters:
query (str) – The search term (e.g., ‘Creative Commons’, ‘GPL’, ‘MIT’).
- semantic_remapping(data_graph: Graph)[source]
Validates types against the reference ontology and remaps unrecognized types to the base BFO Entity class.
- serialize_bulk(output_path: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, license: str | None = None, write_files: bool | None = True) Graph[source]
Aggregates all row-level RDF graphs into a single master file while preserving the original context.
This method performs a “Bulk Serialization” by first generating RDF subgraphs for every row in the DataFrame and then merging them into a singular master Graph object. Unlike ‘serialize_row’, which creates multiple files, this method outputs one unified dataset file, ensuring that the JSON-LD ‘@context’ is applied globally to maintain consistent prefixing (e.g., ‘mds:’, ‘qudt:’) across all entries.
- Parameters:
output_path (str) – The full destination path, including the filename and extension, where the aggregated graph will be saved.
format (str, optional) – The RDF serialization format (e.g., ‘json-ld’, ‘turtle’, ‘xml’). Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate unique row identifiers.
id_cols (list[str], optional) – Column names to be used as entity identifiers (@id) instead of row keys.
license (str, optional) – SPDX license ID or URI to be applied to the triples.
write_files (bool, optional) – Whether to write serialized data to disk. Defaults to True.
- Returns:
- A single aggregated RDFLib Graph object containing the triples
for every row in the dataset.
- Return type:
Graph
Note
This method is highly recommended for creating FAIR-compliant datasets destined for Triple Stores or Graph Databases.
It maintains the exact same URI structure and namespace bindings as individual row serializations to ensure interoperability.
The output directory is automatically created if it does not exist.
- serialize_row(output_folder: str, format='json-ld', row_key_cols: list[str] | None = None, id_cols: list[str] | None = None, license: str | None = None, write_files: bool | None = True) list[Graph][source]
Serializes each row of the DataFrame into individual RDF files using the active semantic metadata template.
This method transforms tabular experimental data into Linked Data. It iterates through the DataFrame, generating a unique row identifier (Subject URI) for each entry based on either specified ‘id_cols’ or a hash of the study-stage metadata. It maps cell values to ‘qudt:value’ triples and establishes inter-column relationships defined in the internal ‘data_relations’ manager.
- Parameters:
output_folder (str) – Directory where individual RDF files will be saved.
format (str, optional) – The RDF serialization format. Supported: ‘json-ld’, ‘turtle’, ‘xml’, ‘nt’. Defaults to ‘json-ld’.
row_key_cols (list[str], optional) – Column names used to generate the unique row string used for file naming and internal row indexing.
id_cols (list[str], optional) – Column names whose values should be normalized and used as the primary Subject URI identifier (@id). If None, Subject URIs are generated from the unique row key.
license (str, optional) – An SPDX license identifier (e.g., ‘MIT’) or a full URI. Defaults to ‘CC0-1.0’.
write_files (bool, optional) – If True, writes each row to a file on disk. If False, only returns the list of RDF Graphs. Defaults to True.
- Raises:
ValueError – If the provided license is invalid or if the metadata template is missing required ‘skos:altLabel’ definitions.
- Returns:
- A list of RDFLib Graph objects, each representing
one row of experimental data and its associated semantic context.
- Return type:
List[rdflib.Graph]
Note
Parent directories for output_folder are created automatically.
Files are named using the pattern: ‘{random_suffix}-{row_key}.{ext}’.
Triples for ‘pd.NA’ or empty string values are omitted to maintain graph sparsity and data integrity.
- template_generator(skip_prompts: bool = False)[source]
Generates a semantic metadata template by mapping DataFrame columns to ontology terms.
This method performs a fuzzy match between column headers and the loaded ontology. It attempts to automatically resolve the RDF type (@type), study stage, and units. If a direct match is not found, or if ‘skip_prompts’ is False, it can interactively prompt the user to provide missing metadata fields.
The resulting template follows the JSON-LD structure, integrating namespaces such as QUDT, SKOS, PROV, and MDS.
- Parameters:
skip_prompts (bool, optional) – If True, suppresses interactive user input for missing units or definitions, instead using ‘UNITLESS’ or placeholders. Defaults to False.
- Returns:
- A tuple containing:
metadata_template (dict): The complete JSON-LD dictionary with ‘@context’ and ‘@graph’ entries for each column.
matched_log (list): A list of strings documenting successful fuzzy-match associations (Column => IRI).
unmatched_log (list): A list of column names that could not be found in the provided ontology.
- Return type:
tuple
Note
The method prioritizes metadata explicitly included in the first three rows of the CSV (type, unit, study stage).
Unit extraction handles both raw strings (e.g., ‘unit:KiloGM’) and string-encoded dictionaries (e.g., “{‘@id’: ‘unit:M’}”).
Time-stamping via ‘prov:generatedAtTime’ is applied to each entry for provenance tracking.
- update_metadata(col_name: str, field: str, value: str)[source]
Wrapper to update a metadata property (unit, type, definition, etc.) for a specific column.
- update_metadata_bulk(metadata_template: dict)[source]
Wrapper to update metadata template in bulk for multiple columns
- validate_data_relations()[source]
Wrapper to validate relations using the instance’s own data and ontology.
- validate_metadata() bool[source]
Performs a two-way integrity check between the DataFrame and the Metadata Template.
- Category 1 (Undefined Data Columns):
Columns in the DataFrame that are NOT defined in the Metadata. -> Result: These will be skipped during serialization.
- Category 2 (Empty Metadata Entries):
Definitions in the Metadata that have no matching column in the DataFrame. -> Result: These will create ‘empty’ RDF nodes with no measurement values.
- Returns:
True if data and metadata are perfectly aligned, False otherwise.
- Return type:
bool
- view_data_relations()[source]
Displays a visual validation report for the provided DataRelationsDict.