FAIRLinked.QBWorkflow package

Submodules

FAIRLinked.QBWorkflow.data_parser module

FAIRLinked.QBWorkflow.data_parser.read_excel_template(file_path)[source]

Reads the Excel data file generated by generate_data_xlsx_template and returns: - A flat dictionary containing metadata for each variable, including the category. - A DataFrame containing the experimental data with variable names (without category prefixes).

Parameters:

file_path (str) – The path to the Excel file.

Returns:

A tuple containing:
  • variable_metadata (dict): Flat dictionary with variable names as keys, and values being metadata dictionaries including ‘Category’.

  • df (DataFrame): Experimental data with variable names as columns (without category prefixes).

Return type:

tuple

FAIRLinked.QBWorkflow.data_template_generator module

FAIRLinked.QBWorkflow.data_template_generator.generate_data_xlsx_template(children_terms, output_xlsx_file_path)[source]

Generates an Excel template for data collection with various features such as merged cells, category headers, colors, formatting, and borders. If no categories are provided, a ‘Miscellaneous’ category is generated.

Parameters:
  • children_terms (dict) – A dictionary where keys represent ontology categories (e.g., ‘mds:tool’) and values are lists of terms (e.g., [‘InstrumentId’, ‘InstrumentName’]). If categories are empty, a default ‘Miscellaneous’ category is created.

  • output_xlsx_file_path (str) – The path where the generated Excel file should be saved.

Returns:

The function creates and saves an Excel file with the specified formatting at the provided path.

Return type:

None

FAIRLinked.QBWorkflow.input_handler module

FAIRLinked.QBWorkflow.input_handler.check_if_running_experiment() bool[source]

Prompts the user to answer whether they are currently running an experiment, accepting only ‘yes’ or ‘no’.

Parameters:

None

Returns:

Returns True if the user enters ‘yes’, otherwise False.

Return type:

bool

Raises:

ValueError – If the input is not ‘yes’ or ‘no’.

FAIRLinked.QBWorkflow.input_handler.check_ingestion() str[source]

Prompts the user whether data is for ingesting into CRADLE.

Returns:

True if data is for ingesting into CRADLE, False otherwise

Return type:

bool

FAIRLinked.QBWorkflow.input_handler.check_valid_id(user_input)[source]
FAIRLinked.QBWorkflow.input_handler.choose_conversion_mode() str[source]

Prompts the user to decide whether to convert the entire DataFrame as one dataset or each row as an individual dataset for RDF conversion.

Returns:

‘entire’ or ‘row-by-row’

Return type:

str

FAIRLinked.QBWorkflow.input_handler.get_approved_id_columns(candidate_id_columns: List[str], mode: str) List[str][source]
Description:

Given a list of candidate ID columns (those that contain ‘id’ in their name), prompts the user for each column whether they want it included in the dataset naming. The prompt text changes depending on whether we are in ‘row-by-row’ mode or ‘entire’ mode.

Algorithm:
  1. If there are no candidate_id_columns, print a message and return empty list.

  2. Print each candidate column to the user, asking if it should be included in naming. - If mode == ‘row-by-row’, mention “row-based dataset naming.” - If mode == ‘entire’, mention “slice naming for entire dataset.”

  3. Collect approved columns in a list.

  4. Return that list.

Parameters:
  • candidate_id_columns (List[str]) – Columns that appear to be ID-like (contain “id” or “ID”).

  • mode (str) – ‘row-by-row’ or ‘entire’—used to tailor the prompt text.

Returns:

Subset of candidate_id_columns that the user approves for naming.

Return type:

List[str]

FAIRLinked.QBWorkflow.input_handler.get_dataset_name() str[source]

Prompts the user to enter a name for their dataset. Only allows letters and underscores. Offers ‘SampleDataset’ as a fallback option if invalid input is provided.

Returns:

A valid dataset name containing only letters, numbers, and underscores.

Return type:

str

FAIRLinked.QBWorkflow.input_handler.get_domain(domains_hashset: Set[str]) str[source]

Prompts the user to select a domain from the available options (present in a hashset), ensuring proper handling of spaces and case.

Parameters:

domains_hashset (set) – A set of available domain names.

Returns:

The selected domain name in lowercase.

Return type:

str

Raises:

ValueError – If the input is not a valid number corresponding to a domain in the list.

FAIRLinked.QBWorkflow.input_handler.get_identifiers(approved_id_cols) dict[source]

Prompt the user to enter identifiers

FAIRLinked.QBWorkflow.input_handler.get_input_data_excel() str[source]

Prompts the user to enter the file path for a data Excel file and validates whether the file exists.

Returns:

The valid file path of the Excel file.

Return type:

str

Raises:

FileNotFoundError – If the file path provided by the user does not exist.

FAIRLinked.QBWorkflow.input_handler.get_input_namespace_excel() str[source]

Prompts the user to enter the file path for a namespace Excel file and validates whether the file exists.

Returns:

The valid file path of the Excel file.

Return type:

str

Raises:

FileNotFoundError – If the file path provided by the user does not exist.

FAIRLinked.QBWorkflow.input_handler.get_namespace_for_dataset(namespace_map: Dict[str, str]) str[source]

Prompts the user to select a namespace for their dataset, excluding predefined standard vocabulary namespaces.

Parameters:

namespace_map (dict) – A dictionary where the keys are namespace prefixes and the values are corresponding base URIs.

Returns:

The selected namespace prefix.

Return type:

str

Raises:

ValueError – If the user selects a number outside the valid range.

FAIRLinked.QBWorkflow.input_handler.get_ontology_file(prompt_message: str) str[source]

Prompts the user to enter the file path for an ontology file and validates whether it exists and has the correct extension (.ttl).

Parameters:

prompt_message (str) – Custom prompt message to specify which ontology file is being requested.

Returns:

The valid file path of the ontology file.

Return type:

str

Raises:
  • FileNotFoundError – If the file path provided does not exist.

  • ValueError – If the file does not have a .ttl extension.

FAIRLinked.QBWorkflow.input_handler.get_orcid() str[source]

Prompts the user to input an ORC_ID, ensuring that it conforms to the proper format by using a validation function.

Parameters:

None

Returns:

A valid ORC_ID string.

Return type:

str

Raises:

ValueError – If the ORC_ID is empty or invalid.

FAIRLinked.QBWorkflow.input_handler.get_output_folder_path() str[source]

Prompts the user to provide an output folder path, and creates the folder if it does not exist.

Parameters:

None

Returns:

The valid path to the output folder.

Return type:

str

Raises:

NotADirectoryError – If the path provided is not a valid directory.

FAIRLinked.QBWorkflow.input_handler.get_row_identifier_columns(df) List[str][source]
FAIRLinked.QBWorkflow.input_handler.has_all_ontology_files() bool[source]

Prompts the user to confirm availability of all required ontology files.

Parameters:

None

Returns:

True if the user indicates they have both required ontology files

(lowest-level, combined), False otherwise.

Return type:

bool

Raises:

ValueError – If the user provides an answer other than ‘yes’ or ‘no’.

FAIRLinked.QBWorkflow.input_handler.has_existing_datacube_file() Tuple[bool, str][source]

Prompts the user to specify if they have an existing RDF data cube dataset, which may be either:

  • A single file (.ttl, .jsonld)

  • A directory that contains .ttl/.jsonld files

Returns:

(False, “”) if user says ‘no’, (True, path) if user says ‘yes’ and provides a valid path that exists (whether it’s a file or directory).

FAIRLinked.QBWorkflow.input_handler.should_save_csv() bool[source]

Prompts the user whether they want to save the DataFrame as CSV.

Returns:

True if user wants to save CSV, False otherwise

Return type:

bool

FAIRLinked.QBWorkflow.mds_ontology_analyzer module

FAIRLinked.QBWorkflow.mds_ontology_analyzer.classify_leaf_nodes(combined_ontology_path: str, leaf_nodes: Set[URIRef], top_level_terms: Set[str]) Tuple[Dict[str, List[str]], List[str]][source]

Classifies each leaf node into a top-level category by traversing upward along rdfs:subClassOf and skos:broader relationships until a known top-level category is found.

Algorithm: 1. Parse the combined ontology into a graph. 2. For each leaf node, recursively follow rdfs:subClassOf and skos:broader upwards. 3. If a top-level category is reached, classify the leaf under that category. 4. If no top-level category is found, mark the leaf as missing.

This uses memoization to avoid repeated traversals of the same class.

Time Complexity: O(N + E) with memoization, where N is number of nodes and E is number of edges.

Space Complexity: O(N) for memoization and classification structures.

Parameters:
  • combined_ontology_path (str) – Path to the combined ontology (.ttl file).

  • leaf_nodes (Set[URIRef]) – Set of leaf node URIs identified from the low-level ontology.

  • top_level_terms (Set[str]) – Set of URIs representing top-level categories.

Returns:

  • Dictionary mapping top-level category URIs to a list of leaf node URIs.

  • List of URIs for leaf nodes that couldn’t be mapped.

Return type:

Tuple[Dict[str, List[str]], List[str]]

FAIRLinked.QBWorkflow.mds_ontology_analyzer.find_leaf_nodes(lowest_level_ontology_path: str) Set[URIRef][source]

Identifies leaf nodes in the lowest-level ontology. Leaf nodes are classes that do not serve as a superclass of any other class within the MDS namespace.

Algorithm: 1. Parse the lowest-level ontology and build an RDF graph. 2. Gather all classes (subjects of RDFS.subClassOf) and their superclasses. 3. Classes that never appear as an RDFS.subClassOf object are considered leaf nodes.

Time Complexity: O(N + E) where N is the number of classes and E the number of subclass relations.

Space Complexity: O(N) for storing classes and relationships.

Parameters:

lowest_level_ontology_path (str) – Path to the low-level ontology (.ttl file).

Returns:

A set of URIs representing leaf classes.

Return type:

Set[URIRef]

FAIRLinked.QBWorkflow.mds_ontology_analyzer.get_classification(lowest_level_ontology_path: str, combined_ontology_path: str) Tuple[Dict[str, List[str]], List[str]][source]

High-level function that coordinates: 1. Finding leaf nodes from the low-level ontology. 2. Identifying top-level categories directly from the combined ontology. 3. Classifying leaf nodes under these top-level categories. 4. Updating category colors and converting URIs to prefixed forms.

Parameters:
  • lowest_level_ontology_path (str) – Path to the lowest-level MDS ontology (.ttl file).

  • combined_ontology_path (str) – Path to the combined MDS ontology (.ttl file).

Returns:

  • classification_prefixed: A dictionary with prefixed category URIs as keys and lists of prefixed leaf nodes as values.

  • missing_top_terms_prefixed: A list of prefixed URIs for terms that couldn’t be mapped.

Return type:

Tuple[Dict[str, List[str]], List[str]]

FAIRLinked.QBWorkflow.mds_ontology_analyzer.get_prefixed_name(uri: str | URIRef) str[source]

Converts a full URI to its corresponding prefixed form using the global namespace mappings.

Algorithm: 1. Convert the URIRef to a string if needed. 2. Iterate over NAMESPACE_MAP to find a prefix whose namespace is a prefix of the given URI. 3. If found, return prefix:LocalName. Otherwise, return the original URI string.

Parameters:

uri (Union[str, URIRef]) – The URI to convert.

Returns:

Prefixed form of the URI (e.g. ‘mds:SampleSize’).

Return type:

str

FAIRLinked.QBWorkflow.mds_ontology_analyzer.get_top_level_terms_from_combined(combined_ontology_path: str) Set[str][source]

Derives top-level categories directly from the combined ontology. This removes the need for a separate top-level ontology file.

A top-level category is defined as a class that appears as a ‘broader’ concept (object of SKOS.broader) but does not appear as a narrower concept for any other class within the MDS namespace. If no such classes are found, we consider classes with no broader relations as top-level.

Algorithm: 1. Parse the combined ontology. 2. For all triples (narrower SKOS.broader broader), record narrower and broader classes. 3. Top-level categories are those that appear as broader but never as narrower. 4. If none found this way, fallback to classes that never appear as narrower at all.

Time Complexity: O(N + E) where N is number of classes and E is number of SKOS.broader relationships.

Space Complexity: O(N) for storing class sets and relationships.

Parameters:

combined_ontology_path (str) – Path to the combined MDS ontology (.ttl file).

Returns:

A set of URIs for top-level category classes.

Return type:

Set[str]

FAIRLinked.QBWorkflow.mds_ontology_analyzer.update_category_colors(categories: Set[str]) None[source]

Updates the global CATEGORY_COLORS dictionary with color assignments for each category.

The function ensures that each category (in prefixed form e.g. ‘mds:tool’) is assigned a unique color from the LIGHT_COLORS palette, cycling through colors if needed.

High-level logic: 1. Clear existing color assignments 2. Convert full URIs to prefixed form (e.g. ‘http://…#tool’ -> ‘mds:tool’) 3. Assign colors from palette to each prefixed category 4. Update ONTO_CORE_CATEGORIES with final set of categories

Parameters:

categories (Set[str]) – Set of category URIs to assign colors to (e.g. {’http://…#tool’, ‘http://…#recipe’})

Global Effects:
  • Updates CATEGORY_COLORS with mappings like {‘mds:tool’: ‘FFE6E6’, ‘mds:recipe’: ‘E6FFE6’}

  • Updates ONTO_CORE_CATEGORIES with prefixed category names

FAIRLinked.QBWorkflow.namespace_parser module

FAIRLinked.QBWorkflow.namespace_parser.parse_excel_to_namespace_map(excel_file_path)[source]

Parses the Excel file containing namespaces and base URIs. Updates and returns the namespace map.

Parameters:

excel_file_path (str) – The path to the Excel file to parse.

Returns:

Updated namespace map.

Return type:

dict

FAIRLinked.QBWorkflow.namespace_template_generator module

FAIRLinked.QBWorkflow.namespace_template_generator.generate_namespace_excel(excel_file_path)[source]

Generates an Excel file listing default namespaces and their corresponding base URIs. The Excel file is styled with borders, colored headers, and centered text for better readability.

Parameters:

excel_file_path (str) – The path where the generated Excel file will be saved.

Returns:

The function saves an Excel file containing the namespace-URI mappings to the provided file path.

Return type:

None

FAIRLinked.QBWorkflow.rdf_data_cube_workflow module

FAIRLinked.QBWorkflow.rdf_data_cube_workflow.parse_existing_datacube_workflow(file_path: str)[source]

If the user has an existing RDF data cube file or a directory of such files, parse it/them into a tabular format and optionally save as CSV.

Parameters:

file_path (str) – Either a path to a single .ttl/.jsonld file or a directory containing multiple .ttl/.jsonld/.json-ld files.

FAIRLinked.QBWorkflow.rdf_data_cube_workflow.rdf_data_cube_workflow_start()[source]

Welcome to FAIRLinked 🚀

The entry point for the FAIRLinked data processing workflow using RDF Data Cube.

Steps Overview: 1. Checks if an existing RDF data cube file/folder is present.

  • If yes, parse it back to tabular format (optionally saving CSV).

  1. If no existing data cube, prompts whether the user is running an experiment or not. - If experiment, generate namespace & data templates (with optional ontology analysis). - Otherwise, parse existing Excel files for namespaces & data,

    then convert them to RDF in ‘entire’ or ‘row-by-row’ mode.

FAIRLinked.QBWorkflow.rdf_data_cube_workflow.run_experiment_workflow()[source]

Generates namespace and data templates with optional ontology analysis for FAIRLinked.QBWorkflow.

Steps: 1. Check if the user has local ontology files (lowest-level & combined). 2. If found, run classification => map terms to categories. 3. Generate ‘namespace_template.xlsx’ and ‘data_template.xlsx’,

optionally populating with mapped terms.

FAIRLinked.QBWorkflow.rdf_data_cube_workflow.run_ingestion_workflow()[source]

Processes namespace and data Excel files to generate RDF outputs with FAIRLinked.QBWorkflow.

Steps: 1. Gather user inputs (ORCID, namespace/data Excel, output folder). 2. Prompt for conversion mode (entire or row-by-row). 3. If entire mode => ask for dataset name; if row-by-row => skip it. 4. Parse the Excel templates => produce RDF using convert_dataset_to_rdf_with_mode.

FAIRLinked.QBWorkflow.rdf_data_cube_workflow.run_standard_workflow()[source]

Processes namespace and data Excel files to generate RDF outputs with FAIRLinked.QBWorkflow.

Steps: 1. Gather user inputs (ORCID, namespace/data Excel, output folder). 2. Prompt for conversion mode (entire or row-by-row). 3. If entire mode => ask for dataset name; if row-by-row => skip it. 4. Parse the Excel templates => produce RDF using convert_dataset_to_rdf_with_mode.

FAIRLinked.QBWorkflow.rdf_to_df module

FAIRLinked.QBWorkflow.rdf_to_df.parse_rdf_to_df(file_path: str, variable_metadata_json_path: str, arrow_output_path: str) tuple[source]
Description:

Parses one or multiple RDF Data Cube file(s) (TTL or JSON-LD) into a single Pandas DataFrame plus a consolidated variable_metadata dictionary. This function supports both “row-by-row” style RDF (each row => separate qb:DataSet) and “entire” style RDF (one qb:DataSet with many slices), as well as any mixture of them (multiple DataSets across multiple files).

After parsing each file’s DataSets, it merges the partial DataFrames and merges partial metadata:

  • Merges units from different Observations

  • Merges altLabels, categories, and measure/dimension flags

Then sorts the resulting DataFrame and writes:
  1. The final DataFrame => Parquet

  2. The final variable_metadata => JSON

Finally, prints summary stats and previews the first row.

Algorithm (High-Level):
  1. Gather all valid RDF files (.ttl/.jsonld/.json-ld) from either a single file path or a directory (recursively).

  2. Initialize an empty list of partial DataFrames (all_dfs) and an empty dictionary for final_variable_metadata.

  3. For each RDF file:
    1. Determine the rdflib parse format (‘turtle’ or ‘json-ld’).

    2. Parse the Graph.

    3. Pass the Graph to _parse_single_rdf_graph(…) which may produce: (partial_df, partial_metadata).

    4. Concatenate partial_df to the global list (if not empty).

    5. Merge partial_metadata into final_variable_metadata, unifying measure units, altLabels, categories, etc.

  4. Concatenate all partial DataFrames if any => final_df.

  5. Sort final_df by “ExperimentId” if present.

  6. Reorder columns by (Category, ColumnName), with “ExperimentId” forced to front if it exists.

  7. Convert final_df => PyArrow Table => Parquet => arrow_output_path.

  8. Dump final_variable_metadata => JSON => variable_metadata_json_path.

  9. Print summary stats & preview.

Parameters:
  • file_path (str) – Path to either a single .ttl/.jsonld file or a folder containing multiple .ttl/.jsonld files.

  • variable_metadata_json_path (str) – Destination to write the final variable_metadata as JSON.

  • arrow_output_path (str) – Destination to write the final PyArrow Table (saved in Parquet format).

Returns:

pa.Table => The final table of observations, after merging across all files. dict => The final merged variable_metadata mapping each column => metadata.

Return type:

(pa.Table, dict)

FAIRLinked.QBWorkflow.rdf_transformer module

FAIRLinked.QBWorkflow.rdf_transformer.add_component_to_dsd(dsd_graph: Graph, dsd_uri: URIRef, prop_uri: URIRef, component_type: URIRef, prop_type: URIRef) None[source]
Description:

Adds a dimension/measure/attribute property to the qb:DataStructureDefinition by creating a blank node and linking it accordingly.

Algorithm:
  1. Create a blank node => component.

  2. dsd_uri – qb:component –> component

  3. component – (component_type) –> prop_uri (e.g. dimension => prop_uri)

  4. prop_uri – rdf:type –> prop_type

Parameters:
  • dsd_graph (Graph) – The Graph that holds the DSD.

  • dsd_uri (URIRef) – The DataStructureDefinition node.

  • prop_uri (URIRef) – The property URI for this dimension/measure/attribute.

  • component_type (URIRef) – e.g. qb:dimension, qb:measure, qb:attribute.

  • prop_type (URIRef) – e.g. qb:DimensionProperty, qb:MeasureProperty, qb:AttributeProperty.

Returns:

None

FAIRLinked.QBWorkflow.rdf_transformer.compute_file_hash(file_path: str) str[source]
Description:

Computes the SHA-256 hash of a given file by reading its contents in chunks.

Algorithm:
  1. Initialize a sha256 object (hashlib.sha256).

  2. Open the file in ‘rb’ mode.

  3. Read the file in 4096-byte chunks, updating the hash object each time.

  4. Return the hex digest of the final hash.

Parameters:

file_path (str) – The path to the file that needs hashing.

Returns:

The SHA-256 hex digest string for the file contents.

Return type:

str

FAIRLinked.QBWorkflow.rdf_transformer.convert_dataset_to_rdf_with_mode(df: DataFrame, variable_metadata: dict, namespace_map: dict, user_chosen_prefix: str = 'mds', output_folder_path: str = '.', orcid: str = '', dataset_name: str = 'SampleDataset', fixed_dimensions: list | None = None, conversion_mode: str = 'entire') None[source]
Description:
Main entry point for converting a Pandas DataFrame to RDF using either:
  • ‘entire’: single qb:DataSet with multiple qb:Slices

  • ‘row-by-row’: each row => a separate qb:DataSet

Also writes a naming conventions .txt file describing how URIs and filenames are formed.

Algorithm:
  1. create_root_folder => get a top-level folder named “Output_{orcidDigits}_{timestamp}”.

  2. prepare_namespaces => validate prefix => namespace URIs.

  3. If conversion_mode == ‘entire’:
    1. create_subfolders => ‘ttl’, ‘jsonld’, ‘hash’

    2. build combined_iri => dataset_name_for_iri

    3. call convert_entire_dataset(…)

    elif conversion_mode == ‘row-by-row’:

    call convert_row_by_row(…)

    else:

    raise ValueError if mode is invalid

  4. write_naming_conventions_doc => describing the chosen naming approach

  5. Print success message.

Parameters:
  • df (pd.DataFrame) – The data to convert.

  • variable_metadata (dict) – column => metadata (like IsMeasure, Unit, Category, etc.).

  • namespace_map (dict) – prefix => base URI for RDF

  • user_chosen_prefix (str) – e.g. ‘mds’

  • output_folder_path (str) – base folder path for new output folder

  • orcid (str) – user’s ORCID

  • dataset_name (str) – used if ‘entire’ mode

  • fixed_dimensions (list or None) – used in entire mode to specify columns that remain the same across slices

  • conversion_mode (str) – ‘entire’ or ‘row-by-row’

Returns:

None

FAIRLinked.QBWorkflow.rdf_transformer.convert_entire_dataset(df: DataFrame, variable_metadata: dict, ns_map: dict, user_chosen_prefix: str, dataset_name: str, orcid: str, output_folder_paths: dict, fixed_dimensions: list | None = None, overall_timestamp: str | None = None)[source]
Description:

Converts the entire DataFrame into a single qb:DataSet, and each row becomes a qb:Slice. Observations are created for each measure in each row. If user picks ID columns => these become part of the slice’s name.

Algorithm:
  1. Identify columns with ‘id’. Prompt user with mode=’entire’ => get_approved_id_columns(…).

  2. Create a DataStructureDefinition for the entire DF => dimensions + measures.

  3. Add a single qb:DataSet => e.g. mds:Dataset_{datasetName}.

  4. Create a single qb:SliceKey referencing dimension properties.

  5. For each row => create qb:Slice => name derived from (someIDs + orcid + timestamp).

  6. Within that slice => create Observations (one per measure).

  7. Write a single .ttl/.jsonld + .sha256 hash to the respective subfolders.

Parameters:
  • df (pd.DataFrame) – The entire dataset in tabular form.

  • variable_metadata (dict) – column => metadata dictionary.

  • ns_map (dict) – prefix => rdflib.Namespace

  • user_chosen_prefix (str) – e.g. ‘mds’

  • dataset_name (str) – e.g. ‘SampleDataset’ (already sanitized or not).

  • orcid (str) – The user’s ORCID from which we parse digits for naming.

  • output_folder_paths (dict) – subfolders => their paths.

  • fixed_dimensions (list or None) – optionally specify columns that remain fixed in every slice.

  • overall_timestamp (str or None) – If provided, used for consistent naming across slices.

Returns:

None

FAIRLinked.QBWorkflow.rdf_transformer.convert_row_by_row(df: DataFrame, variable_metadata: dict, ns_map: dict, user_chosen_prefix: str, orcid: str, root_folder_path: str, overall_timestamp: str)[source]
Description:

Converts each row of a DataFrame into its own qb:DataSet in RDF, prompting the user to choose which ID columns to incorporate in naming. Each row-based dataset shares the same folder timestamp to keep them grouped.

Algorithm:
  1. Identify candidate ID columns (contain ‘id’), pass ‘row-by-row’ to get_approved_id_columns(…).

  2. Extract which columns are dimensions vs. measures => create a single DSD for entire DF.

  3. For each row => build a new Graph that:
    • Copies the DSD

    • Creates a new qb:DataSet => mds:Dataset_{someIDs}_{orcid}_{timestamp}

    • Creates a SliceKey => mds:SliceKey_{someIDs}_{orcid}_{timestamp}

    • Creates a Slice => mds:Slice_{someIDs}_{orcid}_{timestamp}

    • Adds Observations for each measure

  4. Write each row’s TTL/JSON-LD + .sha256 hash in subfolders.

Parameters:
  • df (pd.DataFrame) – The entire DataFrame to convert row-by-row.

  • variable_metadata (dict) – column => metadata dictionary.

  • ns_map (dict) – prefix => Namespace mapping.

  • user_chosen_prefix (str) – e.g. ‘mds’.

  • orcid (str) – The user’s ORCID, from which we extract digits.

  • root_folder_path (str) – The top-level folder for outputs.

  • overall_timestamp (str) – The run’s global timestamp for consistent naming.

Returns:

None

FAIRLinked.QBWorkflow.rdf_transformer.convert_row_by_row_CRADLE(df: DataFrame, variable_metadata: dict, ns_map: dict, user_chosen_prefix: str, orcid: str, root_folder_path: str, overall_timestamp: str)[source]
Description:

Converts each row of a DataFrame into its own qb:DataSet in RDF, prompting the user to choose which ID columns to incorporate in naming. Each row-based dataset shares the same folder timestamp to keep them grouped.

Algorithm:
  1. Identify candidate ID columns (contain ‘id’), pass ‘row-by-row’ to get_approved_id_columns(…).

  2. Extract which columns are dimensions vs. measures => create a single DSD for entire DF.

  3. For each row => build a new Graph that:
    • Copies the DSD

    • Creates a new qb:DataSet => mds:Dataset_{someIDs}_{orcid}_{timestamp}

    • Creates a SliceKey => mds:SliceKey_{someIDs}_{orcid}_{timestamp}

    • Creates a Slice => mds:Slice_{someIDs}_{orcid}_{timestamp}

    • Adds Observations for each measure

  4. Write each row’s TTL/JSON-LD + .sha256 hash in subfolders.

Parameters:
  • df (pd.DataFrame) – The entire DataFrame to convert row-by-row.

  • variable_metadata (dict) – column => metadata dictionary.

  • ns_map (dict) – prefix => Namespace mapping.

  • user_chosen_prefix (str) – e.g. ‘mds’.

  • orcid (str) – The user’s ORCID, from which we extract digits.

  • root_folder_path (str) – The top-level folder for outputs.

  • overall_timestamp (str) – The run’s global timestamp for consistent naming.

Returns:

None

FAIRLinked.QBWorkflow.rdf_transformer.create_dsd(variable_metadata: dict, dimensions: list, measures: list, ns_map: dict, user_ns: Namespace) tuple[source]
Description:

Builds a qb:DataStructureDefinition for the given dimensions and measures. Also sets up measureType as a dimension. Adds optional qb:attribute for ‘unitMeasure’ and ‘category’ if present.

Algorithm:
  1. Initialize an empty Graph, bind namespace prefixes.

  2. Create dsd_uri = user_ns[“DataStructureDefinition”], mark it as qb:DataStructureDefinition.

  3. For each dimension => add_component_to_dsd(…).

  4. Add measureType as dimension.

  5. For each measure => add_component_to_dsd(…).

  6. If any columns require ‘unitMeasure’ or ‘category’, add qb:attribute.

  7. Return (dsd_graph, dsd_uri).

Parameters:
  • variable_metadata (dict) – Maps column => metadata (like AltLabel, Category, etc.).

  • dimensions (list) – The dimension column names.

  • measures (list) – The measure column names.

  • ns_map (dict) – prefix => Namespace

  • user_ns (Namespace) – The user-chosen prefix’s Namespace.

Returns:

dsd_graph: The Graph containing the DSD definitions. dsd_uri: The URIRef for the qb:DataStructureDefinition.

Return type:

(Graph, URIRef)

FAIRLinked.QBWorkflow.rdf_transformer.create_observation(dataset_graph: Graph, row: Series, variable_metadata: dict, variable_dimensions: list, measures: list, ns_map: dict, user_ns: Namespace, observation_counter: int) tuple[source]
Description:

For each measure in ‘measures’, if the row has a non-null value, create a qb:Observation node. Link measureType => measure property, store measure value, dimension values (from variable_dimensions), and optionally link the unit measure if found.

Algorithm:
  1. For each measure in ‘measures’: a) If row[measure] is not null => create observation_{observation_counter} as qb:Observation. b) Add triple (obs_uri, qb:measureType, measure_prop). c) Add triple (obs_uri, measure_prop, measure_value). d) For each dimension in variable_dimensions => store dimension_value or NotFound. e) If UNIT_FIELD => link sdmx-attribute:unitMeasure => unit URI.

  2. Return (list_of_observation_uris, updated_observation_counter).

Parameters:
  • dataset_graph (Graph) – The graph where we store Observations and data.

  • row (pd.Series) – A single row from the DataFrame.

  • variable_metadata (dict) – column => metadata.

  • variable_dimensions (list) – the subset of dimensions that vary in this context.

  • measures (list) – measure column names.

  • ns_map (dict) – prefix => Namespace.

  • user_ns (Namespace) – the user-chosen prefix’s Namespace object.

  • observation_counter (int) – the current global counter for numbering Observations.

Returns:

A list of the newly created observation URIs, plus the incremented observation_counter.

Return type:

(list_of_obs_uris, updated_counter)

FAIRLinked.QBWorkflow.rdf_transformer.create_observation_2(row: Series, variable_metadata: dict, ns_map: dict, user_ns: Namespace, file_name: str) Graph[source]
FAIRLinked.QBWorkflow.rdf_transformer.create_root_folder(output_folder_path: str, dataset_name: str, orcid: str) tuple[source]
Description:

Creates a top-level folder named ‘Output_{orcidDigits}_{timestamp}’ to store all outputs for the conversion process. Also prepares standardized strings for dataset and orcid usage throughout the process.

Algorithm:
  1. Generate a timestamp in ‘YYYYmmddHHMMSS’ format.

  2. Extract digits from the user’s ORCID => ‘sanitized_orcid’.

  3. Sanitize ‘dataset_name’ => ‘sanitized_dataset_name’ by replacing non-word chars.

  4. Construct folder name => “Output_{sanitized_orcid}_{timestamp}”.

  5. os.makedirs(…) to create the folder if not existing.

  6. Return (root_folder_path, overall_timestamp, sanitized_dataset_name, sanitized_orcid).

Parameters:
  • output_folder_path (str) – The parent path where we create this top-level output folder.

  • dataset_name (str) – The dataset name to be sanitized (though not used in folder name).

  • orcid (str) – The user’s ORCID, from which we extract digits.

Returns:

(

root_folder_path (str): The newly created folder path, overall_timestamp (str): The run-specific timestamp (YYYYmmddHHMMSS), sanitized_dataset_name (str): The sanitized dataset name, sanitized_orcid (str): The numeric portion extracted from the ORCID.

)

Return type:

tuple

FAIRLinked.QBWorkflow.rdf_transformer.create_subfolders(root_folder_path: str) dict[source]
Description:

Creates three subfolders (‘ttl’, ‘jsonld’, ‘hash’) under the specified root folder to store Turtle files (.ttl), JSON-LD files (.jsonld), and the SHA-256 hash files (.sha256).

Algorithm:
  1. Define a list of subfolder names: [‘ttl’, ‘jsonld’, ‘hash’].

  2. For each subfolder name: - Construct a path using os.path.join(root_folder_path, folder_name). - Create the directory if it doesn’t exist (os.makedirs with exist_ok=True). - Store the result in a dictionary under the same key.

  3. Return this dictionary of paths.

Parameters:

root_folder_path (str) – The path to the parent output folder where subfolders will be created.

Returns:

A dictionary of subfolder paths, keyed by ‘ttl’, ‘jsonld’, and ‘hash’.

Return type:

dict

FAIRLinked.QBWorkflow.rdf_transformer.extract_variables(variable_metadata: dict, df_columns: list) tuple[source]
Description:

Splits columns into ‘dimensions’ or ‘measures’ by checking variable_metadata[var_name][‘IsMeasure’].

Algorithm:
  1. Initialize dimensions = [] and measures = [].

  2. For each column in df_columns: - Retrieve meta = variable_metadata.get(column). - If meta is found, check ‘IsMeasure’ => ‘yes’? => measures, else => dimensions. - If not found, log a warning message.

  3. Return (dimensions, measures).

Parameters:
  • variable_metadata (dict) – A dict mapping column => metadata (with ‘IsMeasure’, etc.).

  • df_columns (list) – The list of column names from the DataFrame.

Returns:

A tuple of (dimensions, measures).

Return type:

(list, list)

FAIRLinked.QBWorkflow.rdf_transformer.get_property_uri(var_name: str, meta: dict, ns_map: dict, user_ns: Namespace) URIRef[source]
Description:

Obtains the appropriate URIRef for a given variable. If ‘ExistingURI’ is provided, attempts to parse it as ‘prefix:LocalPart’ or a full URI. Otherwise, uses the user namespace + sanitized var_name.

Algorithm:
  1. If meta has ‘ExistingURI’: a) if it has ‘:’ => split prefix vs. local_part

    => look up in ns_map => create ns[local_part]

    1. else treat it as a full URIRef

  2. Otherwise => user_ns[var_name], with spaces replaced by underscores.

Parameters:
  • var_name (str) – The DataFrame column name.

  • meta (dict) – The metadata for that variable (keys e.g. ‘ExistingURI’).

  • ns_map (dict) – A dictionary mapping prefix => rdflib.Namespace.

  • user_ns (Namespace) – The user-chosen prefix’s Namespace object.

Returns:

The final property URI for this variable.

Return type:

rdflib.term.URIRef

Raises:

ValueError if prefix not found in ns_map.

FAIRLinked.QBWorkflow.rdf_transformer.prepare_namespaces(namespace_map: dict, user_chosen_prefix: str = 'mds') dict[source]
Description:

Validates the user-defined prefix in namespace_map, ensures each URI ends with ‘/’ or ‘#’, and converts them to rdflib.Namespace objects.

Algorithm:
  1. Copy the input namespace_map.

  2. Check if user_chosen_prefix in the map; raise an error if missing.

  3. For each (prefix, uri): - Validate it starts with http(s). - Ensure it ends with ‘/’ or ‘#’.

  4. Convert each to Namespace(…).

  5. Return the new dictionary.

Parameters:
  • namespace_map (dict) – Keys => prefix (str), Values => base URI (str).

  • user_chosen_prefix (str) – The prefix the user wants to use for new IRIs (default: ‘mds’).

Returns:

A mapping of prefix => rdflib.Namespace objects.

Return type:

dict

Raises:

ValueError if user_chosen_prefix is missing or if any URI is invalid.

FAIRLinked.QBWorkflow.rdf_transformer.process_unit(unit_str: str, ns_map: dict, user_ns: Namespace) URIRef[source]
Description:

Interprets a ‘Unit’ string from variable_metadata and returns an rdflib.URIRef for that unit. If it’s ‘prefix:LocalPart’, we look up prefix in ns_map. Otherwise, we treat it as user_ns[unitStr] (with spaces replaced).

Algorithm:
  1. If unit_str is empty => return None.

  2. If ‘:’ in unit_str => parse prefix, local_part => ns_map[prefix][local_part].

  3. Otherwise => user_ns[unit_str], replacing spaces with underscores.

  4. If prefix not found => raise ValueError.

Parameters:
  • unit_str (str) – e.g. “qudt:MilliM” or “MyLocalUnit”.

  • ns_map (dict) – prefix => Namespace

  • user_ns (Namespace) – the user prefix’s namespace.

Returns:

rdflib.URIRef or None if no unit_str given.

Raises:

ValueError if prefix is missing from ns_map.

FAIRLinked.QBWorkflow.rdf_transformer.write_naming_conventions_doc(root_folder_path: str, conversion_mode: str, orcid: str, overall_timestamp: str, dataset_name: str) None[source]
Description:

Writes a text file (naming_conventions_{orcidDigits}_{timestamp}.txt) describing how the DataSet, Slice, SliceKey, and filenames are named in the chosen mode (‘entire’ or ‘row-by-row’).

Algorithm:
  1. Extract numeric digits from ‘orcid’ => numeric_orcid.

  2. Build the .txt filename => naming_conventions_{numeric_orcid}_{overall_timestamp}.txt.

  3. Depending on ‘conversion_mode’, build a descriptive message about naming patterns.

  4. Write this message to the .txt file in ‘root_folder_path’.

Parameters:
  • root_folder_path (str) – The folder where we store the .txt file.

  • conversion_mode (str) – ‘entire’ or ‘row-by-row’ mode.

  • orcid (str) – The user’s ORCID (for numeric extraction).

  • overall_timestamp (str) – The run-specific timestamp used for naming outputs.

  • dataset_name (str) – The sanitized dataset name to reference in the doc.

Returns:

None

FAIRLinked.QBWorkflow.utility module

FAIRLinked.QBWorkflow.utility.validate_orcid_format(orcid)[source]

Module contents