From Spreadsheets to SDMX Effortless with Python and .Stat Suite

17 min readSep 23, 2024

George Verouchis | Christos Gkatzis

Introduction

As data exchange and reporting requirements evolve, the conversion of spreadsheets like XLSX to SDMX formats is becoming increasingly crucial for institutions. This article explores why this transition is necessary, how it can be efficiently automated, and the tools that simplify the process. We will focus on leveraging the .Stat Suite and highlight a practical example using World Bank data.

Why SDMX Adoption Is Becoming Essential

More international organizations, such as Eurostat and the European Central Bank (ECB), mandate institutions to adopt SDMX for data reporting. As a result, SDMX compliance is no longer an option but a requirement for many institutions.
SDMX allows for more efficient and standardized data sharing between organizations, ensuring consistency in data interpretation and reporting across different systems.

Leveraging the .Stat Suite for Automation

The .Stat Suite is a powerful tool for automating the conversion of XLSX to SDMX. One key feature of the suite is its ability to upload an XLSX file along with an Excel data definition (EDD) “edd.xml” file. This capability can save institutions significant time and effort by automating what would otherwise be a tedious manual conversion process.

How it works:

XLSX Upload: you can upload raw data in XLSX format.
edd.xml File: This accompanying file provides metadata and additional instructions to ensure the correct mapping of the data to SDMX.

However, before utilizing this feature, certain prerequisites must be met.

Preparing Your Dataflow: DSD and Dataflow Creation

Before automating the conversion process, it’s essential to ensure that your Data Structure Definition (DSD) and Dataflow for the dataset are already prepared. Without these components in place, automating XLSX to SDMX conversion via the .Stat Suite will not be possible.

A Case Study: Converting World Bank Commodity Markets (Pink Sheet) Data

To illustrate the process, let’s take the example of converting World Bank’s Pink Sheet data. This dataset provides commodity price information and is typically published in XLSX format.

Steps:

Retrieve the Data: from World Bank official page through this link.
Prepare the DSD: Define the structure for the Pink Sheet data, including commodity dimensions, time periods, and price units.
Create the Dataflow and EDD: Set up a dataflow and EDD for the Pink Sheet dataset, linking it to the DSD.
Upload: Using the .Stat Suite, upload the XLSX file along with the corresponding structures.
Conversion: The .Stat Suite processes the files and converts the data into SDMX format, ready for reporting and sharing.

Implementation

The code defines four important superclasses: Downloader, Converter, DataLoader and Importer, which provide the foundational structure for downloading and importing data.

Downloader(ABC)

The Downloader class is an abstract base class, responsible for managing data downloads. It includes methods for logging the end of a download and filling a Reporter with details about the downloaded files. Subclasses must implement the download method.

class Downloader(ABC):
    def __init__(self, data_loader: DataLoader) -> None:
        self.data_loader = data_loader

    def _end_message(self):
        message = f'Downloaded {self.data_loader.get_file_number()} file(s) with data, ' \
                  f'Avg file size: {self.data_loader.get_avg_file_size()}, ' \
                  f'Total Data: {self.data_loader.get_total_size()}'
        logger.info(message)

    def fill_reporter(self, reporter: Reporter):
        reporter.data_file_number = self.data_loader.get_file_number()
        reporter.avg_data_file_size = self.data_loader.get_avg_file_size()
        reporter.total_data = self.data_loader.get_total_size()

    @abstractmethod
    def download(self, data_info: DataInfo) -> ActionResult:
        """Downloads the data from the source to the downloaded_directory."""

Converter(ABC)

The Converter class is an abstract base class that manages dataset conversion using a DataLoader for file operations. It provides logging methods for the conversion process and requires subclasses to implement the convert method to handle the actual conversion.

class Converter(ABC):
    def __init__(self,data_loader:DataLoader) -> None:
        self.data_loader=data_loader

    def _start_message(self):
        message=f'Converting {self.data_loader.get_file_number()} file(s)'
        logger.info(message)
    
    def _end_message(self):
        message=f'Converted in {self.data_loader.get_file_number(True)} file(s)'
        logger.info(message)

    def fill_reporter(self,reporter:Reporter):
        reporter.file_number=self.data_loader.get_file_number(True)
        reporter.avg_file_size=self.data_loader.get_avg_file_size(True)

    @abstractmethod
    def convert(self)->ActionResult:
        """
        Method converts datasets from the downloaded directory and saves them to the converted directory
        """
        pass

DataLoader

The DataLoader class manages file-related operations such as finding, deleting, and calculating file sizes. It works with downloaded and converted directories and provides utility methods to assist in managing data flows.

class DataLoader:
    def __init__(self, downloaded_directory: str, converted_directory: str) -> None:
        if not os.path.exists(downloaded_directory):
            raise ValueError('downloaded_directory does not exist')
        elif not os.path.exists(converted_directory):
            raise ValueError('converted_directory does not exist')
        self.downloaded_directory = downloaded_directory
        self.converted_directory = converted_directory

    def find_dataflows(self, converted: bool = False) -> list[str]:
        directory_path = self.converted_directory if converted else self.downloaded_directory
        xml_pattern = os.path.join(directory_path, 'data*.xml')
        zip_pattern = os.path.join(directory_path, 'data*.zip')
        xlsx_pattern = os.path.join(directory_path, 'data*.xlsx')
        csv_pattern = os.path.join(directory_path, 'data*.csv')
        return glob.glob(xml_pattern) + glob.glob(zip_pattern) + glob.glob(xlsx_pattern) + glob.glob(csv_pattern)

    def delete_dataflows(self, converted: bool = False):
        file_paths = self.find_dataflows(converted)
        for file_path in file_paths:
            os.remove(file_path)

    def get_file_size(self, file_path) -> float:
        return round(os.path.getsize(file_path) / (1024 * 1024), 4)

    def get_file_number(self, converted: bool = False) -> int:
        return len(self.find_dataflows(converted))

    def get_total_size(self, converted: bool = False) -> float:
        dataflow_paths = self.find_dataflows(converted)
        total = sum(os.path.getsize(path) for path in dataflow_paths)
        return round(total / (1024 * 1024), 3)

    def get_avg_file_size(self, converted: bool = False) -> float:
        file_number = self.get_file_number(converted)
        return round(self.get_total_size(converted) / file_number, 3) if file_number else 0

Importer(ABC)

The Importer class is an abstract base class designed to handle data imports. It includes a data_loader for managing files and an abstract method import_dataflow to be implemented by subclasses.

class Importer(ABC):
    def __init__(self, data_loader: DataLoader) -> None:
        self.data_loader = data_loader

    def _start_message(self):
        message = f'Importing {self.data_loader.get_file_number(True)} files'
        logger.info(message)

    @abstractmethod
    def import_dataflow(self, config: Config, data_info: DataInfo) -> ActionResult:
        """Imports dataflow to .Stat suite"""

Raw Data preview

Annual Prices snippet in original .xlsx file

Annual Indices snippet in original .xlsx file

Processing the World Bank Commodity Data from the Pink Sheet, as shown in the tables, poses several challenges. First, the sheer volume of data across multiple categories — such as energy, agriculture, and metals — can be overwhelming, especially when dealing with annual series in both nominal and real terms. Additionally, the complex column structure with multiple nested headers and abbreviations makes it difficult to map and clean the data accurately. To convert this into SDMX format, one must correctly define the data structure, and dimensions, which requires meticulous attention to detail and an understanding of how each variable relates to the overall dataset.

Downloading Process

WebToExcelDownloader(Downloader): This class is the base for downloading Excel files. It includes functionality to download a file and provides a structure for subclasses to implement specific download logic.

class WebToExcelDownloader(Downloader):
    def __init__(self, data_loader: DataLoader) -> None:
        super().__init__(data_loader)
    def download_excel_file(self, url: str, file_name: str) -> str:
        response = requests.get(url)
        if response.status_code == 200:
            file_path = os.path.join(self.data_loader.downloaded_directory, file_name)
            with open(file_path, 'wb') as file:
                file.write(response.content)
            return file_path
        else:
            logger.error(f"Failed to download file from {url}. Status code: {response.status_code}")
            return None
    @abstractmethod
    def download(self, data_info: DataInfo) -> bool:
        pass
    def fill_reporter(self, reporter: Reporter):
        reporter.data_file_number = self.data_loader.get_file_number()
        reporter.avg_data_file_size = self.data_loader.get_avg_file_size()
        reporter.total_data = self.data_loader.get_total_size()

World Bank Data Downloading

WorldBankExcelDownloader(WebToExcelDownloader)
This subclass specifically downloads Excel files from the World Bank’s commodity markets webpage. It scrapes the page, looks for .xlsx links, and downloads monthly and annual files.

class WorldBankExcelDownloader(WebToExcelDownloader):
    def __init__(self, data_loader: DataLoader) -> None:
        super().__init__(data_loader)

`Download` Method

Purpose:

The download method is responsible for fetching the World Bank's Pink Sheet data from their website. This method scrapes the HTML content of the page, locates the links to annual Excel files, and downloads them. It then processes the downloaded annual data.

How It Works:

Sends a GET request to the World Bank commodity markets page to retrieve the HTML content.
Parses the HTML using BeautifulSoup to identify links to .xlsx files.
Downloads the Excel files using the download_excel_file method.
If successful, it processes the annual data by calling process_annual_data.

   def download(self, data_info: DataInfo) -> bool:
        url = "http://www.worldbank.org/en/research/commodity-markets"
        download_dir = self.data_loader.downloaded_directory
    
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, verify=False)
        html_content = response.content
    
        soup = BeautifulSoup(html_content, 'html.parser')
    
        xls_links = {}
        for link in soup.find_all('a', href=True):
            if link['href'].endswith('.xlsx'):
                link_text = link.get_text().strip().lower()
                if "monthly" in link_text:
                    xls_links['monthly'] = link['href']
                elif "annual" in link_text:
                    xls_links['annual'] = link['href']
    
        monthly_file = self.download_excel_file(xls_links.get('monthly'), 'data_monthly_data.xlsx')
        annual_file = self.download_excel_file(xls_links.get('annual'), 'data_annual_data.xlsx')
    
        if monthly_file:
            print(f"Monthly data downloaded to: {monthly_file}")
        if annual_file:
            print(f"Annual data downloaded to: {annual_file}")
            self.process_annual_data(annual_file)
    
        self._end_message()
        return True

`process_annual_data` Method

Purpose:

This method processes the annual Pink Sheet data that was downloaded. It reads various sheets from the Excel file (e.g., nominal and real prices and indices), processes them, and then prepares them for conversion into SDMX format.

How It Works:

Reads the relevant sheets from the Excel file into Pandas DataFrames: nominal and real prices and indices.
Processes the indices using the process_indices method to clean and restructure the data.
Merges nominal and real prices into a single DataFrame.
Saves the cleaned indices and prices DataFrames as separate Excel files.
Calls the create_sdmx_structure method to generate SDMX structures for indices and prices.
Generates XML metadata for the Excel data using the create_excel_data_description_xml method.

def process_annual_data(self, annual_file: str) -> None:
        an_prices_nominal = pd.read_excel(annual_file, sheet_name="Annual Prices (Nominal)")
        an_indices_nominal = pd.read_excel(annual_file, sheet_name="Annual Indices (Nominal)")
        an_prices_real = pd.read_excel(annual_file, sheet_name="Annual Prices (Real)")
        an_indices_real = pd.read_excel(annual_file, sheet_name="Annual Indices (Real)")
    
        an_indices_nominal, an_indices_real = self.process_indices(an_indices_nominal, an_indices_real)
        an_indices = pd.concat([an_indices_nominal, an_indices_real])
    
        an_prices_nominal, an_prices_real, merged_df = self.process_prices(an_prices_nominal, an_prices_real)
    
        indices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
        prices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")
    
        with pd.ExcelWriter(indices_xlsx_path) as writer:
            an_indices.to_excel(writer, index=False)
    
        with pd.ExcelWriter(prices_xlsx_path) as writer:
            merged_df.to_excel(writer, index=False)
    
        self.create_sdmx_structure("INDICES_INFO", an_indices, "PinkSheetIndices", "Annual Indices - The World Bank Pinksheet", "DSD_INDICES_PINKSHEET")
        self.create_sdmx_structure("PRICES_INFO", merged_df, "PinkSheetPrices", "Annual Prices - The World Bank Pinksheet", "DSD_PRICES_PINKSHEET")
    
        self.create_excel_data_description_xml("PinkSheetIndices", indices_xlsx_path, "PinkSheetIndices_v1", "Sheet1", an_indices)
        self.create_excel_data_description_xml("PinkSheetPrices", prices_xlsx_path, "PinkSheetPrices_v1", "Sheet1", merged_df)

`process_indices` Method

Purpose:

The process_indices method cleans and restructures the nominal and real index data. It renames columns, removes irrelevant rows, and adds a "Measure" column to distinguish between nominal and real data.

How It Works:

Renames columns in the DataFrames for both nominal and real indices using a predefined mapping (new_titles_nominal).
Drops unnecessary rows and resets the index.
Adds a “Measure” column to indicate whether the data is nominal or real.
Returns the cleaned DataFrames for nominal and real indices.

def process_indices(self, an_indices_nominal, an_indices_real) -> tuple:
        new_titles_nominal = {
            'World Bank Commodity Price Data (The Pink Sheet)': 'year',
            'Unnamed: 1': 'World Bank Commodity Price Index',
            'Unnamed: 2': 'Energy',
            # Other columns renamed
        }
        an_indices_nominal = an_indices_nominal.rename(columns=new_titles_nominal).drop(range(0, 8)).reset_index(drop=True)
    
        new_titles_real = new_titles_nominal.copy()
        new_titles_real['Unnamed: 17'] = 'Deflator MUV Index'
        an_indices_real = an_indices_real.rename(columns=new_titles_real).drop(range(0, 8)).reset_index(drop=True)
    
        an_indices_nominal.insert(1, "Measure", "Nominal")
        an_indices_real.insert(1, "Measure", "Real")
    
        return an_indices_nominal, an_indices_real

`create_sdmx_structure` Method

Purpose:

The create_sdmx_structure method generates the SDMX data structure and metadata needed to map the cleaned data into the SDMX format. It defines the structure, including code lists and concept schemes, which are required for SDMX compliance.

How It Works:

Generates the necessary column and concept names/IDs for the SDMX structure.
Calls the generate_sdmx_xml method to produce the SDMX XML file, which defines the structure and metadata of the dataset.
The XML includes code names, concept names, and references to the data structure definition (DSD).

def create_sdmx_structure(self, codelist_name: str, dataframe: pd.DataFrame, dataflow_id: str, dataflow_name: str, structure_ref_id: str) -> None:
        column_names = self.generate_column_names(dataframe.columns)
        concept_names = ["Concept " + col for col in dataframe.columns]
        concept_ids = ["C_" + col for col in column_names]
    
        self.generate_sdmx_xml(
            codelist_name=codelist_name,
            code_names=dataframe.columns,
            code_ids=column_names,
            concept_names=concept_names,
            concept_ids=concept_ids,
            agency="WBG",
            version="1.0",
            dataflow_id=dataflow_id,
            dataflow_name=dataflow_name,
            structure_ref_id=structure_ref_id,
            structure_ref_agency="WBG"
        )

`create_excel_data_description_xml` Method

Purpose:

This method generates an XML file that provides a description of the structure of the Excel data. It maps the Excel data’s layout (e.g., top-left and bottom-right cell coordinates) to SDMX concepts like commodity type and time period.

How It Works:

Determines the bottom-right coordinate of the data within the Excel sheet.
Constructs an XML file that describes the mapping between the Excel sheet’s structure and the SDMX dimensions (e.g., commodity type, time period, and measure).
Saves the XML file, which can be used for SDMX data conversion.

def create_excel_data_description_xml(self, dataflow_name: str, data_file_path: str, coord_mapping_name: str, sheet_name: str, dataframe: pd.DataFrame) -> None:
        max_row = len(dataframe)
        max_column = len(dataframe.columns)
        bottom_right_coord = openpyxl.utils.get_column_letter(max_column + 1) + str(max_row + 2)
    
        root = ET.Element("Excel-Data-Description", {"Dataflow": dataflow_name, "Agency": "WBG", "Version": "1.0"})
        coord_mapping = ET.SubElement(root, "Coord-Mapping", {"Name": coord_mapping_name, "ExcludeNulls": "false", "IsActive": "true"})
    
        ET.SubElement(coord_mapping, "Top-Left", {"Coord": "C2"})
        ET.SubElement(coord_mapping, "Bottom-Right", {"Coord": bottom_right_coord})
    
        xml_file_name = os.path.splitext(data_file_path)[0] + "_excel_data_description.xml"
        tree = ET.ElementTree(root)
        with open(xml_file_name, "wb") as xml_file:
            tree.write(xml_file, encoding="utf-8", xml_declaration=True)

WorldBankExcelConverter Class

The WorldBankExcelConverter class extends a Converter and is responsible for converting downloaded Excel files from the World Bank into processed formats. It also generates SDMX structures and metadata for the converted data. This class includes methods to handle data transformation and structure generation.

class WorldBankExcelConverter(Converter):
    def __init__(self, data_loader: DataLoader):
        super().__init__(data_loader)

    def convert(self):
        annual_file = os.path.join(self.data_loader.downloaded_directory, 'annual_data.xlsx')
        if os.path.exists(annual_file):
            self.process_annual_data(annual_file)

`process_annual_data` Method

Purpose:

This method processes the annual Pink Sheet data by reading multiple sheets (nominal and real prices/indices) from the Excel file, cleaning and transforming the data, and then saving the processed data into new Excel files. It also generates the SDMX structure and the corresponding XML description for further use.

How It Works:

Loads the annual Pink Sheet data from four specific sheets in the Excel file: nominal and real prices, and nominal and real indices.
Cleans and processes the nominal and real indices using the process_indices method and merges them into one DataFrame.
Cleans and processes the nominal and real prices using the process_prices method and merges them into a single DataFrame.
Saves the processed indices and prices DataFrames to Excel files.
Generates the SDMX structure for the indices and prices using the create_sdmx_structure method.
Finally, creates XML files to describe the Excel data structure using the create_excel_data_description_xml method.

def process_annual_data(self, annual_file: str) -> None:
    an_prices_nominal = pd.read_excel(annual_file, sheet_name="Annual Prices (Nominal)")
    an_indices_nominal = pd.read_excel(annual_file, sheet_name="Annual Indices (Nominal)")
    an_prices_real = pd.read_excel(annual_file, sheet_name="Annual Prices (Real)")
    an_indices_real = pd.read_excel(annual_file, sheet_name="Annual Indices (Real)")

    # Process datasets
    an_indices_nominal, an_indices_real = self.process_indices(an_indices_nominal, an_indices_real)
    an_indices = pd.concat([an_indices_nominal, an_indices_real])

    an_prices_nominal, an_prices_real, merged_df = self.process_prices(an_prices_nominal, an_prices_real)

    # Save merged datasets as Excel files
    indices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
    prices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")

    with pd.ExcelWriter(indices_xlsx_path) as writer:
        an_indices.to_excel(writer, index=False)

    with pd.ExcelWriter(prices_xlsx_path) as writer:
        merged_df.to_excel(writer, index=False)

    # Generate SDMX structures and Excel data description XMLs for both indices and prices
    self.create_sdmx_structure("INDICES_INFO", an_indices, "PinkSheetIndices", "Annual Indices - The World Bank Pinksheet", "DSD_INDICES_PINKSHEET")
    self.create_sdmx_structure("PRICES_INFO", merged_df, "PinkSheetPrices", "Annual Prices - The World Bank Pinksheet", "DSD_PRICES_PINKSHEET")

    self.create_excel_data_description_xml("PinkSheetIndices", indices_xlsx_path, "PinkSheetIndices_v1", "Sheet1", an_indices)
    self.create_excel_data_description_xml("PinkSheetPrices", prices_xlsx_path, "PinkSheetPrices_v1", "Sheet1", merged_df)

`process_indices` Method

Purpose:

This method cleans and transforms the nominal and real indices data. It renames columns, removes unnecessary rows, and adds a “Measure” column to indicate whether the data is nominal or real.

How It Works:

Renames columns using a predefined mapping that standardizes column names.
Drops irrelevant rows (e.g., header rows or empty rows) and resets the index to ensure the DataFrame is clean and consistent.
Adds a “Measure” column to distinguish between nominal and real data.
Returns two cleaned DataFrames: one for nominal indices and one for real indices.

def process_indices(self, an_indices_nominal, an_indices_real) -> tuple:
    new_titles_nominal = {
        'World Bank Commodity Price Data (The Pink Sheet)': 'year',
        'Unnamed: 1': 'World Bank Commodity Price Index',
        'Unnamed: 2': 'Energy',
        'Unnamed: 3': 'Non-energy',
        'Unnamed: 4': 'Agriculture',
        'Unnamed: 5': 'Beverages',
        'Unnamed: 6': 'Food',
        'Unnamed: 7': 'Oils & Meals',
        'Unnamed: 8': 'Grains',
        'Unnamed: 9': 'Other food',
        'Unnamed: 10': 'Raw Materials',
        'Unnamed: 11': 'Timber',
        'Unnamed: 12': 'Other Raw Mat.',
        'Unnamed: 13': 'Fertilizers',
        'Unnamed: 14': 'Metals & Minerals',
        'Unnamed: 15': 'Base Metals (ex. iron ore)',
        'Unnamed: 16': 'Precious Metals',
    }
    an_indices_nominal = an_indices_nominal.rename(columns=new_titles_nominal)
    an_indices_nominal = an_indices_nominal.drop(range(0, 8)).reset_index(drop=True)

    new_titles_real = new_titles_nominal.copy()
    new_titles_real['Unnamed: 17'] = 'Deflator MUV Index'
    an_indices_real = an_indices_real.rename(columns=new_titles_real)
    an_indices_real = an_indices_real.drop(range(0, 8)).reset_index(drop=True)

    an_indices_nominal.insert(1, "Measure", "Nominal")
    an_indices_real.insert(1, "Measure", "Real")

    return an_indices_nominal, an_indices_real

`process_prices` Method

Purpose:

This method processes the nominal and real prices data by cleaning, transforming, and merging them into a single DataFrame. It ensures the prices are correctly labeled and the data is ready for further processing.

How It Works:

Extracts the header (price names) and cleans up the price columns for both nominal and real price DataFrames.
Drops unnecessary rows to ensure the data starts from the correct point.
Adds a “Measure” column to indicate whether the data is nominal or real.
Merges the two cleaned DataFrames (nominal and real) into a single DataFrame and returns them.

def process_prices(self, an_prices_nominal, an_prices_real) -> tuple:
    prices_nominal = an_prices_nominal.iloc[5].tolist()
    prices_real = an_prices_real.iloc[5].tolist()

    prices_nominal = self.clean_price_list(prices_nominal)
    prices_real = self.clean_price_list(prices_real)

    an_prices_nominal = an_prices_nominal.iloc[5:]
    an_prices_real = an_prices_real.iloc[5:]

    an_prices_nominal.columns = ['year', 'Measure'] + prices_nominal
    an_prices_real.columns = ['year', 'Measure'] + prices_real

    an_prices_nominal.insert(1, "Measure", "Nominal")
    an_prices_real.insert(1, "Measure", "Real")

    an_prices_nominal = an_prices_nominal.iloc[1:].reset_index(drop=True)
    an_prices_real = an_prices_real.iloc[1:].reset_index(drop=True)

    merged_df = pd.merge(an_prices_nominal, an_prices_real, how='outer')

    return an_prices_nominal, an_prices_real, merged_df

`clean_price_list` Method

Purpose:

This method cleans a list of price column names by removing any unwanted characters and ensuring consistency in the column headers. It removes special characters like @, #, !, etc., which can interfere with data processing.

How It Works:

Removes nan values and special characters (e.g., !, @, #, $, etc.) from the price list.
Ensures no trailing spaces in column names.
Returns a cleaned price list that can be used as column headers.

def clean_price_list(self, price_list) -> list:
    price_list = [x for x in price_list if str(x) != 'nan']
    price_list = [x.replace("!","").replace("@","").replace("#","").replace("$","").replace("^","").replace("&","").replace("*","") for x in price_list]
    price_list = [x[:-1] if x[-1] == " " else x for x in price_list]
    return price_list

`create_sdmx_structure` Method

Purpose:

This method generates the SDMX structure for the dataset, defining the concept scheme, dimensions, measures, and other metadata required for SDMX compliance. It calls the generate_sdmx_xml method to create the actual SDMX XML file.

How It Works:

Generates column names and concept names for the SDMX structure based on the DataFrame.
Calls generate_sdmx_xml to produce the XML file that contains the SDMX structure for the given dataset.
The XML includes essential metadata like agency, version, dataflow, and structure reference IDs.

def create_sdmx_structure(self, codelist_name: str, dataframe: pd.DataFrame, dataflow_id: str, dataflow_name: str, structure_ref_id: str) -> None:
    column_names = self.generate_column_names(dataframe.columns)
    concept_names = ["Concept " + col for col in dataframe.columns]
    concept_ids = ["C_" + col for col in column_names]

    self.generate_sdmx_xml(
        codelist_name=codelist_name,
        code_names=dataframe.columns,
        code_ids=column_names,
        concept_names=concept_names,
        concept_ids=concept_ids,
        agency="WBG",
        version="1.0",
        dataflow_id=dataflow_id,
        dataflow_name=dataflow_name,
        structure_ref_id=structure_ref_id,
        structure_ref_agency="WBG"
    )

`generate_column_names` Method

Purpose:

This method generates standardized column names by converting DataFrame column names to uppercase and removing special characters and spaces. It ensures that the column names conform to SDMX standards, which often require clean, standardized names.

How It Works:

Converts column names to uppercase.
Removes or replaces special characters (e.g., spaces, parentheses, slashes, etc.) to ensure the column names are valid for SDMX.
Checks for duplicate column names and prints a warning if any are found.

def generate_column_names(self, columns: pd.Index) -> list:
    column_names = []
    for column in columns:
        column_name = column.upper().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "").replace("/", "").replace(".", "").replace(",", "").replace(":", "").replace("%","_PERCENT").replace("&", "_AND")
        column_names.append(column_name)

    seen = set()
    duplicates = set(x for x in column_names if x in seen or seen.add(x))
    if duplicates:
        print(f"Duplicate column names: {duplicates}")
    
    return column_names

`generate_sdmx_xml` Method

Purpose:

This method generates an SDMX-compliant XML file that defines the structure of the data, including code lists, concept schemes, and the overall data structure. This XML file is essential for reporting and sharing the dataset in the SDMX format.

How It Works:

Creates the root XML element and adds header information (e.g., ID, prepared date, sender, etc.).
Defines dataflows, structures, and code lists for the dataset.
Adds concepts like frequency, commodity type, time period, and measure.
Generates the final XML file and saves it to disk.

def generate_sdmx_xml(self, codelist_name: str, code_names: list, code_ids: list, concept_names: list, concept_ids: list, agency: str, version: str, dataflow_id: str, dataflow_name: str, structure_ref_id: str, structure_ref_agency: str) -> None:
        # Create the root element
        root = ET.Element('message:Structure', {
            'xmlns:message': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message",
            'xmlns:s': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure",
            'xmlns:c': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common"
        })
        # Create Header element
        header = ET.SubElement(root, 'message:Header')
        ET.SubElement(header, 'message:ID').text = 'IDREF5'
        ET.SubElement(header, 'message:Test').text = 'false'
        ET.SubElement(header, 'message:Prepared').text = '2021-03-01T13:41:02.8010359+00:00'
        ET.SubElement(header, 'message:Sender', {'id': agency})

        # Create Structures element
        structures = ET.SubElement(root, 'message:Structures')

        # Create Dataflows element
        dataflows = ET.SubElement(structures, 's:Dataflows')
        dataflow = ET.SubElement(dataflows, 's:Dataflow', {
            'id': dataflow_id,
            'agencyID': structure_ref_agency,
            'version': "1.0",
            'isFinal': "true"
        })

        annotations = ET.SubElement(dataflow, 'c:Annotations')
        annotation = ET.SubElement(annotations, 'c:Annotation')
        ET.SubElement(annotation, 'c:AnnotationType').text = 'NonProductionDataflow'
        ET.SubElement(annotation, 'c:AnnotationText', {'xml:lang': "en"}).text = 'true'

        ET.SubElement(dataflow, 'c:Name', {'xml:lang': "en"}).text = dataflow_name

        structure = ET.SubElement(dataflow, 's:Structure')
        ET.SubElement(structure, 'Ref', {
            'id': structure_ref_id,
            'version': "1.0",
            'agencyID': structure_ref_agency,
            'package': "datastructure",
            'class': "DataStructure"
        })

        # Create Codelists element
        codelists = ET.SubElement(structures, 's:Codelists')

        # Create Codelist element
        codelist1 = ET.SubElement(codelists, 's:Codelist', {
            'agencyID': agency,
            'id': codelist_name,
            'isFinal': 'true',
            'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Codelist={agency}:{codelist_name}({version})",
            'version': version
        })

        # Add names to Codelist
        ET.SubElement(codelist1, 'c:Name', {'xml:lang': 'en'}).text = codelist_name

        # Add Codes to Codelist
        for code_id, code_name in zip(code_ids, code_names):
            if code_name == "Measure" or code_name == "year": 
                continue
            code = ET.SubElement(codelist1, 's:Code', {
                'id': code_id,
                'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Code={agency}:{codelist_name}({version}).{code_id}"
            })
            ET.SubElement(code, 'c:Name', {'xml:lang': 'en'}).text = code_name

        codelist2 = ET.SubElement(codelists, 's:Codelist', {
            'agencyID': agency,
            'id': "MEASURE",
            'isFinal': 'true',
            'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Codelist={agency}:MEASURE({version})",
            'version': version
        })

        ET.SubElement(codelist2, 'c:Name', {'xml:lang': 'en'}).text = "MEASURE"

        for code_id, code_name in zip(["NOMINAL", "REAL"], ["Nominal", "Real"]):
            code = ET.SubElement(codelist2, 's:Code', {
                'id': code_id,
                'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Code={agency}:MEASURE({version}).{code_id}"
            })
            ET.SubElement(code, 'c:Name', {'xml:lang': 'en'}).text = code_name

        # Create Concepts element
        concepts = ET.SubElement(structures, 's:Concepts')
        concept_scheme = ET.SubElement(concepts, 's:ConceptScheme', {
            'id': "CS_" + structure_ref_id,
            'urn': "urn:sdmx:org.sdmx.infomodel.conceptscheme.ConceptScheme=WBG:CS_" + structure_ref_id+ "(1.0)",
            'agencyID': "WBG",
            'version': "1.0",
            'isFinal': "true"
        })
        ET.SubElement(concept_scheme, 'c:Name', {'xml:lang': "en"}).text = 'Concept Scheme for Energy, Agriculture, and Metals'
        ET.SubElement(concept_scheme, 'c:Description', {'xml:lang': "en"}).text = 'List of Concepts to be used in DSD for Energy, Agriculture, and Metals'
        
        concept_list = [
            ("FREQ", "Frequency", "The frequency of the observation (e.g., monthly, quarterly, annually)."),
            ("COMMODITY_TYPE", "Commodity Type", "The type of commodity (e.g., crude oil, cocoa, aluminum)."),
            ("TIME_PERIOD", "Time Period", "Reference date or date range the observed value refers to."),
            ("OBS_VALUE", "Observation value", "Actual observation value."),
            ("MEASURE", "Measure", "The type of measure (e.g., nominal, real).")
        ]
        
        for concept_id, concept_name, concept_description in concept_list:
            concept = ET.SubElement(concept_scheme, 's:Concept', {
                'id': concept_id,
                'urn': f"urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=WBG:CS_" + structure_ref_id+ f"(1.0).{concept_id}"
            })
            ET.SubElement(concept, 'c:Name', {'xml:lang': "en"}).text = concept_name
            ET.SubElement(concept, 'c:Description', {'xml:lang': "en"}).text = concept_description

        # Create DataStructures element
        data_structures = ET.SubElement(structures, 's:DataStructures')
        data_structure = ET.SubElement(data_structures, 's:DataStructure', {
            'id': structure_ref_id,
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=WBG:" + structure_ref_id +"(1.0)",
            'agencyID': "WBG",
            'version': "1.0",
            'isFinal': "true"
        })
        ET.SubElement(data_structure, 'c:Name', {'xml:lang': "en"}).text = 'DSD: Energy, Agriculture, and Metals'
        
        components = ET.SubElement(data_structure, 's:DataStructureComponents')
        
        dimension_list = ET.SubElement(components, 's:DimensionList', {
            'id': "DimensionDescriptor",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.DimensionDescriptor=WBG:" + structure_ref_id +"(1.0).DimensionDescriptor"
        })
        
        dimension = ET.SubElement(dimension_list, 's:Dimension', {
            'id': "COMMODITY_TYPE",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.Dimension=WBG:" + structure_ref_id +"(1.0).COMMODITY_TYPE",
            'position': "2"
        })
        concept_identity = ET.SubElement(dimension, 's:ConceptIdentity')
        ET.SubElement(concept_identity, 'Ref', {
            'id': "COMMODITY_TYPE",
            'maintainableParentID': "CS_" + structure_ref_id,
            'maintainableParentVersion': "1.0",
            'agencyID': "WBG",
            'package': "conceptscheme",
            'class': "Concept"
        })
        local_representation = ET.SubElement(dimension, 's:LocalRepresentation')
        enumeration = ET.SubElement(local_representation, 's:Enumeration')
        ET.SubElement(enumeration, 'Ref', {
            'id': codelist_name,
            'version': "1.0",
            'agencyID': "WBG",
            'package': "codelist",
            'class': "Codelist"
        })

        dimension = ET.SubElement(dimension_list, 's:Dimension', {
            'id': "MEASURE",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.Dimension=WBG:" + structure_ref_id +"(1.0).MEASURE",
            'position': "1"
        })

        concept_identity = ET.SubElement(dimension, 's:ConceptIdentity')
        ET.SubElement(concept_identity, 'Ref', {
            'id': "MEASURE",
            'maintainableParentID': "CS_" + structure_ref_id,
            'maintainableParentVersion': "1.0",
            'agencyID': "WBG",
            'package': "conceptscheme",
            'class': "Concept"
        })

        local_representation = ET.SubElement(dimension, 's:LocalRepresentation')
        enumeration = ET.SubElement(local_representation, 's:Enumeration')
        ET.SubElement(enumeration, 'Ref', {
            'id': "MEASURE",
            'version': "1.0",
            'agencyID': "WBG",
            'package': "codelist",
            'class': "Codelist"
        })

        
        time_dimension = ET.SubElement(dimension_list, 's:TimeDimension', {
            'id': "TIME_PERIOD",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.TimeDimension=WBG:" + structure_ref_id +"(1.0).TIME_PERIOD",
            'position': "3"
        })
        concept_identity = ET.SubElement(time_dimension, 's:ConceptIdentity')
        ET.SubElement(concept_identity, 'Ref', {
            'id': "TIME_PERIOD",
            'maintainableParentID': "CS_" + structure_ref_id,
            'maintainableParentVersion': "1.0",
            'agencyID': "WBG",
            'package': "conceptscheme",
            'class': "Concept"
        })
        local_representation = ET.SubElement(time_dimension, 's:LocalRepresentation')
        ET.SubElement(local_representation, 's:TextFormat', {'textType': "ObservationalTimePeriod"})
        
        measure_list = ET.SubElement(components, 's:MeasureList', {
            'id': "MeasureDescriptor",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.MeasureDescriptor=WBG:" + structure_ref_id +"(1.0).MeasureDescriptor"
        })
        primary_measure = ET.SubElement(measure_list, 's:PrimaryMeasure', {
            'id': "OBS_VALUE",
            'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.PrimaryMeasure=WBG:" + structure_ref_id +"(1.0).OBS_VALUE"
        })
        concept_identity = ET.SubElement(primary_measure, 's:ConceptIdentity')
        ET.SubElement(concept_identity, 'Ref', {
            'id': "OBS_VALUE",
            'maintainableParentID': "CS_" + structure_ref_id,
            'maintainableParentVersion': "1.0",
            'agencyID': "WBG",
            'package': "conceptscheme",
            'class': "Concept"
        })

        # Create a new XML file with the results
        tree = ET.ElementTree(root)
        with open(f"{codelist_name}_structure.xml", "wb") as files:
            tree.write(files, encoding='utf-8', xml_declaration=True)

`create_excel_data_description_xml` Method

Purpose:

This method generates an XML file that describes the structure of the Excel data, including the layout of the sheet (e.g., top-left and bottom-right coordinates) and the mapping between Excel cells and SDMX dimensions. This XML helps map the Excel data into SDMX format.

How It Works:

Determines the size of the Excel data by calculating the bottom-right coordinate.
Constructs the XML file with the Excel data description, including the coordinates for each dimension (e.g., commodity type, time period, measure).
Saves the XML file to disk.

def create_excel_data_description_xml(self, dataflow_name: str, data_file_path: str, coord_mapping_name: str, sheet_name: str, dataframe: pd.DataFrame) -> None:
    max_row = len(dataframe)
    max_column = len(dataframe.columns)
    bottom_right_coord = openpyxl.utils.get_column_letter(max_column + 1) + str(max_row + 2)

    root = ET.Element("Excel-Data-Description", {
        "Dataflow": dataflow_name,
        "Agency": "WBG",
        "Version": "1.0"
    })
    
    coord_mapping = ET.SubElement(root, "Coord-Mapping", {
        "Name": coord_mapping_name,
        "ExcludeNulls": "false",
        "IsActive": "true"
    })

    worksheets = ET.SubElement(coord_mapping, "Worksheets")
    ET.SubElement(worksheets, "Worksheet", {"Name": sheet_name})

    ET.SubElement(coord_mapping, "Top-Left", {"Coord": "C2"})
    ET.SubElement(coord_mapping, "Bottom-Right", {"Coord": bottom_right_coord})

    # Save the XML file
    xml_file_name = os.path.splitext(data_file_path)[0] + "_excel_data_description.xml"
    tree = ET.ElementTree(root)
    with open(xml_file_name, "wb") as xml_file:
        tree.write(xml_file, encoding="utf-8", xml_declaration=True)

Excel and SDMX Data Importing

The ExcelAndSdmxImporter class is responsible for importing data structures (SDMX) and data files (Excel) into a system. This is done by first importing the XML structures and then the actual data along with its corresponding description file. Let’s break down this process step by step.

class ExcelAndSdmxImporter(Importer):
    def import_dataflow(self, config:Config, data_info: DataInfo):
        self._start_message()
        
        # Paths for indices and prices structure and data
        indices_structure_file_path = os.path.join(self.data_loader.converted_directory, "INDICES_INFO_structure.xml")
        prices_structure_file_path = os.path.join(self.data_loader.converted_directory, "PRICES_INFO_structure.xml")
        
        indices_excel_file_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
        prices_excel_file_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")
        
        indices_excel_data_description_path = os.path.splitext(indices_excel_file_path)[0] + "_excel_data_description.xml"
        prices_excel_data_description_path = os.path.splitext(prices_excel_file_path)[0] + "_excel_data_description.xml"

        # Import the structure XMLs
        self.import_structure(config, indices_structure_file_path)
        self.import_structure(config, prices_structure_file_path)

        # Import the Excel data and associated XMLs
        self.import_excel_data(config, indices_excel_file_path, indices_excel_data_description_path)
        self.import_excel_data(config, prices_excel_file_path, prices_excel_data_description_path)

    def import_structure(self, config, structure_file_path: str):
        if structure_file_path.endswith('.xml'):
            self.import_xml(config, structure_file_path)

    def import_excel_data(self, config:Config, excel_file_path: str, excel_data_description_path: str):
        if excel_file_path.endswith('.xlsx') and excel_data_description_path.endswith('.xml'):
            payload = {
                'excelFile': excel_file_path,
                'eddFile': excel_data_description_path
            }
            resp = send_import_request(config.services.data, payload, main_token_process(config.token))
            print(resp)
            if resp.status_code == 400:
                print(resp.text)
                print(f"Error importing {excel_file_path} with {excel_data_description_path}")

Result

That’s it, so simple, we did it. By breaking down the technical complexities, we’ve shown that automation doesn’t have to be daunting — it’s all about using the right tools effectively.

Looking forward to any feedback and always open to discuss improvements or additional insights. Feel free to reach out through my LinkedIn account.

D ONE

From Spreadsheets to SDMX Effortless with Python and .Stat Suite

Introduction

Why SDMX Adoption Is Becoming Essential

Leveraging the .Stat Suite for Automation

Preparing Your Dataflow: DSD and Dataflow Creation

A Case Study: Converting World Bank Commodity Markets (Pink Sheet) Data

Implementation

Downloader(ABC)

Converter(ABC)

DataLoader

Importer(ABC)

Raw Data preview

Downloading Process

World Bank Data Downloading

Download Method

Purpose:

How It Works:

process_annual_data Method

Purpose:

How It Works:

process_indices Method

Purpose:

How It Works:

create_sdmx_structure Method

Purpose:

How It Works:

create_excel_data_description_xml Method

Purpose:

How It Works:

WorldBankExcelConverter Class

process_annual_data Method

Purpose:

How It Works:

process_indices Method

Purpose:

How It Works:

process_prices Method

Purpose:

How It Works:

clean_price_list Method

Purpose:

How It Works:

create_sdmx_structure Method

Purpose:

How It Works:

generate_column_names Method

Purpose:

How It Works:

generate_sdmx_xml Method

Purpose:

How It Works:

create_excel_data_description_xml Method

Purpose:

How It Works:

Excel and SDMX Data Importing

Result

Published in D ONE

Written by George Verouchis

No responses yet

`Download` Method

`process_annual_data` Method

`process_indices` Method

`create_sdmx_structure` Method

`create_excel_data_description_xml` Method

`process_annual_data` Method

`process_indices` Method

`process_prices` Method

`clean_price_list` Method

`create_sdmx_structure` Method

`generate_column_names` Method

`generate_sdmx_xml` Method

`create_excel_data_description_xml` Method