From Spreadsheets to SDMX Effortless with Python and .Stat Suite
George Verouchis | Christos Gkatzis
Introduction
As data exchange and reporting requirements evolve, the conversion of spreadsheets like XLSX to SDMX formats is becoming increasingly crucial for institutions. This article explores why this transition is necessary, how it can be efficiently automated, and the tools that simplify the process. We will focus on leveraging the .Stat Suite and highlight a practical example using World Bank data.
Why SDMX Adoption Is Becoming Essential
- More international organizations, such as Eurostat and the European Central Bank (ECB), mandate institutions to adopt SDMX for data reporting. As a result, SDMX compliance is no longer an option but a requirement for many institutions.
- SDMX allows for more efficient and standardized data sharing between organizations, ensuring consistency in data interpretation and reporting across different systems.
Leveraging the .Stat Suite for Automation
The .Stat Suite is a powerful tool for automating the conversion of XLSX to SDMX. One key feature of the suite is its ability to upload an XLSX file along with an Excel data definition (EDD) “edd.xml” file. This capability can save institutions significant time and effort by automating what would otherwise be a tedious manual conversion process.
How it works:
- XLSX Upload: you can upload raw data in XLSX format.
- edd.xml File: This accompanying file provides metadata and additional instructions to ensure the correct mapping of the data to SDMX.
However, before utilizing this feature, certain prerequisites must be met.
Preparing Your Dataflow: DSD and Dataflow Creation
Before automating the conversion process, it’s essential to ensure that your Data Structure Definition (DSD) and Dataflow for the dataset are already prepared. Without these components in place, automating XLSX to SDMX conversion via the .Stat Suite will not be possible.
A Case Study: Converting World Bank Commodity Markets (Pink Sheet) Data
To illustrate the process, let’s take the example of converting World Bank’s Pink Sheet data. This dataset provides commodity price information and is typically published in XLSX format.
Steps:
- Retrieve the Data: from World Bank official page through this link.
- Prepare the DSD: Define the structure for the Pink Sheet data, including commodity dimensions, time periods, and price units.
- Create the Dataflow and EDD: Set up a dataflow and EDD for the Pink Sheet dataset, linking it to the DSD.
- Upload: Using the .Stat Suite, upload the XLSX file along with the corresponding structures.
- Conversion: The .Stat Suite processes the files and converts the data into SDMX format, ready for reporting and sharing.
Implementation
The code defines four important superclasses: Downloader, Converter, DataLoader and Importer, which provide the foundational structure for downloading and importing data.
Downloader(ABC)
The Downloader class is an abstract base class, responsible for managing data downloads. It includes methods for logging the end of a download and filling a Reporter with details about the downloaded files. Subclasses must implement the download method.
class Downloader(ABC):
def __init__(self, data_loader: DataLoader) -> None:
self.data_loader = data_loader
def _end_message(self):
message = f'Downloaded {self.data_loader.get_file_number()} file(s) with data, ' \
f'Avg file size: {self.data_loader.get_avg_file_size()}, ' \
f'Total Data: {self.data_loader.get_total_size()}'
logger.info(message)
def fill_reporter(self, reporter: Reporter):
reporter.data_file_number = self.data_loader.get_file_number()
reporter.avg_data_file_size = self.data_loader.get_avg_file_size()
reporter.total_data = self.data_loader.get_total_size()
@abstractmethod
def download(self, data_info: DataInfo) -> ActionResult:
"""Downloads the data from the source to the downloaded_directory."""
Converter(ABC)
The Converter class is an abstract base class that manages dataset conversion using a DataLoader for file operations. It provides logging methods for the conversion process and requires subclasses to implement the convert method to handle the actual conversion.
class Converter(ABC):
def __init__(self,data_loader:DataLoader) -> None:
self.data_loader=data_loader
def _start_message(self):
message=f'Converting {self.data_loader.get_file_number()} file(s)'
logger.info(message)
def _end_message(self):
message=f'Converted in {self.data_loader.get_file_number(True)} file(s)'
logger.info(message)
def fill_reporter(self,reporter:Reporter):
reporter.file_number=self.data_loader.get_file_number(True)
reporter.avg_file_size=self.data_loader.get_avg_file_size(True)
@abstractmethod
def convert(self)->ActionResult:
"""
Method converts datasets from the downloaded directory and saves them to the converted directory
"""
pass
DataLoader
The DataLoader class manages file-related operations such as finding, deleting, and calculating file sizes. It works with downloaded and converted directories and provides utility methods to assist in managing data flows.
class DataLoader:
def __init__(self, downloaded_directory: str, converted_directory: str) -> None:
if not os.path.exists(downloaded_directory):
raise ValueError('downloaded_directory does not exist')
elif not os.path.exists(converted_directory):
raise ValueError('converted_directory does not exist')
self.downloaded_directory = downloaded_directory
self.converted_directory = converted_directory
def find_dataflows(self, converted: bool = False) -> list[str]:
directory_path = self.converted_directory if converted else self.downloaded_directory
xml_pattern = os.path.join(directory_path, 'data*.xml')
zip_pattern = os.path.join(directory_path, 'data*.zip')
xlsx_pattern = os.path.join(directory_path, 'data*.xlsx')
csv_pattern = os.path.join(directory_path, 'data*.csv')
return glob.glob(xml_pattern) + glob.glob(zip_pattern) + glob.glob(xlsx_pattern) + glob.glob(csv_pattern)
def delete_dataflows(self, converted: bool = False):
file_paths = self.find_dataflows(converted)
for file_path in file_paths:
os.remove(file_path)
def get_file_size(self, file_path) -> float:
return round(os.path.getsize(file_path) / (1024 * 1024), 4)
def get_file_number(self, converted: bool = False) -> int:
return len(self.find_dataflows(converted))
def get_total_size(self, converted: bool = False) -> float:
dataflow_paths = self.find_dataflows(converted)
total = sum(os.path.getsize(path) for path in dataflow_paths)
return round(total / (1024 * 1024), 3)
def get_avg_file_size(self, converted: bool = False) -> float:
file_number = self.get_file_number(converted)
return round(self.get_total_size(converted) / file_number, 3) if file_number else 0
Importer(ABC)
The Importer class is an abstract base class designed to handle data imports. It includes a data_loader for managing files and an abstract method import_dataflow to be implemented by subclasses.
class Importer(ABC):
def __init__(self, data_loader: DataLoader) -> None:
self.data_loader = data_loader
def _start_message(self):
message = f'Importing {self.data_loader.get_file_number(True)} files'
logger.info(message)
@abstractmethod
def import_dataflow(self, config: Config, data_info: DataInfo) -> ActionResult:
"""Imports dataflow to .Stat suite"""
Raw Data preview
Processing the World Bank Commodity Data from the Pink Sheet, as shown in the tables, poses several challenges. First, the sheer volume of data across multiple categories — such as energy, agriculture, and metals — can be overwhelming, especially when dealing with annual series in both nominal and real terms. Additionally, the complex column structure with multiple nested headers and abbreviations makes it difficult to map and clean the data accurately. To convert this into SDMX format, one must correctly define the data structure, and dimensions, which requires meticulous attention to detail and an understanding of how each variable relates to the overall dataset.
Downloading Process
WebToExcelDownloader(Downloader): This class is the base for downloading Excel files. It includes functionality to download a file and provides a structure for subclasses to implement specific download logic.
class WebToExcelDownloader(Downloader):
def __init__(self, data_loader: DataLoader) -> None:
super().__init__(data_loader)
def download_excel_file(self, url: str, file_name: str) -> str:
response = requests.get(url)
if response.status_code == 200:
file_path = os.path.join(self.data_loader.downloaded_directory, file_name)
with open(file_path, 'wb') as file:
file.write(response.content)
return file_path
else:
logger.error(f"Failed to download file from {url}. Status code: {response.status_code}")
return None
@abstractmethod
def download(self, data_info: DataInfo) -> bool:
pass
def fill_reporter(self, reporter: Reporter):
reporter.data_file_number = self.data_loader.get_file_number()
reporter.avg_data_file_size = self.data_loader.get_avg_file_size()
reporter.total_data = self.data_loader.get_total_size()
World Bank Data Downloading
WorldBankExcelDownloader(WebToExcelDownloader)
This subclass specifically downloads Excel files from the World Bank’s commodity markets webpage. It scrapes the page, looks for .xlsx links, and downloads monthly and annual files.
class WorldBankExcelDownloader(WebToExcelDownloader):
def __init__(self, data_loader: DataLoader) -> None:
super().__init__(data_loader)
Download
Method
Purpose:
The download
method is responsible for fetching the World Bank's Pink Sheet data from their website. This method scrapes the HTML content of the page, locates the links to annual Excel files, and downloads them. It then processes the downloaded annual data.
How It Works:
- Sends a GET request to the World Bank commodity markets page to retrieve the HTML content.
- Parses the HTML using BeautifulSoup to identify links to
.xlsx
files. - Downloads the Excel files using the
download_excel_file
method. - If successful, it processes the annual data by calling
process_annual_data
.
def download(self, data_info: DataInfo) -> bool:
url = "http://www.worldbank.org/en/research/commodity-markets"
download_dir = self.data_loader.downloaded_directory
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers, verify=False)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
xls_links = {}
for link in soup.find_all('a', href=True):
if link['href'].endswith('.xlsx'):
link_text = link.get_text().strip().lower()
if "monthly" in link_text:
xls_links['monthly'] = link['href']
elif "annual" in link_text:
xls_links['annual'] = link['href']
monthly_file = self.download_excel_file(xls_links.get('monthly'), 'data_monthly_data.xlsx')
annual_file = self.download_excel_file(xls_links.get('annual'), 'data_annual_data.xlsx')
if monthly_file:
print(f"Monthly data downloaded to: {monthly_file}")
if annual_file:
print(f"Annual data downloaded to: {annual_file}")
self.process_annual_data(annual_file)
self._end_message()
return True
process_annual_data
Method
Purpose:
This method processes the annual Pink Sheet data that was downloaded. It reads various sheets from the Excel file (e.g., nominal and real prices and indices), processes them, and then prepares them for conversion into SDMX format.
How It Works:
- Reads the relevant sheets from the Excel file into Pandas DataFrames: nominal and real prices and indices.
- Processes the indices using the
process_indices
method to clean and restructure the data. - Merges nominal and real prices into a single DataFrame.
- Saves the cleaned indices and prices DataFrames as separate Excel files.
- Calls the
create_sdmx_structure
method to generate SDMX structures for indices and prices. - Generates XML metadata for the Excel data using the
create_excel_data_description_xml
method.
def process_annual_data(self, annual_file: str) -> None:
an_prices_nominal = pd.read_excel(annual_file, sheet_name="Annual Prices (Nominal)")
an_indices_nominal = pd.read_excel(annual_file, sheet_name="Annual Indices (Nominal)")
an_prices_real = pd.read_excel(annual_file, sheet_name="Annual Prices (Real)")
an_indices_real = pd.read_excel(annual_file, sheet_name="Annual Indices (Real)")
an_indices_nominal, an_indices_real = self.process_indices(an_indices_nominal, an_indices_real)
an_indices = pd.concat([an_indices_nominal, an_indices_real])
an_prices_nominal, an_prices_real, merged_df = self.process_prices(an_prices_nominal, an_prices_real)
indices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
prices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")
with pd.ExcelWriter(indices_xlsx_path) as writer:
an_indices.to_excel(writer, index=False)
with pd.ExcelWriter(prices_xlsx_path) as writer:
merged_df.to_excel(writer, index=False)
self.create_sdmx_structure("INDICES_INFO", an_indices, "PinkSheetIndices", "Annual Indices - The World Bank Pinksheet", "DSD_INDICES_PINKSHEET")
self.create_sdmx_structure("PRICES_INFO", merged_df, "PinkSheetPrices", "Annual Prices - The World Bank Pinksheet", "DSD_PRICES_PINKSHEET")
self.create_excel_data_description_xml("PinkSheetIndices", indices_xlsx_path, "PinkSheetIndices_v1", "Sheet1", an_indices)
self.create_excel_data_description_xml("PinkSheetPrices", prices_xlsx_path, "PinkSheetPrices_v1", "Sheet1", merged_df)
process_indices
Method
Purpose:
The process_indices
method cleans and restructures the nominal and real index data. It renames columns, removes irrelevant rows, and adds a "Measure" column to distinguish between nominal and real data.
How It Works:
- Renames columns in the DataFrames for both nominal and real indices using a predefined mapping (
new_titles_nominal
). - Drops unnecessary rows and resets the index.
- Adds a “Measure” column to indicate whether the data is nominal or real.
- Returns the cleaned DataFrames for nominal and real indices.
def process_indices(self, an_indices_nominal, an_indices_real) -> tuple:
new_titles_nominal = {
'World Bank Commodity Price Data (The Pink Sheet)': 'year',
'Unnamed: 1': 'World Bank Commodity Price Index',
'Unnamed: 2': 'Energy',
# Other columns renamed
}
an_indices_nominal = an_indices_nominal.rename(columns=new_titles_nominal).drop(range(0, 8)).reset_index(drop=True)
new_titles_real = new_titles_nominal.copy()
new_titles_real['Unnamed: 17'] = 'Deflator MUV Index'
an_indices_real = an_indices_real.rename(columns=new_titles_real).drop(range(0, 8)).reset_index(drop=True)
an_indices_nominal.insert(1, "Measure", "Nominal")
an_indices_real.insert(1, "Measure", "Real")
return an_indices_nominal, an_indices_real
create_sdmx_structure
Method
Purpose:
The create_sdmx_structure
method generates the SDMX data structure and metadata needed to map the cleaned data into the SDMX format. It defines the structure, including code lists and concept schemes, which are required for SDMX compliance.
How It Works:
- Generates the necessary column and concept names/IDs for the SDMX structure.
- Calls the
generate_sdmx_xml
method to produce the SDMX XML file, which defines the structure and metadata of the dataset. - The XML includes code names, concept names, and references to the data structure definition (DSD).
def create_sdmx_structure(self, codelist_name: str, dataframe: pd.DataFrame, dataflow_id: str, dataflow_name: str, structure_ref_id: str) -> None:
column_names = self.generate_column_names(dataframe.columns)
concept_names = ["Concept " + col for col in dataframe.columns]
concept_ids = ["C_" + col for col in column_names]
self.generate_sdmx_xml(
codelist_name=codelist_name,
code_names=dataframe.columns,
code_ids=column_names,
concept_names=concept_names,
concept_ids=concept_ids,
agency="WBG",
version="1.0",
dataflow_id=dataflow_id,
dataflow_name=dataflow_name,
structure_ref_id=structure_ref_id,
structure_ref_agency="WBG"
)
create_excel_data_description_xml
Method
Purpose:
This method generates an XML file that provides a description of the structure of the Excel data. It maps the Excel data’s layout (e.g., top-left and bottom-right cell coordinates) to SDMX concepts like commodity type and time period.
How It Works:
- Determines the bottom-right coordinate of the data within the Excel sheet.
- Constructs an XML file that describes the mapping between the Excel sheet’s structure and the SDMX dimensions (e.g., commodity type, time period, and measure).
- Saves the XML file, which can be used for SDMX data conversion.
def create_excel_data_description_xml(self, dataflow_name: str, data_file_path: str, coord_mapping_name: str, sheet_name: str, dataframe: pd.DataFrame) -> None:
max_row = len(dataframe)
max_column = len(dataframe.columns)
bottom_right_coord = openpyxl.utils.get_column_letter(max_column + 1) + str(max_row + 2)
root = ET.Element("Excel-Data-Description", {"Dataflow": dataflow_name, "Agency": "WBG", "Version": "1.0"})
coord_mapping = ET.SubElement(root, "Coord-Mapping", {"Name": coord_mapping_name, "ExcludeNulls": "false", "IsActive": "true"})
ET.SubElement(coord_mapping, "Top-Left", {"Coord": "C2"})
ET.SubElement(coord_mapping, "Bottom-Right", {"Coord": bottom_right_coord})
xml_file_name = os.path.splitext(data_file_path)[0] + "_excel_data_description.xml"
tree = ET.ElementTree(root)
with open(xml_file_name, "wb") as xml_file:
tree.write(xml_file, encoding="utf-8", xml_declaration=True)
WorldBankExcelConverter Class
The WorldBankExcelConverter class extends a Converter and is responsible for converting downloaded Excel files from the World Bank into processed formats. It also generates SDMX structures and metadata for the converted data. This class includes methods to handle data transformation and structure generation.
class WorldBankExcelConverter(Converter):
def __init__(self, data_loader: DataLoader):
super().__init__(data_loader)
def convert(self):
annual_file = os.path.join(self.data_loader.downloaded_directory, 'annual_data.xlsx')
if os.path.exists(annual_file):
self.process_annual_data(annual_file)
process_annual_data
Method
Purpose:
This method processes the annual Pink Sheet data by reading multiple sheets (nominal and real prices/indices) from the Excel file, cleaning and transforming the data, and then saving the processed data into new Excel files. It also generates the SDMX structure and the corresponding XML description for further use.
How It Works:
- Loads the annual Pink Sheet data from four specific sheets in the Excel file: nominal and real prices, and nominal and real indices.
- Cleans and processes the nominal and real indices using the
process_indices
method and merges them into one DataFrame. - Cleans and processes the nominal and real prices using the
process_prices
method and merges them into a single DataFrame. - Saves the processed indices and prices DataFrames to Excel files.
- Generates the SDMX structure for the indices and prices using the
create_sdmx_structure
method. - Finally, creates XML files to describe the Excel data structure using the
create_excel_data_description_xml
method.
def process_annual_data(self, annual_file: str) -> None:
an_prices_nominal = pd.read_excel(annual_file, sheet_name="Annual Prices (Nominal)")
an_indices_nominal = pd.read_excel(annual_file, sheet_name="Annual Indices (Nominal)")
an_prices_real = pd.read_excel(annual_file, sheet_name="Annual Prices (Real)")
an_indices_real = pd.read_excel(annual_file, sheet_name="Annual Indices (Real)")
# Process datasets
an_indices_nominal, an_indices_real = self.process_indices(an_indices_nominal, an_indices_real)
an_indices = pd.concat([an_indices_nominal, an_indices_real])
an_prices_nominal, an_prices_real, merged_df = self.process_prices(an_prices_nominal, an_prices_real)
# Save merged datasets as Excel files
indices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
prices_xlsx_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")
with pd.ExcelWriter(indices_xlsx_path) as writer:
an_indices.to_excel(writer, index=False)
with pd.ExcelWriter(prices_xlsx_path) as writer:
merged_df.to_excel(writer, index=False)
# Generate SDMX structures and Excel data description XMLs for both indices and prices
self.create_sdmx_structure("INDICES_INFO", an_indices, "PinkSheetIndices", "Annual Indices - The World Bank Pinksheet", "DSD_INDICES_PINKSHEET")
self.create_sdmx_structure("PRICES_INFO", merged_df, "PinkSheetPrices", "Annual Prices - The World Bank Pinksheet", "DSD_PRICES_PINKSHEET")
self.create_excel_data_description_xml("PinkSheetIndices", indices_xlsx_path, "PinkSheetIndices_v1", "Sheet1", an_indices)
self.create_excel_data_description_xml("PinkSheetPrices", prices_xlsx_path, "PinkSheetPrices_v1", "Sheet1", merged_df)
process_indices
Method
Purpose:
This method cleans and transforms the nominal and real indices data. It renames columns, removes unnecessary rows, and adds a “Measure” column to indicate whether the data is nominal or real.
How It Works:
- Renames columns using a predefined mapping that standardizes column names.
- Drops irrelevant rows (e.g., header rows or empty rows) and resets the index to ensure the DataFrame is clean and consistent.
- Adds a “Measure” column to distinguish between nominal and real data.
- Returns two cleaned DataFrames: one for nominal indices and one for real indices.
def process_indices(self, an_indices_nominal, an_indices_real) -> tuple:
new_titles_nominal = {
'World Bank Commodity Price Data (The Pink Sheet)': 'year',
'Unnamed: 1': 'World Bank Commodity Price Index',
'Unnamed: 2': 'Energy',
'Unnamed: 3': 'Non-energy',
'Unnamed: 4': 'Agriculture',
'Unnamed: 5': 'Beverages',
'Unnamed: 6': 'Food',
'Unnamed: 7': 'Oils & Meals',
'Unnamed: 8': 'Grains',
'Unnamed: 9': 'Other food',
'Unnamed: 10': 'Raw Materials',
'Unnamed: 11': 'Timber',
'Unnamed: 12': 'Other Raw Mat.',
'Unnamed: 13': 'Fertilizers',
'Unnamed: 14': 'Metals & Minerals',
'Unnamed: 15': 'Base Metals (ex. iron ore)',
'Unnamed: 16': 'Precious Metals',
}
an_indices_nominal = an_indices_nominal.rename(columns=new_titles_nominal)
an_indices_nominal = an_indices_nominal.drop(range(0, 8)).reset_index(drop=True)
new_titles_real = new_titles_nominal.copy()
new_titles_real['Unnamed: 17'] = 'Deflator MUV Index'
an_indices_real = an_indices_real.rename(columns=new_titles_real)
an_indices_real = an_indices_real.drop(range(0, 8)).reset_index(drop=True)
an_indices_nominal.insert(1, "Measure", "Nominal")
an_indices_real.insert(1, "Measure", "Real")
return an_indices_nominal, an_indices_real
process_prices
Method
Purpose:
This method processes the nominal and real prices data by cleaning, transforming, and merging them into a single DataFrame. It ensures the prices are correctly labeled and the data is ready for further processing.
How It Works:
- Extracts the header (price names) and cleans up the price columns for both nominal and real price DataFrames.
- Drops unnecessary rows to ensure the data starts from the correct point.
- Adds a “Measure” column to indicate whether the data is nominal or real.
- Merges the two cleaned DataFrames (nominal and real) into a single DataFrame and returns them.
def process_prices(self, an_prices_nominal, an_prices_real) -> tuple:
prices_nominal = an_prices_nominal.iloc[5].tolist()
prices_real = an_prices_real.iloc[5].tolist()
prices_nominal = self.clean_price_list(prices_nominal)
prices_real = self.clean_price_list(prices_real)
an_prices_nominal = an_prices_nominal.iloc[5:]
an_prices_real = an_prices_real.iloc[5:]
an_prices_nominal.columns = ['year', 'Measure'] + prices_nominal
an_prices_real.columns = ['year', 'Measure'] + prices_real
an_prices_nominal.insert(1, "Measure", "Nominal")
an_prices_real.insert(1, "Measure", "Real")
an_prices_nominal = an_prices_nominal.iloc[1:].reset_index(drop=True)
an_prices_real = an_prices_real.iloc[1:].reset_index(drop=True)
merged_df = pd.merge(an_prices_nominal, an_prices_real, how='outer')
return an_prices_nominal, an_prices_real, merged_df
clean_price_list
Method
Purpose:
This method cleans a list of price column names by removing any unwanted characters and ensuring consistency in the column headers. It removes special characters like @
, #
, !
, etc., which can interfere with data processing.
How It Works:
- Removes
nan
values and special characters (e.g.,!
,@
,#
,$
, etc.) from the price list. - Ensures no trailing spaces in column names.
- Returns a cleaned price list that can be used as column headers.
def clean_price_list(self, price_list) -> list:
price_list = [x for x in price_list if str(x) != 'nan']
price_list = [x.replace("!","").replace("@","").replace("#","").replace("$","").replace("^","").replace("&","").replace("*","") for x in price_list]
price_list = [x[:-1] if x[-1] == " " else x for x in price_list]
return price_list
create_sdmx_structure
Method
Purpose:
This method generates the SDMX structure for the dataset, defining the concept scheme, dimensions, measures, and other metadata required for SDMX compliance. It calls the generate_sdmx_xml
method to create the actual SDMX XML file.
How It Works:
- Generates column names and concept names for the SDMX structure based on the DataFrame.
- Calls
generate_sdmx_xml
to produce the XML file that contains the SDMX structure for the given dataset. - The XML includes essential metadata like agency, version, dataflow, and structure reference IDs.
def create_sdmx_structure(self, codelist_name: str, dataframe: pd.DataFrame, dataflow_id: str, dataflow_name: str, structure_ref_id: str) -> None:
column_names = self.generate_column_names(dataframe.columns)
concept_names = ["Concept " + col for col in dataframe.columns]
concept_ids = ["C_" + col for col in column_names]
self.generate_sdmx_xml(
codelist_name=codelist_name,
code_names=dataframe.columns,
code_ids=column_names,
concept_names=concept_names,
concept_ids=concept_ids,
agency="WBG",
version="1.0",
dataflow_id=dataflow_id,
dataflow_name=dataflow_name,
structure_ref_id=structure_ref_id,
structure_ref_agency="WBG"
)
generate_column_names
Method
Purpose:
This method generates standardized column names by converting DataFrame column names to uppercase and removing special characters and spaces. It ensures that the column names conform to SDMX standards, which often require clean, standardized names.
How It Works:
- Converts column names to uppercase.
- Removes or replaces special characters (e.g., spaces, parentheses, slashes, etc.) to ensure the column names are valid for SDMX.
- Checks for duplicate column names and prints a warning if any are found.
def generate_column_names(self, columns: pd.Index) -> list:
column_names = []
for column in columns:
column_name = column.upper().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "").replace("/", "").replace(".", "").replace(",", "").replace(":", "").replace("%","_PERCENT").replace("&", "_AND")
column_names.append(column_name)
seen = set()
duplicates = set(x for x in column_names if x in seen or seen.add(x))
if duplicates:
print(f"Duplicate column names: {duplicates}")
return column_names
generate_sdmx_xml
Method
Purpose:
This method generates an SDMX-compliant XML file that defines the structure of the data, including code lists, concept schemes, and the overall data structure. This XML file is essential for reporting and sharing the dataset in the SDMX format.
How It Works:
- Creates the root XML element and adds header information (e.g., ID, prepared date, sender, etc.).
- Defines dataflows, structures, and code lists for the dataset.
- Adds concepts like frequency, commodity type, time period, and measure.
- Generates the final XML file and saves it to disk.
def generate_sdmx_xml(self, codelist_name: str, code_names: list, code_ids: list, concept_names: list, concept_ids: list, agency: str, version: str, dataflow_id: str, dataflow_name: str, structure_ref_id: str, structure_ref_agency: str) -> None:
# Create the root element
root = ET.Element('message:Structure', {
'xmlns:message': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message",
'xmlns:s': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure",
'xmlns:c': "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common"
})
# Create Header element
header = ET.SubElement(root, 'message:Header')
ET.SubElement(header, 'message:ID').text = 'IDREF5'
ET.SubElement(header, 'message:Test').text = 'false'
ET.SubElement(header, 'message:Prepared').text = '2021-03-01T13:41:02.8010359+00:00'
ET.SubElement(header, 'message:Sender', {'id': agency})
# Create Structures element
structures = ET.SubElement(root, 'message:Structures')
# Create Dataflows element
dataflows = ET.SubElement(structures, 's:Dataflows')
dataflow = ET.SubElement(dataflows, 's:Dataflow', {
'id': dataflow_id,
'agencyID': structure_ref_agency,
'version': "1.0",
'isFinal': "true"
})
annotations = ET.SubElement(dataflow, 'c:Annotations')
annotation = ET.SubElement(annotations, 'c:Annotation')
ET.SubElement(annotation, 'c:AnnotationType').text = 'NonProductionDataflow'
ET.SubElement(annotation, 'c:AnnotationText', {'xml:lang': "en"}).text = 'true'
ET.SubElement(dataflow, 'c:Name', {'xml:lang': "en"}).text = dataflow_name
structure = ET.SubElement(dataflow, 's:Structure')
ET.SubElement(structure, 'Ref', {
'id': structure_ref_id,
'version': "1.0",
'agencyID': structure_ref_agency,
'package': "datastructure",
'class': "DataStructure"
})
# Create Codelists element
codelists = ET.SubElement(structures, 's:Codelists')
# Create Codelist element
codelist1 = ET.SubElement(codelists, 's:Codelist', {
'agencyID': agency,
'id': codelist_name,
'isFinal': 'true',
'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Codelist={agency}:{codelist_name}({version})",
'version': version
})
# Add names to Codelist
ET.SubElement(codelist1, 'c:Name', {'xml:lang': 'en'}).text = codelist_name
# Add Codes to Codelist
for code_id, code_name in zip(code_ids, code_names):
if code_name == "Measure" or code_name == "year":
continue
code = ET.SubElement(codelist1, 's:Code', {
'id': code_id,
'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Code={agency}:{codelist_name}({version}).{code_id}"
})
ET.SubElement(code, 'c:Name', {'xml:lang': 'en'}).text = code_name
codelist2 = ET.SubElement(codelists, 's:Codelist', {
'agencyID': agency,
'id': "MEASURE",
'isFinal': 'true',
'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Codelist={agency}:MEASURE({version})",
'version': version
})
ET.SubElement(codelist2, 'c:Name', {'xml:lang': 'en'}).text = "MEASURE"
for code_id, code_name in zip(["NOMINAL", "REAL"], ["Nominal", "Real"]):
code = ET.SubElement(codelist2, 's:Code', {
'id': code_id,
'urn': f"urn:sdmx:org.sdmx.infomodel.codelist.Code={agency}:MEASURE({version}).{code_id}"
})
ET.SubElement(code, 'c:Name', {'xml:lang': 'en'}).text = code_name
# Create Concepts element
concepts = ET.SubElement(structures, 's:Concepts')
concept_scheme = ET.SubElement(concepts, 's:ConceptScheme', {
'id': "CS_" + structure_ref_id,
'urn': "urn:sdmx:org.sdmx.infomodel.conceptscheme.ConceptScheme=WBG:CS_" + structure_ref_id+ "(1.0)",
'agencyID': "WBG",
'version': "1.0",
'isFinal': "true"
})
ET.SubElement(concept_scheme, 'c:Name', {'xml:lang': "en"}).text = 'Concept Scheme for Energy, Agriculture, and Metals'
ET.SubElement(concept_scheme, 'c:Description', {'xml:lang': "en"}).text = 'List of Concepts to be used in DSD for Energy, Agriculture, and Metals'
concept_list = [
("FREQ", "Frequency", "The frequency of the observation (e.g., monthly, quarterly, annually)."),
("COMMODITY_TYPE", "Commodity Type", "The type of commodity (e.g., crude oil, cocoa, aluminum)."),
("TIME_PERIOD", "Time Period", "Reference date or date range the observed value refers to."),
("OBS_VALUE", "Observation value", "Actual observation value."),
("MEASURE", "Measure", "The type of measure (e.g., nominal, real).")
]
for concept_id, concept_name, concept_description in concept_list:
concept = ET.SubElement(concept_scheme, 's:Concept', {
'id': concept_id,
'urn': f"urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=WBG:CS_" + structure_ref_id+ f"(1.0).{concept_id}"
})
ET.SubElement(concept, 'c:Name', {'xml:lang': "en"}).text = concept_name
ET.SubElement(concept, 'c:Description', {'xml:lang': "en"}).text = concept_description
# Create DataStructures element
data_structures = ET.SubElement(structures, 's:DataStructures')
data_structure = ET.SubElement(data_structures, 's:DataStructure', {
'id': structure_ref_id,
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=WBG:" + structure_ref_id +"(1.0)",
'agencyID': "WBG",
'version': "1.0",
'isFinal': "true"
})
ET.SubElement(data_structure, 'c:Name', {'xml:lang': "en"}).text = 'DSD: Energy, Agriculture, and Metals'
components = ET.SubElement(data_structure, 's:DataStructureComponents')
dimension_list = ET.SubElement(components, 's:DimensionList', {
'id': "DimensionDescriptor",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.DimensionDescriptor=WBG:" + structure_ref_id +"(1.0).DimensionDescriptor"
})
dimension = ET.SubElement(dimension_list, 's:Dimension', {
'id': "COMMODITY_TYPE",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.Dimension=WBG:" + structure_ref_id +"(1.0).COMMODITY_TYPE",
'position': "2"
})
concept_identity = ET.SubElement(dimension, 's:ConceptIdentity')
ET.SubElement(concept_identity, 'Ref', {
'id': "COMMODITY_TYPE",
'maintainableParentID': "CS_" + structure_ref_id,
'maintainableParentVersion': "1.0",
'agencyID': "WBG",
'package': "conceptscheme",
'class': "Concept"
})
local_representation = ET.SubElement(dimension, 's:LocalRepresentation')
enumeration = ET.SubElement(local_representation, 's:Enumeration')
ET.SubElement(enumeration, 'Ref', {
'id': codelist_name,
'version': "1.0",
'agencyID': "WBG",
'package': "codelist",
'class': "Codelist"
})
dimension = ET.SubElement(dimension_list, 's:Dimension', {
'id': "MEASURE",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.Dimension=WBG:" + structure_ref_id +"(1.0).MEASURE",
'position': "1"
})
concept_identity = ET.SubElement(dimension, 's:ConceptIdentity')
ET.SubElement(concept_identity, 'Ref', {
'id': "MEASURE",
'maintainableParentID': "CS_" + structure_ref_id,
'maintainableParentVersion': "1.0",
'agencyID': "WBG",
'package': "conceptscheme",
'class': "Concept"
})
local_representation = ET.SubElement(dimension, 's:LocalRepresentation')
enumeration = ET.SubElement(local_representation, 's:Enumeration')
ET.SubElement(enumeration, 'Ref', {
'id': "MEASURE",
'version': "1.0",
'agencyID': "WBG",
'package': "codelist",
'class': "Codelist"
})
time_dimension = ET.SubElement(dimension_list, 's:TimeDimension', {
'id': "TIME_PERIOD",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.TimeDimension=WBG:" + structure_ref_id +"(1.0).TIME_PERIOD",
'position': "3"
})
concept_identity = ET.SubElement(time_dimension, 's:ConceptIdentity')
ET.SubElement(concept_identity, 'Ref', {
'id': "TIME_PERIOD",
'maintainableParentID': "CS_" + structure_ref_id,
'maintainableParentVersion': "1.0",
'agencyID': "WBG",
'package': "conceptscheme",
'class': "Concept"
})
local_representation = ET.SubElement(time_dimension, 's:LocalRepresentation')
ET.SubElement(local_representation, 's:TextFormat', {'textType': "ObservationalTimePeriod"})
measure_list = ET.SubElement(components, 's:MeasureList', {
'id': "MeasureDescriptor",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.MeasureDescriptor=WBG:" + structure_ref_id +"(1.0).MeasureDescriptor"
})
primary_measure = ET.SubElement(measure_list, 's:PrimaryMeasure', {
'id': "OBS_VALUE",
'urn': "urn:sdmx:org.sdmx.infomodel.datastructure.PrimaryMeasure=WBG:" + structure_ref_id +"(1.0).OBS_VALUE"
})
concept_identity = ET.SubElement(primary_measure, 's:ConceptIdentity')
ET.SubElement(concept_identity, 'Ref', {
'id': "OBS_VALUE",
'maintainableParentID': "CS_" + structure_ref_id,
'maintainableParentVersion': "1.0",
'agencyID': "WBG",
'package': "conceptscheme",
'class': "Concept"
})
# Create a new XML file with the results
tree = ET.ElementTree(root)
with open(f"{codelist_name}_structure.xml", "wb") as files:
tree.write(files, encoding='utf-8', xml_declaration=True)
create_excel_data_description_xml
Method
Purpose:
This method generates an XML file that describes the structure of the Excel data, including the layout of the sheet (e.g., top-left and bottom-right coordinates) and the mapping between Excel cells and SDMX dimensions. This XML helps map the Excel data into SDMX format.
How It Works:
- Determines the size of the Excel data by calculating the bottom-right coordinate.
- Constructs the XML file with the Excel data description, including the coordinates for each dimension (e.g., commodity type, time period, measure).
- Saves the XML file to disk.
def create_excel_data_description_xml(self, dataflow_name: str, data_file_path: str, coord_mapping_name: str, sheet_name: str, dataframe: pd.DataFrame) -> None:
max_row = len(dataframe)
max_column = len(dataframe.columns)
bottom_right_coord = openpyxl.utils.get_column_letter(max_column + 1) + str(max_row + 2)
root = ET.Element("Excel-Data-Description", {
"Dataflow": dataflow_name,
"Agency": "WBG",
"Version": "1.0"
})
coord_mapping = ET.SubElement(root, "Coord-Mapping", {
"Name": coord_mapping_name,
"ExcludeNulls": "false",
"IsActive": "true"
})
worksheets = ET.SubElement(coord_mapping, "Worksheets")
ET.SubElement(worksheets, "Worksheet", {"Name": sheet_name})
ET.SubElement(coord_mapping, "Top-Left", {"Coord": "C2"})
ET.SubElement(coord_mapping, "Bottom-Right", {"Coord": bottom_right_coord})
# Save the XML file
xml_file_name = os.path.splitext(data_file_path)[0] + "_excel_data_description.xml"
tree = ET.ElementTree(root)
with open(xml_file_name, "wb") as xml_file:
tree.write(xml_file, encoding="utf-8", xml_declaration=True)
Excel and SDMX Data Importing
The ExcelAndSdmxImporter class is responsible for importing data structures (SDMX) and data files (Excel) into a system. This is done by first importing the XML structures and then the actual data along with its corresponding description file. Let’s break down this process step by step.
class ExcelAndSdmxImporter(Importer):
def import_dataflow(self, config:Config, data_info: DataInfo):
self._start_message()
# Paths for indices and prices structure and data
indices_structure_file_path = os.path.join(self.data_loader.converted_directory, "INDICES_INFO_structure.xml")
prices_structure_file_path = os.path.join(self.data_loader.converted_directory, "PRICES_INFO_structure.xml")
indices_excel_file_path = os.path.join(self.data_loader.converted_directory, "an_indices.xlsx")
prices_excel_file_path = os.path.join(self.data_loader.converted_directory, "an_prices.xlsx")
indices_excel_data_description_path = os.path.splitext(indices_excel_file_path)[0] + "_excel_data_description.xml"
prices_excel_data_description_path = os.path.splitext(prices_excel_file_path)[0] + "_excel_data_description.xml"
# Import the structure XMLs
self.import_structure(config, indices_structure_file_path)
self.import_structure(config, prices_structure_file_path)
# Import the Excel data and associated XMLs
self.import_excel_data(config, indices_excel_file_path, indices_excel_data_description_path)
self.import_excel_data(config, prices_excel_file_path, prices_excel_data_description_path)
def import_structure(self, config, structure_file_path: str):
if structure_file_path.endswith('.xml'):
self.import_xml(config, structure_file_path)
def import_excel_data(self, config:Config, excel_file_path: str, excel_data_description_path: str):
if excel_file_path.endswith('.xlsx') and excel_data_description_path.endswith('.xml'):
payload = {
'excelFile': excel_file_path,
'eddFile': excel_data_description_path
}
resp = send_import_request(config.services.data, payload, main_token_process(config.token))
print(resp)
if resp.status_code == 400:
print(resp.text)
print(f"Error importing {excel_file_path} with {excel_data_description_path}")
Result
That’s it, so simple, we did it. By breaking down the technical complexities, we’ve shown that automation doesn’t have to be daunting — it’s all about using the right tools effectively.
Looking forward to any feedback and always open to discuss improvements or additional insights. Feel free to reach out through my LinkedIn account.