Assessing trends in the Danish Labour Market with DeiC Interactive HPC

By combining large language models with the computational power of DeiC Interactive HPC, Master student Rentian Zhu analysed 3 million job postings, to reveal patterns in the skills Danish employers seek and how these skills relate to firm performance.

Rentian Zhu, PhD fellow, Department of Economics, Copenhagen Business School.
A
Anne Rahbek-Damm
Journalist og kommunikationskonsulent
01.07.2025 09:00

“The project was challenging due to the volume of unstructured text data spanning both Danish and English. Without structured tags or categories, these job postings required substantial data processing, far beyond what could be handled by a standard computer, to become useful.” Rentian Zhu, PhD fellow, Department of Economics, CBS.

To address these complexities, Rentian utilized DeiC Interactive HPC to process and analyse the data effectively. Through advanced techniques and the support of large language models like BERT and GPT-3.5/4, he extracted, classified, and quantified skill demands across industries and job roles.

He first became aware of DeiC Interactive through a former colleague:

“I became aware of DeiC Interactive through a former colleague, an NLP enthusiast who highly recommended it”. Rentian says, “The process of resource allocation was incredibly smooth, thanks to the outstanding support from Kristoffer and Lars (CBS FO support red.). They were not only extremely helpful but also kind and patient, ensuring that all my questions were addressed and that I could efficiently set up and utilize the resources for my research”.

Building a Data Pipeline for Skill Extraction

To process and analyze the vast amount of data Rentian developed a robust data pipeline. Key steps included:

Data Preprocessing: Normalizing text data, a step that includes removing inconsistencies, identifying the language (Danish or English), and tokenizing the text into manageable pieces. Tokenization provides a structured foundation for LLMs which enables them to interpret the text accurately.

Skill Extraction: Using the language model BERT to analyze and extract key features from the text and enhance skill recognition accuracy by using GPT-3.5 and GPT-4 with specialized prompt strategies tailored to labor market terminology. These strategies help the models understand nuanced skill requirements better.

Skill Categorization: Creating a hierarchical classification of skills by integrating outputs from BERT and GPT models using LangChain (a library that helps coordinate the flow of information between different AI models). This classification of skills makes it possible to analyze patterns across industries and job roles more effectively.

Linking Skills to Firm Metrics: Analyzing the association between aggregated skill categories at the firm level and performance metrics such as revenue growth and profitability, Rentian applied advanced econometric techniques. By employing panel data models, including Fixed Effects, he and his supervisors uncovered how categorized workforce skills are associated with organizational outcomes, shedding light on labor-driven factors that influence profitability.

About the project

The project was begun in 2022. Approx. 36K core hours were used for the initial phase. In December 2024, 26K core hours were awarded through the national DeiC call for the next phase of the project. This will take place in Jan.-Dec. 2025

RDM Support Team at CBS

The RDM Support team at CBS Library is the central resource for CBS researchers and students seeking expertise in high-performance computing (HPC) and research data management (RDM). The HPC support encompasses advising on and allocating resources, addressing technical challenges, assisting with code development, and providing teaching, documentation, and tutorials. In terms of RDM, the team guides researchers on complying with funder and publisher requirements, writing data management plans, navigating legal considerations, and selecting IT infrastructures that best suit their data management needs.

Optimizing Computational Resources for Complex Data Analysis

To meet the heavy computational demands of his project, Rentian took a strategic approach in efficiently distributing tasks across GPUs and CPUs, ensuring he made the most of DeiC Interactive HPC’s power.

Fine-Tuning BERT on Job Postings: One major task involved fine-tuning BERT to classify skills from over 3 million job postings. This required significant GPU power, which Rentian optimized through parallel processing across several GPUs. By batching data and using techniques like gradient accumulation, he managed to process large amounts of data without exceeding hardware limits. Meanwhile, CPUs handled data preparation and orchestration, ensuring smooth workflow coordination.

Code Example 1: Showcasing how GPUs were utilized for BERT fine-tuning

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict

# Load dataset and tokenizer
dataset = load_dataset("csv", data_files={"train": "job_postings.csv"})
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Preprocessing function
def preprocess(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# Apply preprocessing in batches
tokenized_dataset = dataset.map(preprocess, batched=True)

# Split dataset into batches for distributed processing
batch_size = 32  # Adjust based on GPU memory
train_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    save_strategy="epoch",
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=2,  # To simulate larger batch size
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    dataloader_num_workers=4,  # Optimize data loading
    fp16=True  # Enable mixed precision for speed
)

# Load BERT model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Trainer
trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=tokenized_dataset['train']
)

# Start training
trainer.train()

Prompt Engineering for GPT Models In addition to fine-tuning BERT, Rentian employed GPT-3.5 and GPT-4 for hierarchical skill classification. Using advanced prompt engineering techniques, he guided the models to provide accurate and structured outputs. These included:

  • Chain of Thought Reasoning: Guiding the model to break down its decision-making process step by step.
    Few-Shot Learning: Including curated examples within prompts to demonstrate the desired output format and improve clarity.

Code Example 2: Sample prompt used in GPT classification

# Prompt Strategy 2: Chain of Thought
custom_prompt2 = """As an expert in job market analysis, carefully examine the following job advertisement. Your task is to identify and extract key skills and qualifications mentioned in the ad. 
Once identified, categorize these skills into the appropriate categories from the list provided: 'CRM','Computer Support and Networking','Data Analysis',
'Digital Design,'Digital Marketing','Machining and Manufacturing Technology','Productivity','Programming and Software Development', 'Character','Cognitive','Customerservice','Financial', 'Management','Social','Writing_language'. For each identified skill or qualification, provide a brief explanation of why it fits into its respective category. Your analysis should be detailed and precise to ensure accuracy in categorization. Format your response as a structured JSON object with each category as a key. Under each key, list the identified skills along with a brief explanation for each. If a skill does not fit any of the listed categories, classify it under 'Other'. Ensure your response is well-organized and easy to understand. Here's an example structure for your JSON output. Now job posting begins:"""

Few-shots Learning
# Prompt Strategy 4: Few Shot Learning
custom_prompt4 = """
Example 1:
Job Advertisement: "Seeking a data analyst with experience in SQL, Python, and data visualization. Must possess strong analytical skills and be familiar with machine learning techniques."
Extracted Skills:
  "Data Analysis": ["SQL", "Python”, "Data visualization”,”Machine learning techniques”],
  "Cognitive": ["Analytical skills"]
Example 2:
Job Advertisement: "Digital marketing specialist required with expertise in social media advertising, content creation, and SEO. Must have excellent writing skills and be a creative thinker."
Extracted Skills:
  "Digital Marketing": [“social media advertising", "SEO"],
  "Writing_language": [“content creation", "Excellent writing skills"],
  “Cognitive: [“creative thinker”]
Your task is to identify and extract key skills and qualifications mentioned in the ad. Once identified, categorize these skills into the appropriate categories from the list provided: 'CRM','Computer Support and Networking','Data Analysis','Digital Design,'Digital Marketing','Machining and Manufacturing Technology','Productivity','Programming and Software Development', 'Character','Cognitive','Customerservice','Financial', 'Management','Social','Writing_language'
For each identified skill or qualification, provide a brief explanation of why it fits into its respective category. Your analysis should be detailed and precise to ensure accuracy in categorization.
Format your response as a structured JSON object with each category as a key. Under each key, list the identified skills along with a brief explanation for each. If a skill does not fit any of the listed categories, classify it under 'Other'. Ensure your response is well-organized and easy to understand.
Now, analyze the following job advertisement:
""".strip()

By strategically framing prompts, Rentian achieved precise and reliable results without altering the models’ underlying parameters.

"With the additional national HPC resources, I can now venture into more detailed hypotheses, incorporate a richer variety of data, and apply more advanced modeling techniques. This will help me probe the mechanisms that shape work patterns and skill demands and hopefully lead to a more rigorous scientific understanding of Denmark’s evolving labor market."

Rentian Zhu
PhD Fellow
Copenhagen Business School

Advanced Data Management

Efficient data management was critical to streamlining Rentian’s workflow. He used the Parquet format to compress and structure data, reducing 47 GB of raw data to 25 GB after preprocessing. Distributed computing allowed tasks to be split across multiple processors, accelerating execution and ensuring the entire pipeline ran efficiently.

Moving forward:  securing resources from the national call

Building on this foundational work, Rentian now, as a Ph.D. student, continues to broaden the scope of his research. In December 2024 he was awarded 26K core hours through DeiCs national call, enabling him to combine his extensive data material with registry and time-use data, and to fine-tune the language models for more complex workflows. The expanded setup will not only allow him to deepen his investigation into flexible work arrangements and their broader economic and social effects but also improve efficiency and reduce processing time compared to standard computing environments.

Want to get started using HPC?
Find HPC resources here Søg om regnekraft