Data Scraping and Parsing Weather Records from the Japan Meteorological Agency (JMA) Website Using Python
Introduction
The Japan Meteorological Agency (JMA) monitors and records various weather data including rainfall, wind speed and direction, temperature and relative humidity from over 840 sites across Japan through their Automated Meteorological Data Acquisition System (AMEDAS)1. Much of this data are publicly available online, and some datasets can be downloaded as CSV files. However, learning to scrape and parse weather data directly from the Japan Meteorological Agency website using Python is useful because:
- Not all JMA datasets are available as CSV downloads
- Downloading hundreds of days of data from multiple weather stations manually can be painful
Data scraping allows you to retrieve any weather data available on the JMA website in a consistent and reproducible manner. Although it will require some time and effort upfront to write the code, once set up, the code can be easily modied to scrape data from different dates or weather stations.
In this tutorial, I’ll walk through a method to scrape and parse 10-minute interval weather data of the Tokyo JMA weather station using Python libraries requests and Beautiful Soup.
What You’ll Learn in This Tutorial
By the end of this tutorial, you’ll learn how to:
- Scrape tabular weather data from JMA’s website using
requestslibrary - Parse and give structure to the raw data with
Beautiful Souplibrary - Clean and save the data as a CSV file
If you prefer to skip the explanations and jump straight to the implementation, you can download this blog post as a Jupyter notebook.
Here are the list of things you’ll need to run the code.
Prerequisites
- A copy of this blog as a Jupyter notebook
- ‘SampleData_JMA’ subfolder for saving the CSV output
- Python libraries
requestsbs4pandas
Jargon
Data scraping: Using code to extract information from a web page or a source that’s primarily designed for human viewing. In this blog post, we’ll be pulling data from an HTML table on the JMA site.
Data parsing: Converting raw or unstructured text into something structured and easy to work with. For example, turning HTML table rows into a tidy pandas DataFrame.
Scrape, Parse and Save Japan Weather Data
The outline for collecting weather data from the JMA website is as follows:
- Find the URL you want to retrieve data from
- Use
requestslibrary to scrape all data on the web page - Use an HTML parser and
Beautiful Souplibrary to give structure to the raw data - Navigate the HTML to extract just the desired data
- Convert the desired data into a
pandasDataFrame - Save the DataFrame as a CSV file
Let’s start with step 1.
Step 1: Designate the URL with the Weather Data
First, we’ll need the URL of the JMA webpage that contains the specific data we want to scrape. If you’d like to know the actual data scraping and parsing methods and do not care much for finding a specific JMA station, skip ahead to step 2.
For those interested in finding your own JMA weather station, you can navigate the records of past weather data from here. Refer to Figure 1 below to:
- Narrow down on the JMA weather site by designating the prefecture (red box)
- Pinpoint the observation site within the prefecture (orange box, activated once you designate the prefecture)
- Choose the year, month and day for the desired data (green box)
- Select the type of weather data to access (blue box)
You may need a combination of Google maps, a translation app and trial and error to navigate to your desired URL. You’ll also notice that the column headers in the tables differ depending on the type of data you’ve selected from the blue box in Figure 1.
For the remainder of this blog post, we’ll work with the 10-minute interval weather records. You can select the “10分ごとの値” option located on the left column within the blue box, second from the bottom for a JMA observation site and date of your preference. I’ve provided a default test URL below, which accesses 10-minute interval weather records from the Tokyo site on March 9, 2024.
# URL containing 10-minute interval weather data from Tokyo site on March 9, 2024
# Can be replaced with 10-minute data from any 's1' type observation site
testURL = 'https://www.data.jma.go.jp/stats/etrn/view/10min_s1.php?prec_no=44&block_no=47662&year=2024&month=3&day=9&view='Accessing this URL, we find the tabular weather data which we will be scraping:
Step 2: Data Scraping with the requests Library
Next, we’d like to scrape all the HTML from the webpage. We’ll need to import the requests library to do the scraping.
# Library required for data scraping
import requestsThe requests.get() function inside this library can send an HTTP GET request to the designated URL:
# HTTP GET request to testURL
# Execute care to avoid running multiple requests in a very short period of time
step2_resp = requests.get(testURL)This returns a Response object, a class defined by the requests library, that stores the server’s reply when accessing the designated URL. The Response object has many attributes, including its content.
print("Returned object type:\n\t", type(step2_resp))
# Properties of Response object
step2_prop = step2_resp.__dict__.keys()
print("\nProperties and attributes in object:")
for prop_now in step2_prop:
print('\t', prop_now)Returned object type:
<class 'requests.models.Response'>
Properties and attributes in object:
_content
_content_consumed
_next
status_code
headers
raw
url
encoding
history
reason
cookies
elapsed
request
connection
Some of the other Response object attributes are useful for other applications, such as error handling. For now, we’ll ignore these use cases and parse the HTML content using Beautiful Soup.
Step 3: Initial HTML Parsing with Beautiful Soup
With the response in hand, we can now parse the raw HTML. Import the BeautifulSoup() function from bs4:
# Import the BeautifulSoup function from bs4 library
from bs4 import BeautifulSoupMuch like how humans understand a sequence of words only when the sequence follows a set of known grammatical rules, a computer program needs to know what to expect to make sense of the raw data. In other words, when a program “reads” HTML, it needs an HTML parser to understand and process it.
We’ll use Python’s built-in HTML parser, html.parser to make sense of the contents of the Response object from step 2. This will return a BeautifulSoup object.
# HTML parser to make sense of the Response object
step3_soup = BeautifulSoup(step2_resp.content, "html.parser")
print("Returned object type:\n\t", type(step3_soup))Returned object type:
<class 'bs4.BeautifulSoup'>
After parsing HTML with BeautifulSoup, we can safely navigate the contents. Namely, HTML consists of tags that define specific sections within the content. Most elements follow the pattern:
<tag_name> ...content... </tag_name>
Tags are often nested in a form of hierarchy. For example, every webpage can be broadly divided into two major sections, each with subsections:
<head>contains metadata, scripts, page title<body>contains the visible content, including the tabular data
Let us print just the <head> section to see what structured HTML looks like:
print(step3_soup.head.prettify())<head>
<meta charset="utf-8"/>
<title>
気象庁|過去の気象データ検索
</title>
<meta content="気象庁 Japan Meteorological Agency" name="Author"/>
<meta content="気象庁 Japan Meteorological Agency" name="keywords"/>
<meta content="気象庁|過去の気象データ検索" name="description"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<link href="/com/css/define.css" media="all" rel="stylesheet" type="text/css"/>
<link href="../../css/default.css" media="all" rel="stylesheet" type="text/css"/>
<script language="JavaScript" src="/com/js/jquery.js" type="text/JavaScript">
</script>
<script language="JavaScript" src="../js/jquery.tablefix.js" type="text/JavaScript">
</script>
<style media="all" type="text/css">
<!-- @import url(/com/default.css); -->
</style>
<link href="../../data/css/kako.css" media="all" rel="stylesheet" type="text/css"/>
<link href="../../data/css/print.css" media="print" rel="stylesheet" type="text/css"/>
</head>
The nested hierarchy structure can be easily spotted in the first lines:
<head>
...
<title>
気象庁|過去の気象データ検索
</title>
...
</head>
In a similar manner, the tabular data to extract is contained within <body> ...tabular data... </body>.
Step 5: Create a DataFrame with the Fully Parsed Data Using for Loops
From the isolated table rows, we extract the weather data and place them into a structured pandas DataFrame using a for loop to work through each row.
First, import the pandas library.
import pandas as pdThen, create an empty DataFrame. I’ve provided the English translation of the column headers below:
# The Tokyo observation site (type s1) has 11 columns of data
headers_s1 = (['time',
'local atmospheric pressure (hPa)',
'sea level pressure (hPa)',
'precipitation (mm)',
'temperature (℃)',
'relative humidity',
'average wind speed (m/s)',
'average wind direction',
'max wind speed (m/s)',
'max wind direction',
'sunshine (minutes)'])
# If you are using a 'a1' type observation site, use the headers below
headers_a1 = (['time',
'precipitation (mm)',
'temperature (℃)',
'relative humidity',
'average wind speed (m/s)',
'average wind direction',
'max wind speed (m/s)',
'max wind direction',
'sunshine (minutes)'])
# Specify the header you'll be using
headers_now = headers_s1 # Change to headers_a1 if using 'a1' type observation site
# Empty dataframe to fill with data
step5_df = pd.DataFrame(columns=headers_now)Referring back to the sample table rows from the previous step, a few things stand out:
- The first two table rows contain no data, only headers
- Every data row begins with a
<td>or table data tag containing the time in'HH:MM'format - All remaining meteorological values appear inside
<td class="data_0_0">elements.
We can use these patterns to guide our parsing when working row-by-row in the for loop. Start by skipping the header information.
Step 5-1: Skip Rows Corresponding to Column Headers
We’ll filter out the rows containing just the column headers. These are characterized by rows that do not contain any data_0_0 class data. As such, the method call of soup.select('td.data_0_0') will return an empty list. This will safely skip the column headers table rows, regardless of the number of rows the original table contains.
# Iterate through each row in the table
for rowNow in step4_tableRows:
# Search for the data_0_0 class (actual weather data)
list_data = rowNow.select('td.data_0_0')
if len(list_data) == 0:
# There is no data in this row (ie. column headers), so skip
print("Skipping row: ", rowNow)
continueSkipping row: <tr class="mtx"><th rowspan="2" scope="col">時分</th><th colspan="2" scope="colgroup">気圧(hPa)</th><th rowspan="2" scope="col">降水量<br/>(mm)</th><th rowspan="2" scope="col">気温<br/>(℃)</th><th rowspan="2" scope="col">相対湿度<br/>(%)</th><th colspan="4" scope="colgroup">風向・風速(m/s)</th><th rowspan="2" scope="col">日照<br/>時間<br/>(分)</th></tr>
Skipping row: <tr class="mtx" scope="col"><th scope="col">現地</th><th scope="col">海面</th><th scope="col">平均</th><th scope="col">風向</th><th scope="col">最大瞬間</th><th scope="col">風向</th></tr>
Once the column headers are skipped, we’d like to extract the first data element in the row.
Step 5-2: Extract the Timestamps
After skipping the rows with no data, we’ll extract the first data, which are timestamps when the weather data were collected. Since the first <td> tag in each valid row always contains the timestamp, we can
- Use
soup.select_one('td')to pull out just the first element - Clean the timestamp string with
.text.strip()- Remove HTML tags with
.text() - Remove extra spaces with
.strip()
- Remove HTML tags with
- Assign it to the first column of the DataFrame using
df.loc[]attribute
Let’s add this code to the for loop from before:
# Iterate through each row in the table
rowCount = 0
for rowNow in step4_tableRows:
# Search for the data_0_0 class that contains the weather data
list_data = rowNow.select('.data_0_0')
if len(list_data) == 0:
# There is no data in this row (ie. it is a column headers), so skip
#print("Skipping row: ", rowNow)
continue
else:
# Extract time
time1 = rowNow.select_one('td') # Select just the first <td> tag element
time2 = time1.text.strip() # Remove the tags with .text(). Remove extra spaces with strip()
#print(time2)
# Assign the time to the first column in the DataFrame
step5_df.loc[rowCount, headers_now[0]] = time2
# Increment rowCount
rowCount = rowCount + 1You can verify that all timestamps from '00:10' to '24:00' have been assigned to the DataFrame by inspecting the first and last time entries:
print("First few lines:")
print(step5_df['time'].head())
print("\nLast few lines:")
print(step5_df['time'].tail())First few lines:
0 00:10
1 00:20
2 00:30
3 00:40
4 00:50
Name: time, dtype: object
Last few lines:
139 23:20
140 23:30
141 23:40
142 23:50
143 24:00
Name: time, dtype: object
The column headers are omitted, and the first data element, the timestamps, have been successfully transferred from HTML to a pandas DataFrame. All that remains is to migrate the weather data.
Step 5-3: Retrieve the Meteorological Data
The loop can skip the column headers and collect the time information, so now we’d like to extract the remaining meteorological data from the data_0_0 class elements. To do so, we’ll use a nested for loop that looks for <td class="data_0_0"> data element, clean the string, and insert it into the DataFrame.
However, the keen-eyed reader would have noticed that the wind direction data is recorded as Japanese kanji. We will translate these into English compass notation (NESW notation) within the for loop.
I’ve provided you with the ConvertKanji2NESW() function that translates the Japanese kanji into English compass notation:
def ConvertKanji2NESW(kanji_windDir):
"""Function to convert the wind direction from kanji to English
AUTHOR: Mai Tanaka (www.DataDrivenMai.com)
DATE: 2026-03-31
REQUIRES: kanji_windDir = Full kanji indicating wind direction (eg. '北東')
RETURNS: english_windDir = Wind direction in English in N/E/S/W notation (eg. 'NE')
"""
# Start with an empty string
english_windDir = ''
# Handle the special case: 静穏 (tranquil or no wind)
if kanji_windDir == '静穏':
english_windDir = 'tranquil'
return english_windDir
# Kanji to character (NESW) conversion, character-by-character
for kanji_char in kanji_windDir:
if kanji_char == '北':
english_windDir = english_windDir + 'N'
if kanji_char == '東':
english_windDir = english_windDir + 'E'
if kanji_char == '南':
english_windDir = english_windDir + 'S'
if kanji_char == '西':
english_windDir = english_windDir + 'W'
# Return English wind direction
return english_windDirWorking through the tabular data cell-by-cell, each weather data element is cleaned using .text() and .strip() to remove the HTML tags and spaces, respectively. If the data element corresponds to ‘wind direction’, we apply the ConvertKanji2NESW() function above. The cleaned data can finally be placed in the pandas DataFrame under the appropriate header using df.loc[].
# Iterate through each row in the table
rowCount = 0
for rowNow in step4_tableRows:
# Search for the data_0_0 class
list_data = rowNow.select('.data_0_0')
if len(list_data) == 0:
# There is no data in this row, so skip
#print("Skipping row: ", rowNow)
continue
else:
# Extract time
time1 = rowNow.select_one('td') # Select just the first <td> tag element
time2 = time1.text.strip() # Remove the tags with .text(). Remove extra spaces with strip()
#print(time2)
# Assign the time to the DataFrame
step5_df.loc[rowCount, headers_now[0]] = time2
# Work through each data element to extract meteorological data
columnCount = 1
for dataNow in list_data:
# Extract just the data (remove tags and spaces)
justData = dataNow.text.strip()
# If the data type is wind direction, convert from Kanji to alphabet
if "wind direction" in headers_now[columnCount]:
justData = ConvertKanji2NESW(justData)
# Save the data into the appropriate location and increment columnCount
step5_df.loc[rowCount, headers_now[columnCount]] = justData
columnCount = columnCount + 1
# Increment the rowCount
rowCount = rowCount + 1A quick check of the first few lines in the resulting DataFrame confirms that all data values of time, atmospheric pressure, temperature, humidity, wind speed and others, have been successfully extracted. Furthermore, wind direction has been converted from Japanese kanji to English.
step5_df.head()| time | local atmospheric pressure (hPa) | sea level pressure (hPa) | precipitation (mm) | temperature (℃) | relative humidity | average wind speed (m/s) | average wind direction | max wind speed (m/s) | max wind direction | sunshine (minutes) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00:10 | 1003.3 | 1006.3 | -- | 3.3 | 96 | 1.1 | W | 1.7 | WSW | |
| 1 | 00:20 | 1003.5 | 1006.5 | -- | 3.4 | 96 | 1.1 | W | 1.4 | W | |
| 2 | 00:30 | 1003.7 | 1006.7 | -- | 3.5 | 96 | 1.6 | WNW | 2.0 | NW | |
| 3 | 00:40 | 1003.7 | 1006.7 | -- | 3.9 | 97 | 2.1 | NW | 2.6 | NW | |
| 4 | 00:50 | 1003.8 | 1006.8 | -- | 3.9 | 96 | 1.8 | NW | 2.1 | WNW |
The final step is to save the pandas DataFrame as a CSV for use in subsequent data analysis projects.
Step 6: Save DataFrame as a CSV File
In the final step, we save the fully populated and cleaned DataFrame as a CSV file. We can use pandas’ built-in pandas.to_csv() method to write the data out. Just specify the desired file path (and make sure the directory exists in your local environment), disable the index column, and choose an appropriate text encoding (UTF-8 is a safe default).
# Specify a strong filename in a directory that exists
fileName = 'test_JMADataScraping_step6.csv'
dirName = 'SampleData_JMA/'
savefileName = dirName + fileName
step5_df.to_csv(savefileName, index=False, encoding='utf-8')This saves a CSV file with the scraped and parsed weather data that you can access any time.
Summary
Congratulations!
You now have a full toolkit for scraping and parsing structured data from the JMA website using a combination of the requests, Beautiful Soup, and pandas libraries. A quick summary of the steps we took:
- Designated the URL to access data from
- Accessed the URL using
requests.get()and obtained the raw HTML - Parsed the raw HTML using
BeautifulSoup() - Navigated the complex HTML structure and isolated the target table using
soup.select() - Manually parsed each table row whilst
- Skipping rows with no data
- Extracting and cleaning the timestamps
- Pulling out meteorological data and assigning them to the correct DataFrame columns
- Translating the wind direction from Japanese kanji to English
- Exported the fully parsed DataFrame to a CSV file using
df.to_csv()
For more tips, tricks and tutorials, be sure to check out the blog posts in the Further Readings.
Happy data scraping!
References
Further Reading
- Under construction

