概述

这篇文章主要介绍jsoc数据下载方法及注意事项。

下载方式概述

从jsoc下载数据分为以下三种方式,页面下载,脚本下载,数据库同步;其中页面下载适合小数据量,短期下载;脚本下载时候大数据量,且涉及筛选操作和周期性操作或者自动下载;官网也提供完整数据库同步,类似创建镜像,因为涉及资源较大,并未实际操作,本文不做介绍。

image-20220322140953669

技术实现方式(猜测)

那边服务器应该采用数据库存储,每次请求生成一次文件,而不是采用文件存储,下方为猜测的技术实现。

image-20220322152057445

注意

用户使用页面请求或脚本请求受到以下约束

描述 约束 解决方法
单个email同时下载数量 单个email同时下载数量可能为1,若同时使用同一email请求可能会出错 错开时间,不要开多线程使用一个email;多线程时使用多个email
单次请求文件的总数量 单次文件总数量受到约束,意思是一次搜索搜出来的文件数量不能太多,详见官网,比如15000左右, Limit for AIA to about 15,000 and for HMI about 30,000 in each request. 可以控制单次搜索数目,比如加上时间限制,或者时间间隔
单ip请求限制 单个ip有请求总量限制,请求过多,ip会被封,有可能一个月解封,有可能永久不解封,需要发邮件请求解封 国外服务器代理,或者其他多个动态ip代理,或者换其他ip计算机
单email时间 email注册是有效期的,大概2-3个月 每次使用前可以去官网试试,如果显示未注册,需要注册一下
单个存储url有效期 单个存储url是有有效期的,超过会失效 及时下载

页面下载

注册邮件

  • 打开 http://jsoc.stanford.edu/ajax/exportdata.html

  • 在1处填入可以收到邮件的email

  • 2处会显示未注册

  • 点击3处注册

  • 上面填的email会收到一封验证邮件,尾部有个代码串,不要改动直接转发给发过来的地址

  • 应该会收到一封注册成功邮件,再次刷新页面,填入地址,显示绿色注册成功,如4处,就可以使用

  • 注意这个注册成功是有有效期的,大概2-3月,过期会显示未注册,需要重新注册

image-20220322144620519

image-20220322145032092

配置参数与下载

脚本下载

安装包

脚本下载需要安装下面包

1
2
3
4
5
6
7
# python3 -m pip install requests
# python3 -m pip install bs4
# python3 -m pip install lxml
# python3 -m pip install zeep
# python3 -m pip install drms
# python3 -m pip install sunpy
# python3 -m pip install astropy

下载过程

下载过程为,创建搜索参数,根据参数搜索,根据搜索的结果集下载数据

创建搜索参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
search_tuple = (
a.Time("2010-01-01T00:00:00.000", "2011-01-01T00:00:00.000"),
# 注意要严格上面时间格式,前面表示开始时间,后面表示结束时间,如果开始时间和结束时间一致,则下载那个时间点的数据

a.jsoc.Series("hmi.sharp_cea_720s"),
# 可以从这里面找到系列 http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

a.jsoc.Notify("eg@gmail.com"),
# 注意替换上面地址换成自己的,并提前在后面地址注册一下http://jsoc.stanford.edu/ajax/exportdata.html

a.jsoc.Segment("Bp"), # 选择系列里面的产品
a.jsoc.Segment("Bt"),
a.jsoc.Segment("Br"),

# a.jsoc.PrimeKey("HARPNUM", "180"), # 可以选择HARPNUM

# a.jsoc.Keyword("LON_MIN") > -70,
# a.jsoc.Keyword("LON_MAX") < 70,
# 可以选择筛选条件,但是必须是数字比较,不能是字符串比较
# 详见 https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido

# a.Sample(96 * u.min),
# 可以选择时间间隔,注意如上面是96分钟,注意是以上面a.Time(XXX)为基准点的间隔,如果上面不设置开始结束时间,这个也不生效
)

根据参数搜索

1
search_results = Fido.search(*search_tuple)

下载数据

1
2
3
save_path="./"
downloaded_files = Fido.fetch(search_results, path=save_path)
print(downloaded_files)

简单封装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
# coding=utf-8
"""
Purpose: [1] download data Bp Bt Br from jsoc
[2] Demonstrate sunpy for jsoc basic usage,
Officially already has a good package, this is just a demonstration of basic use

Usage: This code depends on the requests bs4 lxml zeep drms sunpy astropy
They can be installed from conda or pip
This code is compatible with python 3.7.x.

Examples: None Now

Adapted: ZhaoZhongRui (zhaozhongrui21@mails.ucas.ac.cn) Edit Python code From Thomas Wiegelmann (2022.03)
"""
# This code is compatible with python 3.7.x.
# python3 -m pip install requests
# python3 -m pip install bs4 # https://stackoverflow.com/questions/11783875/importerror-no-module-named-bs4-beautifulsoup
# python3 -m pip install lxml
# python3 -m pip install zeep
# python3 -m pip install drms # https://docs.sunpy.org/projects/drms/en/stable/
# python3 -m pip install sunpy
# python3 -m pip install astropy
from sunpy.net import Fido, attrs as a
import astropy.units as u
from concurrent.futures import ProcessPoolExecutor
import os


class DownloadJsoc():

def __init__(self):

self.is_print_log = True # or False not print log

self.mail_address_list = [] # eg ["demo@demo.com","demo2@demo.com"]
# You need to go to the following location URL to register the mailbox filled in the following list,
# so that when multi-threaded, it will not affect the download, because a user can only request once at a time
# http://jsoc.stanford.edu/ajax/exportdata.html
# ---
# Note that this email address has a service period (it may be two months),
# beyond which you need to re-register on the official website
# Note that the number of requests or downloads of the same ip is also limited,
# otherwise the ip may be blocked, you need to apply for unblocking,
# you may also automatically unblock after a long time, it is recommended that a small number of requests,
# a single search request a large amount of data

self.data_save_root_path = None # eg r"C:\Users\Zander\PycharmProjects\pynlfff\pyproduct\test"

def download_one_by_time_point(self, timepoint, harpnum, email=None, save_path=None):
"""
Download the file based on the point in time
:param timepoint: eg "2010-01-01T00:00:00.000"
:param harpnum: eg "220"
:param email: eg "eg@eg.com" register at http://jsoc.stanford.edu/ajax/exportdata.html
:param save_path: str path
:return: None
"""
result = False
try:
if email is None:
if isinstance(self.mail_address_list, list) and len(self.mail_address_list) > 0:
email = self.mail_address_list[0]
if save_path is None:
if os.path.exists(self.data_save_root_path):
save_path = self.data_save_root_path

search_tuple = (
a.Time(timepoint, timepoint), # set same start time and end time is like to download this time point

a.jsoc.Series("hmi.sharp_cea_720s"),
# choose series http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

a.jsoc.Notify(email), # a.jsoc.Notify("eg@gmail.com"),

a.jsoc.Segment("Bp"),
a.jsoc.Segment("Bt"),
a.jsoc.Segment("Br"),

a.jsoc.PrimeKey("HARPNUM", str(harpnum)),

# a.jsoc.Keyword("LON_MIN") > -70, # JSOC keyword filtering with Fido https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido
# a.jsoc.Keyword("LON_MAX") < 70, # Note You can only filter numbers, not strings

# a.Sample(96 * u.min), # time delta, need set time range before
)

results = Fido.search(*search_tuple) # search for result
downloaded_files = Fido.fetch(results, path=save_path) # download result
result = downloaded_files # return result
except BaseException as e:
print(e)
return result

def download_one_by_time_range(self, timestart, timeend, harpnum, timedelta=96, email=None, save_path=None):
"""
download by time range and delta
:param timestart: str eg "2010-01-01T00:00:00.000"
:param timeend: str eg "2022-01-01T00:00:00.000"
:param harpnum: str eg "235"
:param timedelta: int eg 96
:param email: str eg "eg@eg.com"
:param save_path: str path which to save file
:return: result set or False
"""
result = False
try:
if email is None:
if isinstance(self.mail_address_list, list) and len(self.mail_address_list) > 0:
email = self.mail_address_list[0]
if save_path is None:
if os.path.exists(self.data_save_root_path):
save_path = self.data_save_root_path
search_tuple = (
a.Time(timestart, timeend),
# set time range, if want use delta must set this ("2010-01-01T00:00:00.000", "2022-01-01T00:00:00.000"),

a.jsoc.Series("hmi.sharp_cea_720s"),
# choose series http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

a.jsoc.Notify(email), # a.jsoc.Notify("eg@gmail.com"),

a.jsoc.Segment("Bp"), # choose segment
a.jsoc.Segment("Bt"),
a.jsoc.Segment("Br"),

a.jsoc.PrimeKey("HARPNUM", str(harpnum)),

# a.jsoc.Keyword("LON_MIN") > -70, # JSOC keyword filtering with Fido https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido
# a.jsoc.Keyword("LON_MAX") < 70,

a.Sample(timedelta * u.min), # eg a.Sample(96 * u.min) time delta, need set time range before
)

results = Fido.search(*search_tuple)
downloaded_files = Fido.fetch(results, path=save_path)
result = downloaded_files
except BaseException as e:
print(e)
return result

def __download_one_by_time_point_concurrent(self, timepoint, harpnum, job_num=0, save_path=None):
# harpnum = harpnum # "7302"
# stime=timepoint # "2018-09-08T14:24:00.000"
print('job num: {} / pid : {} is runing\n'.format(job_num, os.getpid()))
email_len = len(self.mail_address_list)
mail_num = job_num % email_len
result = self.download_one_by_time_point(timepoint, harpnum, self.mail_address_list[mail_num], save_path)
return result

def download_some_by_time_point(self, timepoint_list, harpnum_list):
"""

:param timepoint_list:
:param harpnum_list:
:return:
"""
result = []
t_len = len(timepoint_list)
h_len = len(harpnum_list)
if t_len == h_len:
for i in range(t_len):
this_result = self.download_one_by_time_point(timepoint_list[i], harpnum_list[i])
result.append(this_result)
return result

def download_some_by_time_point_concurrent(self, timepoint_list, harpnum_list):
"""
Give a list of times and numbers to download data in parallel,Note that the two dimensions should be consistent
:param timepoint_list: eg ["2018-09-08T14:24:00.000","2018-09-08T00:00:00.000"]
:param harpnum_list: eg ["7020","7020"]
:return: None
"""
result = []
t_len = len(timepoint_list)
h_len = len(harpnum_list)
if t_len == h_len:
max_workers = len(self.mail_address_list)
executor = ProcessPoolExecutor(max_workers=max_workers)
for i in range(t_len):
future = executor.submit(
self.__download_one_by_time_point_concurrent,
timepoint_list[i],
harpnum_list[i],
i)
result.append(future)
executor.shutdown(True)

def tran_json_file_tai_num_time_to_download_format(self, raw_str):
"""
Convert the file time downloaded by json to the parameter time required for download,
eg "7020.20170524_222400_TAI" to ["7020","2018-09-08T14:24:00.000"]
Both in the list are in str format
:param raw_str: str, eg "7020.20170524_222400_TAI"
:return: [ str, str ] -> [ harpnum , timestr ], eg ["7020","2018-09-08T14:24:00.000"],
If the over-feed value is malformed, the conversion fails and is returned None,
"""
result = None
if isinstance(raw_str, str):
str_list = raw_str.split(".") # split by . to ["7020","20170524_222400_TAI"]
if len(str_list) == 2:
if len(str_list[1]) >= 15:
try:
hnum = int(str_list[0]) # try if list[1] is not int
this_list = str_list[1]
stime = "20{}-{}-{}T{}:{}:00.000".format(
this_list[2:4],
this_list[4:6],
this_list[6:8],
this_list[9:11],
this_list[11:13],
) # "2018-09-08T14:24:00.000"
result = [hnum, stime]
except BaseException as e:
print(e)
return result

def get_job_list_from_file(self, file_path):
"""
The file content is like "7020.20170524_000000_TAI\n7020.20170524_222400_TAI\n"
:param file_path: str or os.path format
:return: ["7020.20170524_000000_TAI","7020.20170524_222400_TAI"]
"""
result = []
if os.path.exists(file_path):
with open(file_path, "r") as f:
all_list = f.read().split("\n")
result = all_list
return result

def demo_download_one(self):
self.download_one_by_time_point("2018-08-23T17:36:00.000", "7300",email="eg@eg.com",save_path="/data")

def demo_download_some_from_file(self, file_path=None, save_path=None, mail_list=None):
"""
get work job from file and run concurrent
The file content is like "7020.20170524_000000_TAI\n7020.20170524_222400_TAI\n"
:param file_path: str or os.path format
:param save_path: the path to save file
:param mail_list: ["eg@eg.com","eg2@eg2.com"] which has registered in jsoc before
:return: None
"""
if file_path is None:
return False
if file_path is not None:
self.data_save_root_path = save_path
if mail_list is not None:
self.mail_address_list = mail_list

job_raw_list = self.get_job_list_from_file(file_path)
num_list = []
time_list = []
for job_raw in job_raw_list:
job_tran = self.tran_json_file_tai_num_time_to_download_format(job_raw)
if job_tran is not None:
num_list.append(job_tran[0])
time_list.append(job_tran[1])
self.download_some_by_time_point_concurrent(time_list, num_list)



if __name__ == "__main__":
print("start run")

d = DownloadJsoc()
file_path = r"/www/wwwroot/app_run/data/a"
save_path = r"/www/wwwroot/app_run/data"
mail_list = ["demo1@outlook.com",
"demo2@outlook.com",
"demo3@outlook.com",
"demo4@outlook.com", ]
d.demo_download_some_from_file(file_path, save_path, mail_list)