概述

这篇文章主要介绍jsoc数据下载方法及注意事项。

下载方式概述

从jsoc下载数据分为以下三种方式，页面下载，脚本下载，数据库同步；其中页面下载适合小数据量，短期下载；脚本下载时候大数据量，且涉及筛选操作和周期性操作或者自动下载；官网也提供完整数据库同步，类似创建镜像，因为涉及资源较大，并未实际操作，本文不做介绍。

技术实现方式(猜测)

那边服务器应该采用数据库存储，每次请求生成一次文件，而不是采用文件存储，下方为猜测的技术实现。

注意

用户使用页面请求或脚本请求受到以下约束

描述	约束	解决方法
单个email同时下载数量	单个email同时下载数量可能为1，若同时使用同一email请求可能会出错	错开时间，不要开多线程使用一个email；多线程时使用多个email
单次请求文件的总数量	单次文件总数量受到约束，意思是一次搜索搜出来的文件数量不能太多，详见官网，比如15000左右， `Limit for AIA to about 15,000 and for HMI about 30,000 in each request.`	可以控制单次搜索数目，比如加上时间限制，或者时间间隔
单ip请求限制	单个ip有请求总量限制，请求过多，ip会被封，有可能一个月解封，有可能永久不解封，需要发邮件请求解封	国外服务器代理，或者其他多个动态ip代理，或者换其他ip计算机
单email时间	email注册是有效期的，大概2-3个月	每次使用前可以去官网试试，如果显示未注册，需要注册一下
单个存储url有效期	单个存储url是有有效期的，超过会失效	及时下载

页面下载

注册邮件

打开 http://jsoc.stanford.edu/ajax/exportdata.html
在1处填入可以收到邮件的email
2处会显示未注册
点击3处注册
上面填的email会收到一封验证邮件，尾部有个代码串，不要改动直接转发给发过来的地址
应该会收到一封注册成功邮件，再次刷新页面，填入地址，显示绿色注册成功，如4处，就可以使用
注意这个注册成功是有有效期的，大概2-3月，过期会显示未注册，需要重新注册

配置参数与下载

打开 http://jsoc.stanford.edu/ajax/exportdata.html，配置需要的参数
- 可以下载这些系列 http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html
- 邮件需要提前注册
统计完数目后，点击核对check，check通过后下载，会发送一个url下载地址到邮件里面

脚本下载

安装包

脚本下载需要安装下面包

# python3 -m pip install requests
# python3 -m pip install bs4
# python3 -m pip install lxml
# python3 -m pip install zeep
# python3 -m pip install drms
# python3 -m pip install sunpy
# python3 -m pip install astropy

下载过程

下载过程为，创建搜索参数，根据参数搜索，根据搜索的结果集下载数据

创建搜索参数

search_tuple = (
    a.Time("2010-01-01T00:00:00.000", "2011-01-01T00:00:00.000"),
    # 注意要严格上面时间格式，前面表示开始时间，后面表示结束时间，如果开始时间和结束时间一致，则下载那个时间点的数据

    a.jsoc.Series("hmi.sharp_cea_720s"),
    # 可以从这里面找到系列  http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

    a.jsoc.Notify("eg@gmail.com"),  
    # 注意替换上面地址换成自己的，并提前在后面地址注册一下http://jsoc.stanford.edu/ajax/exportdata.html

    a.jsoc.Segment("Bp"),  # 选择系列里面的产品
    a.jsoc.Segment("Bt"),
    a.jsoc.Segment("Br"),

    # a.jsoc.PrimeKey("HARPNUM", "180"), # 可以选择HARPNUM

    # a.jsoc.Keyword("LON_MIN") > -70, 
    # a.jsoc.Keyword("LON_MAX") < 70,
    # 可以选择筛选条件，但是必须是数字比较，不能是字符串比较
    # 详见 https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido
    
    # a.Sample(96 * u.min),  
    # 可以选择时间间隔，注意如上面是96分钟，注意是以上面a.Time(XXX)为基准点的间隔，如果上面不设置开始结束时间，这个也不生效
)

根据参数搜索

1	search_results = Fido.search(*search_tuple)

下载数据

1
2
3

save_path="./"
downloaded_files = Fido.fetch(search_results, path=save_path)
print(downloaded_files)

简单封装

# coding=utf-8
"""
Purpose:   [1] download data Bp Bt Br from jsoc
           [2] Demonstrate sunpy for jsoc basic usage,
               Officially already has a good package, this is just a demonstration of basic use

Usage:     This code depends on the requests bs4 lxml zeep drms sunpy astropy
           They can be installed from conda or pip
           This code is compatible with python 3.7.x.

Examples:  None Now

Adapted:   ZhaoZhongRui (zhaozhongrui21@mails.ucas.ac.cn) Edit Python code From Thomas Wiegelmann (2022.03)
"""
# This code is compatible with python 3.7.x.
# python3 -m pip install requests
# python3 -m pip install bs4  # https://stackoverflow.com/questions/11783875/importerror-no-module-named-bs4-beautifulsoup
# python3 -m pip install lxml
# python3 -m pip install zeep
# python3 -m pip install drms  # https://docs.sunpy.org/projects/drms/en/stable/
# python3 -m pip install sunpy
# python3 -m pip install astropy
from sunpy.net import Fido, attrs as a
import astropy.units as u
from concurrent.futures import ProcessPoolExecutor
import os


class DownloadJsoc():

    def __init__(self):

        self.is_print_log = True  # or False not print log

        self.mail_address_list = []  # eg ["demo@demo.com","demo2@demo.com"]
        #  You need to go to the following location URL to register the mailbox filled in the following list,
        #  so that when multi-threaded, it will not affect the download, because a user can only request once at a time
        #  http://jsoc.stanford.edu/ajax/exportdata.html
        # ---
        #  Note that this email address has a service period (it may be two months),
        # beyond which you need to re-register on the official website
        #  Note that the number of requests or downloads of the same ip is also limited,
        # otherwise the ip may be blocked, you need to apply for unblocking,
        # you may also automatically unblock after a long time, it is recommended that a small number of requests,
        # a single search request a large amount of data

        self.data_save_root_path = None  # eg r"C:\Users\Zander\PycharmProjects\pynlfff\pyproduct\test"

    def download_one_by_time_point(self, timepoint, harpnum, email=None, save_path=None):
        """
        Download the file based on the point in time
        :param timepoint: eg "2010-01-01T00:00:00.000"
        :param harpnum: eg "220"
        :param email: eg "eg@eg.com" register at http://jsoc.stanford.edu/ajax/exportdata.html
        :param save_path: str path
        :return: None
        """
        result = False
        try:
            if email is None:
                if isinstance(self.mail_address_list, list) and len(self.mail_address_list) > 0:
                    email = self.mail_address_list[0]
            if save_path is None:
                if os.path.exists(self.data_save_root_path):
                    save_path = self.data_save_root_path

            search_tuple = (
                a.Time(timepoint, timepoint),  # set same start time and end time is like to download this time point

                a.jsoc.Series("hmi.sharp_cea_720s"),
                # choose series  http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

                a.jsoc.Notify(email),  # a.jsoc.Notify("eg@gmail.com"),

                a.jsoc.Segment("Bp"),
                a.jsoc.Segment("Bt"),
                a.jsoc.Segment("Br"),

                a.jsoc.PrimeKey("HARPNUM", str(harpnum)),

                # a.jsoc.Keyword("LON_MIN") > -70, # JSOC keyword filtering with Fido  https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido
                # a.jsoc.Keyword("LON_MAX") < 70, # Note You can only filter numbers, not strings

                # a.Sample(96 * u.min), # time delta, need set time range before
            )

            results = Fido.search(*search_tuple)  # search for result
            downloaded_files = Fido.fetch(results, path=save_path)  # download result
            result = downloaded_files  # return result
        except BaseException as e:
            print(e)
        return result

    def download_one_by_time_range(self, timestart, timeend, harpnum, timedelta=96, email=None, save_path=None):
        """
        download by time range and delta
        :param timestart: str eg "2010-01-01T00:00:00.000"
        :param timeend:  str eg "2022-01-01T00:00:00.000"
        :param harpnum: str eg "235"
        :param timedelta: int eg 96
        :param email: str eg "eg@eg.com"
        :param save_path: str path which to save file
        :return: result set or False
        """
        result = False
        try:
            if email is None:
                if isinstance(self.mail_address_list, list) and len(self.mail_address_list) > 0:
                    email = self.mail_address_list[0]
            if save_path is None:
                if os.path.exists(self.data_save_root_path):
                    save_path = self.data_save_root_path
            search_tuple = (
                a.Time(timestart, timeend),
                # set time range, if want use delta must set this ("2010-01-01T00:00:00.000", "2022-01-01T00:00:00.000"),

                a.jsoc.Series("hmi.sharp_cea_720s"),
                # choose series  http://jsoc.stanford.edu/JsocSeries_DataProducts_map.html

                a.jsoc.Notify(email),  # a.jsoc.Notify("eg@gmail.com"),

                a.jsoc.Segment("Bp"),  # choose segment
                a.jsoc.Segment("Bt"),
                a.jsoc.Segment("Br"),

                a.jsoc.PrimeKey("HARPNUM", str(harpnum)),

                # a.jsoc.Keyword("LON_MIN") > -70, # JSOC keyword filtering with Fido  https://docs.sunpy.org/en/stable/whatsnew/3.1.html?highlight=t_rec#jsoc-keyword-filtering-with-fido
                # a.jsoc.Keyword("LON_MAX") < 70,

                a.Sample(timedelta * u.min),  # eg a.Sample(96 * u.min) time delta, need set time range before
            )

            results = Fido.search(*search_tuple)
            downloaded_files = Fido.fetch(results, path=save_path)
            result = downloaded_files
        except BaseException as e:
            print(e)
        return result

    def __download_one_by_time_point_concurrent(self, timepoint, harpnum, job_num=0, save_path=None):
        # harpnum = harpnum  # "7302"
        # stime=timepoint # "2018-09-08T14:24:00.000"
        print('job num: {} / pid : {} is runing\n'.format(job_num, os.getpid()))
        email_len = len(self.mail_address_list)
        mail_num = job_num % email_len
        result = self.download_one_by_time_point(timepoint, harpnum, self.mail_address_list[mail_num], save_path)
        return result

    def download_some_by_time_point(self, timepoint_list, harpnum_list):
        """

        :param timepoint_list:
        :param harpnum_list:
        :return:
        """
        result = []
        t_len = len(timepoint_list)
        h_len = len(harpnum_list)
        if t_len == h_len:
            for i in range(t_len):
                this_result = self.download_one_by_time_point(timepoint_list[i], harpnum_list[i])
                result.append(this_result)
        return result

    def download_some_by_time_point_concurrent(self, timepoint_list, harpnum_list):
        """
        Give a list of times and numbers to download data in parallel,Note that the two dimensions should be consistent
        :param timepoint_list: eg ["2018-09-08T14:24:00.000","2018-09-08T00:00:00.000"]
        :param harpnum_list: eg ["7020","7020"]
        :return: None
        """
        result = []
        t_len = len(timepoint_list)
        h_len = len(harpnum_list)
        if t_len == h_len:
            max_workers = len(self.mail_address_list)
            executor = ProcessPoolExecutor(max_workers=max_workers)
            for i in range(t_len):
                future = executor.submit(
                    self.__download_one_by_time_point_concurrent,
                    timepoint_list[i],
                    harpnum_list[i],
                    i)
                result.append(future)
            executor.shutdown(True)

    def tran_json_file_tai_num_time_to_download_format(self, raw_str):
        """
        Convert the file time downloaded by json to the parameter time required for download,
        eg "7020.20170524_222400_TAI" to ["7020","2018-09-08T14:24:00.000"]
        Both in the list are in str format
        :param raw_str: str, eg "7020.20170524_222400_TAI"
        :return:  [ str, str ] -> [ harpnum , timestr ], eg ["7020","2018-09-08T14:24:00.000"],
        If the over-feed value is malformed, the conversion fails and is returned None，
        """
        result = None
        if isinstance(raw_str, str):
            str_list = raw_str.split(".")  # split by . to ["7020","20170524_222400_TAI"]
            if len(str_list) == 2:
                if len(str_list[1]) >= 15:
                    try:
                        hnum = int(str_list[0])  # try if list[1] is not int
                        this_list = str_list[1]
                        stime = "20{}-{}-{}T{}:{}:00.000".format(
                            this_list[2:4],
                            this_list[4:6],
                            this_list[6:8],
                            this_list[9:11],
                            this_list[11:13],
                        )  # "2018-09-08T14:24:00.000"
                        result = [hnum, stime]
                    except BaseException as e:
                        print(e)
        return result

    def get_job_list_from_file(self, file_path):
        """
        The file content is like "7020.20170524_000000_TAI\n7020.20170524_222400_TAI\n"
        :param file_path: str or os.path format
        :return:  ["7020.20170524_000000_TAI","7020.20170524_222400_TAI"]
        """
        result = []
        if os.path.exists(file_path):
            with open(file_path, "r") as f:
                all_list = f.read().split("\n")
            result = all_list
        return result

    def demo_download_one(self):
        self.download_one_by_time_point("2018-08-23T17:36:00.000", "7300",email="eg@eg.com",save_path="/data")

    def demo_download_some_from_file(self, file_path=None, save_path=None, mail_list=None):
        """
        get work job from file and run concurrent
        The file content is like "7020.20170524_000000_TAI\n7020.20170524_222400_TAI\n"
        :param file_path: str or os.path format
        :param save_path: the path to save file
        :param mail_list: ["eg@eg.com","eg2@eg2.com"] which has registered in jsoc before
        :return:  None
        """
        if file_path is None:
            return False
        if file_path is not None:
            self.data_save_root_path = save_path
        if mail_list is not None:
            self.mail_address_list = mail_list

        job_raw_list = self.get_job_list_from_file(file_path)
        num_list = []
        time_list = []
        for job_raw in job_raw_list:
            job_tran = self.tran_json_file_tai_num_time_to_download_format(job_raw)
            if job_tran is not None:
                num_list.append(job_tran[0])
                time_list.append(job_tran[1])
        self.download_some_by_time_point_concurrent(time_list, num_list)



if __name__ == "__main__":
    print("start run")

    d = DownloadJsoc()
    file_path = r"/www/wwwroot/app_run/data/a"
    save_path = r"/www/wwwroot/app_run/data"
    mail_list = ["demo1@outlook.com",
                 "demo2@outlook.com",
                 "demo3@outlook.com",
                 "demo4@outlook.com", ]
    d.demo_download_some_from_file(file_path, save_path, mail_list)