How to write a Financial Chatbot First Part 3 steps to crawl Hong Kong Stock Market (HKEX) realtime stock quotes

rockingdingo 2023-10-16 #financial #chatbot #stock quotes

How to write a Financial Chatbot(First Part)-3 steps to crawl Hong Kong Stock Market (HKEX) realtime stock quotes

Navigation

1. Spider URL

2. Secret Token

3. Complete Program

In this blog, we will show you how simple it is to write a python spider to crawl realtime stock quotes from Hong Kong Stock Market (HKEX) official website. We will use the common python lib "requests" and "BeautifulSoup". And the complete code is also attached to this blog. We use Tencent(stock code: 700) as an example to show you how to to download the realtime stock price quote from HKEX's official website - Tencent Stock Price, which is valid through the time of Oct, 2023.

HKEX Website Realtime Quote

1. Spider URL

The most important thing of writing spider program is to find the url of the data. Let's say we want to download the realtime stock quote data of the summary section as displayed on URL (https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en). You may be curious if we can just use requests lib to GET the data from this URL? The answer is no, because this is just a page for displaying the data from the servlet, and you can't get useful data from this url.

HKEX Website Tencent Stock Quote Summary

HKEX Website Realtime Quote

The correct way is to open your Chrome browser and find the network to see how to data service is called. And you can find the correct web servlet to host the data as in the URL (https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=700&token=evLtsLsBNAUVTPxtGqVeG9et0X662MH%2fwrT6%2b8XWE0MkNcW%2bCSnLXhogDZ3mSR2L&lang=eng&qid=1690107390362&callback=jQuery351019444984572939505_1690107383211&_=1690107383212). The complete URL contains the part of the secret token, which will change the very first time you visit the website from brower and will expire after some time (maybe 24 hr.). The token is a string like "evLtsLsBNAUVTPxtGqVeG9et0X662MH%2fwrT6%2b8XWE0MkNcW%2bCSnLXhogDZ3mSR2L". Now the next difficult thing is to get the correct token since you don't need to manually copy the token each time you want to run the program.

Python Code

            import requests
            import os
            import re
            import bs4
            from bs4 import BeautifulSoup
            import sys
            import json
            import time
            import codecs
            import cachetools
            from cachetools import cached, TTLCache
            import time

            def get_equity_quote_data_from_hkex(token, symbol):
                """
                    URL:https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=3690&token=evLtsLsBNAUVTPxtGqVeG9et0X662MH%2fwrT6%2b8XWE0MkNcW%2bCSnLXhogDZ3mSR2L&lang=eng&qid=1690107390362&callback=jQuery351019444984572939505_1690107383211&_=1690107383212
                """
                equity_data = None
                try:
                    special_token_list = ["
"]
                    if token is None:
                        token = fetch_clean_token()
                    timestamp = int(time.time())
                    url = "https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=%s&token=%s&lang=eng&qid=%d&callback=jQuery351019444984572939505_%d&_=%d" % (symbol,token, timestamp, timestamp, timestamp)
                    print ("DEBUG: symbol is %s, url is %s" % (symbol, url))
                    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
                    res = requests.get(url, headers=headers)
                    soup1 = BeautifulSoup(res.text, 'html.parser')
                    output_json_text = soup1.text
                    for special_token in special_token_list:
                        output_json_text = output_json_text.replace(special_token, "")
                    search_result = re.search("[(]", output_json_text)
                    if search_result is None:
                        return equity_data
                    start_index = search_result.span()[0] + 1
                    end_index = len(output_json_text) - 1
                    output_json = output_json_text[start_index:end_index]
                    stock_quote_json = json.loads(output_json)

                    response_code = stock_quote_json["data"]["responsecode"]
                    if response_code == "000":
                        equity_data=stock_quote_json["data"]["quote"]
                    else:
                        equity_data= None
                except Exception as e:
                    print (e)
                    equity_data = None
                return equity_data

Secret Token

The next question is how to fetch the correct token? The answer is to first visit the normal page and the token is returned in the script section. You can use regex to parse the BASE64 token and save for future use in a short period of time, e.g. 24 hours. You can search the keywords "token" to find the correct place to parse regex.

HKEX Website Realtime Quote

Python Code

            def fetch_clean_token_by_force():
                final_token = ""
                try:
                    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
                    page_max = 100
                    house = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en'
                    res = requests.get(house, headers=headers)
                    soup = BeautifulSoup(res.text, 'html.parser')
                
                    script_list = soup.select("script")
                    special_token_mark = "LabCI.getToken"
                    parsed_script_list = []
                    for script in script_list:
                        if special_token_mark in script.text:
                            parsed_script_list.append(script)
                    token_area_text = parsed_script_list[0].text if len(parsed_script_list) > 0 else ""
                    # print("DEBUG: parsed_script_list is: " + str(parsed_script_list))
                    if token_area_text == "":
                        raw_text = str(soup)
                        search_list = re.search(special_token_mark, raw_text)
                        token_area_text = ""
                        if search_list is not None:
                            start_index = search_list.span()[0]
                            token_area_text = raw_text[start_index:start_index+1000]

                    if token_area_text == "":
                        final_token = ""
                    else:
                        """
                            ## return "evLtsLsBNAUVTPxtGqVeG51zOWokjjvirZ7EhA46047jtmVBVWpZxDY3Mqxv4Q57";
                        """
                        search_list_start = re.search(' return ', token_area_text)
                        start_span =search_list_start.span()
                        ## return "evLtsLsBNAUVTPxtGqVeG51zOWokjjvirZ7EhA46047jtmVBVWpZxDY3Mqxv4Q57";
                        # token_len = 68 ## hk stock 
                        start_index = start_span[1] + 1 if len(start_span) > 1 else 0  # num[0] start index of string, num[1] end index of string
                        sub_token_area_text = token_area_text[start_index:]

                        # return
                        end_search_list = re.search('"', sub_token_area_text)
                        end_span =end_search_list.span()
                        end_index = end_span[0]
                        final_token = sub_token_area_text[:end_index]
                    return final_token
                except Exception as e:
                    print ("DEBUG: fetch_clean_token_by_force meet error...")
                    print (e)
                    return final_token

Complete Program

Python Code


        #coding=utf-8
        #!/usr/bin/python
        import requests
        import os
        import re
        import bs4
        from bs4 import BeautifulSoup
        import sys
        import json
        import time
        import codecs
        import cachetools
        from cachetools import cached, TTLCache
        import time

        def get_equity_quote_data_from_hkex(token, symbol):
                """
                    URL:https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=3690&token=evLtsLsBNAUVTPxtGqVeG9et0X662MH%2fwrT6%2b8XWE0MkNcW%2bCSnLXhogDZ3mSR2L&lang=eng&qid=1690107390362&callback=jQuery351019444984572939505_1690107383211&_=1690107383212
                """
                equity_data = None
                try:
                    special_token_list = ["
"]
                    if token is None:
                        token = fetch_clean_token()
                    timestamp = int(time.time())
                    url = "https://www1.hkex.com.hk/hkexwidget/data/getequityquote?sym=%s&token=%s&lang=eng&qid=%d&callback=jQuery351019444984572939505_%d&_=%d" % (symbol,token, timestamp, timestamp, timestamp)
                    print ("DEBUG: symbol is %s, url is %s" % (symbol, url))
                    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
                    res = requests.get(url, headers=headers)
                    soup1 = BeautifulSoup(res.text, 'html.parser')
                    output_json_text = soup1.text
                    for special_token in special_token_list:
                        output_json_text = output_json_text.replace(special_token, "")
                    search_result = re.search("[(]", output_json_text)
                    if search_result is None:
                        return equity_data
                    start_index = search_result.span()[0] + 1
                    end_index = len(output_json_text) - 1
                    output_json = output_json_text[start_index:end_index]
                    stock_quote_json = json.loads(output_json)

                    response_code = stock_quote_json["data"]["responsecode"]
                    if response_code == "000":
                        equity_data=stock_quote_json["data"]["quote"]
                    else:
                        equity_data= None
                except Exception as e:
                    print (e)
                    equity_data = None
                return equity_data

        def fetch_clean_token_by_force():
                final_token = ""
                try:
                    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
                    page_max = 100
                    house = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en'
                    res = requests.get(house, headers=headers)
                    soup = BeautifulSoup(res.text, 'html.parser')
                
                    script_list = soup.select("script")
                    special_token_mark = "LabCI.getToken"
                    parsed_script_list = []
                    for script in script_list:
                        if special_token_mark in script.text:
                            parsed_script_list.append(script)
                    token_area_text = parsed_script_list[0].text if len(parsed_script_list) > 0 else ""
                    # print("DEBUG: parsed_script_list is: " + str(parsed_script_list))
                    if token_area_text == "":
                        raw_text = str(soup)
                        search_list = re.search(special_token_mark, raw_text)
                        token_area_text = ""
                        if search_list is not None:
                            start_index = search_list.span()[0]
                            token_area_text = raw_text[start_index:start_index+1000]

                    if token_area_text == "":
                        final_token = ""
                    else:
                        """
                            ## return "evLtsLsBNAUVTPxtGqVeG51zOWokjjvirZ7EhA46047jtmVBVWpZxDY3Mqxv4Q57";
                        """
                        search_list_start = re.search(' return ', token_area_text)
                        start_span =search_list_start.span()
                        ## return "evLtsLsBNAUVTPxtGqVeG51zOWokjjvirZ7EhA46047jtmVBVWpZxDY3Mqxv4Q57";
                        # token_len = 68 ## hk stock 
                        start_index = start_span[1] + 1 if len(start_span) > 1 else 0  # num[0] start index of string, num[1] end index of string
                        sub_token_area_text = token_area_text[start_index:]

                        # return
                        end_search_list = re.search('"', sub_token_area_text)
                        end_span =end_search_list.span()
                        end_index = end_span[0]
                        final_token = sub_token_area_text[:end_index]
                    return final_token
                except Exception as e:
                    print ("DEBUG: fetch_clean_token_by_force meet error...")
                    print (e)
                    return final_token

        def main():
            token = fetch_clean_token_by_force()
            entity_data = get_equity_quote_data_from_hkex(token, "700")
            print ("DEBUG: token: %s, result entity data %s" % (token, entity_data))

        if __name__ == '__main__':
            main()

The result of the realtime quote of Tencent stock price is attached below

            DEBUG: token: evLtsLsBNAUVTPxtGqVeG%2fSJGKiOeWzHBmhCQdLaULdEHir%2fITnBt%2fe2wKkffmNY, result entity data {u'ew_underlying_code': None, u'h_share_flag': False, u'shares_issued_date': u'12 Oct 2023', u'ew_desc': None, u'class_A_description': None, u'class_B_description': None, u'tck': u'0.200', u'hc': u'306.800', u'floating_flag': False, u'secondary_listing': False, u'listing_date': u'16 Jun 2004', u'lo': u'', u'ew_strike': u'', u'nom_ccy': None, u'nav_ccy': None, u'ls': u'', u'hsic_sub_sector_classification': u'E-Commerce & Internet Services', u'ew_amt_os': u'', u'premium': u'', u'asset_class': None, u'f_aum_hkd': None, u'rs_stock_flag': False, u'issued_shares_class_B': None, u'issued_shares_class_A': None, u'etp_baseCur': None, u'ew_amt_os_cur': None, u'primary_market': None, u'yield': u'', u'summary': u'Tencent Holdings Ltd is an investment holding company primarily engaged in the provision of value-added (VAS) services, online advertising services, as well as FinTech and business services. The Company primarily operates through four segments. The VAS segment is mainly engaged in the provision of online games, video account live broadcast services, paid video membership services and other social network services. The Online Advertising segment is mainly engaged in media advertising, social and other advertising businesses. The FinTech and Business Services segment mainly provides commercial payment, FinTech and cloud services. The Others segment is principally engaged in the investment, production and distribution of films and television program for third parties, copyrights licensing, merchandise sales and various other activities.', u'depositary': None, u'underlying_code': None, u'ew_strike_cur': None, u'transfer_of_listing_date': u'', u'aum_date': u'', u'ew_sub_right': u'', u'primaryexch': u'HKEX', u'fiscal_year_end': u'31 Dec 2022', u'ew_amt_os_dat': u'', u'entitlement_ratio': u'', u'management_fee': u'', u'office_address': u"29/F, Three Pacific PlaceNo 1 Queen's Road EastWanchaiHong Kong", u'ew_sub_per_to': u'', u'nav_date': u'', u'issue': u'', u'bd': u'', u'listing_category': u'Primary Listing', u'strike_price': u'', u'aum_u': u'', u'isin': u'KYG875721634', u'hsic_ind_classification': u'Information Technology - Software & Services', u'updatetime': u'', u'trdstatus': u'N', u'strike_price_ccy': None, u'am_u': u'', u'domicile_country': None, u'registrar': u'Computershare Hong Kong Investor Services Ltd.', u'exotic_type': None, u'wnt_gear': u'', u'base_currency': None, u'eps': 19.7568, u'os': u'', u'multiple_counter': [{u'counter_trading_currency': u'CNY', u'counter_sym': u'80700'}], u'db_updatetime': u'16 Oct 2023 07:20', u'callput_indicator': None, u'coupon': u'', u'product_subtype': None, u'inception_date': u'', u'issuer_name': u'Tencent Holdings Ltd.', u'eps_ccy': u'RMB', u'eff_gear': u'', u'aum': u'', u'pc': u'', u'sedol': u'BMMV2K8', u'lot': u'100', u'pe': u'13.78', u'launch_date': u'', u'call_price': u'', u'product_type': u'EQTY', u'div_yield': u'0.78', u'underlying_index': None, u'nm_s': u'TENCENT', u'secondary_listing_flag': False, u'hist_closedate': u'13 Oct 2023', u'hi': u'', u'vo_u': u'', u'vo': u'', u'as_at_label': u'as at', u'incorpin': u'Cayman Islands', u'underlying_ric': u'0700.HK', u'hdr': False, u'lo52': u'188.626', u'csic_classification': None, u'mkt_cap_u': u'B', u'geographic_focus': None, u'days_to_expiry': None, u'am': u'', u'iv': u'', u'as': u'', u'replication_method': None, u'mkt_cap': u'2,923.08', u'board_lot_nominal': None, u'ric': u'0700.HK', u'exotic_warrant_indicator': None, u'inline_lower_strike_price': u'', u'hi52': u'416.600', u'nm': u'Tencent Holdings Ltd.', u'issued_shares_note': None, u'nc': u'', u'nav': u'', u'chairman': u'Ma Huateng', u'investment_focus': None, u'original_offer_price': u'', u'inline_upper_strike_price': u'', u'update_time': u'2023-10-15 23:20:03.0', u'amt_ccy': None, u'sym': u'700', u'expiry_date': u'', u'amt_os': u'9,527,658,306', u'ccy': u'HKD', u'ew_sub_per_from': u'', u'interest_payment_date': u'-', u'counter_label': u'counter', u'op': u'', u'moneyness': u''}

Chatbot close

Bot
Hi TEMP_b064af32,
How can I help you today?

Send