Python MySQL Scraping & Word Cloud Generation: A Financial Data Analysis Guide

·

Overview

This tutorial demonstrates how to use Python's requests module to extract financial data from the Jin10 website homepage. We'll implement two key processes: storing data in a MySQL database and performing word cloud analysis. The comparison reveals Python's remarkable efficiency—just a few lines of code can accomplish what would require extensive setup in other languages.

Environment Preparation

Required Python libraries:

  1. PIL (Python Imaging Library)
  2. jieba (Chinese text segmentation)
  3. requests (HTTP requests)
  4. wordcloud (word cloud generation)
  5. pymysql (MySQL database connectivity)

👉 Boost your Python skills with these essential libraries

Important Note: This tutorial uses Python 3.6. Some libraries like wordcloud may encounter installation issues through standard pip. For Windows users, we recommend downloading the appropriate .whl file from a reliable Python package repository and installing it manually using:

pip install path_to_downloaded_file.whl

Web Scraping Parameter Analysis

When analyzing the Jin10 website's "Load More" functionality, we identified these key API endpoints:

  1. Primary data endpoint
  2. Secondary verification endpoint

The "Load More" feature sends two requests to identical URLs:

The critical parameter is max_time, which should be slightly adjusted (adding a few seconds) between requests to avoid duplicate records.

Data Scraping and Database Integration

1. MySQL Table Creation

DROP TABLE IF EXISTS `jin10_data`;
CREATE TABLE `jin10_data` (
  `id` varchar(50) DEFAULT NULL,
  `time` varchar(20) DEFAULT NULL,
  `content` longtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

2. Response Data Analysis

The API returns JSON data with this structure:

{
  "status": 200,
  "message": "OK",
  "data": [
    {
      "id": "20190812082759520100",
      "time": "2019-08-12 08:27:59",
      "type": 0,
      "data": {
        "pic": "",
        "content": "Sample financial news content..."
      }
    }
  ]
}

3. Complete Scraping Script

import requests
import pymysql

# Database storage function
def save(conn, cur, id, time, content):
    sql = '''INSERT INTO jin10_data(id,time,content) VALUES(%s,%s,%s);'''
    try:
        cur.execute(sql, (id, time, content))
        conn.commit()
    except Exception as e:
        print(f'Database error: {e}')

# Main scraping logic
url = "https://flash-api.jin10.com/get_flash_list"
headers = {
    "x-app-id": "SO1EJGmNgCtmpcPF",
    "x-version": "1.0.0"
}
params = {
    "max_time": "2019-08-12 14:18:48",
    "channel": "-8200"
}

# Database connection
conn = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    db='python_data',
    charset="utf8"
)
cur = conn.cursor()

# Continuous scraping loop
total_count = 0
data = requests.get(url, params=params, headers=headers).json()['data']
data_length = len(data)

while data_length > 0:
    for item in data:
        try:
            save(
                conn,
                cur,
                item['id'],
                item['time'],
                item['data']['content']
            )
        except Exception as e:
            print(f'Processing error: {e}')
    
    total_count += data_length
    params['max_time'] = data[data_length-1]['time']
    print(f'Next query time: {params["max_time"]}')
    
    # Fetch next batch
    data = requests.get(url, params=params, headers=headers).json()['data']
    data_length = len(data)

# Cleanup
cur.close()
conn.close()
print(f'Completed successfully. Total records: {total_count}')

Word Cloud Generation

1. Preparation

You'll need:

👉 Create stunning visualizations with Python

2. Complete Word Cloud Code

import jieba.analyse
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator

# Initialize content collector
news_content = ''

# Fetch and process data (similar to scraping section)
# ... [scraping code from previous section] ...

# Keyword analysis
result = jieba.analyse.textrank(news_content, topK=50, withWeight=True)
keywords = {word[0]: word[1] for word in result}

# Generate word cloud
mask_image = Image.open('./mask.png')
mask_array = np.array(mask_image)

wc = WordCloud(
    font_path='./simsun.ttc',
    background_color='White',
    max_words=50,
    mask=mask_array
)
wc.generate_from_frequencies(keywords)

# Display and save
plt.imshow(wc)
plt.axis("off")
plt.show()
wc.to_file('financial_wordcloud.png')

Programming Insights

Key takeaways from this implementation:

  1. Type Checking: Use type(data) to understand your data structure
  2. List Handling: Get length with len(alist) and iterate with for i in range(length)
  3. Module Organization: Create separate files/modules for better organization
  4. Function Ordering: Define functions before using them
  5. Library Installation: Manual .whl installation often solves pip issues
  6. Database Efficiency: Python's database operations are remarkably concise
  7. Time Parameter Adjustment: Slightly adjust max_time between requests to avoid duplicates

FAQ Section

Q: Why is my wordcloud installation failing?

A: Some systems require manual installation of the wordcloud library. Download the appropriate .whl file for your Python version and system architecture, then install using pip install package_name.whl.

Q: How can I avoid duplicate records when scraping?

A: The key is adjusting the max_time parameter slightly (adding a few seconds) between requests to ensure you get new records without overlaps.

Q: What's the best way to handle Chinese text segmentation?

A: The jieba library is currently the most effective solution for Chinese text processing, offering both precise segmentation and keyword extraction capabilities.

Q: How can I improve the visual quality of my word cloud?

A: Experiment with different fonts, mask images, and color schemes. The ImageColorGenerator can create visually striking results by matching the word colors to your mask image.