Overview
This tutorial demonstrates how to use Python's requests module to extract financial data from the Jin10 website homepage. We'll implement two key processes: storing data in a MySQL database and performing word cloud analysis. The comparison reveals Python's remarkable efficiency—just a few lines of code can accomplish what would require extensive setup in other languages.
Environment Preparation
Required Python libraries:
- PIL (Python Imaging Library)
- jieba (Chinese text segmentation)
- requests (HTTP requests)
- wordcloud (word cloud generation)
- pymysql (MySQL database connectivity)
👉 Boost your Python skills with these essential libraries
Important Note: This tutorial uses Python 3.6. Some libraries like wordcloud may encounter installation issues through standard pip. For Windows users, we recommend downloading the appropriate .whl file from a reliable Python package repository and installing it manually using:
pip install path_to_downloaded_file.whlWeb Scraping Parameter Analysis
When analyzing the Jin10 website's "Load More" functionality, we identified these key API endpoints:
- Primary data endpoint
- Secondary verification endpoint
The "Load More" feature sends two requests to identical URLs:
- First uses OPTIONS method (response code 204)
- Second carries the actual payload with specific headers and query parameters
The critical parameter is max_time, which should be slightly adjusted (adding a few seconds) between requests to avoid duplicate records.
Data Scraping and Database Integration
1. MySQL Table Creation
DROP TABLE IF EXISTS `jin10_data`;
CREATE TABLE `jin10_data` (
`id` varchar(50) DEFAULT NULL,
`time` varchar(20) DEFAULT NULL,
`content` longtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;2. Response Data Analysis
The API returns JSON data with this structure:
{
"status": 200,
"message": "OK",
"data": [
{
"id": "20190812082759520100",
"time": "2019-08-12 08:27:59",
"type": 0,
"data": {
"pic": "",
"content": "Sample financial news content..."
}
}
]
}3. Complete Scraping Script
import requests
import pymysql
# Database storage function
def save(conn, cur, id, time, content):
sql = '''INSERT INTO jin10_data(id,time,content) VALUES(%s,%s,%s);'''
try:
cur.execute(sql, (id, time, content))
conn.commit()
except Exception as e:
print(f'Database error: {e}')
# Main scraping logic
url = "https://flash-api.jin10.com/get_flash_list"
headers = {
"x-app-id": "SO1EJGmNgCtmpcPF",
"x-version": "1.0.0"
}
params = {
"max_time": "2019-08-12 14:18:48",
"channel": "-8200"
}
# Database connection
conn = pymysql.connect(
host='127.0.0.1',
user='root',
password='123456',
db='python_data',
charset="utf8"
)
cur = conn.cursor()
# Continuous scraping loop
total_count = 0
data = requests.get(url, params=params, headers=headers).json()['data']
data_length = len(data)
while data_length > 0:
for item in data:
try:
save(
conn,
cur,
item['id'],
item['time'],
item['data']['content']
)
except Exception as e:
print(f'Processing error: {e}')
total_count += data_length
params['max_time'] = data[data_length-1]['time']
print(f'Next query time: {params["max_time"]}')
# Fetch next batch
data = requests.get(url, params=params, headers=headers).json()['data']
data_length = len(data)
# Cleanup
cur.close()
conn.close()
print(f'Completed successfully. Total records: {total_count}')Word Cloud Generation
1. Preparation
You'll need:
- A local font file (e.g., simsun.ttc)
- A mask image (mask.png) for the word cloud shape
👉 Create stunning visualizations with Python
2. Complete Word Cloud Code
import jieba.analyse
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
# Initialize content collector
news_content = ''
# Fetch and process data (similar to scraping section)
# ... [scraping code from previous section] ...
# Keyword analysis
result = jieba.analyse.textrank(news_content, topK=50, withWeight=True)
keywords = {word[0]: word[1] for word in result}
# Generate word cloud
mask_image = Image.open('./mask.png')
mask_array = np.array(mask_image)
wc = WordCloud(
font_path='./simsun.ttc',
background_color='White',
max_words=50,
mask=mask_array
)
wc.generate_from_frequencies(keywords)
# Display and save
plt.imshow(wc)
plt.axis("off")
plt.show()
wc.to_file('financial_wordcloud.png')Programming Insights
Key takeaways from this implementation:
- Type Checking: Use
type(data)to understand your data structure - List Handling: Get length with
len(alist)and iterate withfor i in range(length) - Module Organization: Create separate files/modules for better organization
- Function Ordering: Define functions before using them
- Library Installation: Manual
.whlinstallation often solves pip issues - Database Efficiency: Python's database operations are remarkably concise
- Time Parameter Adjustment: Slightly adjust
max_timebetween requests to avoid duplicates
FAQ Section
Q: Why is my wordcloud installation failing?
A: Some systems require manual installation of the wordcloud library. Download the appropriate .whl file for your Python version and system architecture, then install using pip install package_name.whl.
Q: How can I avoid duplicate records when scraping?
A: The key is adjusting the max_time parameter slightly (adding a few seconds) between requests to ensure you get new records without overlaps.
Q: What's the best way to handle Chinese text segmentation?
A: The jieba library is currently the most effective solution for Chinese text processing, offering both precise segmentation and keyword extraction capabilities.
Q: How can I improve the visual quality of my word cloud?
A: Experiment with different fonts, mask images, and color schemes. The ImageColorGenerator can create visually striking results by matching the word colors to your mask image.