利用Python处理Excel数据：自动查找并提取关键字

在数据处理和分析过程中，Excel是一个非常强大的工具。然而，当数据量较大或需要自动化处理时，手动操作可能会变得繁琐且容易出错。本文将介绍如何使用Python和Pandas库来自动化处理Excel文件，特别是如何查找并提取特定关键字。

场景描述

假设我们有一个包含大量聊天记录的Excel文件，每条记录包含坐席和客户的对话内容。我们需要从这些对话中提取出包含特定关键字的句子，并将这些句子分别写入不同的列中。这样可以更方便地进行后续的分析和处理。

实现步骤

导入必要的库
读取Excel文件
定义关键字和分隔符
遍历每一行并提取关键字
将提取的关键字写入新的列
保存修改后的Excel文件

1. 导入必要的库

首先，我们需要导入Pandas库用于数据处理，以及正则表达式库用于匹配关键字。

import pandas as pd
import re

2. 读取Excel文件

我们定义文件路径和工作表名称，并使用Pandas读取Excel文件。

file_path = '数据模板.xlsx'
sheet_name = 'Sheet1'
df = pd.read_excel(file_path, sheet_name=sheet_name)

3. 定义关键字和分隔符

我们定义要查找的关键字和用于分割对话内容的分隔符。

keywords = ['关键字1', '关键字2', '关键字3', '关键字4', '关键字5']
pattern = re.compile('|'.join([re.escape(keyword).replace('\\s', '\\s*') for keyword in keywords]), re.IGNORECASE)
separators = ['坐席：', '客户：']

4. 遍历每一行并提取关键字

我们遍历每一行，使用分隔符分割对话内容，并过滤包含关键字的句子。

start_col = 'D'
for index, row in df.iterrows():
    cell_content = str(row.iloc[2])  # 使用位置访问C列
    found_sentences = []
    parts = re.split('|'.join(map(re.escape, separators)), cell_content)
    for part in parts:
        if pattern.search(part):
            found_sentences.append(part.strip())

5. 将提取的关键字写入新的列

如果找到符合条件的句子，我们将这些句子分别写入后续的列中。

    if found_sentences:
        for i, sentence in enumerate(found_sentences, start=1):
            col_letter = chr(ord(start_col) + i - 1)  # 计算列字母
            df.at[index, col_letter] = sentence

6. 保存修改后的Excel文件

最后，我们将修改后的DataFrame保存为新的Excel文件。

output_file_path = '处理后的数据模板.xlsx'
df.to_excel(output_file_path, sheet_name=sheet_name, index=False)
print(f"处理完成，结果已保存到 {output_file_path}")

总结

通过上述步骤，我们可以自动化地从Excel文件中提取包含特定关键字的句子，并将这些句子分别写入不同的列中。这种方法不仅提高了数据处理的效率，还减少了手动操作可能带来的错误。

希望这篇文章能帮助你更好地理解和应用Python进行数据处理。如果你有任何问题或建议，欢迎在评论区留言！

源代码

import pandas as pd
import re

# 定义文件路径和工作表名称
file_path = '数据模板.xlsx'
sheet_name = 'Sheet1'

# 读取Excel文件
df = pd.read_excel(file_path, sheet_name=sheet_name)

# 定义要查找的关键字（根据需要调整）
keywords = ['关键字1', '关键字2', '关键字3', '关键字4', '关键字5',]

# 允许关键字中包含空格的正则表达式
# 这里使用 \s* 表示关键字中的任意数量的空白字符（包括零个）
pattern = re.compile('|'.join([re.escape(keyword).replace('\\s', '\\s*') for keyword in keywords]), re.IGNORECASE)

# 定义分隔符
separators = ['坐席：', '客户：']

# 定义要写入的起始列（假设从D列开始）
start_col = 'D'

# 遍历每一行
for index, row in df.iterrows():
    cell_content = str(row.iloc[2])  # 使用位置访问C列
    found_sentences = []

    # 使用分隔符分割内容
    parts = re.split('|'.join(map(re.escape, separators)), cell_content)

    # 过滤包含关键字的句子
    for part in parts:
        if pattern.search(part):
            found_sentences.append(part.strip())

    # 如果找到符合条件的句子，写入后续列
    if found_sentences:
        for i, sentence in enumerate(found_sentences, start=1):
            col_letter = chr(ord(start_col) + i - 1)  # 计算列字母
            df.at[index, col_letter] = sentence

# 保存修改后的Excel文件
output_file_path = '处理后的数据模板.xlsx'
df.to_excel(output_file_path, sheet_name=sheet_name, index=False)

print(f"处理完成，结果已保存到 {output_file_path}")

悠米是只猫