AI For Trading:Term2 NLP (78)

从今天开始,将学习AI量化交易第二学期的内容,非常令人激动。

Industry Experts

We’ve worked with experts from the industry to put together course materials that will introduce you to the exciting and fast moving world of quant trading!

我们与业内专家合作,汇集课程材料,向您介绍量化交易令人兴奋和快速发展的世界!

NLP Overview

Language is an important medium for human communication. It allows us to convey information ,express our ideas, and give instructions to others.

Some philosophers argue that it enables us to form complex thoughts and reason about them. It may turn out to be a critical component of human intelligence. Now consider the various artificial systems we interact with every day,phones, cars, websites, coffee machines.

It's natural to expect them to be able to process and understand human language, right? Yet, computers are still lagging behind. No doubt, we have made some incredible progress in the field of natural language processing,but there is still a long way to go.And that's what makes this an exciting and dynamic area of study.

In this lesson you will not only get to know more about the applications and challenges in NLP, you will learn how to design an intelligent application that uses NLP techniques and deploy it on a scalable platform.Sounds fun? Let's get started.

Structured Languages

为什么计算机理解我们很难呢?人类语言的一大缺陷,或者你如何理解语言所用的特征是缺乏一个准确的概念结构。

为了理解困难的原因,我们首先观察更具结构化的语言,例如数学使用一种结构化语言,当我写出 y=2x+5 时 我想表达的意思非常清晰,我的意思是变量 y 与变量 x 相关性是 2x+5,形式逻辑也使用结构化的语言,例如考虑表达式父系 (x,y) 和父系 (x,z) 得到同层 (y, z) ,这个语句表明如果 x 是 y 的父系且 x 是 z 的父系,那么 y 和 z 是同层。

你熟悉的结构化语言,是脚本语言和编程语言,思考这个 SQL 语句SELECT name, email FROM users WHERE name LIKE A%, 我们询问数据库 姓名以 A 开头,所有用户的姓名和电子邮箱地址,这些语言的设计要尽可能清晰 适合计算机处理

Counting Words

Let's implement a simple function that is often used in Natural Language Processing: Counting word frequencies.

Consider this passage of text:

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

— Excerpt from Treasure Island, by Robert Louis Stevenson.

In the following coding exercise, we have provided code to load the text from a file, call the function count_words() to obtain word counts (which you need to implement), and print the 10 most common and least common unique words.

Complete the portions marked as TODO to count how many times each unique word occurs in the text.

"""Count words."""

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return

    # TODO: Convert to lowercase

    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)

    # TODO: Aggregate word counts using a dictionary

    return counts

def test_run():
    with open("input.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\tCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))

if __name__ == "__main__":
    test_run()

实现方法:

"""Count words."""

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return

    # TODO: Convert to lowercase
    text = text.lower()

    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    text_list = text.split()

    for v in text_list:
        if v in counts:
            counts[v] += 1
        else:
            counts[v] = 1

    # TODO: Aggregate word counts using a dictionary
    return counts

def test_run():
    with open("input.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\tCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))

if __name__ == "__main__":
    test_run()

打印结果:

10 most common words:
Word    Count
a   9
he  6
the 6
and 5
was 4
as  4
with    3
i   2
left    2
about   2

10 least common words:
Word    Count
on  1
wonderful   1
dexterity,  1
waiting,    1
glance  1
like    1
hopping 1
upon    1
but 1
room,   1

input.tx

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

为者常成,行者常至