脚本.py

2024-6-18 • tag-icon

我正在尝试编写一个 cron 作业来定期运行我编写的 python 脚本，该脚本将向我正在构建的数据库添加一些数据。该脚本可以运行，当我python /Users/me/Desktop/pythonScript/script.py从终端运行时也可以运行，但 cron 作业不起作用。我运行了chmod a+x /Users/me/Desktop/pythonScript/script.py以使脚本可执行。python 脚本也以开头#!/usr/bin/python。

我按照建议将的结果添加到我的变量$PATH中PATHcrontab这里，以及添加SHELL和HOME变量。

crontab -l目前返回的是：

PATH="/Library/Frameworks/Python.framework/Versions/3.6/bin:/Users/cole/anaconda/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/TeX/texbin"

SHELL="/bin/bash"

HOME = "/Users/me/Desktop/pythonScript/"

* * * * * python script.py
* * * * * env > /tmp/cronenv

第一项任务应该运行我的脚本，script.py而第二项任务将cron环境信息打印到文件中tmp/cronenv。该文件如下所示：

SHELL=/bin/bash
USER=me

PATH=/Library/Frameworks/Python.framework/Versions/3.6/bin:/Users/cole/anaconda/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/TeX/texbin
PWD=/Users/cole
SHLVL=1
HOME=/Users/cole
LOGNAME=cole
_=/usr/bin/env

但是，我的数据库没有更新，当我cron在 system.log 文件中搜索时，发现以下错误消息：

Nov  5 20:24:00 Coles-MacBook-Air-2 cron[3301]: no path for address 0x11a77b000
Nov  5 20:24:00 Coles-MacBook-Air-2 cron[3302]: no path for address 0x11a77b000
Nov  5 20:25:00 Coles-MacBook-Air-2 cron[3314]: no path for address 0x11a77b000
Nov  5 20:25:00 Coles-MacBook-Air-2 cron[3315]: no path for address 0x11a77b000

请注意，每分钟有两个，每个一个cronjob，但第二个似乎有效，而第一个无效。有什么建议吗？

由于它可能相关，因此这是脚本：

脚本.py

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
import re
from nltk import word_tokenize
import time
import pickle

saveDir = '/Users/me/Desktop/pythonScript/dbfolder' #the folder where I want to save files 
workingDir = '/Users/me/Desktop/pythonScript/' #location of the script

#this function turns integer values into their url location at a gutenberg mirror
home = 'http://mirror.csclub.uwaterloo.ca/gutenberg/'
fileType = '.txt'
def urlMaker(x):
    url = home
    if int(x) > 10:
        for j in [i for i in range(len(x)-1)]:
            url += x[j]+'/'
       url += x+'/'+x+fileType
else:
    url = home+'0/'+x+'/'+x+fileType
return(url)

#this function takes a url and returns the .txt files at each url, as w as a list of cleaned paragraphs over 100 words in length.
def process(url):
    try:
        r  = requests.get(url)
    except ConnectionError:
        time.sleep(300)
        try:
            r  = requests.get(url)
        except ConnectionError:
            time.sleep(600)
            try:
                r  = requests.get(url)
            except ConnectionError:
                return(ConnectionError) 
    toprint = r.text
    text = r.text.lower()
    k = re.search('\Send\Sthe small print!',text)
    l = re.search('the project gutenberg etext of the declaration of independence',text)
    m = re.search('start of (.*) project gutenberg (.*)*', text)
    n = re.search('end of (.*) project gutenberg (.*)*', text)
    o = re.search('http://gutenberg.net.au/licence.html', text)
    p = re.search('this site is full of free ebooks', text)
    x = 0
    lst = []
    if m and n:
        start,end = re.escape(m.group(0)), re.escape(n.group(0))
        text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)
    elif o and p:
        start,end = re.escape(o.group(0)), re.escape(p.group(0))
        text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)
    elif l and n:
        start,end = re.escape(l.group(0)), re.escape(n.group(0))
        text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)
    elif k and n:
        start,end = re.escape(k.group(0)), re.escape(n.group(0))
        text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)
    else:
        text = text
    if text.split('\n\n') != [text]:
        for i in text.split('\n\n'):
            if i != ''\
            and 'gutenberg' not in i\
            and 'ebook' not in i\
            and 'etext' not in i\
            and len(word_tokenize(i)) > 100:
                lst += [i.replace('\n',' ')]
        x = 1
    if text.split('\r\n\r\n') != [text] and x == 0:
        for i in text.split('\r\n\r\n'):
            if i != ''\
            and 'gutenberg' not in i\
            and 'ebook' not in i\
            and 'etext' not in i\
            and len(word_tokenize(i)) > 100:
                lst += [i.replace('\r\n',' ')]
    return((lst,toprint))

####makes an index dictionary of the titles to the title number
indexUrl = 'http://mirror.csclub.uwaterloo.ca/gutenberg/GUTINDEX.ALL'
r  = requests.get(indexUrl)
index = r.text.lower()
#plits index file by beginning and end
start = re.escape(re.search('~ ~ ~ ~ posting dates for the below ebooks:  1 oct 2017 to 31 oct 2017 ~ ~ ~ ~'\
                            ,index).group(0))
end = re.escape(re.search('<==end of gutindex.all==>',index).group(0))
index = re.search('{}(.*){}'.format(start, end), index, re.S).group(1)

#splits file by pc line breaks
lbPC = re.split('\r\n\r\n',index)

#cleans subtitles from line using PC notation
cleanSubsPC = []
for i in lbPC:
    cleanSubsPC += [i.split('\r\n')[0]]

#splits lines which use MAC notation
lbMAC = []
for i in cleanSubsPC:
    if re.split('\n\n',i) == [i]:
        lbMAC += [i]
    else:
        lbMAC += [x for x in re.split('\n\n',i)]

#cleans subtitles etc. which use MAC linebreaks        
cleanSubsMAC = []
for i in lbMAC:
    cleanSubsMAC += [i.split('\n')[0]]

#builds list of strings containing titles and numbers, cleaned of weird unicode stuff
textPairs = []
for i in cleanSubsMAC:
    if len(i) > 1 and not i =='':
        if not i.startswith('~ ~ ~ ~ posting')\
        and not i.startswith('title and author'):
            try:
                int(i[-1])
                textPairs += [i.replace('â','')\
                     .replace('â\xa0',' ').replace('\xa0',' ')]
            except ValueError:
                pass

#builds dic of key:title pairs
inDic = {}
for i in textPairs:
    inDic[int(re.match('.*?([0-9]+)$', i).group(1))] = i.split('   ')[0].replace(',',' ')

#makes dictionary of urls to access
urls = {}
for x in [x for x in range(1,55863)]:
    urls[x] = urlMaker(str(x))

#this opens a saved dictionary of the collected data, so the script will begin where it left off previously
try:
    with open(workingDir+'gutenburgDic', 'rb') as handle:
        data = pickle.load(handle)
except FileNotFoundError:
    pass

#actually iterates through urls, saving data, 100 texts at a time. Also saves raw text files for later use
for i in range(len(data)+1,len(data)+101):
    data[i],text = (urls[i],process(urls[i])[0]),process(urls[i])[1]
    f = open(saveDir+urls[i].replace('/','.'),'w')
    f.write(text)
    f.close()

#saves updated dictionary of >100 word paragraphs 
with open(workingDir+'gutenburgDic', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

答案1

我让它工作了。我重新安装了 python 和 anaconda，以及我的操作系统 (Sierra)。然后我按照 @Jet 给出的建议，将我的更新crontab为：

PATH="/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/Users/cole/anaconda/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/TeX/texbin"

SHELL="/bin/bash"

HOME = "/Users/me/Desktop/pythonScript/"

* * * * * /anaconda3/bin/python /Users/me/Desktop/pythonScript/script.py

其中的路径python取自which python终端的结果。

脚本.py

答案1

相关内容