生成一个变量输入到一个RSS订阅源

问题描述:

这可能是一个非常简单的调试问题(我没有单独编码)我有一个循环代码,用于解析一个刮过的XML,这个解析发生在一个5分钟的循环中,但doesn如果用户标识已经存在于userset中,则作为将用户标识存储在集合中的结果,不会将重复项从一个循环返回到下一个循环,然后脚本跳到xml的下一行。我想将这个脚本的结果作为一个RSS输出,并且我有一个可能的方法来这样做,但是我首先需要将数据存储为某种类型的变量。生成一个变量输入到一个RSS订阅源

我试图做到这一点,但是每次我都会遇到存储在集合中的最后一个userid的问题。我没有提供破碎的代码,而是附上了一个工作代码的例子,它不包括我的哈希尝试将结果打印定义为变量。

import mechanize 
import urllib 
import json 
import re 
import random 
import datetime 
from sched import scheduler 
from time import time, sleep 

######Code to loop the script and set up scheduling time 

s = scheduler(time, sleep) 
random.seed() 

def run_periodically(start, end, interval, func): 
    event_time = start 
    while event_time < end: 
     s.enterabs(event_time, 0, func,()) 
     event_time += interval + random.randrange(-5, 45) 
    s.run() 

###### Code to get the data required from the URL desired 
def getData(): 
    post_url = "URL OF INTEREST" 
    browser = mechanize.Browser() 
    browser.set_handle_robots(False) 
    browser.addheaders = [('User-agent', 'Firefox')] 

######These are the parameters you've got from checking with the aforementioned tools 
    parameters = {'page' : '1', 
       'rp' : '250', 
       'sortname' : 'roi', 
       'sortorder' : 'desc' 
      } 
#####Encode the parameters 
    data = urllib.urlencode(parameters) 
    trans_array = browser.open(post_url,data).read().decode('UTF-8') 

    xmlload1 = json.loads(trans_array) 
    pattern1 = re.compile('>&nbsp;&nbsp;(.*)<') 
    pattern2 = re.compile('/control/profile/view/(.*)\' title=') 
    pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>') 

##### Making the code identify each row, removing the need to numerically quantify the  number of rows in the xmlfile, 
##### thus making number of rows dynamic (change as the list grows, required for looping function to work un interupted) 

    for row in xmlload1['rows']: 
     cell = row["cell"] 

##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex 

     user_delimiter = cell['username'] 
     selection_delimiter = cell['race_horse'] 

     if strikeratecalc2 < 12 : continue; 

##### REMAINDER OF THE REGEX DELMITATIONS 

     username_delimiter_results = re.findall(pattern1, user_delimiter)[0] 
     userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0]) 
     user_selection = re.findall(pattern3, selection_delimiter)[0] 

##### Code to stop duplicate posts of each user throughout the day 

    userset = set ([]) 
    if userid_delimiter_results in userset: continue; 

##### Printing the results of the code at hand 

     print "user id = ",userid_delimiter_results 
     print "username = ",username_delimiter_results 
     print "user selection = ",user_selection 
     print "" 

##### Code to stop duplicate posts of each user throughout the day part 2 (udating set to add users already printed to the ignore list) 

    userset.update(userid_delimiter_results) 

    getData() 

    run_periodically(time()+5, time()+1000000, 300, getData) 

试图产生变量,当我有问题(我试图生产它作为一个数组)是莫名其妙的代码是缺少最后userset.update(userid_delimiter_results),这导致在最后一个条目饲料在代码的每次运行中都会重复,因为根据'用户群',所讨论的用户标识没有被记录。任何简单的方法,使我能够输出这个代码的结果作为变量将不胜感激。亲切的问候AEA

+0

有很多问题,例如,代码总是试图针对空集(userset)进行测试,测试本身可能也是不正确的(或者以后对userset.update()的调用不正确),run_periodical cal由于无限递归(getData()),l无法访问。将你的代码拆分成小块,并分别用模拟数据进行测试,直到你明白每个和平在做什么。考虑使用[scrapy](http://doc.scrapy.org/en/latest/intro/overview.html)来抓取,抓取数据并生成Feed或[scrapely](https://github.com/scrapy)/scrapely)从HTML中提取数据 – jfs

我通过将打印部分放入;

arrayna = [arrayna1, arrayna2, arrayna3, arrayna4] 

    arraym1 = "user id = ",userid_delimiter_results 

然后,为了克服脸上那环路arrayna的每次运行将

my_array = [] # Create an empty list 

print(my_array) 

所以,你的代码可能看起来像:

这个工作:)