大模型测评

模型名称	模型精度	硬件	显存	显存占用	context_length（tokens数量）	输出结果（tokens数量）	首字符时间	生成字符时间	总耗时	tokens/秒
mistralai/Mistral-7B-Instruct-v0.1	torch.float32	V100 * 1	32GB	30GB	658	55	1.002687883	3.078769755	4.081457639	174.7
mistralai/Mistral-7B-Instruct-v0.1	torch.float32	V100 * 1	32GB	30GB	16	512	0.114518285	22.04399722	22.1585155	23.8
mistralai/Mistral-7B-Instruct-v0.1	torch.float16	V100 * 1	32GB	15.44GB	658	55	0.279174662	2.280483913	2.559658575	278.55
mistralai/Mistral-7B-Instruct-v0.1	torch.float16	V100 * 1	32GB	15.44GB	16	512	0.087295771	17.09277301	17.1800687	30.89
openbmb/MiniCPM-2B-sft-fp32	torch.float32	T4 * 1	15GB	10.7GB	23	55	0.142048	8.082035	8.222035	9.49
openbmb/MiniCPM-2B-sft-bf16	torch.float32	T4 * 1	15GB	10.7GB	23	29	0.162126	1.63576	1.798102	28.919388
openbmb/MiniCPM-2B-sft-bf16	torch.float32	T4 * 1	15GB	10.7GB	16	302	0.192262	36.613799	36.803799	8.912123
openbmb/MiniCPM-2B-dpo-fp32	torch.float32	T4 * 1	15GB	10.7GB	16	302	0.192262	36.613799	36.803799	8.912123
openbmb/MiniCPM-2B-sft-bf16	torch.float32	T4 * 1	15GB	10.7GB	110	271	0.346282	14.88633	15.232612	25.0121247

CASE1

total generation time:1.6619608402252197 4.738634347915649 seconds
total generation time:0.9283370971679688 4.004896402359009 seconds
total generation time:0.9297411441802979 4.009643793106079 seconds
total generation time:0.929086446762085 4.006907939910889 seconds
total generation time:0.929163932800293 4.011216640472412 seconds
total generation time:0.9296543598175049 4.0125579833984375 seconds
total generation time:0.9291834831237793 4.006205797195435 seconds
total generation time:0.9296023845672607 4.013807773590088 seconds
total generation time:0.9289572238922119 4.002191543579102 seconds
total generation time:0.9311919212341309 4.008514165878296 seconds

case2
total generation time:0.5885086059570312 22.85054087638855 seconds
total generation time:0.06227469444274902 22.096322059631348 seconds
total generation time:0.06183004379272461 22.086422443389893 seconds
total generation time:0.06199240684509277 22.089636087417603 seconds
total generation time:0.06151008605957031 22.07428288459778 seconds
total generation time:0.061719655990600586 22.078858375549316 seconds
total generation time:0.06190967559814453 22.067675828933716 seconds
total generation time:0.061861276626586914 22.09499216079712 seconds
total generation time:0.06190347671508789 22.06703209877014 seconds
total generation time:0.06167292594909668 22.079392194747925 seconds

CASE3
total generation time:1.0182111263275146 3.315606117248535 seconds
total generation time:0.1967155933380127 2.4853999614715576 seconds
total generation time:0.19660735130310059 2.470855951309204 seconds
total generation time:0.19737577438354492 2.479072093963623 seconds
total generation time:0.19767999649047852 2.501652240753174 seconds
total generation time:0.19811129570007324 2.463073253631592 seconds
total generation time:0.19675588607788086 2.4883460998535156 seconds
total generation time:0.1969308853149414 2.4389374256134033 seconds
total generation time:0.1968069076538086 2.463475227355957 seconds
total generation time:0.19655179977416992 2.4901673793792725 seconds

CASE4
total generation time:0.5371785163879395 17.91862177848816 seconds
total generation time:0.03760647773742676 16.991518020629883 seconds
total generation time:0.037310123443603516 17.35347294807434 seconds
total generation time:0.037428855895996094 16.85565209388733 seconds
total generation time:0.03730964660644531 17.12874126434326 seconds
total generation time:0.03709769248962402 16.900318384170532 seconds
total generation time:0.03712606430053711 17.184556007385254 seconds
total generation time:0.037362098693847656 17.176209211349487 seconds
total generation time:0.03728461265563965 17.043495178222656 seconds
total generation time:0.037253618240356445 17.24810290336609 seconds

import time
from threading import Thread
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, TextIteratorStreamer

device = "cuda" # the device to load the model onto

model_path = "/home/ma-user/Mistral-7B-Instruct-v0.1"

model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, use_safetensors=False, device_map = 'cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, use_safetensors=False, device_map="cuda")
print(model.dtype)
res = []

for i in range(10):
  start = time.time()
  text = "<s>[INST]You are an operation and maintenance engineer of a carrier. A user will provide you with a text description, which describes a carrier's package (the text may be replaced by an offering). You need to find the rewarded resources of the package based on the text content. The requirements are as follows:\n\n1. The package description of a carrier contains a lot of information. After a user subscribes to a package, some resources or money are rewarded to the user. The resources have a certain amount and can be used free of charge. Generally, the rewarded contents are free of charge.\n2. Generally, there are clear charging standards for resources other than donation. This is not the information you want to extract.\n3. The reward includes only three types: bonus (bonus), voice call duration (voice), and data traffic (data). The bonus indicates the amount of money rewarded in each cycle. The voice call duration includes local calls, intra-provincial calls, inter-provincial calls, national calls, and international calls. Data traffic includes local traffic, intra-provincial traffic, inter-provincial traffic, domestic traffic, and international traffic. Other resources are not what you want to extract;\n4. Rollover information about rewarded resources must also be extracted. Rollover means that rewarded resources can be saved to the next cycle when they are used up. If rollover is not involved, this parameter does not need to be returned.\n5. Rewarded resources need to be classified into bonusInfo, voice call duration (voice) and data traffic (data) are classified into freeUnitInfo. If the rewarded resources are empty, no value is returned.\n6. The broadband is not the information you want to extract, only return null. If you enter a value that does not contain the rewarded resources or bonuses or is empty, the system returns a blank value, that is, {}.\n7. This is an information extraction task. Do not provide code or explanation. Directly provide the JSON result. Do not return any other content.\n\nHere are a few examples:\n` ` `\nExample 1\nInput: This package is 10 euros per month, 10 GB national traffic, 100-minute voice calls, and 500 MB/s broadband. After the package is exceeded, the voice call fee is 0.3 euros per minute.\nOutput: {\"freeUnitInfo\":\"10 GB national traffic, 100-minute voice call\"}\n\nExample 2\nInput: 100 euros per month, 20 GB international traffic, 10 euros per month, 500 MB/s broadband, and Sam's Club are supported. Unused traffic can be rolled over to the next cycle. After the traffic exceeds the package, the voice call fee is 0.3 euros per minute.\nOutput: {\"bonusInfo\":\"Euro10 is rewarded each month\",\"freeUnitInfo\":\"The national traffic is 20 GB. The unused traffic can be rolled over to the next cycle.\"}\n\n` ` `\nThe information provided by the user is as follows:Configure a 5G live TV package with a monthly rental of 299 euros and reward 500 MB/s broadband. [/INST]"
  # text = "<s>[INST]Write a story, about 400 words.[/INST]"
  encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(device)

  """
  model_inputs = encodeds.to(device)
  model.to(device)

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  print(decoded[0])
  """

  streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

  generation_kwargs = dict(encodeds, streamer=streamer, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id)
  thread = Thread(target=model.generate, kwargs=generation_kwargs)
  thread.start()
  first_token = True
  out_str = ""
  first_token_time = 0
  for new_text in streamer:
    out_str += new_text
    if first_token:
        #print(f"first token time:{(time.time()-start)} seconds")
        first_token_time = time.time()-start
        first_token = False

  res.append(out_str)
  print(f"total generation time:{first_token_time} {(time.time()-start)} seconds")


writer = open("output.csv", "w+")
for i in res:
    writer.write(i)
    writer.write("\n==================\n")

大模型测评

欢迎大家关注DataLearner官方微信，接受最新的AI技术推送

相关博客

最热博客